Amazon Bedrock Part 5: Multilingual semantic search with Bedrock, Cohere multilingual embeddings and Cohere Command LLM

Diptiman Raichaudhuri
8 min readNov 2, 2023

--

Disclosure: All opinions expressed in this article are my own, and represent no one but myself and not those of my current or any previous employers.

This article is the 5th part of my Amazon Bedrock series.

Here are the links for the previous articles :

Amazon Bedrock Part 1 : RAG with Pinecone Vector Store, Anthropic Claude LLM and LangChain — Income Tax FAQ Q&A bot

Amazon Bedrock Part 2: Hindi Text Summarization using Amazon Bedrock and Anthropic Claude

Amazon Bedrock Part 3: Amazon Neptune Graph database Q&A LangChain Agent with Amazon Bedrock and Anthropic Claude

Amazon Bedrock Part 4: AWS Cost and Usage Report(CUR) Summary from S3, using Bedrock, LangChain and Anthropic Claude

In this article I intend to build a quick prototype on multilingual semantic search using Cohere multilingual multilingual-22–12 embeddings and Cohere Command LLM.

Product companies with global footprints and a global consumer base frequently release product safety manuals, work instruction manuals for entire product lines with text in multiple global languages. Most of the LLMs do not do a decent enough job on text written in a non-English language.

Cohere multilingual-22-12 embedding model is trained on tens of thousands worth of websites in hundreds of languages. It performs significantly better for semantic search within a single language as well as across languages. According to Cohere’s documentation, their multilingual model maps text in different languages to the same vector spaces.

This 2 step process is explained very clearly in Cohere’s blog :

Source : https://txt.cohere.com/multilingual/
Source : https://txt.cohere.com/multilingual/

Thus, for multilingual search(query + result in the same language) and cross-lingual search(query and results are of a different language) both, a single embedding model creates a semantically close vector positions for similar meaning sentences in different languages.

Source : https://txt.cohere.com/search-cohere-langchain/

In this article, I follow the steps below :

  1. Take a publicly available product safety manual in Korean language
  2. Create embeddings using Cohere multilingual-22-12 and store it in FAISS, in-memory.
  3. Execute a similarity search directly on the embeddings for same language results
  4. Use Cohere Command LLM with Amazon Bedrock and use a LangChain QA chain to run question-answer using English language on the Korean manual.

Step 1 :

Here’s the link to the safety manual written in Korean language. I have done a PDFSplit on the main safety manual document, which is written in more than 50 different global languages, and only retained the Korean sections in the linked file ( kr_safety_information.pdf ). Here’s the link to the original safety manual.

Step 2:

I open a SageMaker studio notebook on my SageMaker domain (to get started please go through the other Bedrock articles in sequence) and get going. A “Data Science 3.0” type with ml.t3.medium is good enough to run this.

Install dependencies :

%pip install --upgrade --quiet boto3 botocore langchain cohere python-dotenv

Import libraries

import boto3
import os
import json

from langchain.embeddings.cohere import CohereEmbeddings

from langchain.text_splitter import CharacterTextSplitter, RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, PyPDFDirectoryLoader

from langchain.chains.question_answering import load_qa_chain
from langchain.vectorstores import FAISS
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from dotenv import load_dotenv

First, I would create the language embeddings from this Korean manual. To use the Cohere multilingual-22-12 , I would need a COHERE_API_KEY. Create the key by visiting Cohere portal, register for a trial key.

Add COHERE_API_KEY in .env file.

Load the key :

load_dotenv()

Create a folder kr_doc and upload the Korean Manual. Load the manual using PyPDFDirectoryLoader :

loader = PyPDFDirectoryLoader("./kr_doc/")

Split the document in chunks(chunking) so that the document is broken down in smaller segments to help optimize the relevance of the content we get back from a vector store / vector db.

documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
# Set a really small chunk size, just to show.
chunk_size = 1000,
chunk_overlap = 100,
)
docs = text_splitter.split_documents(documents)

Instantiate CohereEmbeddings with the multilingual-22-12 model and create a in-memory FAISS vector store for query retrievals :

embeddings = CohereEmbeddings(model = "multilingual-22-12")
vectorstore_faiss = FAISS.from_documents(
docs,
embeddings
)

Step 3:

Now that the embeddings are created by Cohere multilingual model, let’s test a simple similarity_search query :

query = """How to clean the product?"""
docs = vectorstore_faiss.similarity_search(query)
print(docs)

And, as expected, I get a Korean response from the embeddings for my query :

제품을 청소할 때 주의하세요.\n 
•제품을 청소할 때는 연필용 지우개나 부드러운 천으로 가볍게 닦으세요.\n
•제품의 충전 단자를 청소할 때는 면봉이나 부드러운 천으로 닦으세요.\n
•제품을 청소하기 위해 독한 화학 물질이나 강한 세제 등을 사용하지 마세요. 제품의 외관이 \n변색되거나 부식될 수 있으며, 화재 또는 감전의 위험이 있습니다.\n
•먼지, 땀, 잉크, 기름, 화학 제품(화장품, 항균 스프레이, 손 세정제, 세제, 살충제)이 닿지 않도록 \n주의하세요. 제품의 외관이나 내부 부품이 손상되어 성능이 저하될 수 있습니다. 해당 물질이 \n
묻은 경우엔 보풀이 없는 부드러운 천으로 닦으세요.', metadata={'source': 'kr_doc/kr_safety_information.pdf', 'page': 8}), Document(page_content='•자동차에 설치된 휴대용 기기나 관련 장치들이 단단히 고정되었는지 확인하세요.\n
•에어백이 장착된 차량에서 사용할 경우 제품을 에어백 주변에 설치하거나 보관하지 마세요. \n
에어백 작동 시 제품이 파손되거나 제품에 부딪혀 다칠 수 있습니다.', metadata={'source': 'kr_doc/kr_safety_information.pdf', 'page': 7}), Document(page_content='•제품을 너무 낮거나 너무 높은 온도에서 보관하지 마세요.\n
•제품을 너무 높은 온도에서 보관하면 제품이 고장 나거나 배터리 수명이 단축될 수 있습니다.\n
•배터리의 음극 단자와 양극 단자를 직접 연결하지 말고, 배터리에 금속 물체가 닿지 않게 \n
하세요. 배터리가 고장 날 수 있습니다.', metadata={'source': 'kr_doc/kr_safety_information.pdf', 'page': 1}), Document(page_content='기기인지 확인 후 사용하세요.\n심한 매연이나 증기를 피하세요.\n
제품 외관이 훼손되거나 고장 날 수 있습니다.\n보청기를 사용하는 경우 전자파 관련 정보를 확인 후 사용하세요.\n
일부 보청기는 제품 전자파로 인해 제대로 동작하지 않을 수 있습니다. 보청기 제조 회사에 확인한 \n후 사용하세요.

This is perfect !

Let’s run a similarity_search query with Korean, using the same query, only written in Korean, this time :

query = """제품을 청소하는 방법?"""

And, I get a perfect response again, which is absolutely the same as the earlier response :

[Document(page_content='제품을 청소할 때 주의하세요.\n •제품을 청소할 때는 연필용 지우개나 부드러운 천으로 가볍게 닦으세요.\n •제품의 충전 단자를 청소할 때는 면봉이나 부드러운 천으로 닦으세요.\n •제품을 청소하기 위해 독한 화학 물질이나 강한 세제 등을 사용하지 마세요. 제품의 외관이 \n변색되거나 부식될 수 있으며, 화재 또는 감전의 위험이 있습니다.\n •먼지, 땀, 잉크, 기름, 화학 제품(화장품, 항균 스프레이, 손 세정제, 세제, 살충제)이 닿지 않도록 \n주의하세요. 제품의 외관이나 내부 부품이 손상되어 성능이 저하될 수 있습니다. 해당 물질이 \n묻은 경우엔 보풀이 없는 부드러운 천으로 닦으세요.', metadata={'source': 'kr_doc/kr_safety_information.pdf', 'page': 8}), Document(page_content='•자동차에 설치된 휴대용 기기나 관련 장치들이 단단히 고정되었는지 확인하세요.\n •에어백이 장착된 차량에서 사용할 경우 제품을 에어백 주변에 설치하거나 보관하지 마세요. \n에어백 작동 시 제품이 파손되거나 제품에 부딪혀 다칠 수 있습니다.', metadata={'source': 'kr_doc/kr_safety_information.pdf', 'page': 7}), Document(page_content='•제품을 너무 낮거나 너무 높은 온도에서 보관하지 마세요.\n •제품을 너무 높은 온도에서 보관하면 제품이 고장 나거나 배터리 수명이 단축될 수 있습니다.\n •배터리의 음극 단자와 양극 단자를 직접 연결하지 말고, 배터리에 금속 물체가 닿지 않게 \n하세요. 배터리가 고장 날 수 있습니다.', metadata={'source': 'kr_doc/kr_safety_information.pdf', 'page': 1}), Document(page_content='기기인지 확인 후 사용하세요.\n심한 매연이나 증기를 피하세요.\n제품 외관이 훼손되거나 고장 날 수 있습니다.\n보청기를 사용하는 경우 전자파 관련 정보를 확인 후 사용하세요.\n일부 보청기는 제품 전자파로 인해 제대로 동작하지 않을 수 있습니다. 보청기 제조 회사에 확인한 \n후 사용하세요.', metadata={'source': 'kr_doc/kr_safety_information.pdf', 'page': 3})]

Step 4:

Now, let’s add the contextualisation using Cohere Command LLM with Bedrock and make a semantic search with LangChain QA Agent :

Let’s create a LangChain prompt_template :

prompt_template = """Text: {context}

Question: {question}

Answer the question based on the text provided. If the text doesn't contain the answer, reply that the answer is not available."""

PROMPT = PromptTemplate(
template=prompt_template, input_variables=["context", "question"]
)

setup Amazon Bedrock runtime to invoke the model :

modelId = 'cohere.command-text-v14'
accept = 'application/json'
contentType = 'application/json'

bedrock_runtime = boto3.client(
service_name = "bedrock-runtime",
region_name = "us-east-1"
)

I have copied this handy method to call different LLMs supported by Bedrock from the Amazon Bedrock workshop:

def get_inference_parameters(model): #return a default set of parameters based on the model's provider
bedrock_model_provider = model.split('.')[0] #grab the model provider from the first part of the model id

if (bedrock_model_provider == 'anthropic'): #Anthropic model
return { #anthropic
"max_tokens_to_sample": 512,
"temperature": 0,
"top_k": 250,
"top_p": 1,
"stop_sequences": ["\n\nHuman:"]
}

elif (bedrock_model_provider == 'ai21'): #AI21
return { #AI21
"maxTokens": 512,
"temperature": 0,
"topP": 0.5,
"stopSequences": [],
"countPenalty": {"scale": 0 },
"presencePenalty": {"scale": 0 },
"frequencyPenalty": {"scale": 0 }
}

elif (bedrock_model_provider == 'cohere'): #COHERE
return {
"max_tokens": 512,
"temperature": 0,
"p": 0.01,
"k": 0,
"stop_sequences": [],
"return_likelihoods": "NONE"
}

else: #Amazon
#For the LangChain Bedrock implementation, these parameters will be added to the
#textGenerationConfig item that LangChain creates for us
return {
"maxTokenCount": 512,
"stopSequences": [],
"temperature": 0,
"topP": 0.9
}

Setup Bedrock for Cohere Command :

from langchain.llms.bedrock import Bedrock

model_kwargs = get_inference_parameters("cohere")
llm = Bedrock(
model_id=modelId, #use the requested model
model_kwargs = model_kwargs
)

Build a QA chain of LangChain, with the FAISS vector store as the embeddings store :

chain_type_kwargs = {"prompt": PROMPT}

qa = RetrievalQA.from_chain_type(llm=llm,
chain_type="stuff",
retriever=vectorstore_faiss.as_retriever(),
chain_type_kwargs=chain_type_kwargs,
return_source_documents=True)

Let’s run the query :

query = """How to clean the product"""

And print, the answer :

answer = qa({"query": query})
result = answer["result"].replace("\n","").replace("Answer:","")
print(f"Question: {query}")
print(f"Answer: {result}")

And, I get the following response :

Question: How to clean the product
Answer: The text provides instructions for cleaning the product.
Here is a summary of the instructions:
- Clean the product gently with a soft cloth or a brush.
- Clean the battery terminals with a soft cloth or a brush.
- Do not use strong chemicals or abrasive cleaners when cleaning the product.
- Be careful not to damage the external appearance or internal components of the product.
- If the product comes into contact with dust, sweat, ink, oil, or chemical products, clean it gently with a soft dry cloth.The text also provides some additional guidelines for handling the product:
- Check that any installed portable devices or related accessories have been securely fixed.
- Do not place the product in an environment with extreme temperatures.
- Do not expose the product to high temperatures, as this may cause it to malfunction or reduce its battery life.
- Avoid direct contact between the positive and negative terminals of the battery.
- Do not use the product if it has been damaged or if there are any signs of damage.Thus, the text does provide guidelines for cleaning the product and handling it properly.

Perfectly answered multilingual semantic search using Cohere’s multilingual-22-12 embedding model and Cohere Command text LLM, with Bedrock as the orchestrator for the LLM.

This opens up broad opportunities for Financial Services, Manufacturing, Retail industries with global manufacturing and customer base.

Here’s the notebook attached.

Happy coding !

--

--