How to Use Chroma to Build Your First Similarity Search

Ajithkumar M
7 min readOct 5, 2023

--

Chroma is an open-source embedding database that can be used to store embeddings and their metadata, embed documents and queries, and search embeddings. Chroma makes it easy to build LLM apps by making knowledge, facts, and skills pluggable for LLMs.It works particularly well with audio data, making it one of the best vector database solutions for audio-based search engines, music recommendations, and other audio-related use cases

Chroma provides its own Python as well as JavaScript/TypeScript client SDK which can be used to connect to the DB.

There are two ways to use Chroma In-memory DB, Running in Docker as a DB server.

What is Vector Store ?

A Vector DB is used to efficiently store and query vector embeddings. They provide the capabilities required to scale, optimize, manage, and secure high-dimensional vector data for a variety of use cases. Some examples of Vector DB are: Chroma, Pinecone, Weaviate, Milvus, AwaDB, DeepLake, BagelDB,…etc

Word embedding or Vector

Word Embedding, also known as Word Vector, is a numeric vector input that represents a word in a lower-dimensional space. It allows words with similar meanings to have comparable representations. They can also approximate meaning. A word vector with 50 values can represent 50 distinct features. Example of a vector data

[ -0.022300036624073982, -0.036522310227155685, -0.06973360478878021, 
0.011544489301741123, -0.006613695528358221, -0.026300203055143356,
0.018448151648044586, -0.026300203055143356, 0.006446260958909988,
-0.007453453727066517, 0.012707822024822 ]

There are deferent Word Embedding Algorithms such as SBERT, Word2Vec, FastText, GloVe, … etc.

Lets see how to build a prototype

Here we are using Chroma vector database with in-memory mode and Sentence Transformers for embedding text.

Step 1: Setup

Using a terminal, install ChromaDB, LangChain and Sentence Transformers libraries.

pip3 install langchain

pip3 install chromadb

pip3 install sentence-transformers

Step 2: Create data file

Here is sample plain txt file here I used 3 newlines as a separator for identifying each context.

Mahatma Gandhi, byname of Mohandas Karamchand Gandhi, (born Oct. 2, 1869, Porbandar, India—died Jan. 30, 1948, Delhi), Preeminent leader of Indian nationalism and prophet of nonviolence in the 20th century.
Gandhi grew up in a home steeped in religion, and he took for granted religious tolerance and the doctrine of ahimsa (noninjury to all living beings). He studied law in England from 1888 to 1891, and in 1893 he took a job with an Indian firm in South Africa. There he became an effective advocate for Indian rights.
In 1906 he first put into action satyagraha, his technique of nonviolent resistance. His success in South Africa gave him an international reputation, and in 1915 he returned to India and within a few years became the leader of a nationwide struggle for Indian home rule. By 1920 Gandhi commanded influence hitherto unattained by any political leader in India.
He refashioned the Indian National Congress into an effective political instrument of Indian nationalism and undertook major campaigns of nonviolent resistance in 1920–22, 1930–34 (including his momentous march to the sea to collect salt to protest a government monopoly), and 1940–42. In the 1930s he also campaigned to end discrimination against India’s lower-caste “untouchables” (Dalits; officially designated as Scheduled Castes) and concentrated on educating rural India and promoting cottage industry.
India achieved dominion status in 1947, but the partition of the subcontinent into India and Pakistan was a great disappointment to Gandhi, who had long worked for Hindu-Muslim unity. In September 1947 he ended rioting in Calcutta (Kolkata) by fasting. Known as the Mahatma (“Great-Souled”), Gandhi had won the affection and loyalty of millions. In January 1948 he was shot and killed by a young Hindu fanatic.


Resistors are the commonly used components in the electronic circuits. A resistor is an electronic component that limits the electric current or flow of electrons to certain level. It consists of two terminals.


The electric current reduced or restricted by the resistor is measured in ohms and it is represented by a symbol Ω.


Capacitors are the most widely used electronic components after the resistors. Capacitors temporarily store the electrical energy in the form of static electric field.

OOPs (Object-Oriented Programming System).
Object means a real-world entity such as a pen, chair, table, computer, watch, etc. Object-Oriented Programming is a methodology or paradigm to design a program using classes and objects. It simplifies software development and maintenance by providing some concepts: Object.Encapsulation, Data Abstraction, Polymorphism and Inheritance are the 4 basics of OOPs.
Encapsulation
Encapsulation is the mechanism of hiding of data implementation by restricting access to public methods. Instance variables are kept private and accessor methods are made public to achieve this.
Abstraction
Abstract means a concept or an Idea which is not associated with any particular instance. Using abstract class/Interface we express the intent of the class rather than the actual implementation. In a way, one class should not know the inner details of another in order to use it, just knowing the interfaces should be good enough.
Inheritance
Inheritances expresses “is-a” and/or “has-a” relationship between two objects. Using Inheritance, In derived classes we can reuse the code of existing super classes. In Java, concept of “is-a” is based on class inheritance (using extends) or interface implementation (using implements).
For example, FileInputStream "is-a" InputStream that reads from a file.
Polymorphism
It means one name many forms. It is further of two types — static and dynamic. Static polymorphism is achieved using method overloading and dynamic polymorphism using method overriding. It is closely related to inheritance. We can write a code that works on the superclass, and it will work with any subclass type as well.


Vector Store is the One of the most common ways to store and search over unstructured data is to embed it and store the resulting embedding vectors, and then at query time to embed the unstructured query and retrieve the embedding vectors that are 'most similar' to the embedded query. A vector store takes care of storing embedded data and performing vector search for you.


LangChain is a framework for developing applications powered by language models. It enables applications that:
Are context-aware: connect a language model to sources of context (prompt instructions, few shot examples, content to ground its response in, etc.)
Reason: rely on a language model to reason (about how to answer based on provided context, what actions to take, etc.)
The main value props of LangChain are:
Components: abstractions for working with language models, along with a collection of implementations for each abstraction. Components are modular and easy-to-use, whether you are using the rest of the LangChain framework or not
Off-the-shelf chains: a structured assembly of components for accomplishing specific higher-level tasks
Off-the-shelf chains make it easy to get started. For complex applications, components make it easy to customize existing chains and build new ones.


Step 3: Declare SentenceTransformer and Chroma Client

import chromadb
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
client = chromadb.PersistentClient(path="my_chroma_db")

multi-qa-MiniLM-L6-cos-v1 is a embedding model all-MiniLM-L6-v2 is by default. my_chroma_db is Directory path that create metadata.

Step 4: Create chroma collection

collection = client.get_or_create_collection(name="collection1", embedding_function=embedding_model)

Step 5: Function to read data file and return as a list of contexts.

def getDataFromText() -> list:
try:
with open('../test_data/data2.txt', 'r', encoding="utf-8") as file:
content: str = file.read()
splited_data: list = content.split("\n\n\n")
return splited_data
except Exception as e:
print("Read data from text failed : ", e)

Step 6: Function to insert embeddings or vector to chromadb.

def addVectorDataToDb() -> None:
embeddings: list = []
metadatas: list = []
documents: list = []
ids: list = []
splited_data: list = getDataFromText()
try:
for index, data in enumerate(splited_data):
embeddings.append(embedding_model.encode(data).tolist())
metadatas.append({"Chapter": str(index+1)})
documents.append(data)
ids.append(str(index+1))
collection.add(
embeddings=embeddings,
metadatas=metadatas,
documents=documents,
ids=ids
)
print("Data added to collection")
except Exception as e:
print("Add data to db failed : ", e)

embedding_model.encode() is used to convert text to vector/embedding.

collection.add() is used to store vector data to chromadb.

Step 7: Function to Search data by vector.

def searchDataByVector(query: str):
try:
query_vector = embedding_model.encode(query).tolist()
res = collection.query(
query_embeddings=[query_vector],
n_results=1,
include=['distances','embeddings', 'documents', 'metadatas'],
)
print("Query", "\n--------------")
print(query)
print("Result", "\n--------------")
print(res['documents'][0][0])
print("Vector", "\n--------------")
print(res['embeddings'][0][0])
print("")
print("")
print("Complete Response","\n-------------------------")
print(res)

except Exception as e:
print("Vector search failed : ", e)

Pass a query to above function to search. First embedding_model.encode() will convert text query to vector form and collection.query() will return the nearest similar result. n_results represents how many results should be return. according to that return the top n results.

Example Result

Query
--------------
What is Vector Store

Result
--------------
Vector Store is the One of the most common ways to store and search over unstructured data is to embed it and store the resulting embedding vectors, and then at query time to embed the unstructured query and retrieve the embedding vectors that are 'most similar' to the embedded query. A vector store takes care of storing embedded data and performing vector search for you.

Vector
--------------
[-0.022300036624073982, -0.036522310227155685, -0.06973360478878021,
0.07131890207529068, 0.011544489301741123, -0.006613695528358221,
-0.026300203055143356, 0.006446260958909988, 0.018448151648044586,
-0.0069664353504776955, -0.009745685383677483, ... ]

Complete Response
-------------------------
{
'ids': [['46']],
'distances': [[0.534490925007332]],
'metadatas': [[{'Chapter': '46'}]],
'embeddings': [[[-0.022300036624073982, -0.036522310227155685, -0.06973360478878021, 0.07131890207529068, ....]]],
'documents': [["Vector Store is the One of the most common ways to store and search over unstructured data is to embed it and store the resulting embedding vectors, and then at query time to embed the unstructured query and retrieve the embedding vectors that are 'most similar' to the embedded query. A vector store takes care of storing embedded data and performing vector search for you."]]
}

Step 8: Full Code Snippet


import chromadb
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
client = chromadb.PersistentClient(path="my_chroma_db_1")
collection = client.get_or_create_collection(name="immi_collection", embedding_function=embedding_model)


def getDataFromText() -> list:
try:
with open('../test_data/data.txt', 'r', encoding="utf-8") as file:
content: str = file.read()
splited_data: list = content.split("\n\n\n")
return splited_data
except Exception as e:
print("Read data from text failed : ", e)

def addVectorDataToDb() -> None:
embeddings: list = []
metadatas: list = []
documents: list = []
ids: list = []
splited_data: list = getDataFromText()
try:
for index, data in enumerate(splited_data):
embeddings.append(embedding_model.encode(data).tolist())
metadatas.append({"Chapter": str(index+1)})
documents.append(data)
ids.append(str(index+1))
collection.add(
embeddings=embeddings,
metadatas=metadatas,
documents=documents,
ids=ids
)
print("Data added to collection")
except Exception as e:
print("Add data to db failed : ", e)


def searchDataByVector(query: str):
try:
query_vector = embedding_model.encode(query).tolist()
res = collection.query(
query_embeddings=[query_vector],
n_results=1,
include=['distances','embeddings', 'documents', 'metadatas'],
)
print("Query", "\n--------------")
print(query)
print("Result", "\n--------------")
print(res['documents'][0][0])
print("Vector", "\n--------------")
print(res['embeddings'][0][0])
print("")
print("")
print("Complete Response","\n-------------------------")
print(res)

except Exception as e:
print("Vector search failed : ", e)


# addVectorDataToDb()

query = "What is Vector Store"
searchDataByVector(query=query)

References

  1. Chroma
  2. LangChain and Vector Sore
  3. Sentence Embedding Methods
  4. Sentience Transformers

--

--

Ajithkumar M

Software Engineer | R&D | ML | LLM | AI | IoT | Python | ChatGPT| React Native