Working with FAISS for Similarity Search

Ajithkumar M
6 min readNov 1, 2023

--

FAISS

FAISS (Facebook AI Similarity Search) is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning.

FAISS is written in C++ with complete wrappers for Python. Some of the most useful algorithms are implemented on the GPU. It is developed by Facebook AI Research

What is similarity search?

Similarity search is a technique used in information retrieval and data analysis to find items that are similar to a given query item within a dataset. The goal of similarity search is to identify items that are most closely related to the query item based on a similarity metric or distance measure.

Here’s how similarity search works:

  1. Data Representation: The first step is to represent the items in the dataset in a way that can be compared for similarity. This often involves converting items into numerical vectors or other suitable representations. For example, in text data, you might represent documents as TF-IDF vectors, and in image data, you might represent images as feature vectors extracted from their pixels.
  2. Similarity Metric: A similarity metric or distance measure is chosen to quantify how similar two items are. Common similarity metrics include Euclidean distance, cosine similarity, Jaccard similarity, and many others. The choice of metric depends on the nature of the data and the specific problem.
  3. Search: Given a query item, the system computes the similarity between the query item and all items in the dataset. This is done by applying the chosen similarity metric. The items are then ranked based on their similarity to the query item.
  4. Retrieval: The most similar items are returned as the search results. The number of results returned may be specified in advance or depend on the user’s preferences.

Similarity search is used in various applications, including:

  • Information Retrieval: In text search engines, similarity search helps find documents that are similar to a search query, rather than exact matches.
  • Recommendation Systems: In collaborative filtering and content-based recommendation systems, similarity search is used to find items (e.g., movies, products) similar to what a user has liked or interacted with.
  • Image and Video Retrieval: In multimedia applications, similarity search helps locate images or videos that are visually similar to a given image or video.
  • Genomics: In bioinformatics, similarity search is used to find genes or sequences with similar genetic structures.
  • Anomaly Detection: In cybersecurity and network monitoring, similarity search can help detect anomalies by identifying patterns in network traffic that deviate from normal behavior.
  • Clustering: Similarity search can be used as a component of clustering algorithms to group similar data points together.

The choice of similarity metric and the efficiency of the search algorithm can significantly impact the performance of a similarity search system. Various indexing structures and search algorithms have been developed to speed up similarity search in high-dimensional spaces, as performing exact similarity search for large datasets can be computationally expensive.

Let’s get started with the implementation of Similarity Search using FAISS, Langchain and Hugging Face!

Before getting started, install following libraries

pip install langchain
pip install torch
pip install transformers
pip install sentence-transformers
pip install datasets
pip install faiss-cpu

Import the libraries

from langchain.document_loaders import HuggingFaceDatasetLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

Here using hugging face dataset — databricks-dolly-15k.

# Specify the dataset name and the column containing the content
dataset_name = "databricks/databricks-dolly-15k"
page_content_column = "context" # or any other column you're interested in

# Create a loader instance
loader = HuggingFaceDatasetLoader(dataset_name, page_content_column)

# Load the data
data = loader.load()

Using RecursiveCharacterTextSplitter for text spliting.The RecursiveCharacterTextSplitter takes a large text and splits it based on a specified chunk size. It does this by using a set of characters. The default characters provided to it are ["\n\n", "\n", " ", ""].

# Create an instance of the RecursiveCharacterTextSplitter class with specific parameters.
# It splits text into chunks of 1000 characters each with a 150-character overlap.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)

# 'data' holds the text you want to split, split the text into documents using the text splitter.
docs = text_splitter.split_documents(data)
# Create an instance of the RecursiveCharacterTextSplitter class with specific parameters.
# It splits text into chunks of 1000 characters each with a 150-character overlap.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)

# 'data' holds the text you want to split, split the text into documents using the text splitter.
docs = text_splitter.split_documents(data)

Sentence Transformers is used for text embedding.

# Define the path to the pre-trained model you want to use
modelPath = "sentence-transformers/all-MiniLM-l6-v2"

# Create a dictionary with model configuration options, specifying to use the CPU for computations
model_kwargs = {'device':'cpu'}

# Create a dictionary with encoding options, specifically setting 'normalize_embeddings' to False
encode_kwargs = {'normalize_embeddings': False}

# Initialize an instance of HuggingFaceEmbeddings with the specified parameters
embeddings = HuggingFaceEmbeddings(
model_name=modelPath, # Provide the pre-trained model's path
model_kwargs=model_kwargs, # Pass the model configuration options
encode_kwargs=encode_kwargs # Pass the encoding options
)

Here we are using FAISS for vector store. I am using FAISS index method for vector store. This is useful so you don’t have to recreate it everytime you use it.

db = FAISS.from_documents(docs, embeddings)
db.save_local(folder_path="../database/faiss_db", index_name="myFaissIndex")

Once vector store and Faiss index created, you can load that index and search your query

db = FAISS.load_local(folder_path="../database/faiss_db",embeddings=embeddings,index_name="myFaissIndex")
searchDocs = db.similarity_search("What is investment banking?")
print(searchDocs[0].page_content)

your output may be like following

Investment banking pertains to certain activities of a financial services 
company or a corporate division that consist in advisory-based financial
transactions on behalf of individuals, corporations, and governments.
Traditionally associated with corporate finance, such a bank might assist
in raising financial capital by underwriting or acting as the client's agent
in the issuance of debt or equity securities.

Full code is given bellow. I created 2 separate scripts for create vector store and searching queries.

# create_faiss.py

from langchain.document_loaders import HuggingFaceDatasetLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

# Specify the dataset name and the column containing the content
dataset_name = "databricks/databricks-dolly-15k"
page_content_column = "context" # or any other column you're interested in

# Create a loader instance
loader = HuggingFaceDatasetLoader(dataset_name, page_content_column)

# Load the data
data = loader.load()

# Display the first 15 entries
data[:2]

# Create an instance of the RecursiveCharacterTextSplitter class with specific parameters.
# It splits text into chunks of 1000 characters each with a 150-character overlap.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)

# 'data' holds the text you want to split, split the text into documents using the text splitter.
docs = text_splitter.split_documents(data)
# Create an instance of the RecursiveCharacterTextSplitter class with specific parameters.
# It splits text into chunks of 1000 characters each with a 150-character overlap.
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)

# 'data' holds the text you want to split, split the text into documents using the text splitter.
docs = text_splitter.split_documents(data)

# Define the path to the pre-trained model you want to use
modelPath = "sentence-transformers/all-MiniLM-l6-v2"

# Create a dictionary with model configuration options, specifying to use the CPU for computations
model_kwargs = {'device':'cpu'}

# Create a dictionary with encoding options, specifically setting 'normalize_embeddings' to False
encode_kwargs = {'normalize_embeddings': False}

# Initialize an instance of HuggingFaceEmbeddings with the specified parameters
embeddings = HuggingFaceEmbeddings(
model_name=modelPath, # Provide the pre-trained model's path
model_kwargs=model_kwargs, # Pass the model configuration options
encode_kwargs=encode_kwargs # Pass the encoding options
)

try:
db = FAISS.from_documents(docs, embeddings)
db.save_local(folder_path="../database/faiss_db", index_name="myFaissIndex")
print("Faiss index created ")
except Exception as e:
print("Fiass store failed \n",e)

# search_faiss.py

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS

# Define the path to the pre-trained model you want to use
modelPath = "sentence-transformers/all-MiniLM-l6-v2"

# Create a dictionary with model configuration options, specifying to use the CPU for computations
model_kwargs = {'device':'cpu'}

# Create a dictionary with encoding options, specifically setting 'normalize_embeddings' to False
encode_kwargs = {'normalize_embeddings': False}

# Initialize an instance of HuggingFaceEmbeddings with the specified parameters
embeddings = HuggingFaceEmbeddings(
model_name=modelPath, # Provide the pre-trained model's path
model_kwargs=model_kwargs, # Pass the model configuration options
encode_kwargs=encode_kwargs # Pass the encoding options
)

try:
db = FAISS.load_local(folder_path="../database/faiss_db",embeddings=embeddings,index_name="myFaissIndex")
print("Faiss index loaded ")
except Exception as e:
print("Fiass index loading failed \n",e)

while True:
question = input("Enter your query:")
searchDocs = db.similarity_search(question)
print(searchDocs[0].page_content)

Just run once create_faiss.py for creating Faiss db and then run search_faiss.py for similarity search.

Advantages of FAISS

FAISS has various advantages, including:

  • Efficient similarity search: FAISS provides efficient methods for similarity search and grouping, which can handle large-scale, high-dimensional data.
  • GPU support: FAISS includes GPU support, which enables for further search acceleration and can greatly increase search performance on large-scale datasets.
  • Scalability: FAISS is designed to be extremely scalable and capable of handling large-scale datasets including billions of components.

--

--

Ajithkumar M

Software Engineer | R&D | ML | LLM | AI | IoT | Python | ChatGPT| React Native