Similarity Search with Language Embeddings

Implementing DistilBERT embeddings, Faiss, and Pinecone for effective semantic search and retrieval
AI
Deep Learning
NLP
Project
Study
Author

Mahmut Osmanovic

Published

August 4, 2024

1.0 | Language Embeddings and Hadiths

Language embeddings have revolutionized how we process and understand text data. This project explores their application in analyzing and searching through Hadiths from Sahih Bukhari. Hadiths from Sahih Bukhari are recorded sayings and actions of Prophet Muhammad, considered highly authentic by Muslims, providing guidance on various aspects of Islamic life. The hadiths are readily available online in PDF format. Their static nature and typically short length make them ideal candidates for text processing, embedding, and subsequent storage in vector databases for efficient similarity searches.

2.0 | Preprocessing & Tokenization of Hadiths

The following Python code demonstrates the process of extracting text from a PDF containing Hadiths, removing irrelevant information, and splitting the text into individual Hadiths:

Show Code
def extract_text_from_pdf(pdf_path):
  """Extracts text from a PDF containing Hadiths.

  Args:
    pdf_path: The path to the PDF file.

  Returns:
    The extracted text content.
  """

  text = ""
  with pdfplumber.open(pdf_path) as pdf:
    for page in pdf.pages:
      text += page.extract_text()
  return text

pdf_path = 'part93-sb.pdf'
text = extract_text_from_pdf(pdf_path)

import re

def remove_string(text, string_to_remove):
  """Removes all occurrences of a specific string from a given text.

  Args:
    text: The input text.
    string_to_remove: The string to be removed.

  Returns:
    The text with the specified string removed.
  """

  return re.sub(re.escape(string_to_remove), '', text)

def split_hadiths(text):
  """Splits the Hadith text into a list of individual hadiths.

  Args:
    text: The text containing all the hadiths.

  Returns:
    A list of strings, where each string is an individual Hadith with its number
    and content.
  """

  hadiths = []
  hadith_start_pattern = r"Sahih Bukhari Volume (\d+), Book (\d+), Hadith Number (\d+)"
  hadith_end_pattern = r"\nSAHIH BUKHARI BOOK 2. BELIEF\n"

  start_index = 0
  for match in re.finditer(hadith_start_pattern, text):
    end_match = re.search(hadith_end_pattern, text[match.end():])
    if end_match:  # Check if a match was found
      end_index = end_match.start() + match.end()
    else:
      end_index = len(text)  # Set end to the end of the text if no match found
    hadith_text = text[match.start():end_index]
    hadiths.append(hadith_text)
    start_index = end_index

  return hadiths

# Assuming you have already extracted the text from your PDF using pdfplumber
text = extract_text_from_pdf(pdf_path)

# Remove unwanted string before splitting
text = remove_string(text, "Hadith Collection | www.HadithCollection.com")

# Split the text into individual hadiths
hadiths_v1b2 = split_hadiths(text)

This code first extracts the text from the PDF using the pdfplumber library. Then, it removes any unwanted strings like collection information using the remove_string function. Finally, the split_hadiths function employs regular expressions to identify Hadith start and end patterns, splitting the text into a list of individual Hadiths containing their reference numbers and content.

The resulting list of hadiths_v1b2 now contains pre-processed Hadiths ready for the next step in the workflow: embedding creation.

3.0 | Embeddings: Transforming Text into Numerical Representations

With the Hadiths tokenized, the next step involves converting them into numerical representations suitable for computational analysis. This process is known as embedding. Embeddings capture the semantic and syntactic information of text, allowing for quantitative comparisons and analysis.

Implementation: I employed a pre-trained language model, DistilBERT (found for free on hugging face), to generate dense vector representations for each Hadith. This model has been trained on a massive amount of text data and can effectively capture the underlying meaning of the Hadiths.

Show Code
import torch
from transformers import AutoTokenizer, AutoModel

model_name = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

def get_embedding(text):
  """Generates an embedding for a given text.

  Args:
    text: The input text.

  Returns:
    The embedding as a NumPy array.
  """

  inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
  with torch.no_grad():
    outputs = model(**inputs)
  # Use mean pooling to get the sentence vector
  embeddings = outputs.last_hidden_state.mean(dim=1)
  return embeddings.squeeze().numpy()

# Generate embeddings for all Hadiths
hadith_embeddings = [get_embedding(hadith) for hadith in hadiths_v1b2]

The resulting hadith_embeddings list contains numerical representations for each Hadith, ready for similarity search and other downstream tasks.

4.0 | Similarity Search with Faiss

To efficiently find Hadiths similar to a given query, I utilize the Faiss library, optimized for similarity search on large datasets of dense vectors. By indexing the Hadith embeddings, I can rapidly retrieve the most similar Hadiths based on their semantic and syntactic closeness.

Show Code
import faiss
import numpy as np

def find_most_similar_hadith(hadiths_embeddings, query_embedding):
  """Finds the most similar hadith to a given query embedding using cosine similarity.

  Args:
    hadiths_embeddings: A list of embeddings for all hadiths.
    query_embedding: The embedding of the query hadith.

  Returns:
    The index of the most similar hadith.
  """

  # Normalize embeddings to use cosine similarity
  hadiths_embeddings = np.array(hadiths_embeddings)
  faiss.normalize_L2(hadiths_embeddings)
  query_embedding = np.array([query_embedding])
  faiss.normalize_L2(query_embedding)

  # Create a Faiss index
  dimension = hadiths_embeddings.shape[1]
  index = faiss.IndexFlatIP(dimension)  # Use inner product for cosine similarity

  # Add embeddings to the index
  index.add(hadiths_embeddings)

  # Search for the most similar hadith
  k = 2  # Find the top 2 nearest neighbors (including the query itself)
  distances, indices = index.search(query_embedding, k)

  # Return the index of the most similar hadith (excluding the query itself)
  return indices[0][1]

By constructing a Faiss index over the Hadith embeddings, I can efficiently retrieve the most similar Hadiths to a given query, enabling various applications such as recommendation systems, information retrieval, and clustering.

4.1 | Example Usage

To illustrate the process, consider the following example (see code and output below):

# Find the most similar Hadith to the first Hadith
most_similar_hadith_index = find_most_similar_hadith(hadith_v2b2_embeddings, hadith_v2b2_embeddings[0])
print("Original Hadith:")
print(hadiths_v1b2[0])

print("Most similar Hadith:")
print(hadiths_v1b2[most_similar_hadith_index])
Show Output
Sahih Bukhari Volume 1, Book 2, Hadith Number 7
Narated By Ibn ‘Umar: Allah’s Apostle said: Islam is based on (the following) five
(principles):
1. To testify that none has the right to be worshiped but Allah and Muhammad is Allah’s
Apostle.
2. To offer the (compulsory congregational) prayers dutifully and perfectly.
3. To pay Zakat. (i.e. obligatory charity)
4. To perform Hajj. (i.e. Pilgrimage to Mecca)
5. To observe fast during the month of Ramadan.

Sahih Bukhari Volume 1, Book 2, Hadith Number 54
Narated By Jarir bin Abdullah: I gave the pledge of allegiance to Allah’s Apostle for
the following:
1. Offer prayers perfectly.
2. Pay the Zakat. (obligatory charity)
3. And be sincere and true to every Muslim.

This code snippet demonstrates how to find the most similar Hadith to a given Hadith using the find_most_similar_hadith function. In this example, the most similar Hadith to the first Hadith in the dataset is found and printed.

Note

The actual output of the most similar Hadith will depend on the specific Hadiths in your dataset and the performance of the embedding model.

By constructing a Faiss index over the Hadith embeddings, I can efficiently retrieve the most similar Hadiths to a given query, enabling various applications such as recommendation systems, information retrieval, and clustering.

5.0 | Similarity Search using Pinecone

As referenced in 1.0 | Language Embeddings and Hadiths, one often stores embeddings within a vector database to facilitate efficient retrieval and similarity search. Vector databases are particularly useful when working with high-dimensional data, such as language embeddings, enabling quick and accurate search through large datasets. A widely used vector database, as of the date of this blog post, is Pinecone. Pinecone’s architecture is optimized for scalable vector searches, making it an ideal choice for tasks such as querying religious texts or other structured content.

In the following code snippets, I demonstrate how to use Pinecone to store and query embeddings derived from a collection of Hadiths. After embedding the Hadiths using a language model, the vectors are formatted and upserted into a Pinecone index for storage. I then perform a similarity search to find Hadiths most similar to a specified query vector.

from pinecone import Pinecone
PINECONE_API_KEY = ENV.PINECONE_API_KEY
pc = Pinecone(api_key=PINECONE_API_KEY)
index = pc.Index("hadith-index") # name of pinecone index
formatted_vectors = [
    {"id": f"hadith_{i}", "values": vector}
    for i, vector in enumerate(hadith_v2b2_embeddings)
]

Firstly, I initialize a PineCone client using the API key and create an index named hadith-index. The Hadith embeddings are then formatted and prepared for insertion into the index.

index.upsert(
    vectors=formatted_vectors,
    namespace="v1b2"
)
Show Output
{'upserted_count': 49}

The embeddings are upserted into the Pinecone index under the namespace v1b2. The upsert operation confirms that 49 vectors have been successfully added to the index.

# Fetch the vector for 'hadith_0'
query_id = "hadith_0"
fetch_response  = index.fetch(ids=[query_id], namespace="v1b2")

res = index.query(
    namespace="v1b2",
    id="hadith_0",
    top_k=3,
    include_values=True
)

print("Query Results:")
for match in res.get('matches', []):
    vector_id = match.get("id")
    score = match.get("score")
    values = match.get("values")

    print(f"ID: {vector_id}, Score: {score}")
Show Output
Query Results:
ID: hadith_0, Score: 1.00310218
ID: hadith_47, Score: 0.959191442
ID: hadith_8, Score: 0.944332957

In this snippet, I query the Pinecone index for Hadiths similar to hadith_0. The query method retrieves the top 3 most similar Hadiths, displaying their IDs and similarity scores. Note that the first is hadith_0 itself, since it obviously is semantically identical to itself.

for id in vector_ids:
    print(hadiths_v1b2[id])
    print()
Show Output
Sahih Bukhari Volume 1, Book 2, Hadith Number 7
Narated By Ibn ‘Umar: Allah’s Apostle said: Islam is based on (the following) five
(principles):
1. To testify that none has the right to be worshiped but Allah and Muhammad is Allah’s
Apostle.
2. To offer the (compulsory congregational) prayers dutifully and perfectly.
3. To pay Zakat. (i.e. obligatory charity)
4. To perform Hajj. (i.e. Pilgrimage to Mecca)
5. To observe fast during the month of Ramadan.

Sahih Bukhari Volume 1, Book 2, Hadith Number 54
Narated By Jarir bin Abdullah: I gave the pledge of allegiance to Allah’s Apostle for
the following:
1. Offer prayers perfectly.
2. Pay the Zakat. (obligatory charity)
3. And be sincere and true to every Muslim.

Sahih Bukhari Volume 1, Book 2, Hadith Number 15
Narated By Anas: The Prophet said, “Whoever possesses the following three qualities
will have the sweetness (delight) of faith:
1. The one to whom Allah and His Apostle becomes dearer than anything else.
2. Who loves a person and he loves him only for Allah’s sake.
3. Who hates to revert to Atheism (disbelief) as he hates to be thrown into the fire.”

Finally, the IDs of the retrieved Hadiths are used to fetch the corresponding Hadith texts from the original dataset, specifically from the second book in Volume 1 of Sahih Bukhari. The first Hadith in the query response is the Hadith that was initially queried, included here to serve as a reference point. This allows for direct comparison with the other two Hadiths retrieved, which are the most similar to the query Hadith. These two Hadiths demonstrate strong semantic similarity to the query Hadith, as indicated by their high similarity scores, highlighting the close relationship between their content and the original query.

Note

It’s important to note that the query does not necessarily have to be an existing Hadith; it could be any text, and the model would still return the most semantically similar Hadiths in the database. This capability makes Pinecone an effective tool for exploring and retrieving related content from large collections of text.

6.0 | Concluding Remarks

In this project, I explored the application of language embeddings for analyzing and searching through Hadiths from Sahih Bukhari. By preprocessing and tokenizing the Hadiths, I prepared the text data for embedding creation using a pre-trained language model, DistilBERT. The embeddings captured the semantic and syntactic information of the Hadiths, enabling efficient similarity search through the Faiss library. This approach facilitates rapid retrieval of the most relevant Hadiths based on their content, opening up possibilities for various applications such as recommendation systems, information retrieval, and clustering. The combination of these techniques demonstrates a powerful method for handling and analyzing religious texts in a meaningful way.


References

  1. PdfPlumber. v0.11.2 (2024). https://pypi.org/project/pdfplumber/.
  2. HuggingFace, DistilBERT, 2024. https://huggingface.co/distilbert/distilbert-base-uncased.
  3. Faiss, 2024. https://github.com/facebookresearch/faiss.
  4. Pinecone, 2024. https://www.pinecone.io/.