1.0 | Language Embeddings and Hadiths
Language embeddings have revolutionized how we process and understand text data. This project explores their application in analyzing and searching through Hadiths from Sahih Bukhari. Hadiths from Sahih Bukhari are recorded sayings and actions of Prophet Muhammad, considered highly authentic by Muslims, providing guidance on various aspects of Islamic life. The hadiths are readily available online in PDF format. Their static nature and typically short length make them ideal candidates for text processing, embedding, and subsequent storage in vector databases for efficient similarity searches.
2.0 | Preprocessing & Tokenization of Hadiths
The following Python code demonstrates the process of extracting text from a PDF containing Hadiths, removing irrelevant information, and splitting the text into individual Hadiths:
Show Code
def extract_text_from_pdf(pdf_path):
"""Extracts text from a PDF containing Hadiths.
Args:
pdf_path: The path to the PDF file.
Returns:
The extracted text content.
"""
= ""
text with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
+= page.extract_text()
text return text
= 'part93-sb.pdf'
pdf_path = extract_text_from_pdf(pdf_path)
text
import re
def remove_string(text, string_to_remove):
"""Removes all occurrences of a specific string from a given text.
Args:
text: The input text.
string_to_remove: The string to be removed.
Returns:
The text with the specified string removed.
"""
return re.sub(re.escape(string_to_remove), '', text)
def split_hadiths(text):
"""Splits the Hadith text into a list of individual hadiths.
Args:
text: The text containing all the hadiths.
Returns:
A list of strings, where each string is an individual Hadith with its number
and content.
"""
= []
hadiths = r"Sahih Bukhari Volume (\d+), Book (\d+), Hadith Number (\d+)"
hadith_start_pattern = r"\nSAHIH BUKHARI BOOK 2. BELIEF\n"
hadith_end_pattern
= 0
start_index for match in re.finditer(hadith_start_pattern, text):
= re.search(hadith_end_pattern, text[match.end():])
end_match if end_match: # Check if a match was found
= end_match.start() + match.end()
end_index else:
= len(text) # Set end to the end of the text if no match found
end_index = text[match.start():end_index]
hadith_text
hadiths.append(hadith_text)= end_index
start_index
return hadiths
# Assuming you have already extracted the text from your PDF using pdfplumber
= extract_text_from_pdf(pdf_path)
text
# Remove unwanted string before splitting
= remove_string(text, "Hadith Collection | www.HadithCollection.com")
text
# Split the text into individual hadiths
= split_hadiths(text) hadiths_v1b2
This code first extracts the text from the PDF using the pdfplumber library. Then, it removes any unwanted strings like collection information using the remove_string
function. Finally, the split_hadiths
function employs regular expressions to identify Hadith start and end patterns, splitting the text into a list of individual Hadiths containing their reference numbers and content.
The resulting list of hadiths_v1b2
now contains pre-processed Hadiths ready for the next step in the workflow: embedding creation.
3.0 | Embeddings: Transforming Text into Numerical Representations
With the Hadiths tokenized, the next step involves converting them into numerical representations suitable for computational analysis. This process is known as embedding. Embeddings capture the semantic and syntactic information of text, allowing for quantitative comparisons and analysis.
Implementation: I employed a pre-trained language model, DistilBERT (found for free on hugging face), to generate dense vector representations for each Hadith. This model has been trained on a massive amount of text data and can effectively capture the underlying meaning of the Hadiths.
Show Code
import torch
from transformers import AutoTokenizer, AutoModel
= 'distilbert-base-uncased'
model_name = AutoTokenizer.from_pretrained(model_name)
tokenizer = AutoModel.from_pretrained(model_name)
model
def get_embedding(text):
"""Generates an embedding for a given text.
Args:
text: The input text.
Returns:
The embedding as a NumPy array.
"""
= tokenizer(text, return_tensors='pt', truncation=True, padding=True)
inputs with torch.no_grad():
= model(**inputs)
outputs # Use mean pooling to get the sentence vector
= outputs.last_hidden_state.mean(dim=1)
embeddings return embeddings.squeeze().numpy()
# Generate embeddings for all Hadiths
= [get_embedding(hadith) for hadith in hadiths_v1b2] hadith_embeddings
The resulting hadith_embeddings
list contains numerical representations for each Hadith, ready for similarity search and other downstream tasks.
4.0 | Similarity Search with Faiss
To efficiently find Hadiths similar to a given query, I utilize the Faiss library, optimized for similarity search on large datasets of dense vectors. By indexing the Hadith embeddings, I can rapidly retrieve the most similar Hadiths based on their semantic and syntactic closeness.
Show Code
import faiss
import numpy as np
def find_most_similar_hadith(hadiths_embeddings, query_embedding):
"""Finds the most similar hadith to a given query embedding using cosine similarity.
Args:
hadiths_embeddings: A list of embeddings for all hadiths.
query_embedding: The embedding of the query hadith.
Returns:
The index of the most similar hadith.
"""
# Normalize embeddings to use cosine similarity
= np.array(hadiths_embeddings)
hadiths_embeddings
faiss.normalize_L2(hadiths_embeddings)= np.array([query_embedding])
query_embedding
faiss.normalize_L2(query_embedding)
# Create a Faiss index
= hadiths_embeddings.shape[1]
dimension = faiss.IndexFlatIP(dimension) # Use inner product for cosine similarity
index
# Add embeddings to the index
index.add(hadiths_embeddings)
# Search for the most similar hadith
= 2 # Find the top 2 nearest neighbors (including the query itself)
k = index.search(query_embedding, k)
distances, indices
# Return the index of the most similar hadith (excluding the query itself)
return indices[0][1]
By constructing a Faiss index over the Hadith embeddings, I can efficiently retrieve the most similar Hadiths to a given query, enabling various applications such as recommendation systems, information retrieval, and clustering.
4.1 | Example Usage
To illustrate the process, consider the following example (see code and output below):
# Find the most similar Hadith to the first Hadith
= find_most_similar_hadith(hadith_v2b2_embeddings, hadith_v2b2_embeddings[0])
most_similar_hadith_index print("Original Hadith:")
print(hadiths_v1b2[0])
print("Most similar Hadith:")
print(hadiths_v1b2[most_similar_hadith_index])
Show Output
Sahih Bukhari Volume 1, Book 2, Hadith Number 7
Narated By Ibn ‘Umar: Allah’s Apostle said: Islam is based on (the following) five
(principles):
1. To testify that none has the right to be worshiped but Allah and Muhammad is Allah’s
Apostle.
2. To offer the (compulsory congregational) prayers dutifully and perfectly.
3. To pay Zakat. (i.e. obligatory charity)
4. To perform Hajj. (i.e. Pilgrimage to Mecca)
5. To observe fast during the month of Ramadan.
Sahih Bukhari Volume 1, Book 2, Hadith Number 54
Narated By Jarir bin Abdullah: I gave the pledge of allegiance to Allah’s Apostle for
the following:
1. Offer prayers perfectly.
2. Pay the Zakat. (obligatory charity)
3. And be sincere and true to every Muslim.
This code snippet demonstrates how to find the most similar Hadith to a given Hadith using the find_most_similar_hadith
function. In this example, the most similar Hadith to the first Hadith in the dataset is found and printed.
The actual output of the most similar Hadith will depend on the specific Hadiths in your dataset and the performance of the embedding model.
By constructing a Faiss index over the Hadith embeddings, I can efficiently retrieve the most similar Hadiths to a given query, enabling various applications such as recommendation systems, information retrieval, and clustering.
5.0 | Similarity Search using Pinecone
As referenced in 1.0 | Language Embeddings and Hadiths, one often stores embeddings within a vector database to facilitate efficient retrieval and similarity search. Vector databases are particularly useful when working with high-dimensional data, such as language embeddings, enabling quick and accurate search through large datasets. A widely used vector database, as of the date of this blog post, is Pinecone. Pinecone’s architecture is optimized for scalable vector searches, making it an ideal choice for tasks such as querying religious texts or other structured content.
In the following code snippets, I demonstrate how to use Pinecone to store and query embeddings derived from a collection of Hadiths. After embedding the Hadiths using a language model, the vectors are formatted and upserted into a Pinecone index for storage. I then perform a similarity search to find Hadiths most similar to a specified query vector.
from pinecone import Pinecone
= ENV.PINECONE_API_KEY
PINECONE_API_KEY = Pinecone(api_key=PINECONE_API_KEY)
pc = pc.Index("hadith-index") # name of pinecone index
index = [
formatted_vectors "id": f"hadith_{i}", "values": vector}
{for i, vector in enumerate(hadith_v2b2_embeddings)
]
Firstly, I initialize a PineCone client using the API key and create an index named hadith-index
. The Hadith embeddings are then formatted and prepared for insertion into the index.
index.upsert(=formatted_vectors,
vectors="v1b2"
namespace )
Show Output
{'upserted_count': 49}
The embeddings are upserted into the Pinecone index under the namespace v1b2
. The upsert operation confirms that 49 vectors have been successfully added to the index.
# Fetch the vector for 'hadith_0'
= "hadith_0"
query_id = index.fetch(ids=[query_id], namespace="v1b2")
fetch_response
= index.query(
res ="v1b2",
namespaceid="hadith_0",
=3,
top_k=True
include_values
)
print("Query Results:")
for match in res.get('matches', []):
= match.get("id")
vector_id = match.get("score")
score = match.get("values")
values
print(f"ID: {vector_id}, Score: {score}")
Show Output
Query Results:1.00310218
ID: hadith_0, Score: 0.959191442
ID: hadith_47, Score: 0.944332957 ID: hadith_8, Score:
In this snippet, I query the Pinecone index for Hadiths similar to hadith_0
. The query method retrieves the top 3 most similar Hadiths, displaying their IDs and similarity scores. Note that the first is hadith_0
itself, since it obviously is semantically identical to itself.
for id in vector_ids:
print(hadiths_v1b2[id])
print()
Show Output
Sahih Bukhari Volume 1, Book 2, Hadith Number 7
Narated By Ibn ‘Umar: Allah’s Apostle said: Islam is based on (the following) five
(principles):
1. To testify that none has the right to be worshiped but Allah and Muhammad is Allah’s
Apostle.
2. To offer the (compulsory congregational) prayers dutifully and perfectly.
3. To pay Zakat. (i.e. obligatory charity)
4. To perform Hajj. (i.e. Pilgrimage to Mecca)
5. To observe fast during the month of Ramadan.
Sahih Bukhari Volume 1, Book 2, Hadith Number 54
Narated By Jarir bin Abdullah: I gave the pledge of allegiance to Allah’s Apostle for
the following:
1. Offer prayers perfectly.
2. Pay the Zakat. (obligatory charity)
3. And be sincere and true to every Muslim.
Sahih Bukhari Volume 1, Book 2, Hadith Number 15
Narated By Anas: The Prophet said, “Whoever possesses the following three qualities
will have the sweetness (delight) of faith:
1. The one to whom Allah and His Apostle becomes dearer than anything else.
2. Who loves a person and he loves him only for Allah’s sake.
3. Who hates to revert to Atheism (disbelief) as he hates to be thrown into the fire.”
Finally, the IDs of the retrieved Hadiths are used to fetch the corresponding Hadith texts from the original dataset, specifically from the second book in Volume 1 of Sahih Bukhari. The first Hadith in the query response is the Hadith that was initially queried, included here to serve as a reference point. This allows for direct comparison with the other two Hadiths retrieved, which are the most similar to the query Hadith. These two Hadiths demonstrate strong semantic similarity to the query Hadith, as indicated by their high similarity scores, highlighting the close relationship between their content and the original query.
It’s important to note that the query does not necessarily have to be an existing Hadith; it could be any text, and the model would still return the most semantically similar Hadiths in the database. This capability makes Pinecone an effective tool for exploring and retrieving related content from large collections of text.
6.0 | Concluding Remarks
In this project, I explored the application of language embeddings for analyzing and searching through Hadiths from Sahih Bukhari. By preprocessing and tokenizing the Hadiths, I prepared the text data for embedding creation using a pre-trained language model, DistilBERT. The embeddings captured the semantic and syntactic information of the Hadiths, enabling efficient similarity search through the Faiss library. This approach facilitates rapid retrieval of the most relevant Hadiths based on their content, opening up possibilities for various applications such as recommendation systems, information retrieval, and clustering. The combination of these techniques demonstrates a powerful method for handling and analyzing religious texts in a meaningful way.
References
- PdfPlumber. v0.11.2 (2024). https://pypi.org/project/pdfplumber/.
- HuggingFace, DistilBERT, 2024. https://huggingface.co/distilbert/distilbert-base-uncased.
- Faiss, 2024. https://github.com/facebookresearch/faiss.
- Pinecone, 2024. https://www.pinecone.io/.