Monday, 21 October 2024

Building a Retrieval-Augmented Generation (RAG) System for Academic Papers||An In-depth Research Paper Elaboration with Code


 

Terminology

  1. NLP: NLP stands for Natural Language Processing, which is defined as the one of the paradigm of Artificial Intelligence which allow computers to understand, analyze, reason, generate, and continuously learn through human feedback for accurate and precise performance.
  2. Token: Token is defined as the fundamental unit of text, an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing.
  3. .Tokenization: Tokenization is defined as the process of converting the text into tokens, where token can be a word, sub-word or character.
  4. Vector Search: Vector Search is defined as the search technique where texts are converted into the mathematical vectors [datapoints in multi-dimensional space] and retrieve the information based on similarity between two vectors.
  5. RAG: RAG stands for retrieval augmented generation, which can be further defined as the way of response generation by the large language model where the generation takes place only after retrieval of enough context and semantic relation between the user’s query and provided documents as a part of knowledge base updation of large language model for that instance.
  6. Cosine similarity: cosine similarity is a mathematical measure of similarity between two vectors [ text embedding in form of vector] used to measure the document/text similarity in text analysis.
  7. Contextually aware text generation: This term means that the response generating engine [language model] actually look after the context of the query and provide the contextual response instead of just generating the response based on the query’s keywords and the semantic similarity.
  8. BERTBERT stands for bidirectional encoder representations of transformer, a open-source language model based one encoder-only architecture which can represent the text into the meaningful embeddings with the learnt weights from the self-supervised learning during its training phase.
  9. HNSW Index: It stands for Hierarchical Navigable Small World Index, a specialized data structure used for robust vector research, allow the retrieval of most relevant information by organizing the datapoints/vectors in multi-dimensional space.
  10. Semantic Similarity: This refers to the degree to which two pieces of text are similar in meaning, even if they use different words or phrases.
  11. FAISS: Facebook AI Similarity Search is an open-source library developed by Meta (formerly Facebook) that helps with fast and efficient vector searches.

What is Retrieval-Augmented Generation (RAG) and How Does It Transform Academic Research?

RAG is an AI framework for retrieving facts from an external knowledge base to ground large language models (LLMs) on the most accurate, up-to-date information and to give users insight into LLMs’ generative process.

RAG basically involves two-step operation for generating the response for the user’s query. Retrieval takes place at the beginning which basically means identification of the most relevant section of the document or most relevant documents according to the requirements, followed by generation which means receiving the relevant section/documents and then generating precise response to the user’s query.

Applications of RAG can be found in sectors like customer support [ClincKasisto], automatic content creation for blogs and web content using custom knowledge base [QuillWritesonic and Wizdom]. RAG actually solves the problems of large language models including hallucinations and outdated knowledge increasing the accuracy and credibility of the generated content.

Specifically, in this research paper, use of popular arXiv dataset from Kaggle developed and distributed by Cornell University. The dataset contain the metadata file in the json format for the indexed research papers in arXiv.org. The metadata json file contain following contents for each research papers.

  • id: ArXiv ID (can be used to access the paper, see below)
  • submitter: Who submitted the paper
  • authors: Authors of the paper
  • title: Title of the paper
  • comments: Additional info, such as number of pages and figures
  • journal-ref: Information about the journal the paper was published in
  • doi: [https://www.doi.org](Digital Object Identifier)
  • abstract: The abstract of the paper
  • categories: Categories / tags in the ArXiv system
  • versions: A version history

This research paper emphasizes the essence of RAG for searching research papers along with interacting with the various documents which includes content like text, graphs and tables for research purpose and learning from the research papers.

How Vector Search and Semantic Similarity Revolutionize Data Retrieval in RAG Systems

Vector Search & Semantic similarity have changed the landscape of RAG systems by providing more effective way to retrieve and process information which leads to more accurate, precise and contextually relevant responses.

For performing vector search, creating embedding is the essential part where a embedding model is required. The textual data need to be converted into numerical representations. This research paper first implemented the pioneering embedding model called Word2vec which came short with the sentence-level and paragraph-level context which lead this research to use SBERT [distilled version of BERT].

This research paper solved the shortcoming of small token limit by calculating the mean vector of various chunks and then using SBERT Model to creates mean vector [summary of chunk of data (vectors) which lead to use multiple chunks of data at a time without any problem of token limitations. Vector search and semantic search enabled this RAG system to get independent of any of above methods and provide the hybrid search functionality. Vector search also good option in terms of scalability and designed to handle large datasets efficiently, allowing seamless performance of RAG system.

How HNSW and FAISS Indexing Boost the Performance of Large-Scale Vector Search Systems?

HNSW and FAISS are two popular indexing techniques that significantly enhance the performance of large-scale vector systems. These techniques are specifically designed to handle massive datasets efficiently and provide fast retrieval of similar vectors. Naiv vector indexing solutions didn’t provide sufficient functionality in future in terms of scalability, this problem was addressed by this research paper by introducing the vector index as the solution which employs specialized data structure like Locality Sensitive Hashing (LSH) & HNSW to organize the vector and accelerating the vector search.

This paper combined the multiple search techniques [cosine similiarity for semantic comparision, L2 (Eucledian Distance between vectors) and HNSW to build an efficient graph based index for faster vector search which also outperformed FlatIndexL2. FAISS reduced the number of comparisions needed and leverage parallel computing. This system perfectly fits for the applications that requires real-time search across massive datasets [ million of research papers from arXiv.org]. HNSW improves the accuracy and computational time by focusing on local neighbourhoods, ensuring that even subtle relationships between the data points are captured. Graph-based indexing provided by HNSW ensures that only semantically related vectors are connected, forming local neighborhoods. When a query is processed, the search begins at a higher level and moves closer to relevant neighbourhoods, gradually refining the results. The FAISS search reduced the computational cost as the traditional vector searches rely on brute force comparisons, which are resource intensive. In contrast, graph based indexing like HNSW reduces the number of comparisons by narrowing down the search path. Together, they empower RAG systems to efficiently handle real-time, high-volume data retrieval, making them invaluable for applications like academic research, customer support, and knowledge-intensive tasks.

How Hybrid Search Methods Improve Accuracy in Retrieval-Augmented Generation Systems ?

Hybrid Search method uses the combination of search techniques-typically integrating semantic search with other similarity metrics such as cosine similarity, L2 distance and inner product. This integrate the strengths of multiple techniques ensuring that more relevant information is retrieved even in complex datasets. In this research paper, the methodology works where vector-based semantic search find conceptually related content, while keyword-based searches or clustering can help filter results further by topic relevance. FAISS & HNSW also contribute alot towards this hybrid search for optimization of indexing, where FAISS handled the large datasets by organizing vectors into efficient clusters for fast search and HNSW creates a graph-based index, optimizing the retrieval by navigation local neighbourhoods of similar content. Such that this system quickly provides the most relevant papers and sectors ensuring that the researcher gets precise answers.

Performance Evaluation: LLMs and Their Role in RAG Systems

In terms of LLM performance, the research compared multiple models, but OpenAI’s GPT-3.5-turbo emerged as the most reliable for generating accurate responses from retrieved content. GPT-3.5’s ability to maintain information density, subject proportion, and entity recognition was superior to its competitors, making it a natural fit for academic research tasks.However, Meta’s LLaMA 3.1, though promising, underperformed in several areas. It struggled with information density and named entity recognition, producing less coherent responses compared to GPT-3.5. While LLaMA’s subject focus showed potential, outliers in entity recognition and accuracy made it less suitable for handling highly technical academic content. This highlighted the importance of selecting the right LLM for tasks that demand high precision.

What is the Future of Retrieval-Augmented Generation (RAG) Systems in Academic Research and Beyond?

The research highlights the power of RAG (Retrieval-Augmented Generation) systems in making academic research more accessible by combining efficient retrieval with advanced language models. Through extensive experimentation with multiple search methodologies, embeddings, and LLMs, the study demonstrates how systems can be optimized to handle the complexities of large datasets, such as academic papers.

One of the critical takeaways from this research was the importance of multi-layered search strategies. Specifically, the first layer of retrieval (focused on abstracts) worked well in most cases, but there were instances where discrepancies emerged. This raised further questions: Do abstracts truly capture the essence of full papers? Or, is there a risk that some information in the paper’s content might not be reflected accurately in the abstract?

To address this, the team calculated cosine similarity scores between individual PDF chunks and their corresponding abstracts. These scores were averaged to form a mean similarity score, visualized alongside a confidence interval to check for normal distribution. While the majority of documents aligned well with their abstracts, outliers were identified — cases where the abstract and main paper differed significantly. This discrepancy prompted further investigation into whether summarization via abstracts is sufficient for accurate retrieval or if deeper content analysis is required.

Code

    No comments:

    Post a Comment