Monday 21 October 2024

Building a Retrieval-Augmented Generation (RAG) System for Academic Papers||An In-depth Research Paper Elaboration with Code


 

Terminology

  1. NLP: NLP stands for Natural Language Processing, which is defined as the one of the paradigm of Artificial Intelligence which allow computers to understand, analyze, reason, generate, and continuously learn through human feedback for accurate and precise performance.
  2. Token: Token is defined as the fundamental unit of text, an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing.
  3. .Tokenization: Tokenization is defined as the process of converting the text into tokens, where token can be a word, sub-word or character.
  4. Vector Search: Vector Search is defined as the search technique where texts are converted into the mathematical vectors [datapoints in multi-dimensional space] and retrieve the information based on similarity between two vectors.
  5. RAG: RAG stands for retrieval augmented generation, which can be further defined as the way of response generation by the large language model where the generation takes place only after retrieval of enough context and semantic relation between the user’s query and provided documents as a part of knowledge base updation of large language model for that instance.
  6. Cosine similarity: cosine similarity is a mathematical measure of similarity between two vectors [ text embedding in form of vector] used to measure the document/text similarity in text analysis.
  7. Contextually aware text generation: This term means that the response generating engine [language model] actually look after the context of the query and provide the contextual response instead of just generating the response based on the query’s keywords and the semantic similarity.
  8. BERTBERT stands for bidirectional encoder representations of transformer, a open-source language model based one encoder-only architecture which can represent the text into the meaningful embeddings with the learnt weights from the self-supervised learning during its training phase.
  9. HNSW Index: It stands for Hierarchical Navigable Small World Index, a specialized data structure used for robust vector research, allow the retrieval of most relevant information by organizing the datapoints/vectors in multi-dimensional space.
  10. Semantic Similarity: This refers to the degree to which two pieces of text are similar in meaning, even if they use different words or phrases.
  11. FAISS: Facebook AI Similarity Search is an open-source library developed by Meta (formerly Facebook) that helps with fast and efficient vector searches.

What is Retrieval-Augmented Generation (RAG) and How Does It Transform Academic Research?

RAG is an AI framework for retrieving facts from an external knowledge base to ground large language models (LLMs) on the most accurate, up-to-date information and to give users insight into LLMs’ generative process.

RAG basically involves two-step operation for generating the response for the user’s query. Retrieval takes place at the beginning which basically means identification of the most relevant section of the document or most relevant documents according to the requirements, followed by generation which means receiving the relevant section/documents and then generating precise response to the user’s query.

Applications of RAG can be found in sectors like customer support [ClincKasisto], automatic content creation for blogs and web content using custom knowledge base [QuillWritesonic and Wizdom]. RAG actually solves the problems of large language models including hallucinations and outdated knowledge increasing the accuracy and credibility of the generated content.

Specifically, in this research paper, use of popular arXiv dataset from Kaggle developed and distributed by Cornell University. The dataset contain the metadata file in the json format for the indexed research papers in arXiv.org. The metadata json file contain following contents for each research papers.

  • id: ArXiv ID (can be used to access the paper, see below)
  • submitter: Who submitted the paper
  • authors: Authors of the paper
  • title: Title of the paper
  • comments: Additional info, such as number of pages and figures
  • journal-ref: Information about the journal the paper was published in
  • doi: [https://www.doi.org](Digital Object Identifier)
  • abstract: The abstract of the paper
  • categories: Categories / tags in the ArXiv system
  • versions: A version history

This research paper emphasizes the essence of RAG for searching research papers along with interacting with the various documents which includes content like text, graphs and tables for research purpose and learning from the research papers.

How Vector Search and Semantic Similarity Revolutionize Data Retrieval in RAG Systems

Vector Search & Semantic similarity have changed the landscape of RAG systems by providing more effective way to retrieve and process information which leads to more accurate, precise and contextually relevant responses.

For performing vector search, creating embedding is the essential part where a embedding model is required. The textual data need to be converted into numerical representations. This research paper first implemented the pioneering embedding model called Word2vec which came short with the sentence-level and paragraph-level context which lead this research to use SBERT [distilled version of BERT].

This research paper solved the shortcoming of small token limit by calculating the mean vector of various chunks and then using SBERT Model to creates mean vector [summary of chunk of data (vectors) which lead to use multiple chunks of data at a time without any problem of token limitations. Vector search and semantic search enabled this RAG system to get independent of any of above methods and provide the hybrid search functionality. Vector search also good option in terms of scalability and designed to handle large datasets efficiently, allowing seamless performance of RAG system.

How HNSW and FAISS Indexing Boost the Performance of Large-Scale Vector Search Systems?

HNSW and FAISS are two popular indexing techniques that significantly enhance the performance of large-scale vector systems. These techniques are specifically designed to handle massive datasets efficiently and provide fast retrieval of similar vectors. Naiv vector indexing solutions didn’t provide sufficient functionality in future in terms of scalability, this problem was addressed by this research paper by introducing the vector index as the solution which employs specialized data structure like Locality Sensitive Hashing (LSH) & HNSW to organize the vector and accelerating the vector search.

This paper combined the multiple search techniques [cosine similiarity for semantic comparision, L2 (Eucledian Distance between vectors) and HNSW to build an efficient graph based index for faster vector search which also outperformed FlatIndexL2. FAISS reduced the number of comparisions needed and leverage parallel computing. This system perfectly fits for the applications that requires real-time search across massive datasets [ million of research papers from arXiv.org]. HNSW improves the accuracy and computational time by focusing on local neighbourhoods, ensuring that even subtle relationships between the data points are captured. Graph-based indexing provided by HNSW ensures that only semantically related vectors are connected, forming local neighborhoods. When a query is processed, the search begins at a higher level and moves closer to relevant neighbourhoods, gradually refining the results. The FAISS search reduced the computational cost as the traditional vector searches rely on brute force comparisons, which are resource intensive. In contrast, graph based indexing like HNSW reduces the number of comparisons by narrowing down the search path. Together, they empower RAG systems to efficiently handle real-time, high-volume data retrieval, making them invaluable for applications like academic research, customer support, and knowledge-intensive tasks.

How Hybrid Search Methods Improve Accuracy in Retrieval-Augmented Generation Systems ?

Hybrid Search method uses the combination of search techniques-typically integrating semantic search with other similarity metrics such as cosine similarity, L2 distance and inner product. This integrate the strengths of multiple techniques ensuring that more relevant information is retrieved even in complex datasets. In this research paper, the methodology works where vector-based semantic search find conceptually related content, while keyword-based searches or clustering can help filter results further by topic relevance. FAISS & HNSW also contribute alot towards this hybrid search for optimization of indexing, where FAISS handled the large datasets by organizing vectors into efficient clusters for fast search and HNSW creates a graph-based index, optimizing the retrieval by navigation local neighbourhoods of similar content. Such that this system quickly provides the most relevant papers and sectors ensuring that the researcher gets precise answers.

Performance Evaluation: LLMs and Their Role in RAG Systems

In terms of LLM performance, the research compared multiple models, but OpenAI’s GPT-3.5-turbo emerged as the most reliable for generating accurate responses from retrieved content. GPT-3.5’s ability to maintain information density, subject proportion, and entity recognition was superior to its competitors, making it a natural fit for academic research tasks.However, Meta’s LLaMA 3.1, though promising, underperformed in several areas. It struggled with information density and named entity recognition, producing less coherent responses compared to GPT-3.5. While LLaMA’s subject focus showed potential, outliers in entity recognition and accuracy made it less suitable for handling highly technical academic content. This highlighted the importance of selecting the right LLM for tasks that demand high precision.

What is the Future of Retrieval-Augmented Generation (RAG) Systems in Academic Research and Beyond?

The research highlights the power of RAG (Retrieval-Augmented Generation) systems in making academic research more accessible by combining efficient retrieval with advanced language models. Through extensive experimentation with multiple search methodologies, embeddings, and LLMs, the study demonstrates how systems can be optimized to handle the complexities of large datasets, such as academic papers.

One of the critical takeaways from this research was the importance of multi-layered search strategies. Specifically, the first layer of retrieval (focused on abstracts) worked well in most cases, but there were instances where discrepancies emerged. This raised further questions: Do abstracts truly capture the essence of full papers? Or, is there a risk that some information in the paper’s content might not be reflected accurately in the abstract?

To address this, the team calculated cosine similarity scores between individual PDF chunks and their corresponding abstracts. These scores were averaged to form a mean similarity score, visualized alongside a confidence interval to check for normal distribution. While the majority of documents aligned well with their abstracts, outliers were identified — cases where the abstract and main paper differed significantly. This discrepancy prompted further investigation into whether summarization via abstracts is sufficient for accurate retrieval or if deeper content analysis is required.

Code

    Friday 18 October 2024

    Python For Everyone

     

    What is Python Programming Language?

    Python is a high-level programming language known for its simplicity and readability. It was created by Guido Van Rossum and first released in 1991. Python is widely used in various fields such as web-development, data science, artificial intelligence, scientific computing, automation and more. Python comes with a comprehensive standard library that provides modules and packages for tasks ranging from file I/O network programming, making it convenient for various applications.


    Who can excel in Python?

    The qualities possess by the person who have the potential to excel in it are analytical thinking, curiosity, attention to detail, problem-solving skills, adaptability and passion for technology along with little bit of madness for being exceptional. These traits form a solid foundation for acquiring proficiency in Python Programming and thriving in the dynamic world of software development.

    How to install python in your computer?

    • Download the latest python version for Windows from the official Python website https://www.python.org/.
    • Open the downloaded file from your file manager and double click on it to start the installation process, if prompted by the User Account Control dialog, click ‘Yes’ to allow the installer to make changes to your system
    • Now click on the user-friendly installation wizard and click on the ‘Install Now’ button to start the installation process.
    • Once the installation is complete, you can verify it by opening a command prompt and typing ‘python — version’. This should display the installed Python version.

    Let’s start coding some Python.

    1. Hello, World

    In Python, you can start by printing “Hello World!” to the screen. Open up a text editor like VS Code or PyCharm and type:

    print("Hello, World!")

    Now save that file with some name to it and run it from the text editor. Then you can see Hello, World! printed on the screen.

    2. Variables in Python

    In Python, a variable is like a container that holds a value. You can think of it as a labeled box where you can store different types of information, such as numbers, text, or even more complex data structures. You need to follow the given guidelines while declaring the variables in python.

    • Variable names can contain letters, digits, and underscores but they cannot start with a digit.
    • Variables names are case-sensitive.
    • Use descriptive names for readability.
    • Use the ‘=’ operator to assign a value to a variable.
    • Python supports various data types: integers, floats, strings, Booleans, lists, tuples, dictionaries, etc.
    • Variables in Python have scope, which defines where they can be accessed from in the code.
    • Variables declared inside a function have local scope, while variables declared outside the functions have global scope.
    # Integer variable
    age = 25

    # Float variable
    height = 5.8

    # String variable
    name = "Alice"

    # Boolean variable
    is_student = True

    # Multiple assignment
    x = y = z = 10 # Assigns the value 10 to all three variables

    # Variable reassignment
    x = 5
    y = 3
    z = x + y # z is now assigned the value 8

    # Using variables in print statements
    print("Name:", name)
    print("Age:", age)
    print("Height:", height)
    print("Is student?", is_student)
    print("Sum of x and y:", z)

    Loop in Python

    In Python, loops are like repetitive tasks that help you do something over and over again without having to write the same code multiple times. There are main types of loops in Python: ‘for’ loops and ‘while’ loops.

    For loop

    A for loop is like a magic spell that repeats a set of instructions for each iten in a collection, like a list or a range of numbers.

    # Let's say you have a list of fruits:
    fruits = ["apple", "banana", "cherry"]

    # You can use a for loop to print each fruit:
    for fruit in fruits:
    print(fruit)

    This loop goes through each fruit in the list and prints it. It starts with ‘apple’ then ‘banana’ and finally ‘cherry’.

    While loop

    A while loop is like a never-ending adventure that keeps going until a condition is no longer true.

    # Let's say you want to count from 1 to 5:
    count = 1
    while count <= 5:
    print(count)
    count = count + 1

    This loop starts with a count of 1. As long as the count is less than or equal to 5, it prints the current count and then increases int by 1. It repeats this until the count becomes 6, at which point the condition ‘count ≤ 5’ becomes false, and the loop stops.

    Conditional Statements

    Conditional statements are like decision-making tools that help your code choose what to do based on certain conditions. The two main types of conditional statements in Python are ‘if’ statements and ‘else’ statement.

    If Statement

    An ‘if’ statement is like a gatekeep er that checks if a condition is true, and if it is, it lets your code do something.

    # Let's say you want to check if a number is greater than 10:
    number = 15
    if number > 10:
    print("The number is greater than 10!")

    In this example, the ‘if’ statement checks if the ‘number’ variable is greater than 10. If it is, the code inside the ‘if’ block get executed, and you’ll see the message “The number is greater than 10!” printed.

    Else Statement

    An ‘else’ statement is like a backup plan that gets executed when the condition in an ‘if’ statement is not true.

    # Let's say you want to check if a number is greater than 10, and if it's not, you want to do something else:
    number = 5
    if number > 10:
    print("The number is greater than 10!")
    else:
    print("The number is not greater than 10.")

    In this example, if the ‘number’ variable is not greater than 10, the code inside the ‘else’ block gets executed and you’ll see the message “The number is not greater than 10.”

    Basic Data Structures in Python

    List

    A list in python is like a collection of items stored in particular order. It’s like having a backpack where you can put different things.

    # Let's create a list of fruits:
    fruits = ["apple", "banana", "cherry"]

    # You can access individual items in the list using their index:
    print(fruits[0]) # Prints "apple"
    print(fruits[1]) # Prints "banana"
    print(fruits[2]) # Prints "cherry"

    # You can also add new items to the list:
    fruits.append("orange")

    # And remove items from the list:
    fruits.remove("banana")

    # Lists are flexible and can hold different types of items, like strings, numbers, or even other lists.

    Tuples

    A tuples is similar to a list, but it’s immutable, which means you can’t change its contents after creating it. It’s like havinga sealed envelope with some information inside.

    # Let's create a tuple of coordinates:
    point = (10, 20)

    # You can access individual items in the tuple using their index:
    print(point[0]) # Prints 10
    print(point[1]) # Prints 20

    # Tuples are useful when you want to store a fixed set of values that shouldn't change.

    Dictionaries

    A dictionary in Python is like a phone book where you can look up information using a particular key. It’s like having a key-value pair where each key is unique.

    # Let's create a dictionary of student information:
    student = {"name": "Alice", "age": 10, "grade": 5}

    # You can access values in the dictionary using their keys:
    print(student["name"]) # Prints "Alice"
    print(student["age"]) # Prints 10

    # You can also add new key-value pairs to the dictionary:
    student["school"] = "ABC School"

    # And remove key-value pairs from the dictionary:
    del student["grade"]

    # Dictionaries are great for storing and retrieving data based on specific keys.

    Arrays

    In Python, arrays are similar to lists, but they are usually used in numerical computations. You can use the ‘numpy’ library to work with arrays effectively. Arrays are optimized for numerical operations and can efficiently handle large datasets.

    import numpy as np

    # Let's create an array of numbers:
    numbers = np.array([1, 2, 3, 4, 5])

    # You can perform various numerical operations on arrays:
    print(numbers.sum()) # Prints the sum of all numbers
    print(numbers.mean()) # Prints the mean of all numbers

    # Arrays are powerful for numerical computations and data manipulation.

    Control Flow

    Break Statement

    The ‘break’ statement is like an emergency exit that allows you to immediately stop th execution of a loop, even if the loop condition hasn’t been fully satisfied.

    # Let's say you want to find the first occurrence of a number in a list:
    numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
    target = 5

    for number in numbers:
    if number == target:
    print("Number found!")
    break # Stop the loop as soon as the number is found

    In this example, when the ‘target’ number is found in the ‘numbers’ list, the ‘break’ statement immediately exits the loop, even though there might be more numbers in the list.

    Continue Statement

    The ‘continue’ statement is like a skip button that allows you to skip the current iteration of a loop and move on to the next one.

    # Let's say you want to print only odd numbers from a list:
    numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

    for number in numbers:
    if number % 2 == 0: # Check if the number is even
    continue # Skip even numbers and move to the next iteration
    print(number)

    In this example, when an even number is encountered, the ‘continue’ statement skips printing it and moves on to the next iteration of the loop, ensuring that only odd numbers are printed.

    Private Variables in Python

    Private variables are like secrets that are meant to be hidden from the outside world. They’re accesible only inside the class where they are defined.

    In python, private variables are indicated by prefixing the variable name with double underscores ‘__’.

    class Car:
    def __init__(self, brand, model):
    self.__brand = brand # Private variable
    self.__model = model # Private variable

    def get_brand(self):
    return self.__brand

    def get_model(self):
    return self.__model

    # Create a Car object
    my_car = Car("Toyota", "Camry")

    # Accessing private variables directly from outside the class will result in an error:
    # print(my_car.__brand) # This will raise an AttributeError

    # But you can access them using public methods:
    print(my_car.get_brand()) # Prints "Toyota"
    print(my_car.get_model()) # Prints "Camry"

    In this example ‘__brand’ and ‘__model’ are private variables of the ‘Car’ class. They cannot be accessed directly from outside the class. However, public methods (‘get_brand()’ and ‘get_model()’) are provided to access these private variables indirectly.

    String Formatting

    String formatting in Python allows you to create formatted strings with placeholders that can be replaced with values. There are multiple ways to format strings in Python, including using the “%” operator, the ‘format()’ method and f-strings (formatted string literals).

    Let’s discuss each method.

    Using ‘%’ Operator

    name = "Alice"
    age = 25
    formatted_string = "My name is %s and I am %d years old." % (name, age)
    print(formatted_string)

    Using ‘format()’ Method

    name = "Bob"
    age = 30
    formatted_string = "My name is {} and I am {} years old.".format(name, age)
    print(formatted_string)

    Using f-strings (Formatted String Literals)

    name = "Charlie"
    age = 35
    formatted_string = f"My name is {name} and I am {age} years old."
    print(formatted_string)

    All three methods achieve same result, but f-strings are considered more modern and preferred due to their simplicity and readability.

    List Comprehension in Python

    List comprehension in Python is a concise way to create lists based on existing lists or other iterable objects. It allows you to write compact and readable code for creating lists in a single line.

    The basic syntax of list comprehension is:

    new_list = [expression for item in iterable if condition]

    Here is a breakdown of each part:

    • ‘expression’: The expression to be evaluated and added to the new list.
    • ‘item’ The variable representing each item in iterable.
    • ‘iterable’: The existing list, tuple, string, or any iterable object.
    • ‘condition’: An optional condition that filters the items added to the new list.

    Here are some examples of list comprehension:

    Creating a list of squares:

    numbers = [1, 2, 3, 4, 5]
    squares = [x**2 for x in numbers]
    # squares will be [1, 4, 9, 16, 25]

    Filtering even numbers:

    numbers = [1, 2, 3, 4, 5]
    even_numbers = [x for x in numbers if x % 2 == 0]
    # even_numbers will be [2, 4]

    Converting String to uppercase:

    words = ["hello", "world", "python"]
    uppercase_words = [word.upper() for word in words]
    # uppercase_words will be ['HELLO', 'WORLD', 'PYTHON']

    Creating a list of tuples:

    numbers = [1, 2, 3]
    squares = [(x, x**2) for x in numbers]
    # squares will be [(1, 1), (2, 4), (3, 9)]

    Lamda Functions

    Lamda functions, also known as anonymous functions or lambda expressions, are small, inline functions that can have any number of arguments but can only have one expression. They are particularly useful when you need a simple function for short period of time, such as for use as arguments in higher-order functions like ‘map()’ , ‘filter()’ and ‘sorted()’ or in situations where defining a named function is unnecessary or cumbersome.

    lambda arguments: expression

    Lamda functions can be used wherever a function object is required. Here are some examples:

    Using lambda with ‘map()’:

    numbers = [1, 2, 3, 4, 5]
    squared = map(lambda x: x**2, numbers)
    # squared will be [1, 4, 9, 16, 25]

    Using lambda with ‘filter()’:

    numbers = [1, 2, 3, 4, 5]
    even_numbers = filter(lambda x: x % 2 == 0, numbers)
    # even_numbers will be [2, 4]

    Using lambda with ‘sorted()’:

    words = ["apple", "banana", "cherry", "date"]
    sorted_words = sorted(words, key=lambda x: len(x))
    # sorted_words will be ["date", "apple", "banana", "cherry"]

    Conclusion

    Overall, Python’s simplicity, readability and extensive library ecosystem make it a popular choice for a wide range of applications, including web development, data analysis, machine learning, automation and more. Whether you’re a beginner learning to code or an experienced developer tackling complex problems, Python offers tools and features to meet your needs effectively.