Vivek_Singh.png

Vivek Singh

Vivek Singh, a Machine Learning Engineer, enjoys crafting scalable solutions for customers infused with machine learning. He's an avid reader of tech blogs and spends his free time creating small proof-of-concepts. You might also find him watching or playing football.

Posts by Vivek Singh

Aug 02, 2024 | 2 min. read

Enhance your RAG with RAGA: The Key to Precision and Relevance

What is RAGA? RAGA, or Retrieval-Augmented Generation Assessment, is your toolkit for ensuring your AI's output meets expectations. It assesses whether the data retrieved and the answers generated by your LLM align with the question. Key metrics include: Faithfulness: Are the generated claims supported by the context? Answer Relevancy: Is the generated answer relevant to the question? Context Relevancy: Does the context pertain to the question? Context Precision: How accurate is the retrieved context? Unlocking the RAGA Advantage By moving forward with custom metrics, it's possible to calculate and rank feature rollouts, enforcing strict thresholds before anything goes live. Here's the current state of the application: Component Context Precision Context Relevancy Retriever 0.57 0.647 Component Faithfulness Answer Relevancy Generator 0.81 0.82 For the generator, the formulas implemented were: Faithfulness = (# claims that can be derived from context) / (# total claims)  Answer Relevancy = mean( cosine_similarity(artificially generated questions from the answers, question)) For the retriever, the formulas used were: Context Precision = (# of chunks that ranked high @k) / (total chunks) Context Relevancy = |S| / (total # of sentences)  Here, k represents the precision count for the nth chunk, and S denotes the number of sentences relevant to the question. Adjusting parameters like chunk size and overlap led to new experiments and the following results: Component Context Precision Context Relevancy Retriever 0.53 0.6 Component Faithfulness Answer Relevancy Generator 0.85 0.86 The custom approach to RAGA has highlighted areas for improvement and identified what can remain unchanged: Retriever Performance: With context precision at 0.53 and context relevancy at 0.6, there is significant room for improvement. Enhancing the relevance and accuracy of retrieved chunks and adjusting chunk size and overlap are essential steps. Generator Performance: The generator shows promising results with a faithfulness score of 0.85 and answer relevancy of 0.86, indicating that the LLM generates responses consistent with the provided context and relevant to the questions asked. RAGA in Action: Elevating AI Excellence RAGA (Retrieval-Augmented Generation Assessment) metrics are game-changers for improving the quality and reliability of RAG-based AI applications. These metrics provide crucial insights into the performance of both retriever and generator components, offering a quantitative basis for continuous improvement. Adopting RAGA addresses the "Lost in the Middle" problem and establishes a robust framework for evaluating and enhancing AI applications. This approach ensures more accurate, relevant, and reliable AI-generated responses, leading to a better user experience and increased trust in the system. Charting the Future with RAGA Looking ahead, iterating on these metrics and incorporating additional RAGA parameters will continuously refine the RAG pipeline. This commitment to quantifiable quality assurance is essential for navigating the rapidly evolving landscape of AI and LLM applications. Ready to elevate your RAG game? Dive into the world of RAGA and transform your AI application into a powerhouse of precision and relevancy!

Jul 23, 2024 | 4 min. read

Hidden patterns in embedding optimization — Part I

As Generative AI becomes more prevalent, the need for efficient vector storage in Large Language Models (LLMs) is more important than ever. Embeddings serve as the foundation of many AI applications, transforming raw data into meaningful vector representations. However, the challenge of managing high-dimensional embeddings necessitates optimization strategies to reduce storage costs and improve processing efficiency. This article bridges theory and practice by demonstrating how Principal Component Analysis (PCA) can be used to optimize vector storage, making Large Language Model (LLM) personalization, especially when using Retrieval-Augmented Generation (RAG), more practical and efficient. The Challenge of Vector Storage A key consideration when working with vector databases is storage efficiency. Converting text snippets or images into vectors and storing them in the database can increase infrastructure costs due to the significant size of each vector, especially with high dimensionality. For example, a single 1536-dimensional vector can be around 14 KB. However, there's a trade-off; if each text chunk processed is 1024 characters long, the raw data size is only about 1 KB. This suggests that our text representation might be overparameterized, resulting in unnecessarily large vectors. Principal Component Analysis to the Rescue Principal Component Analysis (PCA) captures the essential components of vector representations, effectively reducing dimensionality and significantly cutting storage requirements. This helps tackle the challenge of overly detailed embeddings PCA can reduce vector dimensionality, capture the most informative aspects, and minimize storage requirements without sacrificing performance. Image Credit: Principal Component Analysis (PCA) Explained Visually with Zero Math Optimizing Vector Storage This section outlines a systematic strategy for improving vector storage efficiency without compromising accuracy. By setting up a vector database, ingesting data with a high-quality vectorizer, conducting iterative analysis, measuring precision, and optimizing dimensions, storage requirements were significantly reduced while maintaining high precision in vector representations. Vector Database Setup: A vector database was established to store and manage embeddings efficiently. Data Ingestion: The dataset was ingested into the vector database using the text-embedding-ada-002 vectorizer, known for generating high-quality text embeddings. Iterative Analysis: A series of analyses were conducted, focusing on calculating precision while systematically reducing vector dimensions. This iterative process allowed observation of how dimensionality reduction affected accuracy. Precision Measurement: With each iteration of dimension reduction, the precision of the model was measured to understand the trade-off between vector size and accuracy. Optimization: By analyzing the results of each iteration, the optimal number of dimensions that balanced storage efficiency with precision was identified. This methodical approach allowed fine-tuning of embeddings, significantly reducing storage requirements while maintaining a high level of accuracy in vector representations. Key Observations After conducting the experiments, the following results were observed: Performance Impact on Time Precision vs. Latency: Reducing dimensionality from 1536 to 512 results in a 14% drop in precision but decreases average latency by 50%, effectively doubling the capacity for calls. Maintaining Precision: Using 1024 dimensions maintains the same precision while reducing latency by 27%. Context-Specific Use Cases: Dimensionality below 512 is not ideal for Retrieval-Augmented Generation (RAG) applications that require high context specificity. Impact on Storage Storage Efficiency: The storage size required for vectors is reduced by 10% at 1024 dimensions and by 42% at 512 dimensions. Overhead Consideration: Implementing dimensionality reduction introduces additional storage overhead. Future Directions Scaling this methodology for millions of vectors without memory issues. Identifying optimal datasets for data modeling. Detecting data drift and determining when to rerun dimensionality reduction. Establishing the ideal top-k to match your Recall rate.