CXL enabled Retrieval-Augmented Generation system
Research Assistant, Advisor: Prof. Mosharaf Chowdhury
Mar 2024-Now
Retrieval-Augmented Generation (RAG) is a popular technique for improving the reliability of Large Language Models (LLMs) by reducing hallucination. Implementing this effectively often requires repeatedly searching through large vector databases. At scale, these vector searches become a substantial computational bottleneck for RAG-enabled LLM inference. In this project we introduce Aether, a prototype system that enhances search efficiency across massive sharded datasets through scheduling optimizations and dynamic file management. Aether is a highly adaptable system that is designed to scale across multi-tier memory systems, such as CXL-enabled clusters. Through comprehensive evaluations across diverse workloads and under varying memory constraints, we demonstrate that Aether significantly outperforms baseline approaches. Leveraging asynchronous I/O, our scheduling optimizations achieve an average throughput improvement of 19.7% over their synchronous counterpart. Additionally, dynamic index file management using a novel LFU+ caching policy outperforms traditional LRU by 14.5% in serving throughput.
Materials