Researchers developed the technique of Retrieval-Augmented Generation (RAG) in the 2020 paper Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Before we jump into the code let's take a moment to answer the following questions:
- Why do we use RAG?
- Why would we run models locally?
- How can I access Llama 2?
- I don't have a GPU, what can I do?
Why do we use RAG?
Picture this, you want to ask a LLM questions about your extensive financial report such as "what were the earnings for Q1"? or "Why has revenue fallen over the past 4 months?".
Well you've got a problem as your financial report was not included in the models initial training data. Retraining the model to include up-to-date information can be extremely costly coupling this with having to retrain every quarter it quickly becomes an impracticable approach.
Generally speaking these models do not have access to the outside world nor can they readily keep up to date on information.
The solution is to use RAG, by allowing the model to dynamically access external repositories of information you can provide recent context allowing the model to answer questions with greater reliability and mitigate the risk of hallucinations.
Why do we run models locally?
The biggest reason is privacy.
When working with clients, especially large ones privacy becomes a massive point of contention. Sending sensitive data to other companies via API is not something compliance would sign off on so being able to keep the data and models in house is a priority.
There are some other added benefits to such as
- Reduced long term costs
- Greater control and customisation
- Lower latency as data no longer needs to be transmitted
How can I access LLama 2?
Head over to this form here You'll shortly receive emails from Meta with a unique link to download the weights.
I don't have a GPU, what can I do?
The best option for individuals who can't access GPU's to use a Google Colab You'll be able to use a T4 GPU absolutely free.
The source code for this project can be found here
Now without further ado let's jump into the code.
Section 1: Dependencies
Here are the list of dependencies used in this project with a link to learn more :
Dependency | Version |
---|---|
transformers | 4.31.0 |
sentence-transformers | 2.2.2 |
pinecone-client | 2.2.2 |
datasets | 2.14.0 |
accelerate | 0.21.0 |
einops | 0.6.1 |
langchain | 0.0.240 |
xformers | 0.0.20 |
bitsandbytes | 0.41.0 |
It is recommended that you use a virtual environment for this project, you can find out how to do that here
To install dependencies run the following in your terminal
pip install -r requirements.txt
or run the following in a code cell in your Jupyter Notebook
!pip install -r requirements.txt
Section 2: Building the Embedding Pipeline
Consider embeddings to be coordinates of words in a nth-dimensional space. Words which are semantically similar will cluster together such as 'light, happy, fun' and words which have the opposite meaning such as 'dark, unhappy, boring' may be placed at the opposite end of a spectrum.
To illustrate this we can use the following image by researchers in the paper Cross-domain sentiment-aware word embeddings for review sentiment analysis
In this case we will be using the all-MiniLM-L6-v2 model which can map sentences to a 384 dimensional vector space. You can access the link and try using the interface API to compare sentence similarities.
Let's load the model onto our GPU and embed two example sentences.
Each sentence will be converted into a list of length embedding dimension
which as stated above is 384.
See for yourself when we implement this.
input:
# Let's now use the model to embed two sentences
docs = [
"this is one document",
"and another document"
]
# Embeddings will be a list where each element contains nested list of 384 values
embeddings = embed_model.embed_documents(docs)
# Extract the number of dimensions per sentence
number_of_dimensions = len(embeddings[0])
print(f"We have {len(embeddings)} embeddings, each with {number_of_dimensions} dimensions.")
output:
We have 2 embeddings, each with 384 dimensions.
Section 3: Building the Vector index
In order for the model to successfully retrieve our information we will need to store our embeddings in a vector database. To do this it is recommended that you use Pinecones free tier
It's good practice to store your API key in an .env
in the following format:
PINECONE_API_KEY=YOUR_KEY
First we will instantiate the Pinecone class with your API key and then create a new index for this project.
Creating an index requires three values:
- index_name: Any arbitrary name that you can identify this index with
- dimension: This will be the number of dimensions the model will create for each sentence
in our case this value will be
384
- metric: This represents the method of measurement used to calculate the distance between
vectors in the database. We will use
cosine
which is typically used in text analysis. You can learn more here. - spec: We will use the default specs associated with a free account.
Let's create our index and connect to it
input:
# Create the index
index_name = 'llama-2-rag'
if index_name not in pinecone.list_indexes().names():
pinecone.create_index(
name=index_name,
dimension=number_of_dimensions,
metric='cosine',
spec=PodSpec(environment="gcp-starter")
)
# Check if the index is ready to use
if pinecone.describe_index(index_name).status['ready']:
print("Ready to go!")
# Connect to the index
index = pinecone.Index(index_name)
index.describe_index_stats()
output:
Ready to go!
{'dimension': 384,
'index_fullness': 0.0,
'namespaces': {},
'total_vector_count': 0}
Section 4: Load & Embed the Dataset
We will be using the jamescalam/llama-2-arxiv-papers-chunked dataset which contains excerpts from the llama-2 paper split into chunks.
Once we obtain the data via the HuggingFace
module we will convert it into a pandas dataframe,
extract IDs, emebddings and metadata
to be sent to Pinecone in batches.
input:
from datasets import load_dataset
data = load_dataset(
'jamescalam/llama-2-arxiv-papers-chunked',
split='train'
)
data = data.to_pandas()
# Iterate through each batch in the data
for i in range(0, len(data), batch_size):
# Calculate the final index for each batch avoiding an index error for the final batch
i_end = min(len(data), i + batch_size)
# Extract the current batch
batch = data.iloc[i:i_end]
# Create a unique ID from doi + chunk_id
ids = [f"{row['doi']}-{row['chunk-id']}" for _, row in batch.iterrows()]
# Extract Text data and create embeddings
texts = [row['chunk'] for _, row in batch.iterrows()]
embeddings = embed_model.embed_documents(texts)
# Generate Meta Data
metadata = [
{
'text': row['chunk'],
'source': row['source'],
'title': row['title']
} for _, row in batch.iterrows()
]
# Upload to Pinecone
index.upsert(vectors=zip(ids, embeddings, metadata))
index.describe_index_stats()
output:
{'dimension': 384,
'index_fullness': 0.04448,
'namespaces': {'': {'vector_count': 4448}},
'total_vector_count': 4448}
We can see our total_vector_count
went from 0 -> 4,448
Section 5: Initialize the Large Language Model
Now that our vector index has been set up the next step is to load our LLM with the respective tokenizer for the model.
We will also be using the bitsandbytes
library to quantize the model to work with less GPU memory.
Quantization is the process of reducing the precision of tensors to lower memory requirements and get faster inference with a model, this comes at a cost of performance but for our use case this is okay.
Below I've demonstrated what happens to a tensor value after quantization.
Weight Value Before Quantization | Weight Value After Quantization |
---|---|
0.0238434627463647 | 0.02384346 |
First, we will configure the quantization configuration from BytesandBytes
with the following settings:
load_in_4bit=True
: This will set weights to a 4-bit precisionbnb_4bit_quant_type=nf4
: This will use thenf4
schema which is the normalised 4 bit data typebnb_4bit_use_double_quant=True
: This will allow us to use nested quantization where the quantization constants are quantized againbnb_4bit_compute_dtype=bfloat16
: This will set computations to use thebfloat16
data type
Model configuration will simply include the name of the model and your HuggingFace API key.
Let's use these configurations to load our model and set it to evaluation mode.
input:
from torch import cuda, bfloat16
import transformers
model_id = 'meta-llama/Llama-2-13b-chat-hf'
# Set quantization configuration
quantization_config = transformers.BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=bfloat16
)
# Load Hugging Face Token
load_dotenv()
hugging_face_token = os.environ.get('HF_AUTH_TOKEN')
# Set model configuration
model_config = transformers.AutoConfig.from_pretrained(
pretrained_model_name_or_path=model_id,
token=hugging_face_token
)
# Load model with quantization and model configurations
model = transformers.AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path=model_id,
trust_remote_code=True,
config=model_config,
quantization_config=quantization_config,
device_map='auto',
token=hugging_face_token
)
# Set model to evaluation mode
model.eval()
print(torch.cuda.is_available())
Now the model is loaded let's load the tokenizer for the model and construct the pipeline. We will also test to see if our implementation is working.
# Load the Tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(
model_id,
token=hugging_face_token
)
# Construct the Pipeline
from langchain.llms import HuggingFacePipeline
generate_text = transformers.pipeline(
model=model, tokenizer=tokenizer,
return_full_text=True, # langchain expects the full text
task='text-generation',
temperature=0.01, # 'randomness' of outputs, 0.0 is the min and 1.0 the max
max_new_tokens=512, # mex number of tokens to generate in the output
repetition_penalty=1.1 # without this output begins repeating
)
llm = HuggingFacePipeline(pipeline=generate_text)
llm(prompt="Explain to me the difference between nuclear fission and fusion.")
Section 6: Implementing RAG
Now that we have our model loaded let's allow our model to access the Vector Index.
We will create a Pinecone object and connect it to a LangChain pipeline so that querying is easier.
from langchain.vectorstores import Pinecone
from langchain.chains import RetrievalQA
# Pinecone requires this field for Metadata
text_field = 'text'
vectorstore = Pinecone(
index,
embed_model.embed_query,
text_field
)
rag_pipeline = RetrievalQA.from_chain_type(
llm=llm, chain_type='stuff',
retriever=vectorstore.as_retriever()
)
# Use our RAG pipeline
rag_pipeline('what is so special about llama 2?')
There we go! You are no able to provide large language models with additional context without the hassle of re-training.