How to get started with AI, ML and Data Science, particularly LLMs and RAG?
The purpose of this article is to help someone enter the emerging fields of AI/ML, particularly LLMs(Large Language Models).
What are Large Language Models?
You must have heard of ChatGPT. The technology upon which ChatGPT is built is called an LLM. The particular LLM upon which ChatGPT works is trained on large chuck of Internet by a company called OpenAI.
But OpenAI is not the only company that has its own LLM. Meta(Facebook) has its amazing series of LLMs called Llama. Google has Gemini series. There are several more companies like Claude, Anthropic which have phenomenal LLMs.
There are thousands of such LLMs. A really good company where you can find all these LLMs to use in your own projects is HuggingFace.
What is the difference between an LLM and a ChatGPT like Chatbot?
LLM is a huge Neural Network with billions of parameters trained on thousands of GPUs on a huge amount of data for Weeks and Months. Training an LLM can cost millions of dollars and only very large companies like OpenAI, Google and Meta can do it.
ChatGPT like Chatbot is an application built on top of an LLM. But a chatbot is not the only application that can be built on an LLM. There could be many other applications.
How can I build an Application on top of an LLM?
First I wanna explain, what can you build on top of an LLM. Honestly rise of LLMs is like rise of the Internet, there could be millions of applications that we don’t see yet. And there will be millions of products that will be built on LLMs. If you want some examples you can head to Y Combinator’s website and check out the companies they have funded, you’ll find hundreds of ideas.
One way to build an application on top of an LLM is to host an LLM yourself on the cloud, but this approach could be super expensive for most folks.
Another way is by using the APIs provided by OpenAI or Gemini. They charge you by token. Token is basically each character that you input in an LLM and each character the LLM produces.
There are three primary ways you will be charged. When you create embeddings, when you do inference and when you fine-tune a model.
Let’s break all these concepts one by one:
What are vector embeddings?
The thing with computers is that they can only understand numbers, they cannot understand words. But Language consists of words and not numbers. So engineers came up with a way to represent words as numbers. These are called embeddings. It’s basically the process of embedding words as numbers.
Now embedding models themselves are a kind of trained models. There are many different trained embedding systems. You can find a lot of them on Huggingface. The one used by OpenAI for ChatGPT is called Tiktoken.
Each embedding is basically a vector in an n-dimensional space. So for example Cat could be [1,2,2] and Dog could be [3,2,1]. Now words with similar meanings tend to have lower distance from each other in this space and words with different meanings are far apart.
OpenAI provides an API to convert text to embeddings, which is really important to understand in order to build applications.
What is inference?
The process of getting a prediction from LLM is called inference. Basically the way LLMs work is they predict the next token. So for example if you feed “Once upon a” in an LLM, the LLM will predict “time”. This type of prediction is called inference. OpenAI provides an API for inference.
What is fine-tuning?
Fine-tuning is changing the weights of a neural network by training the LLM on your custom data so the inference aligns more with your custom data. Fine-tuning could be expensive as you will need to retrain and host the retrained model. The alternative to fine-tuning to build applications for your custom data is RAG.
What is RAG (Retrieval Augmented Generation)?
Suppose you have a large amount of your own data. It can be in form of PDFs, spreadsheets or databases. How can you chat with this custom data. One option is to fine-tune the LLM with this data. But as said it could be expensive. It could also be ineffective actually.
The second and more prominent option is to use Retrieval Augmented Generation. This technique uses earlier mentioned vector embeddings and inference from LLMs.
For example, suppose you have a large PDF and you want to chat with it. You can’t feed this large PDF into LLM directly because LLMs usually have token limits. So you will first divide this PDF into small chunks. Then you convert all these chunks into vector embeddings using a model like tiktoken. These vector embeddings are then stored in a vector database like Pinecone, Chroma or Milvus.
Now suppose user asks a question. You will convert this question to vector embeddings as well and as a first step you will do a similarity search with all the chunks you stored earlier. You will get the most similar chunks, combine them with the question and then send that to the LLM for inference.
In this process there are some variables that you can play with: like you can do the chunking of text in multiple ways, so for example smaller chunks or larger chunks. You can also do similarity search in multiple ways too. There are two very popular libraries to deal with RAG, Langchain and Llama-Index.
In next article we will dive more deeply in RAG.