How to Improve Your OpenAI Embeddings?

Published on


1. Background of the RAG Pipeline

Retrieval Augmented Generation (RAG) is a prominent framework that bridges Large Language Models (LLMs) with external knowledge bases. Building a RAG pipeline typically involves the following steps:

  1. First convert the external knowledge passages into vector embeddings using a pre-trained embedding model.
  2. When a query is posed, convert the query into a vector form using the same embedding model.
  3. Retrieve the top-k knowledge passages with the highest embedding similarity to the query.
  4. Supply the query with the retrieved passages as contextual information and send the LLM to generate the response.

The RAG pipeline can effectively reduce the propensity of the LLM to produce 'hallucinated' content and offers an efficient adaptation of the LLM to varied applications.

Get early access to Vectify's RAG platform. We provide hassle-free hosted retrieval API for your data and AI applications.

2. Fine-Tune Embedding Models Without Labels

There are many pre-trained embedding models available for quick deployment, such as the black-box OpenAI/text-embedding-ada-002 and the open-sourced embedding models listed on Huggingface MTEB leaderboard. These embedding models are pre-trained on the general corpus, which are not usually optimal for your specific domains. One way to improve the embedding performance is fine-tuning the embedding models on the domain-specific documents. However, the classic fine-tuning method relies on ranking labels for each query, which are hard to obtain in practice. Recently, a blog from LlamaIndex introduced the following method to fine-tune embedding models without labels:

  1. For each chunk, use the LLM to generate hypothetical queries that align with the content of the chunk, which gives us the training data pairs: (query1,chunk1),(queryN,chunkN){(\text{query}_1, \text{chunk}_1), \cdots (\text{query}_N, \text{chunk}_N)}.
  2. Fine-tune the embedding model to optimize the following criteria:
    • The embedding of queryn\text{query}_n should be closer to the embedding of its corresponding chunkn\text{chunk}_n;
    • The embedding of queryn\text{query}_n should be distanced from the embeddings of other chunks: chunkm\text{chunk}_m where (mn)(m\neq n).

In machine learning literature, this training method is also referred to as self-supervised contrastive learning [1] and can help the neural network learn better embedding representations for the domain-specific data.

In the Llamaindex's blog, a pre-trained BAAI/bge-small-en is used as the base model and fine-tuned on a domain-specific training dataset. This model is lightweight and can be easily fine-tuned on a personal laptop. The following Figure shows test retrieval accuracy evaluated after every training epoch, where we can see that fine-tuning provides a significant boost in retrieval performance (Epoch 0 represents the performance of the pre-trained model). Please refer to the original Llamaindex's blog for more details.


However, the retrieval performance is still worse than the OpenAI/text-embedding-ada-002 model after finetuning. This is because the BAAI/bge-small-en model is not as powerful as the OpenAI/text-embedding-ada-002. Is there any way to take advantage of both the powerful pre-trained capabilities of OpenAI's model and the flexibility of customized fine-tuning? The answer is yes! We will introduce the fine-tuning augmented embedding method in the next section.

3. Improving Your OpenAI's Embeddings with Fine-Tuning

In an ideal scenario, we would like to directly fine-tune OpenAI's embedding model to enhance its performance on domain-specific data. However, since OpenAI's embedding operates as a black box and no fine-tuning API is provided, we are limited to obtaining embeddings solely from its service.

To circumvent this limitation, we propose augmenting OpenAI's embeddings using a trainable, open-sourced model, which can be represented as

augmented_embedding=[openai_embedding,finetunable_embedding].\text{augmented\_embedding}=[ \text{openai\_embedding},\text{finetunable\_embedding} ].

For example, the OpenAI/text-embedding-ada-002 model has an output embedding size of 1536. When augmented with the trainable BAAI/bge-small-en model -- a model with an output embedding dimension of 384, the augmented embedding will have a dimension of 1536+384=19201536+384=1920. With this approach, we can just fine-tune the trainable part of the embedding model, which maintains low computational demands as the traditional fine-tuning approach, making it feasible to train even on a personal laptop! We term this approach Fine-tuning Augmented Embedding (FAE).

4. Demonstration

For demonstration, we augment the OpenAI/text-embedding-ada-002 embeddings with those from BAAI/bge-small-en, and we solely fine-tune the augmented model (BAAI/bge-small-en) on our training dataset. The training objective and training/evaluation dataset constructions remain the same as those described in the LlamaIndex's blog.

Please visit our Github Repo for the open-sourced FAE Demonstration!

The following Figure shows the test retrieval accuracy evaluated after every training epoch.


We can see that before the fine-tuning starts, augmenting the OpenAI/text-embedding-ada-002 embeddings with the BAAI/bge-small-en model has a worse retrieval performance than only using the OpenAI/text-embedding-ada-002 model. However, during fine-tuning, the retrieval performance of FAE graduately outperforms the OpenAI/text-embedding-ada-002!

5. Looking to Improve the Embedding Model for Your RAG Pipelines?

In this blog, we introduce the concept of Fine-tuning Augmented Embedding (FAE), which comes with several compelling benefits:

  1. Better retrieval performance: By fine-tuning on domain-specific data, we can enhance retrieval performance.
  2. No labels required: It uses LLMs to generate the queries that are paired with the chunks as the training data.
  3. Low computational cost: Only a small augmented embedding model is fine-tuned.
  4. No additional OpenAI embedding cost for fine-tuning: OpenAI embeddings for the training data are generated only once and are reused throughout training.

This strategy can be employed to improve retrieval performance across a wide range of LLM applications, such as coding assistants, legal chatbots, customer service, medical analysis, and more!

Get early access to Vectify's embedding model fine-tuning service and get your RAG pipelines improved from today!