All about synthetic data generation

Introduction Challenges Synthetic data for model pre-training Synthetic data for Model Finetuning Inducing instruction diversity Multi-turn model finetuning Finetuning for tool use Domain-specific finetuning Finetuning text embeddings Synthetic data for Model alignment Synthetic data for Evaluating LLM applications Agentic workflows with LLMs Retrieval augment generation (RAG)Conclusion

Introduction

Synthetic data refers to artificially generated data that mimics the characteristics and patterns of real-world data but is created through algorithms. Synthetic data generation is one of the highly regarded applications of LLMs, and the ability to design pipelines capable of generation high-quality data samples using models is a hot skill where it has applications from pre-training LLMs to evaluating LLM applications.

But what challenges arise in generating synthetic data? What methodologies can ensure the creation of high-quality datasets tailored to specific applications? This blog aims to provide you with a structured mental framework to address your data needs effectively.

Challenges

The fundamental goal of synthetic data generation is to automate the process of data curation thereby saving time and cost involved in manual data annotation and curation. But there are many challenges in scaling synthetic data generation, some of them are:

Ensuring diversity in the generated samples: Diversity is one of the important aspects of any machine learning dataset, whether it is for training or evaluation. LLMs do not track the knowledge about their past inputs and outputs.LLMs are also trained to converge which means that their generations usually follow a common path while generating output given an input text. These two factors make it challenging to produce large amounts of data using LLMs that are also diverse in nature.

Distribution shift with real-world data: The synthetic data generated using models may not ideally represent the distribution of data observed in real-world scenarios. For example, users’ queries or conversations can be incomplete and or filled with grammar mistakes, and synthetic data generated using LLMs are mostly complete and grammatically correct.

Factuality and fidelity: Ensuring factuality and fidelity of synthetic data is challenging because of inherent issues with LLMs like hallucination (Huang et al.).

Bias and fairness: LLMs are proven to have certain biases (Gallegos et al.) , the synthetic data that are generated by these models will also reflect these biases. There is also an added risk that future models trained on such data can amplify these biases.

The goal when designing synthetic data generation pipelines is to overcome these challenges using various techniques. We will now discuss various methods proposed to generate high-quality synthetic datasets for applications from model pre-training to evaluating LLM applications.

The process for generating synthetic data differs from task to task. But in general, the idea is to prompt an LLM with a task specification, a set of conditions to ensure diversity and optionally few examples to generate output. In the coming sections, we will dive deeper into methods proposed to synthetically generate datasets for different tasks.

Synthetic data for model pre-training

The task of pre-training foundational models using self-supervised learning requires plain text data with billions or trillions of tokens depending on the model size and budget. Most of the recent open research in pre-training foundational models has consistently focused on the quality of the pre-training dataset. Some of the recent models like o1.AI (Young et al.) have attributed their superior model quality to sophisticated data preprocessing and filtering pipelines.

The three most important works that have come out in using synthetic data generation for LLM pre-training are Tiny Stories (Eldan et al.), phi series (Gunasekar et al.) and Bubeck et al.) from Microsoft research and cosmopedia (Allal et al.) which is an open replication of ph1.5.

The Tiny Stories paper focuses on preparing a diverse set of story datasets to pre-train small language models ( below 10M params) to generate fluent and coherent text. To ensure diversity in generated stories, 1500 words separated into nouns, adjectives, and verbs are first prepared. Then for each story generation, 3 words are randomly sampled from each category and LLM is prompted to generate a story that contains these words.

While the latter work ensures diversity using seed words, phi series, and cosmopedia ensures diversity by constraining generations on topics and audience style. Phi 1 focuses on generating textbook-like data to improve the code capabilities of small LMs, and ph1.5 and cosmopedia focus on eliciting better common sense reasoning from small LMs. To add diversity to generations, a list of topics is extracted from sites like Wikihow, Khan Academy, etc. To generate more topics, web datasets like refined web are clustered and topics are extracted using LLM from each cluster. An added level of diversity is achieved by using audience style, audience style refers to the persona of the target audience for each topic which can range from high school students to college professors. The same topic can be repurposed multiple times by altering different target audiences to generate diverse data.

example of audience style can influence LLMs output

Synthetic data in pre-training is highly useful and probably going to be explored much more in the near future, we can even expect a chichilla like law that can be used to calculate the right mixture of synthetic to real data in model pre-training.

Synthetic data for Model Finetuning

Model finetuning or instruction finetuning is done to make the model better understand and follow instructions given by the user. The data used to do this supervised learning procedure is in the form of (instruction, and response) pairs. If the model is expected to engage in multi-turn conversations with the user each data point will have multiple rounds of such instruction and response pairs. Synthetic data has emerged as an effective alternative to manual annotation for model finetuning, we will discuss most of the top works and methods that are used to create synthetic data for model finetuning. To provide an overview before getting to details, there are three methods used to synthesize data for instruction finetuning.

Keyword driven: this method makes uses of seed topics and or subtopics collected or generated from multiple sources as keywords to generate instruction, and response pairs.

Persona driven: persona descriptions are regarded as carriers of world knowledge, given a plain text different personas associated with it can be inferred using an LLM. These personas can be used when generating instructions for any given task.

We will now dive into and discuss methods and works that propose different variants of these methods.

Self instruct (Wang et al.) was one the earliest works in generating synthetic data for instruction finetuning. Here a set of 175 seed instructions were used to bootstrap more instructions using models by conditioning the model on several randomly selected seed instructions to generate new ones using GPT3. In this particular paper, for classification task, the output labels are first fixed, and then input text is generated based on it. To avoid duplication, rogue-l scores with already generated instructions are calculated and samples above a certain threshold are filtered. At the end of this process, a dataset containing 52k (instruction, response) pairs is obtained. Stanford alpaca (Taori et al.) later used a similar pipeline with the GPT 3.5 model to form an alpaca dataset. Vicuna (Peng et al.) later used the same dataset to query responses from GPT-4 to form the Vicuna dataset.

Inducing instruction diversity

Towards the end of 2023, several works insisted on the importance of instruction diversity in finetuning datasets. One of the most popular ideas proposed to achieve this is by varying the levels of complexity of instruction present in the dataset. The evolve instruct method proposed by wizardLM paper (Xu et al.) is one the most interesting works on this, in they propose to improve the diversity of the SFT dataset by introducing a query evolution framework through which the seed instructions are evolved in complexity by transformations like concretization, adding constraints, increasing reasoning, etc. While wizardLM uses predefined transformations to evolve instructions, the framework proposed by CodecLM (Wang et al.) argues that such generic transformations may not generalize to instructions from various domains and can also be vague. Instead, they propose self-rubrics. Their work uses metadata extraction ( use-case and skills required to respond ) from each instruction and then uses this to generate rubrics for complicated instructions.

For example, for a query with the use case of "business plan development" with skills in "market research and planning," generic advice such as "add reasoning steps" is too vague and unhelpful. In contrast, self-rubrics can generate specific actions like "add SWOT analysis" and "include comparison with market competitors.

While works discussed above focus on modifying instruction complexity and diversity, works like orca ( Mukherjee et al. & Mitra et al.) use instructions from existing SFT datasets to sample responses with explanations to enable small LLMs (7B to 13B params) to achieve the same level of reasoning as larger LLMs (30B+ params). Orc1 introduces the concept of explanation style tuning where there sample explanation style responses for instructions from the FLAN dataset using get 3.5 and get 4 to form the orca dataset (5M samples). These responses are collected by prompting LLM using several predefined system messages designed to elicit reasoning steps.

Multi-turn model finetuning

All the work discussed above generates single-shot instruction, response pairs. This can’t be used for a finetuning model for having meaningful multi-turn conversations. In order to fill this gap, Ultrachat ( Ding et al.) proposes pipelines to create high-quality and diverse multi-turn conversations. The method used to create the dataset is by bootstrapping topics, entities, etc for different conversation types like information seeking (questions about existing entity), information transformation (tasks like summarisation, etc), etc. Then an LLM is used to act as both user and assistant to generate dialogues given a topic and conversation type. Each conversation in this dataset can contain ~3-7 rounds of dialogue.

Finetuning for tool use

Tool use abilities of LLMs can be improved using finetuning language models with datasets specifically created for allowing LLMs to identify and pass needed arguments to API while in a conversation with the user. Works like tool former (Schick et al.) and ToolAlpaca (Tang et al) both use data synthesized by feeding API documentations to LLM to simulate tool use instances.

Domain-specific finetuning

Synthetic data generation can be very useful in improving the models’ capabilities to better adapt to domain-specific tasks. To improve the math problem-solving capabilities of LLM, Meta math paper (Yu et al.) uses questions bootstrapped from existing datasets like GM8k as seed questions to evolve them using several transformations like self-verification, etc that are specific to math domain. This paper (Xu et al.) makes use of KG available in clinical datasets to generate synthetic QA pairs. To improve the coding abilities of LLM using instruction finetuning, magic coder (Wei et al.) proposes a pipeline where they create instruction, and response pairs by sampling 1-15 consecutive lines from code document corpora and prompt LLM to generate IR pairs using it. Data cleaning and decontamination are also done on top of it.

Finetuning text embeddings

Synthetic data is widely used to improve text embeddings for various applications by finetuning. Finetuning embedding models using contrastive loss would require samples containing query, positive document, and negative document.

an example of query, positive document and hard negative document

In the work improving text embeddings using LLMs (Wang et al.), seed topics are used as keywords to generate queries. Positive documents and hard negative documents for corresponding queries are generated by prompting LLM. Works like M3 and llama-index use document chunks sampled from various sources to generate queries targeted at the given document to finetune models using MultipleNegativesRankingLoss.

Synthetic data for Model alignment

Aligning the language model’s output with shared human values means ensuring honesty, helpfulness, and harmlessness (HHH). There are several methods proposed to achieve this but two of the most popular ones are reinforcement learning with human feedback (RLHF Ouyang at al.) and direct preference optimization (Rafailov et al.). Both of these require a dataset that contains multiple responses to each instruction that are ranked according to human preferences. for example

an example of ranked responses used for RM/DPO training

The obvious method to acquire this data is to have annotate and rank difference responses to different instructions. Synthetic data generation methods proposed to acquire the same data has proven to be an effective alternative. There are two key approaches that has been proposed to curate data in this format using synthetic data generation

Teacher model approach: This method commonly uses multiple closed and open-source models to sample responses for instructions curated in various datasets such as discussed in the fine-tuning section. Then GPT-4 like the powerful model is prompted to act as a teacher model to rank these responses based on provided rubrics.

Self-refine-based approach: In this approach, an LLM is first used to sample a response to each instruction in the dataset, and then the same model itself is used to provide feedback to the previously generated response. This feedback is then used to improve the response by feeding in the initial response + feedback to the model again. This process is done iteratively till the required quality is achieved.. This process will yield a dataset that can be later used for training models using RLHF/DPO.

Works like Vicuna (Peng et al.), GRATH (Chen et al.) , and UltraFeedback (Cui et al.) uses the teacher model approach. Vicuna samples multiple responses for instructions in the alpaca 52k dataset using various open-source models. Then GPT-4 is used to rank these responses to for the RLHF dataset. GRATH focuses on improving the truthfulness of LLMs using DPO, for this they generate a pair of correct and incorrect responses for each instruction from datasets like hellswag, MMLU, etc. A similar approach is used in (Castricato et al. ). When LLMs are instructed to avoid discussing a certain entity (a 'Pink Elephant') and instead discuss a preferred entity ('Grey Elephant'), most LLMs after a certain number of dialogues tend to forget this given constraint. This work aims to align LLMs to improve the model’s capabilities in this particular direction. For this, seed topics and subtopics are used as keywords to synthesise conversations with desired and undesired responses which are later used for DPO training.

UltraFeedback uses a pipeline in which instructions are sampled from datasets like Ultrachat and Sharegpt and GPT-4 is used to score, critic, and rank these responses. The ranked responses form the dataset to train the reward model. The free-form feedback is also used to fine-tune a critic model that can provide feedback to model responses.

rDPO (Gallego et al.) and Configurable safety tuning for LLMs (Gallego et al.) use self refine approach to formulate the alignment dataset. rDPO uses instructions from datasets like Harmful Behavior (Zou et al.) to sample responses using Mixtral model. Then a critic step followed with step revision is done to improve the safety of the responses. Configurable safety tuning for LLMs aims to train language models where safety preferences can be configured at run time using system prompts. For this, they first sample uncensored response from models and then refines it to form safe responses. The DPO training is done using altering the responses depending upon the system prompt.

Synthetic data for Evaluating LLM applications

Evaluating and testing LLM applications are also a crucial part in having LLMs to production. While curating manual data is challenging synthetic data can make this process faster and help to simulate test data points with minimal human supervision.

Agentic workflows with LLMs

Evaluating AI agents here refers to assessing the API calling capabilities of LLM. While works like Toolformer use static predefined dialogue histories to evaluate the API calling capabilities of LLM there is a trend towards automating this by synthetically generating test cases for evaluating the API calling capabilities of LLM. AutoDE (Mu et al.) uses synthetic data generation for dynamic evaluation of AI agents which would also improve the coverage of test cases. This is done by first synthesizing a user script from API documentation using an LLM. A user script provides the essential background for the conversation as shown in figure. Then an LLM is prompted to act as a user agent to generate conversations according to the provided user script. These conversations with ground truth (API call with labels) are then used to evaluate the API calling abilities of LLM. In their experiments, they found that this method also has a high correlation with human evaluation.

Retrieval augment generation (RAG)

RAG is one of the most popular applications of LLMs and many works have proposed different methods, and metrics to automate the evaluation of RAG systems. Automating the evaluation of RAG would also require synthetically generating question, and answer pairs from a given set of documents. One of the easiest ways to do this is to feed document chunks to an LLM and prompt it to create QA pairs from it. Another approach proposed by ( Guinet et al.) uses documents to create multi-choice question-answer pairs.

Pitfalls in simple approach

But generating QA pairs targeted from random chunks or documents could yield to a poor test dataset because:

1. This approach fails to create any multi-hop queries from the given document set.

2. The QA created from predefined chunks will be biased towards the predefined chunk size, which is most cases is also used for building the RAG. So it is much more likely that you get a the right chunk compared to real production scenario.

3. The questions generated by conditioning LLM on random chunks are too specific most of the time and hardly reflect what is seen in production.

4. The nature and variety of questions that this approach can create is very limited as LLM is only exposes a part of information from your whole database containing multiple documents that even might be related to one another.

A KG based approach

The synthetic test data generation proposed by ragas tries to tackle these challenges by generating diverse synthetic QA pairs using a knowledge graph-based algorithm. The core idea here is to explore the documents and connect documents that are related to each other using information extracted using each document. Once a knowledge graph is formed, different types of questions can be synthetically created by exploring the formed relationships and providing appropriate context to LLM to frame question and answer pairs. The algorithm also introduces different types, style and persona while creating of questions.

To concretise the idea, if we know document A and document B have few entities in common we can make a safe assumption that these two documents have something in common to synthesise a multi-hop question.

Question types are broadly classified into two:

Single Hop : A single-hop query is a straightforward question that requires retrieving information from a single document or source to provide a relevant answer. It involves only one step to arrive at the answer.

Multi Hop: A multi-hop query involves multiple steps of reasoning, requiring information from two or more chunks. The system must retrieve information from various documents and connect the dots to generate an accurate answer.

Let’s see how you can use the high level API of ragas to create queries from given document set.

install the ragas python library


pip install ragas

load documents using langchain/llama-index document loader and run test generation. See the getting started guide here.


from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

This uses the default settings to explore document relations and query types etc, which all can be customised according to your requirements. Learn more about it here

Conclusion

Through this article, we discussed many of the methods to synthetically generate datasets for different tasks. On an abstract level, there are certainly some patterns that one can identify that are being reused in each of them regardless of the task but there are also some methods that are very specific to the task and domain. In the future, we can expect more of these methods that even extend synthetic data generation to multi-modality.

This blog was the result of our over one year of research and experimentation with synthetic data generation for applications from model training to LLM application evaluation. Follow us on X and Linkedin for more for more.