Can we create pedagogically valuable multi-turn synthetic datasets from Cosmopedia?

Community Article Published May 7, 2024

tl;dr: I'm working on a multiturn dataset using Cosmopedia as a starting point. You can find the current version at https://huggingface.co/datasets/davanstrien/cosmochat. I will be working on this in public. Feel free to laugh at all the typos in my prompts or suggest ideas for improvements!

What and why?

Synthetic datasets are increasingly helping to push forward the quality of open-source LLMs. One recent example of this is Cosmopedia:

Cosmopedia is a dataset of synthetic textbooks, blogposts, stories, posts and WikiHow articles generated by Mixtral-8x7B-Instruct-v0.1. The dataset contains over 30 million files and 25 billion tokens, making it the largest open synthetic dataset to date.

One of the goals of this dataset was to reproduce the “Textbooks Are All You Need” paper. The tl;dr of that paper is that you can potentially improve the performance of LLMs (especially with limited parameters) by focusing a lot of effort on the data used to train those models.

The intuition of this approach is that textbooks could be a very useful format for training models since they aim to convey information in a dense and efficient format. You can learn more about Cosmopedia in this blog post

Can we create chat data from this?

Since Cosmopedia was released, I have been curious about building a chat-format dataset from Cosmopedia. A previous attempt is here: davanstrien/cosmopedia_chat.

In particular, since textbook-style content is quite informationally dense, and the goal was to present high-quality information in a pedagogically helpful way, it seems that Cosmopedia could potentially be a useful format for creating a synthetic chat-style dataset that also aims to build on this pedagogical approach.

Can we create high-quality multi-turn data from this?

One of the main reasons some people find chat models useful is for trying to learn about a new concept. This doesn't always work well since a model can give wrong answers confidently. Leaving this issue aside, for now, chatting with a model to learn more about a topic can still be beneficial. In contrast to a text where the content is static and may not address the area you are struggling with in detail, a chat model can be asked to delve into a particular area. if you don't understand its answer, you can say so and ask for a reformulation. You can also ask questions to check your own understanding, i.e., questions like 'Would it be fair to say that...'.

One of the limits of many existing open datasets is that they are single-turn, i.e., question/answer responses. In practice, one of the reasons an LLM might be useful is precisely when you go beyond this initial question/answer pair. This is what I am working on building. Multi-turn data is important for improving the performance of language models in real-world conversational scenarios. Unlike single-turn question-answer pairs, multi-turn data captures the dynamic nature of human conversations, allowing models to learn how to maintain context, resolve ambiguities, follow conversational flow, adapt to changes in user intent, and engage in complex interactions. By training on multi-turn data, language models can hopefully play this pedaocigal role better.

I have a lot of ideas about how different aspects I could pursue with this (shiny objects 😅), but the rough goal is:

create multi-turn data. At the moment, I am focusing on hard-coded turns, but I am thinking about ways of more "organically" determining the number of turns that make sense for a particular chat.
the initial question for the chat is generated via the text from Cosmopedia
the initial question should be appropriate for the audience level defined in Cosmopeida (grade school student or college)
the follow-up questions may sometimes reflect a student's lack of understanding and try to test how well the model can clarify. Other times the student may seek to go deeper.

For this work I'm using the wonderful distilabel library. I will share the full pipeline and process (once I've tidied the initial version!), but the prompts roughly look like this at the moment.

Prompt to generate initial question

def cosmopedia_to_question_prompt(text, audience) -> str:
    return f""""You will play a {audience}.
Based on some text I will show you, please write a question about the topic discussed which could be asked by a {audience}.
The question should focus on topics discussed in the text. You should write the question assuming the student has some familiarity with the topic but not the text itself.
<text>\n
{text.strip()}
</text>
If the text includes analogies, examples, metaphors, etc., you should not include them in the question unless it is reasonable to assume these would be commonly used in educational materials for a {audience}.
For example, if the text discusses the concept of gravity using the analogy of a bowling ball on a trampoline, you should not include the analogy in the question unless it is reasonable to assume that a {audience} would be familiar with this analogy.
Return only the question you would ask, not the text itself or any other information.
"""

Prompt to generate follow-up question

def format_to_generate_second_question(questions_and_answer_messages, audience) -> str:
    student_understanding = [
        "a very poor understanding of the topic. This student may be confused or have misconceptions about the topic which will be expressed in their question.",
        "a poor grasp of the topic and wants to clarify their understanding.",
        "a good understanding of the topic. They are likely to follow up about a specific component of the answer given.",
        f"a very deep understanding of the topic and wants to explore the topic further. This student may ask questions beyond what would be expected from their level of study as a {audience}. They may make connections to other topics or ask about advanced concepts.",
    ]
    student_understanding = random.choice(student_understanding)
    messages_formatted = ""
    for message in questions_and_answer_messages:
        if message["role"] == "user":
            messages_formatted += f"Q: {message['content']}\n"
        if message["role"] == "system":
            messages_formatted += f"A: {message['content']}\n"
    return f"""Based on the conversation below, write a follow-up question from a student.
    <conversation>
    {messages_formatted}
    </conversation>
    Remember, the student is an {audience} with {student_understanding}. 
    If the response from the model is beyond what would be expected from a {audience}, reflect this in the question.
    Just respond with the question.
    """

An example of a full conversation produced by this approach so far:

Example chat

Turn 1

User: How does the conversion of a second-order ODE to a first-order ODE system affect the complexity of solving the initial value problem, and what are the implications for numerical methods such as the Euler methods?

Assistant: Converting a second-order ordinary differential equation (ODE) to a first-order ODE system can significantly impact the complexity of solving the initial value problem and has important implications for numerical methods like the Euler methods.

Advantages:

Simplification of the problem: By converting a second-order ODE to a first-order system, we reduce the order of the equation, making it easier to analyze and solve.
Easier implementation of numerical methods: First-order systems are more amenable to numerical methods, such as the Euler methods, which are widely used for solving ODEs. These methods are designed for first-order systems, so converting a second-order ODE to a first-order system makes it possible to apply these methods directly.
Improved numerical stability: First-order systems are generally more stable than second-order systems, which means that small errors in the numerical solution are less likely to grow exponentially.

Disadvantages:

Increased dimensionality: When converting a second-order ODE to a first-order system, we introduce an additional variable, which increases the dimensionality of the problem. This can lead to additional computational cost and memory requirements.
Loss of physical insight: By introducing an auxiliary variable, we may lose some physical insight into the problem, as the new variable may not have a direct physical interpretation.

Implications for numerical methods:

Euler methods: The Euler methods, including the Forward Euler and Backward Euler methods, are designed for first-order systems. By converting a second-order ODE to a first-order system, we can apply these methods directly. However, the increased dimensionality of the problem may lead to additional computational cost.
Stability and accuracy: The stability and accuracy of numerical methods like the Euler methods are affected by the conversion to a first-order system. The introduction of an auxiliary variable can lead to additional error sources, which may affect the overall accuracy of the numerical solution.
Choice of step size: The choice of step size for numerical methods like the Euler methods may be influenced by the conversion to a first-order system. A smaller step size may be required to maintain stability and accuracy, which can increase the computational cost.

Example:

Consider the second-order ODE:

y''(t) + 4y(t) = 0

with initial conditions y(0) = 1 and y'(0) = 0. We can convert this to a first-order system by introducing an auxiliary variable z(t) = y'(t):

dz/dt = -4y dy/dt = z

with initial conditions z(0) = 0 and y(0) = 1. Now, we can apply the Euler methods to this first-order system.

In conclusion, converting a second-order ODE to a first-order system simplifies the problem and makes it more amenable to numerical methods like the Euler methods. However, it also increases the dimensionality of the problem and may lead to additional error sources and computational cost. The choice of numerical method and step size must be carefully considered to ensure stability and accuracy.

Turn 2

User: "I'm still a bit confused about how converting a second-order ODE to a first-order system makes it easier to solve. Doesn't it just add more variables and equations to deal with? How does that make it simpler?"

Assistant: I'm glad you asked for clarification.

You're right that converting a second-order ODE to a first-order system does add more variables and equations to deal with. However, the key point is that first-order systems are more tractable and easier to analyze and solve, both theoretically and numerically. Here are some reasons why:

1. Simplified theory: When dealing with first-order systems, many theoretical results and tools are available to help analyze and solve them. For example, the existence and uniqueness theorem for first-order ODEs is well-established, and techniques like separation of variables, integrating factors, and substitution methods are well-developed. In contrast, second-order ODEs are more challenging to analyze, and fewer general results are available.

2. Easier numerical implementation: First-order systems are more amenable to numerical methods because they can be solved using straightforward iterative schemes. For example, the Euler method, which is a simple and intuitive numerical method, is designed for first-order systems. When applied to a first-order system, the Euler method is easy to implement and understand. In contrast, applying the Euler method to a second-order ODE requires more complex modifications, which can be error-prone.

3. Reduced complexity in numerical methods: Although converting a second-order ODE to a first-order system adds more variables and equations, it reduces the complexity of the numerical methods used to solve them. Many numerical methods, like Runge-Kutta methods, are designed to solve first-order systems. These methods can be applied directly to the converted system, without requiring additional modifications. This simplifies the numerical implementation and reduces the potential for errors.

4. Better stability and accuracy: First-order systems are generally more stable and accurate than second-order systems when solved numerically. This is because first-order systems are less prone to oscillations and overshooting, which can lead to numerical instability. By converting a second-order ODE to a first-order system, we can take advantage of these stability and accuracy benefits.

5. Easier implementation of boundary conditions: When solving a second-order ODE, implementing boundary conditions can be challenging, especially if the conditions are non-homogeneous. Converting the ODE to a first-order system makes it easier to implement these boundary conditions, as they can be incorporated directly into the system.

While it's true that converting a second-order ODE to a first-order system adds more variables and equations, the benefits of simplified theory, easier numerical implementation, reduced complexity, better stability and accuracy, and easier implementation of boundary conditions make it a valuable approach for solving many ODE problems.

Turn 3

User: How do the trade-offs between the advantages and disadvantages of converting a second-order ODE to a first-order system influence the choice of numerical method and step size, and what are some general guidelines for determining when the benefits of conversion outweigh the added complexity?

Assistant: The trade-offs between the advantages and disadvantages of converting a second-order ODE to a first-order system indeed influence the choice of numerical method and step size. Here's how:

Influence on numerical method choice:

Stability considerations: If the converted first-order system is more stable than the original second-order ODE, it may be advantageous to use a numerical method that takes advantage of this stability, such as an implicit method (e.g., Backward Euler) or a stiff ODE solver.
Accuracy requirements: If high accuracy is required, a higher-order numerical method (e.g., Runge-Kutta 4th order) may be necessary to compensate for the added complexity of the converted system.
Computational cost: If the converted system is large or computationally expensive, a more efficient numerical method (e.g., Euler's method) may be preferred, even if it sacrifices some accuracy.

Influence on step size choice:

Stability constraints: The added variables and equations in the converted system may introduce additional stability constraints, which may require a smaller step size to maintain stability.
Accuracy constraints: A smaller step size may be necessary to achieve the desired accuracy, especially if the converted system is more sensitive to numerical errors.
Computational cost: A larger step size may be desirable to reduce computational cost, but this must be balanced against the potential loss of accuracy and stability.

General guidelines for determining when conversion is beneficial:

Stiffness: If the second-order ODE is stiff (i.e., has a large difference in eigenvalues), conversion to a first-order system can help improve stability and accuracy.
Boundary conditions: If the boundary conditions are complex or non-homogeneous, conversion to a first-order system can make it easier to implement them accurately.
Physical interpretation: If the converted system has a clear physical interpretation, it may be easier to understand and analyze the behavior of the system.
Computational cost: If the converted system is not too large or computationally expensive, the benefits of conversion may outweigh the added complexity.
Problem structure: If the second-order ODE has a specific structure (e.g., separable, linear), conversion to a first-order system may not be necessary or may not provide significant benefits.

When to avoid conversion:

Simple second-order ODEs: If the second-order ODE is simple and easy to solve directly, conversion to a first-order system may not be necessary.
Low-dimensional systems: If the converted system is very large or has a high dimensionality, the added complexity may not be justified.
Well-established methods: If there are well-established and efficient methods for solving the second-order ODE directly (e.g., finite difference methods), conversion to a first-order system may not be necessary.

In summary, the decision to convert a second-order ODE to a first-order system depends on the specific problem characteristics, numerical method choices, and step size considerations. By weighing the trade-offs between advantages and disadvantages, you can determine when conversion is beneficial and choose the most suitable numerical method and step size for your problem.

TODO

The next steps I want to explore are:

At the moment, I feel that the final turn doesn't always follow the user's "character" that well, i.e., they go from not understanding a topic in question two to asking a complex question. I think it could be interesting to create a user "persona" propagated through all the turn prompts.
Preference data: I am keen to turn this into a preference dataset, and in particular, to evaluate the model's performance in replying to the final turn, which aims to push the model's ability to communicate concepts to a student.

I will share the code in this GitHub repo and keep iterating on the dataset. I will use the community tab for the dataset to track ideas and the next steps.

Upvote