The llm-dataset-converter library (and its dependent libraries) can be used for converting Large Language Model (LLM) datasets from one format into another. It has support for the following domains:

  • Pretrain
  • Supervised

    • Classification
    • Pairs (Q&A, P/R)
  • Translation

Please refer to the dataset formats section for more details on supported formats.

But the library does not just convert datasets, you can also slot in complex filter pipelines to process/clean the data.

On this website you can find examples for:

Examples for the additional libraries: