Databricks launches API for generating synthetic datasets

Databricks

Databricks’ new API allows users to easily create synthetic datasets for machine learning.

Databricks has introduced a new API that allows users to easily create synthetic datasets for machine learning projects. The API is part of Mosaic AI Agent Evaluation, a tool that allows developers to evaluate the quality, cost and speed of AI applications.

Generate in three steps

Synthetic data generated by AI provides a faster and more cost-effective way to create training datasets than manual methods. The new API focuses on generating question-and-answer collections useful for applications with LLMs. The process involves three steps: uploading a frame of relevant data into Apache Spark or Pandas, specifying the desired number of questions and answers, and customizing the output style and usage scenario.

Because incorrect training data can affect the quality of AI models, the API is designed to simplify data validation. Instead of complete answers, the API generates facts needed to answer the questions.

New features will be added in 2024, including a graphical interface for faster reviews and tools to track changes in datasets.

Earlier this year, Databricks integrated Nvidia GPUs into its platform. This allows users to accelerate AI workloads from the Data Intelligence Platform.