Harness the Power of Generative AI by Training Your LLM on Custom Data
In 2022, generative AI technology exploded into the mainstream when OpenAI released ChatGPT. A year later, OpenAI has released GPTs, which allow users to create customized versions of ChatGPT that are tailored to their specific needs.
While these generative AI tools have huge potential to transform business across industries, they’re only as good as the components and data they’re built on, and how well the model is engineered. That includes the large language model (LLM) that powers the application, the data that feeds into the LLM, and the capabilities of the database that houses that data.
And while there are many uses for broad, open source datasets, certain use cases requiring specialized and highly accurate functionality can only be achieved by using your own data(bases). This is part of why finetuning proprietary GPTs and deploying your own open source LLMs has become popular.
In building a generative AI model trained on their private data, MariaDB customers can create highly tailored applications that can differentiate their offerings from their competitors. By using MindsDB and MariaDB Enterprise Server together, finetuning, model building, training, and retrieval-augmented generation (RAG) becomes quite approachable.
How Training Your LLM on Custom Data Helps Modern Business Innovate Effectively
Generative AI has opened up a whole new level of innovation. While there are limits to the use cases for this technology, the benefits we’re already seeing are tangible.
Benefits of Fine Tuning and Training an LLM On Custom Data
Fine Tuning an LLM using custom data allows you to:
- Gain competitive advantage as you make use of your data to streamline resource-intensive processes, gain deeper insight from your customer base, identify and respond quickly to shifts in the market, and much, much more.
- Enhance their application’s functionality by enabling the LLM to function for domain-specific data that isn’t available anywhere else (e.g., What were our fourth quarter sales results or who are our top five customers?).
- Optimize the LLM’s performance to refine predictions and improve accuracy by incorporating large volumes of contextual information.
Simplify operational analytics by using AI/ML’s powerful analytic capabilities, as well as a simpler, natural language interface on your specialized or unique datasets stored in operational or columnar databases. - Maintain privacy and security by keeping your data internal, allowing you to implement proper controls, enforce security policies and remain compliant with relevant regulations.
The Different Approaches to Training an LLM on Custom Data
Today, there are various ways to leverage LLMs and custom data, depending on your budget, resources, and requirements.
Some of the most popular include:
- Finetuning existing models is best suited for creating models specialized in a particular domain or task. This approach relies on a fixed dataset that doesn’t change unless the model is retrained. This is the least complex and only allows moderate customization.
- Retrieval-augmented generation (RAG) works for tasks that require access to a broad range of information. It needs external data sources for retrieval into a pre-existing model and offers limited customization opportunities. This approach requires moderate expertise in generative AI and data integration.
- Building your own model(s) is useful in situations where existing models aren’t accurate enough or simply can’t accomplish the task needed. Developing a completely customized model and training the datasets requires significant expertise.
Fine Tuning Your LLM with Data Stored in MariaDB Enterprise Server
External middleware, like MindsDB, can simplify the process of connecting your LLM with data stored in your MariaDB instance, allowing you to build your own models, finetune existing ones or implement RAG. This is helpful for rapid prototyping; simplifies the build-from-scratch, finetune and retraining processes; and unifies the development approach by using a single domain-specific language (DSL).
What Kind of Private Data Can LLMs Use from MariaDB Enterprise Server?
This integration between MariaDB Enterprise Server and the LLM you’re fine tuning is so powerful because of the array of database workloads MariaDB’s Enterprise Server can support, including:
- Online Transaction Processing (OLTP): Transactional workloads, easily scaling up or down with variable workloads, including clustered or replicated deployments for read scaling and high availability.
- Online Analytical Processing (OLAP): Operational analytics to perform complex queries and data analysis on historical data, aggregated information, and other datasets optimized for analytical queries.
- JSON Document Data: For mixed SQL / noSQL workloads that include JSON document data, MariaDB allows you to query the non-structured data using SQL and join JSON data with existing tables.
This versatility expands and simplifies the potential of your finetuning process. It allows you to make use of all types of data your business generates — from x-ray scans to historic sales data — further honing the LLM’s capabilities.
Use Case: Building an AI Travel Assistant Using MariaDB and MindsDB
MariaDB partnered with Coding Entrepreneurs to develop a full-stack tutorial on how to build a system that can forecast or predict the future values of the cost of flights and generate potential itineraries from natural language. This tutorial is recommended for both front-end and back-end developers using JavaScript and Python. Developers will use technology like Jupyter, FastAPI, Flowbite, Pydantic, SQLAlchemy, Pandas, gretel.ai for synthetic data, TailwindCSS, Next.js and more.
Throughout the tutorial, you’ll learn how to generate these forecasts based on a private dataset in MariaDB Enterprise Server, which has been customized and expanded with synthetic data. He uses an LLM as a chatbot interface to predict airline prices and travel data. Next, he explains how those values can be fed into another AI to parse the data and suggest the best option for the user.
Quick Overview of the Tutorial:
- Part 1: Learn the basics by example: Forecasting
- Setup and load example dataset into MariaDB
- Integrate MariaDB & MindsDB
- Create train regression model using example Kaggle data
- Create Forecasting model & predict future values
- Part 2: Doing it for real with Travel Data
- Load into MariaDB and prepare data
- Train and refine predictor model
- Create data models and input validation schemas
- Create REST API and integrate into MindsDB predictor model
- Part 3: Frontend UI
- Frontend for list and detail views
- Layout and CSS styling
- Error handling and input handling
- Dropdown airport selector, prediction results table
- Part 4: Putting it all together
- Integrate OpenAI for flight recommendations via LLM input
- Recommendation response UI, purchase links
- Use Gretel.ai to Enrich Dataset
- Load additional records to MariaDB via Pandas
- Train forecast model
While there may be open source historical flight data, it could be outdated, incomplete, or lacking important context. For example, if the dataset doesn’t tie price fluctuations to the month of the year, it may be difficult for the AI to adjust prices during popular holidays.
By using private data, the presenter was able to refine the application’s predictions. This accuracy is important for identifying the best option based on the criteria selected. A hypothetical end user would use this tool because it’s able to help them identify the best flight for the least amount of money. Inaccurate or unreliable predictions would likely cause the end user to switch to a competitor’s tool.
See How You Can Apply LLM Fine Tuning in Your Business
Regardless of what industry you’re in or your application’s use case, implementing and fine tuning LLM models will exponentially increase your application’s potential. MariaDB Enterprise Server and MindsDB can remove these limitations around what types of data you can use to finetune your LLM.
Watch this step-by-step tutorial on how to connect your database to LLMs to empower applications with machine learning and generative AI capabilities.