The Secret to More Accurate, Intelligent LLMs Isn’t Data
—It’s Metadata

To improve the accuracy of your large language models, you need to provide them with context, and metadata is the best way to do that.

Table of Contents

    Given the endless daily headlines published on the topic, you’d be forgiven for thinking that artificial intelligence is the only thing that matters in today’s business technology landscape. The mainstreaming of AI and machine learning (ML) is definitely ushering in a new era for our industry. The potential for technologies such as generative AI to transform business seem limitless.

    But while the potential for AI is enormous, so are the risks given that its outputs are unreliable, impossible to diagnose, and can prove costly when bad data is used to make important business decisions. As I’ve written about previously, before jumping into gen AI and large language models (LLMs), you need to ensure your data fundamentals are in place. This includes making sure you have the fullest context possible for your data in the form of metadata.

     

    What is metadata?

    Metadata is everywhere. In fact, you may be capturing it without even realizing it. When you take a photo with your smartphone, you may only be thinking about the image itself. But attached to that photo is a large amount of metadata your device captured automatically behind the scenes: the timestamp, the location coordinates, the type of device you used to take the photo, the size of the photo file, and more. And while you might not think about the value of this metadata while you’re taking the picture, without it you couldn’t search your photo library for photos from a particular event, or those taken in a specific location. That metadata gives important context to your photos and turns your gallery into a useful database of memories.

    In a similar way, many businesses are passively generating metadata in the process of building out and using their data platform. Nearly every team is consuming it, they just don’t know it. If you look at a data cataloging tool, it’s all metadata. When you look at a glossary of what all your table columns mean, that’s not data—it’s metadata. This is because you’re seeing what the column is, not what’s in the column. Even though teams may not immediately recognize the value of this metadata, for any company that wants to build and train a useful LLM, metadata is an essential element.

     

    Contextualizing LLM training with metadata

    When popular consumer LLMs like ChatGPT first came out, there was a lot of buzz about how they would affect the data space. “You don’t need humans, you don’t need SQL—just load this data into the LLM and it will give you answers.” We all know how that turned out. LLMs make things up, they hallucinate. They’re directly indexed to the quality of data you feed them.

    But the accuracy of an LLM is not achieved simply by providing the model with more and more data, hoping it will eventually find patterns—this approach isn’t scalable. This is very expensive from a compute point of view, and you’ll run out of your “return on investment” at some point. The best way to get there is by not only providing clean, quality data, but also providing the context along with it. That context is metadata.

    It’s this context that’s critical if you want to improve the accuracy of your LLM. If, for example, your goal was to train an LLM to generate SQL, you would need to explain and train it on the table and column relationships, what particular columns mean, which columns contain bad quality data and which contain good quality data, and so on.

    The generic, general purpose LLMs like OpenAI’s ChatGPT and Google’s Gemini (previously Bard) are fun for people to play with and can be good at a lot of things—but not great at one specific thing. Not only are they not valuable to a business, they could open up that business to risk. If I’m running an insurance company, I want my LLM to understand the nuances of my data so that when I ask it a question, it gives me as accurate an answer as possible.

    That’s where domain-specific LLMs trained on a company’s data, with the right context—that is, metadata—come in. To train such a model, a business can start with a smaller LLM, such as one of Meta’s Llama 3 models that have 8 billion or 70 billion parameters, rather than GPT-4’s estimated 1.5 trillion, and append a retrieval-augmented generation (RAG) system, which adds metadata context to your business data.

    Which brings us back to where we started: none of that would be feasible without a proper data foundation.

     

    A solid data foundation makes all the difference

    When we talk about having a solid data foundation, what we mean is one that harnesses your metadata and purposefully uses it to help you build, manage, and iterate on your data projects as efficiently as possible. When you use modern data tools and solutions that incorporate rich, column-level metadata into every aspect of your data platform’s architecture, you have the key to delivering business-ready data quickly and consistently. This is what we mean when we talk about the importance of column-aware architecture.

    In addition, enriching your data with this valuable context helps to optimize your query performance, lowering your overall compute costs. Think of it this way: What does your brain power look like when you’re trying to figure something out without any clues? You have to think a lot harder. But if someone gives you a clue, you can start to narrow it down and it becomes easier. It’s the same logic—metadata offers those clues.

     

    What being metadata-driven really means

    Here at Coalesce we talk about the value of metadata all the time, and for good reason—we’ve seen firsthand the enormous difference data teams experience when working with a solid data foundation that makes the most of their metadata. Taking a metadata-driven approach to building out your platform means you’re choosing the tooling that captures the most metadata possible and leverages it to bring context to all your data. This makes the difference between what you and your team might be spending time on today, such as firefighting and troubleshooting, versus what you’d rather be spending it on in the future: fine-tuning your LLMs, launching new data projects, and bringing value to the business.

    To learn more about how Coalesce can help you on this journey, contact us or start a free trial.

    Explore next-gen data transformations for yourself

    Experience the power of Coalesce with a 30-day trial