Before Jumping Into Generative AI and LLMs, Ensure Data Fundamentals Are in Place

AI is changing the data space—and we plan to move in lockstep with that change, with the proper foundation and strategy in place.

In my two decades in the data industry, I’ve seen one pattern repeat over and again: every time something new comes up, we collectively jump on it and fixate on the belief that it’s going to solve everything. We rush to that new technology motivated by the fear of missing out or the fear of being left behind. A few years later, we realize we should have fixed the foundation first–and we start all over again.

In the past decade, data modeling took a backseat as organizations rushed to the cloud and the promise of Big Data-powered insights. Many still struggle to extract insights from the data lakes where they dumped all their data, effectively turning them into data swamps. Now data modeling is making a comeback–and rightfully so.

Today, with Artificial Intelligence (AI) and Large Language Models (LLMs) dominating headlines and conversation, a new gold rush has emerged and the race is on to adopt these latest technologies. Without a doubt, AI is changing our industry, quickly—but it carries the ultimate risk of garbage in-garbage out as its outputs are unreliable and impossible to diagnose.

Failed experiments can be costly, especially if they are used to drive business decisions. It’s essential for leaders to focus on data fundamentals before committing to spend millions on an LLM.

The importance of data fundamentals

Analytics and statistics have been at the core of data teams since the invention of the database. Even today, these are enough to guide business decisions, understand customer behavior, and optimize growth. Before embarking on an ML & AI journey, we should first ask, “Is our data clean and consistent? Can we understand our business? Our customers? The industry?”

LLMs require massive amounts of training data to generate results. ChatGPT is trained on a 45TB dataset. The PaLM 2 LLM has 340 billion parameters (and PaLM: on 340 billion parameters). If you’ve played with these tools, you might notice the outputs are impressive, though often flawed. This goes back to the garbage in–garbage out problem: if you work with flawed data, you get flawed results. Worse, if you work with bad data and inefficiently built pipelines–not an uncommon combination–you get bad results at a potentially meteoric cost.

We envision a future where businesses will build smaller LLMs off of their internal data, groomed to meet their specific business use cases and customers’ needs. Those LLMs will not be built on billions of parameters; they will be smaller, more efficient, and domain-specific. With that, compute costs will be manageable. But to get there, we must ensure we are not working with garbage data. We must put proper data fundamentals in place.

So, what do we mean by “fundamentals?”

They may differ based on your business or industry, but you’ve probably heard various categories within the context of the Modern Data Stack (MDS). For most, we can lump these into four groups, in the order of their place in the data value chain:

  • Collection
  • Modeling
  • Governance
  • Consumption

These are generalized groups–e.g. storage is a subset of collection, transformation is a subset of modeling, documentation and literacy are a subset of governance–but they illustrate data priorities well. Without first collecting quality data, modeling and transformation efforts will suffer. Similarly, governing improperly modeled data is futile. It’s only after curating consumable, governed data that we’d recommend pursuing AI/ML initiatives.

At each of these stages, consider two overarching concepts: automation and maintainability, and collaboration and transparency. All mitigate overhead, i.e. labor and cloud costs, and improve efficiency.

  • Automation and maintainability. We believe these are the biggest prerequisites for setting the right data foundation at your organization. Anything pattern-based can be automated. So can any process that is repeatable. Automation will improve maintainability, reduce costs, and ultimately set up organizations for pursuing AI/ML initiatives, and work with or build LLMs to benefit the business.
  • Collaboration and transparency. You don’t want to create silos, be it in your data or data processes. At Coalesce, we believe that data platforms and tools should support multiple roles and stakeholders, regardless of technical skill, and provide governed access and transparency of data initiatives and projects across the organization. We’ve built and continually develop new Coalesce features with collaboration in mind.

A logical approach to the “data” problem

This all sounds great in theory, but how do we focus on fundamentals and prioritize?

To do so, we have to start with a few basic questions:

  • Where are we now?
  • Where would we like to be in the future?
  • How will we get there?
  • How long will it take?

When considering these questions, prioritize direction over speed. You could jump on the Metro as soon as possible, but if you board the wrong line, you’ll never reach your destination.

Where are we now? What does our data look like? Can we answer basic questions about the business, in a timely manner? Does our organization trust data? Do we have silos? Can anyone answer these questions or just the data team? Do the people who are responsible for reporting have all the support they need: tools, knowledge, skill set?

Where would we like to be in the future? Now it’s time to set realistic goals. For many, simply building trust in data and the data team is a win. Trust is notoriously difficult to establish and easy to lose. Building from foundations helps to ensure reliable data, timely reporting, and subsequent trust.

Perhaps your team is already there—if so, what’s the next step? A self-service platform for non-technical users? Nuanced financial reporting? Or are you ready to pursue a decentralized structure–data mesh–that eliminates bottlenecks and improves efficiency? These are ambitious goals, but they can all be reached with planning, prioritization, and consistent effort.

How will we get there? This is where your peers and experts are invaluable. We’ve noticed that many data leaders are open about their challenges and solutions. As such, we highly recommend attending conferences like Snowflake Summit to meet and brainstorm with industry leaders. As much as we think of ourselves as innovators, every problem has a solution… and most are not entirely novel.

How long will it take? Be patient, take incremental steps, and focus on fundamentals: It’s much easier to do it right the first time than backtrack and rebuild. However, quick wins can be a powerful motivator; progress becomes more visible and feedback is received earlier in the delivery lifecycle. If you can find and execute on one data project quickly and show that it generates value for the business, chances are your team will gain the trust of leadership and key stakeholders, who will then be more willing to allow the time to set the fundamental framework of your processes and data.

Your Data Team is key to a successful data strategy

In a world that’s hyping artificial intelligence, we suggest prioritizing human capital. We believe that underlying all data initiatives are quality, hard-working individuals.

That means finding leaders for each domain of your business: data rockstars with a willingness to learn. We cannot overstate the power of learning and advancement in passionate people. The job of a leader is akin to that of a gardener—curate the right environment and bring in the best candidates for growth. The rest will fall into place.

Conclusion

Before you jump into the latest, trendiest technologies and solutions, ensure you’ve built a proper foundation for your data organization. By focusing on prioritization and people, fundamentals will fall into place.

Coalesce provides the first and only automated and scalable transformation framework for those in the Snowflake ecosystem. Our product is built on time-tested tech: while our method revolutionizes transformation, we start from simplicity. SQL transformations, structured intelligently to eliminate bottlenecks, are at the core of our solution.

Are we looking into integrating generative AI into our product? Of course, we cannot ignore the profound change potential of cutting-edge technology, not only for the Modern Data Stack and our industry, but for the world and humanity as a whole. But, as Snowflake CEO Frank Slootman said in his keynote speech at Snowflake Summit 2023, “In order to have an AI strategy, you need to have a data strategy.”

The data space is going to change because of AI and with AI–and we plan to move in lockstep with that change–with the proper foundation and strategy in place. To stay in touch with us on this journey, contact us or start your free trial here.

Explore next-gen data transformations for yourself

Experience the power of Coalesce with a 30-day trial