Hands-On With Coalesce MCPs: Transform, Catalog, and Quality

Part 2: Going from ad-hoc to production prompts with Claude skills
Table of Contents

    In part 1 of this series, we looked at how to use MCPs across the Coalesce suite, and how they let us speed up workflows like root cause analysis and tagging owners. In part 2, we go a level up and look at how to use skills to package those workflows into reproducible recipes that follow the same path each time.

    Introduction to skills

    A skill is a markdown file (SKILL.md) plus optional supporting files that provide instruction about how to perform a task. While skills are not unique to Claude, we’ll focus on using them with Claude’s approach for the remainder of this article. In short, a skill tells Claude what the task is, what tools to use, what good output looks like, and what to avoid.

    Claude is designed to help identify which skills to use. When a conversation starts, Claude reads the skill description and automatically tries to match it against what the user is asking. You can also force a skill explicitly. Typing / brings up the available skills and lets you pick one directly.

    Skills are only as good as the underlying data they can access, so the more context you can pass through MCPs, the better the skills you can build on top. Our Coalesce examples here rely heavily on this, using metadata around data transformations, governance definitions, and data quality.

    Planning a skill suite

    For data roles specifically, I like to think of data skills as jobs to be done. These can be grouped into areas of responsibility such as data quality, data modelling, analytics, data governance, and each job category contains specific jobs. This is powerful and for example lets us encode our testing philosophy as a skill so each time data engineers want to add new tests, they can do it with the add‑tests‑to‑node skill that follows our standards (internally we’ve taken this a step further and encoded our data quality guide as a skill with step by step instructions for how to add high–impact tests).

    Organize skills around jobs to be done

    As a rule of thumb, the more often you find yourself doing a workflow, and the more time it takes, the more likely it is to benefit from being converted into a skill.

    Each skill is written with a description that helps Claude quickly navigate to the right one for the job. So when someone asks to “triage this issue,” the model consistently points them to that skill and they get all the right context, instead of the model chaining LLM calls and reinventing the workflow each time. That’s the benefit of building out a deliberate skill suite, and a good way to figure out which skills you should build.

    Let’s put it into practice by building an actual skill.

    Deep dive: designing a skill for a weekly data quality report

    A common challenge we see customers face is wanting a weekly data quality report. We’ve designed this inside our product, but customers often want bespoke metrics that fit their use cases.

    This may look simple on the surface but designing this as a skill requires some deliberate choices. A few things to consider before we start writing the skill:

    • What should the structure of the report look like
    • What counts as a “data issue”, are all severities included, or only error
    • Should it be actionable, tagging owners, linking to data products, suggesting next steps

    We could start documenting all these steps manually but creating a skill for it is a much better idea.

    Using skill-creator to create the skill

    Claude’s skill-creator is excellent for kickstarting most of the process and then iterating with your own feedback. It takes in a description of what you want and walks you through creating the skill

    The prompt we used:

    Use skill-creator to create a skill that generates a weekly data quality report from Coalesce Quality and posts it to Slack. Use when the user asks for a “weekly data quality report,” “Coalesce Quality weekly summary,” “data health digest,” “weekly DQ recap,” or anything that combines Coalesce Quality incidents/issues/test results with a Slack delivery. Covers open incidents, open issues, failing monitors, recent execution failures, and affected entities over the past 7 days.

    There are three key parts to a good description: what it does + when to use it + key capabilities.

    With this information, Claude will create a skill following the best practice structure, and using its best understanding of the tools available in the connected MCPs. Importantly for our example, it defaults to using prescriptive instructions for each step, by mentioning the tool name it uses. This helps us give confidence that each time it’s run, it produces the same result, which is exactly what we want.

    Anatomy of a skill

    The better the description is, the better the first iteration of the skill is. Below are a few examples of descriptions that are not as good of a starting point.

    “Create a weekly data quality report”. This says nothing about when to trigger or what it contains, and leaves too much up to interpretation about which metrics to use.

    “Use list_incidents… to create a weekly data quality report…”. Specifying this level of detail is too technical for a first draft. Claude will automatically capture a lot of this from a good natural language description, and you can instead go back and edit the skill with these details later on.

    The result

    A few real-world things the first draft didn’t handle, which we ended up editing in by hand:

    • Whether the report goes for review before being sent, or posts directly
    • Which Slack channel it should go to (and what happens when the default doesn’t exist in the workspace)
    • Whether the time window is adjustable, what if we want a monthly report instead of weekly?

    We also had to go back a few times on iterations around how the data is presented. For example, the initial report was too dense, and instead we adjusted the skill to send a brief summary and then reply with evidence in the Slack thread.

    A simple weekly overview of ongoing data quality issues with details around specific failures, owners and recommended next steps.

    Thread # data-updates

    The powerful thing about our data quality report is that it will look consistent next week when another person is on data ops rota and calls it. We can also schedule it to be run weekly to automatically send out an update. In other words, we’ve gone from using MCPs ad-hoc to a production grade report that’s used as a key part of our data quality workflow.

    From report to action with a triage skill

    A weekly report tells you what is broken but a common workflow is that triaging of issues can still be a manual process of understanding if issues should be prioritized, routing to owners and pasting information into Linear tickets.

    We built another skill called data-ops-weekly-rota-triage. It pulls the same Coalesce Quality data the report skill uses, but proposes Linear tickets instead of formatting for Slack.

    The interesting design problem is how each issue gets triaged. The skill scores issues by severity, downstream impact (does it hit a P1 data product), and ownership status (is anyone already investigating), then assigns one of four actions: Create a ticket, skip it because someone’s on it or it’s already been filed, acknowledge it as known variance, or flag it for monitor tuning if it’s a known noisy alert.

    Watch the final skill in action

    This goes to show how different skills can tie into each other and start automating manual parts of the workflow that would otherwise take a long time to do.

    Evaluating our skill

    We’ve now created a skill that’s part of our workflow. Hopefully it saves the team hours every week, and more importantly, distributes a sense of ownership over data quality without anyone having to call it out explicitly.

    In many cases, eyeballing the results gives us a good idea of how well it works. But sometimes it makes sense to approach it more systematically. When evaluating skills, it’s most often worth doing through these lenses:

    1. Does it trigger on the right requests
    2. Does it produce what we expect it to
    3. How does it handle edge cases

    Which of these matters most depends on the use case. For a weekly report where users can also explicitly invoke the skill, triggering accuracy matters less. For an agentic analytics skill that’s expected to fire on any business question, triggering is critical. In our case, what we care about most is output quality and edge-case handling; the report should be good and still good when the data looks different than it does this week.

    Putting the skill through actual evals

    We ran a small benchmark: three test cases, each run twice, once with the skill loaded (Claude reads SKILL.md and follows it) and once without (Claude makes structural decisions from scratch using the same tools). Claude skill-builder comes with this baked in and can help generate these for you.

    Each output was graded against five objective assertions per eval, does it contain a TL;DR, does it deduplicate the noisy monitor, does it call out P1 downstream impact, is it actually Slack-formatted, and so on.

    Result from the test benchmark

    Eval review

    In short, the skill provides us with better output (more of the assertions we’re running are passing), but on average also adds time it takes to complete the task. The extra tokens are the price of the skill pulling more data (get_issue_impact per top issue, deeper execution history) and following a stricter template. For a once-a-week report, that’s fine.

    With Skills, the variance also drops significantly as they make each task execution much more predictable. Without one, Claude makes structural decisions fresh every time. Sometimes it writes a great report, sometimes it skips the TL;DR or forgets to deduplicate noise. With a skill, the template is enforced. That’s the actual case for skills: without one, every run is a roll of the dice. Although we shouldn’t read too much into these numbers with such a small sample, they still go to show the point.

    Summary and what’s next

    Skills are great when the task is well-defined and bounded. They start to break when the workflow has multiple decision points, needs to integrate with ticketing or PR review, or has to coordinate across teams. That’s where playbooks and integrated AI agents can help.

    Part 3 picks up there: sharing skills across teams, instrumenting them so you can see which ones actually get used, and building a skill overview with evals and usage metrics, and using purposeful build agents built directly into the Coalesce suite.

    Frequently Asked Questions About Coalesce MCP

    A Claude skill is a reusable recipe (a SKILL.md file plus optional supporting files) that tells Claude how to perform a specific task: what the task is, which tools to use, what good output looks like, and what to avoid. In a data engineering context, skills sit on top of the Coalesce Transform, Catalog, and Quality MCPs, using that metadata to run real workflows like root cause analysis, owner tagging, or a weekly data quality report in a consistent, repeatable way.

    Ad-hoc prompting works well for exploration when you can verify the answer yourself, but each run starts fresh, so the same data quality question can come back structured differently each time. A skill enforces one path: the same investigation steps, the same report layout, the same triage logic on every run. Variance drops and output stays consistent across team members and across weeks, which matters when a workflow is part of your data operations rather than a one-off question.

    A rule of thumb: the more often your team repeats a workflow, and the more time it takes, the better it fits as a skill. It helps to think of data skills as jobs to be done, grouped by area of responsibility, data quality, modelling, analytics, governance, with specific jobs in each. This is where skills get powerful: you can encode your own standards, like a testing philosophy, so engineers add new tests with an add-tests-to-node skill that follows your conventions every time instead of doing it their own way.

    Claude’s skill-creator handles most of the initial work. You give it a natural-language description of the workflow and when it should trigger, and it generates a skill following best-practice structure, using its understanding of the tools in your connected Coalesce MCPs. It defaults to prescriptive, tool-specific steps so runs stay consistent. Keep the description natural (what it does, when to use it, key capabilities) rather than hard-coding tool names up front; you can refine technical detail later.

    Most skills need a few iterations. Common gaps in a first version of a data quality report skill: whether output goes for review or posts directly, which Slack channel it targets and the fallback if that channel doesn’t exist, and whether the time window is configurable (say, monthly instead of weekly). Presentation usually needs work too, ours started too dense, so we changed it to post a short summary with the supporting evidence in the Slack thread.

    Yes, and that’s where they pay off. A weekly data quality report skill formats Coalesce Quality data for Slack. A second skill, data-ops-weekly-rota-triage, pulls the same Quality data but proposes Linear tickets, scoring each issue by severity, downstream impact (does it hit a P1 data product), and whether someone’s already investigating, then assigning an action: create a ticket, skip it, acknowledge it as known variance, or flag it for monitor tuning. Together they automate a rota workflow that would otherwise be manual.

    Evaluate it on three things: does it trigger on the right requests, does it produce what you expect, and how does it handle edge cases. The priority depends on the workflow, for a weekly report you can also invoke manually, output quality and edge-case handling matter more than triggering. You can eyeball results, or run a small benchmark grading each output against objective checks (has a TL;DR, deduplicates a noisy monitor, calls out P1 downstream impact, is correctly Slack-formatted). In testing, skills passed more of these checks but took a bit longer, since pulling deeper data and following a stricter template costs extra tokens, a fair trade for a once-a-week report.

    Skills work best for well-defined, bounded tasks. They start to strain when a workflow has multiple decision points, needs to integrate with ticketing or PR review, or has to coordinate across teams. That’s the boundary where playbooks and integrated AI agents in the Coalesce suite take over, which is what part 3 of this series covers.