← Back to blog
Data EngineeringSeptember 24, 20254 min read

How MCP Prompts Automated 600-Line Kusto Query Debugging

I recently helped an old colleague debug one of their product's most critical Kusto queries. This wasn't a toy query — 600+ lines, years of data, 10+ joins, mv-applys, 20+ summaries. The kind of query that powers core functionality.

These guys know Kusto for over a decade, but even then… when a query gets this big, figuring out why it's slow becomes a nightmare.


The Traditional Debugging Approach

The way I've approached these problems for years is simple but effective: I focus on two key measures – total CPU and scanned/total extents.

  • Total CPU is the actual computation time needed, independent of parallelism or external factors. It tells me if the query is worth optimizing.
  • Scanned Extents reveal when queries span too many shards, which often signals major optimization opportunities.

The hard part? Identifying which parts of the query contribute most to these two metrics. You have to break the query into pieces and run them again and again. Take a simple join like:

Sales
| join (Customers) on customerId

To measure it properly I need to:

  1. Run the base query before the join.
  2. Run the inner query of the join.
  3. Run the complete query and subtract the baseline costs to isolate the JOIN operation's performance impact.

And since the total CPU metric can sometimes be inconsistent, I repeat steps a few times for accuracy. That's 10+ runs… just for one small join. Now imagine doing that for 600+ lines. 😅


Automating the Process with MCP

🔗 Remember my post where I said MCP tools are ingredients, MCP prompts are plating them? This was the perfect place to apply it.

I extended my Kusto MCP server to return query usage stats, then defined a workflow in an MCP prompt that tells Claude:

  • 👉 break the query into logical steps
  • 👉 measure total CPU + scanned extents
  • 👉 assemble a structured performance report

The Automated Performance Report

The output looks like this:

  • Bottlenecks called out
  • Total CPU increases visualized step by step
  • Recommendations on where to optimize

Here's what the automated performance report looks like (from my testing, not the actual 600+ line query):

• Performance Analysis Results

Based on the 3 runs per check, here's the performance impact analysis:

Scanned Extents Impact:

- Check #1 (Base query): 31 extents (baseline)
- Check #2 (+ Lookup): 62 extents (+31 extents, +100%)
- Checks #4-7: All remained at 62 extents (no additional overhead)

Total CPU Impact:

| Operation                     | Avg CPU (seconds) | CPU Increase from Previous |
|-------------------------------|-------------------|----------------------------|
| Base query (Check #1)         | ~0.08s            | Baseline                   |
| + Lookup (Check #3)           | ~18.39s           | +18.31s (+23,000%)         |
| + First summarize (Check #4)  | ~28.07s           | +9.68s (+53%)              |
| + Second summarize (Check #5) | ~28.06s           | -0.01s (0%)                |
| + Order by (Check #6)         | ~27.92s           | -0.14s (-0%)               |
| + Top 10 (Check #7)           | ~31.77s           | +3.85s (+14%)              |

Key Findings:

🔴 MAJOR BOTTLENECK: Lookup Operation
- Scanned Extents: Doubled from 31 to 62 (+100%)
- Total CPU Impact: Increased by ~18.3 seconds (+23,000%)
- The lookup operation is by far the most expensive part of your query

🟡 MODERATE IMPACT: First Summarize
- Added ~9.7 seconds of total CPU time (+53% over lookup)
- No additional extent scanning overhead

🟢 MINIMAL IMPACT: Second Summarize, Order By
- Virtually no additional total CPU overhead
- These operations are very efficient on the reduced dataset

🟡 MODERATE IMPACT: Top 10
- Added ~3.9 seconds (+14%)
- Likely due to sorting overhead on the final result set

The result? I went from drowning in manual runs to getting an instant, data-driven report on what hurts most. And in this case, the biggest culprit wasn't the joins at all — it was a table with tons of tiny extents (1/200 the recommended size).

Left unchecked, these tiny extents would keep accumulating, degrading performance every single day.


The Key Recommendations

My two main recommendations were:

  1. Introduce a custom merge policy → allow dramatically more merges, to shrink the number of extents, getting them to the recommended size.
  2. Bound one of the join's inner tables to a constant time frame → given the data + product's nature, this was safe and cut down the join size massively.

These two changes achieved:

  • Significant reduction in JOIN's total CPU cost through fewer records on the right-hand side.
  • Major improvement in scanned extents across the entire query.

💡 Lesson: sometimes the difference between hours of detective work and a clear, actionable optimization is just "plating the ingredients" with the right MCP prompt.

Found this helpful? Share it!