← Back to blog
Data EngineeringAugust 3, 20254 min read

AI Agents Debug Spark Faster

AI Agents Debug Spark Faster

When I first had to debug a Spark job on Databricks, I expected a long, painful cycle. Distributed data, cluster orchestration, Python code I wasn't fluent in — the usual recipe for frustration. Instead, I ended up with a surprisingly tight debug loop using Databricks CLI, DuckDB, and a few pragmatic tricks. This is the story of that cycle.


The Problem

I had a job running in Databricks that was supposed to process a large set of Parquet files. On paper, the pipeline was simple: filter rows, apply a few transformations, and write the result back. In practice, the job was misbehaving — inconsistent row counts, slow execution, and cluster utilization that looked suspiciously low.

I needed to debug:

  • Was the issue in the filtering logic?
  • Was Spark misconfigured?
  • Was I misunderstanding the schema?

But doing all of that only inside Databricks notebooks felt like overkill. I wanted a local, iterative loop.


Step 1: Databricks CLI to the Rescue

Instead of poking around in the web UI, I leaned on the Databricks CLI. With a few commands, I could:

  • List and download Parquet files directly from DBFS:
databricks fs ls dbfs:/mnt/raw-data/
databricks fs cp dbfs:/mnt/raw-data/part-0001.parquet ./sample.parquet
  • Trigger jobs without clicking buttons in the UI:
databricks jobs run-now --job-id 1234
  • Inspect cluster configs and logs quickly.

This meant I could pull real data samples down to my laptop instead of constantly iterating in the cloud.


Step 2: DuckDB for Local Debugging

Here's where DuckDB entered the picture. Once I had a few Parquet files locally, I opened them with DuckDB:

SELECT COUNT(*) FROM 'sample.parquet';
SELECT col1, col2 FROM 'sample.parquet' WHERE event_type = 'click';

DuckDB gave me instant answers. No Spark startup time, no cluster cost. I could validate:

  • The schema matched my expectations.
  • The filters worked as intended.
  • The suspicious row counts were reproducible.

In minutes, I nailed down a couple of issues that would've taken hours if I'd stuck to Databricks alone.


Step 3: Replaying Fixes in Spark

Once I had confidence locally, I pushed the fixes back into the Spark job:

  • Adjusted filter expressions.
  • Fixed schema assumptions.
  • Tightened up partitioning logic.

I reran the job on the Databricks cluster. This time, logs showed balanced executor utilization and output counts that matched what I'd tested in DuckDB.


Step 4: Iterating the Cycle

The magic wasn't in any single tool — it was in the cycle:

  1. Databricks CLI → Grab data and run jobs programmatically.
  2. DuckDB locally → Validate assumptions fast.
  3. Databricks cluster → Rerun with fixes at scale.

Rinse and repeat.

The whole process turned debugging from a multi-hour cluster slog into a tight feedback loop I could run from my terminal.


Why This Works

  • DuckDB is the perfect local playground for columnar data like Parquet.
  • Databricks CLI makes the platform scriptable, not just a UI experience.
  • Together, they let you treat the cloud as execution muscle while keeping iteration speed local.

It's a reminder that sometimes, the best debugging strategy isn't "pick one environment," but rather bridge multiple layers.


Closing Thoughts

I went into this expecting to drown in Spark logs. Instead, I found a workflow that was almost fun: grab data, test locally, rerun at scale, repeat.

If you're working with Databricks and find yourself in a slow debug cycle, try mixing in DuckDB. The combination can shrink your iteration time dramatically — and save you from racking up needless cluster hours.

Found this helpful? Share it!