AI Agents Debug Spark Faster

When I first had to debug a Spark job on Databricks, I expected a long, painful cycle. Distributed data, cluster orchestration, Python code I wasn't fluent in — the usual recipe for frustration. Instead, I ended up with a surprisingly tight debug loop using Databricks CLI, DuckDB, and a few pragmatic tricks. This is the story of that cycle.

The Problem

I had a job running in Databricks that was supposed to process a large set of Parquet files. On paper, the pipeline was simple: filter rows, apply a few transformations, and write the result back. In practice, the job was misbehaving — inconsistent row counts, slow execution, and cluster utilization that looked suspiciously low.

I needed to debug:

Was the issue in the filtering logic?
Was Spark misconfigured?
Was I misunderstanding the schema?

But doing all of that only inside Databricks notebooks felt like overkill. I wanted a local, iterative loop.

Step 1: Databricks CLI to the Rescue

Instead of poking around in the web UI, I leaned on the Databricks CLI. With a few commands, I could:

List and download Parquet files directly from DBFS:

databricks fs ls dbfs:/mnt/raw-data/
databricks fs cp dbfs:/mnt/raw-data/part-0001.parquet ./sample.parquet

Trigger jobs without clicking buttons in the UI:

databricks jobs run-now --job-id 1234

Inspect cluster configs and logs quickly.

This meant I could pull real data samples down to my laptop instead of constantly iterating in the cloud.

Step 2: DuckDB for Local Debugging

Here's where DuckDB entered the picture. Once I had a few Parquet files locally, I opened them with DuckDB:

SELECT COUNT(*) FROM 'sample.parquet';
SELECT col1, col2 FROM 'sample.parquet' WHERE event_type = 'click';

DuckDB gave me instant answers. No Spark startup time, no cluster cost. I could validate:

The schema matched my expectations.
The filters worked as intended.
The suspicious row counts were reproducible.

In minutes, I nailed down a couple of issues that would've taken hours if I'd stuck to Databricks alone.

Step 3: Replaying Fixes in Spark

Once I had confidence locally, I pushed the fixes back into the Spark job:

Adjusted filter expressions.
Fixed schema assumptions.
Tightened up partitioning logic.

I reran the job on the Databricks cluster. This time, logs showed balanced executor utilization and output counts that matched what I'd tested in DuckDB.

Step 4: Iterating the Cycle

The magic wasn't in any single tool — it was in the cycle:

Databricks CLI → Grab data and run jobs programmatically.
DuckDB locally → Validate assumptions fast.
Databricks cluster → Rerun with fixes at scale.

Rinse and repeat.

The whole process turned debugging from a multi-hour cluster slog into a tight feedback loop I could run from my terminal.

Why This Works

DuckDB is the perfect local playground for columnar data like Parquet.
Databricks CLI makes the platform scriptable, not just a UI experience.
Together, they let you treat the cloud as execution muscle while keeping iteration speed local.

It's a reminder that sometimes, the best debugging strategy isn't "pick one environment," but rather bridge multiple layers.

Closing Thoughts

I went into this expecting to drown in Spark logs. Instead, I found a workflow that was almost fun: grab data, test locally, rerun at scale, repeat.

If you're working with Databricks and find yourself in a slow debug cycle, try mixing in DuckDB. The combination can shrink your iteration time dramatically — and save you from racking up needless cluster hours.