The Interactive Turn: How DocETL and Dialog Engineering are Reshaping LLM Data Analysis

LLM
Podcast
Author

Junichiro Iwasawa

Published

April 25, 2025

Large Language Models (LLMs) hold immense promise for extracting insights from the vast oceans of unstructured data we navigate daily. Yet, anyone who has tried to apply them to complex, real-world tasks knows the reality is often messy. Optimizing LLM pipelines for accuracy and efficiency, especially at scale, can quickly become a frustrating exercise in manual tuning.

Enter DocETL, a framework developed by UC Berkeley researcher Shreya Shankar and colleagues (paper available here), which has been garnering attention. DocETL aims to tame this complexity. Insights from Shankar’s recent interview on the TWIML Podcast shed light not only on DocETL’s innovative approach but also on a more fundamental truth about how we need to work with LLMs productively.

The Harsh Reality of LLM Data Processing: Beyond Slick Demos

As detailed in the interview and the DocETL announcement blog post, simply throwing a complex task at an LLM often yields disappointing results. Consider analyzing decades of presidential debate transcripts for evolving themes. The sheer scale can overwhelm context windows. The complexity involves more than extraction – it requires identifying themes, tracking changes over time, and synthesizing viewpoints across documents. And the ever-present challenge of accuracy means dealing with hallucinations and missed information.

Shankar’s work on another project analyzing police misconduct records in California highlights the high stakes. Sifting through potentially thousands of pages of unstructured text for patterns requires precision; mistakes are not an option, yet traditional manual annotation is incredibly time-consuming.

Many developers attempt to solve this by manually chunking data, crafting intricate prompts, and orchestrating multiple LLM calls. But as Shankar points out, this often leads to brittle pipelines that are difficult to modify and may only yield mediocre results after days of painstaking effort.

DocETL: A Declarative Framework with LLM Agent-Powered Optimization

DocETL offers a different path: a declarative framework for building and optimizing LLM-powered data processing pipelines. Users define their desired operations – like Map (apply to each document), Reduce (aggregate results), Split (chunk documents), Gather (add context to chunks), or Resolve (normalize similar entities) – along with prompts describing the task for each step, using YAML or Python.

The core innovation lies beyond simple execution. DocETL employs LLM agents to automatically rewrite and optimize the user’s initial pipeline for better accuracy. This involves two key steps:

  1. Pipeline Rewriting: Based on predefined rules (covering data decomposition, adding intermediate steps, and LLM-specific improvements), DocETL’s agents propose alternative pipeline structures. For instance, a complex Map operation might be automatically broken down into a sequence: split the document, gather relevant context for each chunk, apply a simpler map to each chunk, and then aggregate the results.
  2. Quality Assessment & Selection: The agents generate task-specific validation criteria (e.g., “Were all instances of misconduct extracted?”, “Can each extracted instance be traced back to the source document?”). They then use an “LLM-as-a-judge” approach to evaluate the outputs of different candidate pipelines on sample data, ultimately selecting the plan predicted to yield the highest accuracy.

This approach frees analysts from wrestling with low-level implementation details like optimal chunk sizes or error recovery logic, allowing them to focus on their analytical goals.

The Inescapable Need for “Interaction”

Perhaps the most compelling insight from Shankar’s interview transcends DocETL’s technical architecture. She repeatedly emphasizes the crucial role of interaction in working effectively with LLMs. The idea of writing a perfect, one-shot prompt is often a myth.

“Users have no idea how to write the prompt until they see the initial output,” Shankar explains. “They have to see what the LLM came up with the first time… and then say ‘Oh, like actually…’ They changed the task; they redefine [e.g.] misconduct.”

“The more users iterate [based on seeing intermediate outputs], the more complex the prompts get… That’s very fascinating to me because I think there’s a lot of work in automated prompt engineering… that really puts the human out [of the loop].”

It’s through seeing the LLM’s output – its successes, failures, and ambiguities – that users truly understand the nuances of their task and refine their instructions. This human-driven iterative refinement process, Shankar suggests, is fundamental to successfully leveraging LLMs for complex problems.

Echoes of Jeremy Howard’s “Dialog Engineering”

Shankar’s findings resonate strongly with the concept of “Dialog Engineering,” championed by Jeremy Howard, co-founder of fast.ai and now leading Answer.AI. Jeremy argues that the common approach of throwing large, monolithic prompts at an AI and hoping for hundreds of lines of perfect code is fundamentally flawed for real-world development.

Dialog Engineering proposes the opposite: a tight interactive loop between human and LLM, where code or other artifacts are co-constructed in small, manageable increments. Validation happens at each step.

Answer.AI’s tool, solveit (currently in private beta), aims to embody this philosophy. It provides an interface blending chat and a REPL (Read-Eval-Print Loop), allowing users to give instructions in natural language, receive small code suggestions, and immediately execute and verify them within the same environment. The LLM always sees the current context (conversation, files), and users can easily step back or re-run parts of the process if things go awry or requirements change. Simple tests can even be embedded directly in the conversation flow to continuously check for regressions.

The development style solveit enables—stating a small goal, getting a suggestion, immediately testing it, then accepting or refining—directly mirrors the iterative process Shankar observed as essential in her DocETL research. Both DocETL’s findings and solveit’s approach point to the same conclusion: we need to treat LLMs less like black-box instruction takers and more like collaborative partners engaged in a dialog. Shankar’s research provides compelling empirical validation for the principles behind Dialog Engineering, extending their relevance beyond code generation to complex data analysis.

The Road Ahead: Challenges and Opportunities

DocETL is still evolving, and Shankar acknowledges several areas for future work:

  • Interfaces: Moving beyond YAML to more intuitive UIs that facilitate visualization and iteration on large documents and LLM outputs.
  • Agent Reliability: Improving the consistency and fault tolerance of the LLM agents performing the optimization.
  • Optimization Speed & Transparency: Making the optimization process faster and more debuggable for users.
  • Benchmarking: Developing new benchmarks specifically designed to evaluate LLM performance on the complex, long-context data processing tasks DocETL targets.

Conclusion: Dialog is Key in the LLM Era

DocETL represents a significant step forward in harnessing LLMs for complex unstructured data analysis. Its declarative framework and agent-based optimization offer a powerful alternative to manual pipeline construction. However, the research journey itself highlights a critical lesson: the technology alone isn’t enough.

The true potential of LLMs in these domains will be unlocked not just by better models or automation, but by better interfaces and workflows that embrace human-LLM interaction. We need tools that facilitate the back-and-forth, the iterative refinement, the dialog that allows us to clarify our intent and leverage the LLM’s capabilities effectively. The direction indicated by DocETL’s research and tools like Jeremy Howard’s solveit suggests that this interactive paradigm is central to the future of AI-assisted analysis and development.