AI coding tools show reliability gaps in structured output tasks
A new study from the University of Waterloo finds that leading artificial intelligence coding tools still fail in roughly one out of four cases when generating structured outputs, raising concerns about their reliability in real-world software development workflows.
The research, released on March 16 and scheduled for presentation at the International Conference on Learning Representations 2026, evaluated 11 large language models across 18 structured output formats and 44 tasks. Even the best-performing proprietary systems reached only about 75 percent accuracy, while top open source models achieved close to 67 percent.
Structured output remains a critical weak point
The study, titled “StructEval: Benchmarking LLMs’ Capabilities to Generate Structural Outputs,” focused on formats commonly used in development pipelines, including JSON, YAML, CSV, HTML, React and SVG. These formats are essential for integrating AI-generated code into production systems.
Researchers assessed model outputs using a combination of syntax validation, keyword matching and visual question answering. The results showed that while models performed reasonably well on text-based tasks such as documentation and simple data structures, they struggled with more complex outputs.
Tasks involving visual or layout elements, including image generation, video content, dynamic web design and diagram code, produced the highest error rates. The study also found that generation tasks, where models convert natural language instructions into structured formats, were significantly more difficult than conversion tasks between existing formats.
Human oversight remains essential
The research team included Dongfu Jiang, Jialin Yang and Wenhu Chen, supported by a group of 17 contributors involved in annotation and evaluation. According to Jiang, the study measured both syntactic correctness and whether outputs meaningfully addressed the task.
He noted that despite rapid advances, AI coding systems still require close human supervision. Developers using these tools cannot rely solely on automated outputs, particularly in environments where precision is critical.
Chen emphasized the collaborative research model at Waterloo, where students contribute to and lead benchmarking efforts, reflecting a broader trend in AI development that combines experimentation with evaluation.
Widespread adoption meets practical limitations
The findings come at a time when AI-assisted coding tools have become deeply embedded in software engineering workflows. A recent survey by The Pragmatic Engineer indicates that 95 percent of respondents use AI tools at least weekly, and 75 percent rely on them for at least half of their engineering tasks.
Platforms such as GitHub Copilot, Claude Code and Cursor are now standard in many development environments. However, the Waterloo study highlights a key risk: errors in structured outputs may not always be immediately visible, increasing the likelihood of hidden bugs or configuration issues.
In complex systems, such issues can propagate and lead to broader failures, making validation and review processes more important than ever.
The study has been published in Transactions on Machine Learning Research and contributes to ongoing discussions about the role of large language models in production-grade software development.
-
08:20
-
07:50
-
07:20
-
07:00
-
23:50
-
23:40
-
23:20
-
23:00
-
22:40
-
22:20
-
22:03
-
22:00
-
21:40
-
21:20
-
21:00
-
20:40
-
20:20
-
20:00
-
19:40
-
19:20
-
19:02
-
16:15
-
16:00
-
15:45
-
15:30
-
15:20
-
15:15
-
15:00
-
14:50
-
14:45
-
14:34
-
14:30
-
14:20
-
14:17
-
14:15
-
14:03
-
14:00
-
13:50
-
13:45
-
13:40
-
13:34
-
13:30
-
13:20
-
13:15
-
13:00
-
12:54
-
12:50
-
12:45
-
12:39
-
12:31
-
12:29
-
12:20
-
12:15
-
12:00
-
11:50
-
11:20
-
11:15
-
11:01
-
10:50
-
10:20
-
09:50