To evaluate Diana, we run her across an internal benchmark that contains 150+ DBT-related engineering tasks across documentation, testing and modelling twice every time we deploy an improvement. The goal is to track Diana’s ability to solve real-world data engineering problems consistently and accurately across a meaningful problem space.
Each test is composed of a task prompt and a success criteria list. While many benchmark tasks can be deterministically evaluated with data & unit tests, some require reasoning to judge their accuracy. For this, we built a review agent that evaluates Diana’s work and generates a report card for each task. The report card lists what Diana did well and what she didn’t do, and a second step scores her work on a percentile scale across five metrics with rigid guidelines - Context Gathering & Understanding, Implementation, Validation, Testing and Finalization, counting anything below an 85% as a fail. From this, we can quickly gather her performance from deterministic and stochastic benchmark tests.
Once it's complete, we aim to reach 250+ tasks across different data environments and open source the benchmark and Diana’s historical performance.
If you have any questions, reach out to william @ artemisdata.io