The AI Developer That Knows When It's Right (And When It's Not)
May 25, 2025 • Jesper Svensson and Ludvig Olsson
We introduce Codev – an AI system for autonomous software development.
Codev combines agentic AI workflows, conventional software tools, and quantitative multi-sampling, to resolve issues and assess its own work. It scores 49% on SWE-bench Lite, and its confidence correlates with actual success at R² = 82%.
This post covers key aspects of our approach to building a reliable AI system for software development.
Research
Codev starts by analyzing the given problem statement and studying the specified repository. SWE-bench tasks are from repos which typically include a few thousand files, among them 100 to 3,000 files of python code, which contains between 20,000 and 800,000 lines of code from relatively small repos like Flask, to large repos like SymPy.
It builds a code graph of the source code to be able to navigate the contents of the repo. Then it starts a parallelized problem solving process, inspired by the scientific method following these steps:
- Formulating a hypothesis on what the root cause is, or what the solution might be. This also helps generate relevant content for the next step.
- Specific queries per source type are generated to find evidence in the codebase, based on the problem and the hypothesis. Anthropic's Claude-3.5-Haiku is used for generating the hypothesis and the queries.
- Retrieval of code blocks such as functions, classes or methods. Different Sources; a Code Graph, 2 types of String Search and an Embeddings source all gather interesting code blocks using their specialized queries and search methods. The first 3 sources execute locally in software and the Embedding source, is built on a fast and very cheap AI model for Semantic Search. For embeddings Openai's text-embedding-small is used.
- An LLM-based analyst reads and analyzes all relevant files found by the first 4 sources to suggest additional code blocks of relevance, which the other sources may have missed. The ideal is a fast and cheap LLM that can read a lot of code quickly. Google's Gemini-2.5-Flash is used for this.
- An Agentic Search runs in parallel with the steps above. In this step, a more advanced model uses the problem statement to navigate the repository top-down, to suggest the best location to solve the problem. This complements the other sources bottom-up approach. The Agentic Search process is run using Gemini-2.5-Pro.
- Retrieved code blocks are evaluated using a standardized way to score all code blocks found by the 6 sources. A fast and cheap LLM scores and tags all the code blocks as: Solution Target – for code blocks where the solution could be implemented; Supporting – for code blocks which supports the hypothesis; Interesting – for code blocks which seem in general interesting for the solution or the problem; and Not Relevant – for code which seem not relevant. Finally, all files and code blocks get a total score and an instrinsically derived probability, subsequently referred to as 'confidence'. Figure 1 shows a strong correlation between Codevs confidence and the actual chance of having found the correct file.
- Planning – for files with high confidence, development plans are formulated by Anthropic's Claude-3.7-Sonnet. Since files can span thousands of lines of code, a custom code consolidation mechanism is used. It starts with the highest scoring code blocks and progressively incorporates adjacent blocks, their parent/child relationships, and forward and backward call dependencies, incrementally building up to a target code size. This approach ensures the planning and coding LLMs receive the most relevant code context without having to process entire files. This saves time and cost, and limits confusion from excessive code context.
This parallelized research approach is scalable and can easily be adjusted for repositories of varying sizes, for example by setting number of sources, number of queries and number of collected blocks per query. However, we used a one size fits all configuration which typically yielded 30-50 evaluated code blocks, most often successfully identifying the ideal file and location multiple times from different sources.
File Reliability Diagram – Correctness vs Intrinsic Confidence
The relationship between system confidence vs actual correctness of file
Figure 1: This graph displays how likely it is that the system found the correct file as function of its confidence. All SWE-bench Lite issues are sorted by the system's confidence in its most probable file. The x-axis shows the average confidence for groups of 15 consecutive issues in the sorted list (ranging from 35% to 100% confidence). The y-axis shows the average actual correctness of the same group of issues (ranging from 14% to 100%).
Number of files and rollouts - The number of interesting files can vary, and we allow for up to three different files to be selected for development rollout. All files with above a certain confidence threshold gets its own development plan and is passed onto development. If only one file is deemed interesting, 3 rollouts are assigned for this file. If 2-3 files are interesting enough, 2 rollouts are given for each file. At least 2 rollouts make for more robust patch generation. The development plan is created slightly differently for consecutive rollouts of the same file, to introduce variance and find new solutions.
File Candidates
Golden File or Most Commonly Resolved File - Plan for correct file for Editing
Figure 2: For each issue, Codev generates a ranked list of probable files where the problem could be solved. This graph displays all issues sorted by Codev's confidence in its first file choice. For the first 190 issues, Codev was highly confident and only attempted to work in a single file. For the next 97 issues, it attempted solutions in two files, and for 13 issues, it tried three different files. In 20 issues Codev did not find the correct file at all. And In 13 issues, the correct file was found but ranked too low to be selected for development, though 7 of those would have been implemented if one more file had been allowed.
Development
Regression test baseline – Before development starts, a specialized workflow is run to identify and run the repos existing regression tests. The first step of this workflow is to determine the test runner, i.e. the command that runs tests in the given repo. For repositories in the SWE-bench dataset this is often pytest or tox, but can also be any arbitrary custom script, for example Djangos runtests.py or SymPys bin/test.
Well maintained repos and commercial products often have large numbers of regression tests. Therefore, it is desirable to identify which tests are most relevant when editing a given file. Ranking of which tests to run, is done through static import analysis, module name comparison as well as through LLM calls.
Test Module Coverage
Issues where Codev finds the relevant test module within a given number of test modules
Figure 3: This figure shows how likely it is that the test discovery process finds the correct test module(s) that SWE-bench has selected for evaluation. The system ranks the tests modules it finds in the repo from most to least likely to be relevant. The x-axis depicts how many regression tests are run. The y-axis shows in how many percent of all SWE-bench Lite issues Codev finds the relevant test modules.
Running through the most relevant regression tests before development establishes a baseline of passing and failing tests. This enables the system to isolate the impact of its changes from pre-existing issues in the codebase, which reduces the risk of misguided noise.
With the research phase complete and regression tests established, development begins. The system proceeds to implement the solution with the help of the problem description, development plan, and condensed file context.
Issue Reproduction and Solution Validation - Once a solution is syntactically correct and mergeable, the development process faces its most challenging task: creating tests which reproduce the original problem and validates the new solution. While typical solution patches may modify 5-50 lines of code, the test scripts for reproduction and validation often span 100-300 lines. The challenge stems from the need to generate comprehensive test cases covering various edge cases, and catering for repo specific setups. It must also accurately distinguish between reproduction of issues in the original source code and validation of success in the updated code.
To alleviate some of the difficulty for the LLM, it is asked to structure the test cases so that a test runner can execute the test cases in a standardized manner, with results falling into four possible categories: a) crashes with traceback, b) executes but no test cases return as expected, c) executes and some, but not all test cases return as expected, and d) executes and all test cases return as expected.
For each iteration which may include updates to source code and test cases, the test script is run on both the original and the updated source code. Progress is measured based on how the tests execute with 1) the original code and 2) the updated code.
Custom system feedback
Custom situational feedback sent to LLM
Original Code | Updated Code | |
---|---|---|
a) Tests crash | Tailored feedback to LLM based on which code had problem and which problems were detected | |
b) Runs – not ok | ||
c) Runs – some ok | ||
d) Runs – all ok | All fail: Success | All pass: Success |
Table 1: Illustration of the evaluation framework for tracking progress in both issue reproduction and solution validation. The test script reproduces the bug or missing functionality in the original source code and validates the functionality in the updated source code.
Complete success is achieved only when the same test script produces failing results on the original code (reproducing the issue) and passing results on the updated code (validating the fix). Given the complexity of correctly interpreting these results, a built-in strict logic categorizes and gives tailored feedback to the LLM. See Table 1.
Regression testing - When a solution has been validated by the test script it is passed to regression testing. If it passes all regression tests the solution is considered a good solution candidate. If a regression test fails, the results are passed to a reasoning model to judge if the regression test is still relevant given the new solution, or if the test should be updated. This is a crucial step to capture the value of the regression testing, so that working solutions are not discarded based on to-be-outdated regression tests. OpenAI's o4-mini reasoning model is used for this assessment.
Development iterations - Six development iterations are allowed for patching source code, creating test scripts, and iterating until both test scripts and regression tests pass. Additional iterations are permitted only if Codev demonstrates measurable progress according to the outcomes defined in Table 1. This approach balances cost and time efficiency while maintaining flexibility to continue when progress indicates a potential solution is emerging.
Development Results - The process for the system to declare Success as part of the development process for a patch is reliable for all issues except the hardest ones in the SWE-bench dataset. For issues where at least 20% of SWE-bench submissions have succeeded, Codev predicts successful fail-to-pass tests with 85% Recall and 88% Precision, for a F1 Score of 86%. For the hardest and the unresolved issues on SWE-bench, Codev drops to 50% accuracy for fail-to-pass. Declaring Pass for pass-to-pass tests is accurate across the board with 85% Recall and 92% Precision, for a F1 Score of 88% across all issues.
Recall, Precision and F1
Breakdown of recall, precision and F1 metrics for test script and regression tests respectively
Test Script / Fail-to-Pass* | |
---|---|
Recall | 85% |
Precision | 88% |
F1 | 86% |
Regression Tests / Pass-to-Pass | |
---|---|
Recall | 85% |
Precision | 92% |
F1 | 88% |
Table 2 and 3: The tables show to what extent Codev's test script and execution of regression tests predicts the corresponding fail-to-pass and pass-to-pass tests.
*Table 2 includes only issues which at least 20% of SWE-bench submissions have resolved. Table 3 includes all issues on SWE-bench Lite.
The following graphs illustrate relationship between issue difficulty and test script Success and regression test Pass predictions by sorting issues based on their resolution frequency in SWE-bench Lite submissions:
Recall, Precision and F1 versus Issue Difficulty
Breakdown of metrics from table 2 and 3 over varying issue diffuclty
Test script / Fail-to-Pass
Regression tests / Pass-to-Pass
Figure 4 and 5: The x-axis shows buckets of issues by issue difficulty. The first bucket contains all issues which have never been solved by any SWE-bench submission. The second bucket contains all issues which have been solved by 0.33-20% of all submissions and so on.
Scaling and picking
After the development rollouts, the system holds several solution candidates and test scripts. Depending on file confidence, 1-3 files (1.41 on average) were attempted per issue. An average of 3.45 rollouts per issue gave an average of 5.9 patches and at least as many functioning test scripts per issue.
Codev proceeds to combine these solution candidates with all the test scripts in a cross-validation algorithm, where each test script is run against each solution candidate to derive individual success counts per solution.
Finally, each solution candidate is scored using a custom scoring function where Success from test scripts and Pass from regression tests are the strongest factors, but a number of additional parameters which were generated during the development process proves to have predictive power. The highest scoring solution candidate is submitted for evaluation.
Figure 8 and Table 4 illustrate the performance of solution scoring in patch selection. To enhance interpretability, Figures 6 and 7 display Confidence derived through post-hoc linear calibration. The R² values of 0.82 and 0.90 shown in Figures 6 and 7, also directly represent the predictive power of the underlying solution Score.
Figure 7 also includes the actually submitted solutions, which demonstrates a best-of effect in terms of confidence, i.e. submitted patches have higher average confidence than all generated patches. As intended, this translates to a higher average resolve rate. Interestingly, the average resolve rate improves disproportionally, leading to a regression above the y = x diagonal, which gives an additional improvement in resolve rate.
Solution Reliability Diagram – Resolve Rate vs Confidence
Correlation between system calibrated confidence and actual resolve rate
All Issues
Multi-Resolved Issues
Figure 6 and 7: The x-axis shows all generated solution sorted by system confidence and grouped into buckets of 40 candidates. The y-axis shows the actual average resolve rate for the same buckets. Figure 6 shows 1700+ solutions across all 300 issues and Figure 7 contains 1000+ solutions across the 216 Multi-Resolved Issues. For Figure 7 the green dots and regression shows how actually submitted patches correlate with resolve rates.
We define Multi-Resolved Issues as SWE-bench Lite issues that 2 or more submissions have successfully resolved, providing reliable signal that the tests accurately measure solution quality rather than test artifacts. By excluding problems with ≤1 successful submission, this subset filters out cases where failure is likely due to overly restrictive testing rather than genuine difficulty, making the remaining Multi-Resolved Issues more suitable for confidence calibration and evaluation.
It is also worth pointing out that in many cases there are several resolving solutions for a given issue and the picker only needs to pick one of them.
Scaling and Picking in Practice - Limiting development to a single file with only one rollout yielded 125 resolved issues. By generating more solution candidates across additional files and development plans, additional resolving solutions were found for 29 additional issues – a 23% potential improvement over the baseline.
However, this approach introduced a critical trade-off: the original set of 125 resolved issues became diluted with new solution candidates, some of which were non-resolving. Among the 125 baseline issues, 54 issues included at least one non-resolving candidate (requiring selection) and 71 issues included only resolving candidates (no additional risk).
The scaling created two distinct categories: 83 issues (29 + 54) where a picker must decide, and 71 issues where there are only resolving solutions (no picking needed). To surpass the unscaled baseline, the selection mechanism needed to achieve >65% accuracy on the 83 selection-critical issues (54 / 83).
In practice, the system successfully picked resolving solutions in 76 out of 83 cases, with 7 suboptimal selections despite available resolving alternatives. It captured 22 issues of the theoretical upside of 29 issues, resulting in a total of 147 resolved issues. See Figure 7.
Scaling and Picking Resolving Candidates
Attempting more rollouts and files imposes risk - requiring good picking.
Figure 8: Illustrating the effects of scaling, where 125 resolves without scaling increases to 147 with scaling. The first intermediate bar shows the effect of generating potential resolves at the risk of loosing resolves and the second intermediate bar shows the picking of candidates to ultimately increase the performance.
The three metrics - pick accuracy, upside captured, and improvement over baseline — highlight the effectiveness of the scaling and picking process. Pick accuracy reflects how often the system selected a resolving solution when multiple candidates were present. Upside captured measures how much of the potential improvement was actually achieved, while improvement over baseline shows the overall gain in resolved issues. See Table 4.
Picking Performance Metrics
Pick Accuracy | 91.6% (76 of 83) |
Upside captured | 75.9% (22 of 29) |
Improvement over baseline | 17.6% (22 over 125) |
Table 4: Scaling improvement over baseline as a result of pick accuracy and upside captured
Scaling Efficiency - With the data gathered from development scaling we measured 125 resolves (41.67%) for the first rollout in the first most probable file. With an average of 3.45 rollouts and picking we got 147 resolves (49%). Thus we can estimate the effect of doubling development effort to 9.5% resolved increase or 4.1 percentage point increase. See Table 5.
Scaling Efficiency Analysis
2x Development Scaling gives 9.5% higher resolve or 4.1 percentage points.
Unscaled | Scaled | 2x Dev Scaling | ||
---|---|---|---|---|
Dev. Rollouts | 1.0 | 3.45 | Calc | Effect |
Resolve | 125 | 147 | 1.176^(1/ log₂ 3.45) -1 | 9.5% |
Percent Resolve | 41.67% | 49.0% | 7.33% / log₂ 3.45 | 4.1 pp |
Table 5: The unscaled resolve of 125 and the scaled resolve of 147 was generated in the same run, where 125 is the resolve of the first rollout in the first file.
Results
Starting with all 300 issues, the system finds and reads the Correct file in 280 instances (93% conversion). In 263 out of those issues, Codev also creates a plan and implements a solution in the Correct file (94% conversion). After development the system selects a patch from the Correct file in 248 instances (94% conversion). The step with the lowest conversion is implementing a fix that successfully passes the fail-to-pass tests - i.e. implementing the desired functionality. This is achieved in 161 out of the issues (65% conversion). The difficulty is primarily to correctly understand the issue and implement a good solution. But It is worth noting that the fail-to-pass tests sometimes only accept special implementations of the solution, while other implementations may also solve the stated problem.
When the system successfully passes the fail-to-pass tests, it tends to manage to pass the regression tests as well, yielding 154 (96% conversion) issues with resolving patches. Finally 147 patches are correctly picked and submitted for evaluation (96% conversion), ultimately resolving 49% of the 300 issues on SWE-bench Lite.
This progression shows that while the research and localization, as well as regression test conversion and patch picking all have very high conversion (on average 95%), the primary loss happens when implementing the solution in a given file (65% conversion). This will be an area for further improvement where expanding budget for development beyond 6 iterations and using more sophisticated models may both be part of the solution.
Issue Resolution Funnel
Finding and editing Correct file* and creating a resolving patch selected for submission
Figure 9: *In almost all issues the golden file in the patch on the SWE-bench dataset is also the most commonly resolving file across submissions to SWE-bench, but for a few issues a different file is the most commonly resolving File. In those cases that file is viewed as the Correct file.
Peer comparison - SWE-bench has revolutionized AI Coding evaluation, and thanks to all submissions it is also a great asset for analysing different approaches to AI software engineering. In the below graph we analyze how different submission skill brackets score across issue bins of increasing difficulty. As we can see the top bracket (Top 7 submissions average) scores better than the All Average, across all bins, with the Top 14 average and Top 50% average falling in between. With these relatively large bins of issues and submission skill brackets, with in total 7, 14, 35 and 70 submissions, smooth and clearly understandable patterns emerge.
Codev’s 49% resolve rate is close to the Top 7 average resolve rate, but the resolve profile across issue difficulty is different in an interesting way. For the more commonly resolved issues (1-150), Codev outperforms the peer group, and for the less commonly resolved issues (151-250), it underperforms the peer groups. Note that the last bin consists of entirely unresolved issues.
Resolve Profile Compared to Peers
Average Resolve Rate for Brackets of Submissions Across Bins of Issue Difficulty
Figure 10: Issues are sorted by the average resolution rate from All SWE-bench Lite submissions (made public on the SWE-bench leaderboard up until May 23). The issues are grouped in bins of 50 issues. The first bin with the easiest issues has an average resolution rate of 80% and the last bin with issues 251-300 has never been resolved by any submission. Submissions are ordered by SWE-bench resolve rate and put in brackets with the brackets average resolve rate graphed.
Unarguably, a higher resolve rate is more desirable for autonomous coding systems. However, different resolve profiles, i.e. the distribution of resolved tasks across issue difficulty, are suitable for different use cases.
When automation is the main objective, systems that excel at resolving well-defined tasks with very high probability are most useful. For more explorative work, systems with a flatter resolve profile can sometimes solve more complex tasks. Though this requires more human oversight, also for easier tasks.
Regardless of the resolve profile, the system's ability to assess and communicate its confidence in its own suggestions, improves trust and help focus human interaction to where is it most needed.
Appendix
Model usage
Percent usage of different models based on total number of tokens.
Figure 11: Total number of input and output tokens by model. Note that cost per token varies across models.