Introduction
As AI tools continue to gain traction, more developers are turning to them for help with writing and understanding code. From generating snippets to spotting bugs and suggesting fixes, AI is becoming a valuable part of the developer’s toolkit. But with so many models available—both open-source and proprietary—it’s important to evaluate how well they actually perform on real coding tasks. Not all models are created equal, and choosing the right one can make a big difference in productivity and reliability.
In this blog, we’ll explore key evaluation datasets used to assess LLMs across a range of coding tasks—from code completion to debugging. We’ll also take a look at how top open-source and proprietary models perform, helping you identify which ones excel in different areas. Whether you're looking to catch bugs more efficiently, improve code quality, or speed up development, understanding model performance is a critical step.
TL;DR
Among the benchmarks presented, SWE-Bench (Verified) stands out as the most practical for assessing models on everyday software engineering tasks. Unlike synthetic or narrowly scoped benchmarks, SWE-Bench is grounded in real-world GitHub issues and requires models to reason about, edit, and integrate code within large, complex codebases. Its focus on realistic patch generation, cross-file dependencies, and actual bug fixes makes it the most representative benchmark for real-world developer workflows.
When evaluated specifically on SWE-Bench (Verified), Claude 3.7 Sonnet and Deepseek R1 stood out as the strongest performers among popular proprietary and open-source models, respectively—demonstrating notable effectiveness in handling realistic software engineering tasks.
Benchmark Datasets
Numerous benchmark datasets are designed to evaluate various aspects of coding performance, including code generation, bug fixing, codebase navigation, code comprehension and explanation, and real-time interactive development.
Table 1: Overview of Some of the Most Widely Used Benchmarks.
Benchmarks | Primary Task Type | What it Evaluates | Evaluated Language |
HumanEval | Code Generation | Write correct standalone functions from a prompt with signature and docstring | Python |
MBPP | Code Generation | Solve small programming problems using input/output examples | Python |
SWE-Bench | Bug Fixing / Patching in Large Systems | Generate code edits that resolve real GitHub issues across large codebases | Python |
LiveCodeBench | Interactive Coding & Debugging | Perform iterative code editing using compiler/test feedback loops | Python |
MultiPL-E | Multi-language Code Generation | Write functions from descriptions in various programming languages | 18+ languages (e.g: Bash, C++, Go, Java, JavaScript, R, Racket, Ruby, Rust, Typescript, etc…) |
APPS | General Programming Problem Solving | Solve diverse problems with varying difficulty | Python |
SWE-Lancer | Real-world Dev Tasks & Management | Fix bugs, implement features, and make engineering decisions | Python |
SAFIM | Fill-in-the-Middle | Complete semantically meaningful code segments (blocks, conditions, API calls) | Python |
HumanEvalExplain | Code Understanding & Regeneration | Explain code and regenerate it from explanation | Python, JavaScript, Java, Go, C++ and Rust |
Popular Benchmarks
These benchmarks frequently appear in general LLM performance overviews that often include aspects of coding evaluation.
1. HumanEval
HumanEval is a benchmark introduced in OpenAI’s 2021 paper, “Evaluating Large Language Models Trained on Code,” developed in collaboration with engineers from OpenAI, Anthropic, and Zipline.
What it Assesses
It evaluates an LLM’s ability to write Python functions that solve specific tasks, testing language comprehension, reasoning, algorithms, and simple math.
Dataset Structure
HumanEval includes 164 handwritten programming problems. Each record contains:
- Description (docstring describing expected behavior)
- Test cases (an average of 7.7 unit tests to verify correctness)
- Starter code (function signature)
Check out the full dataset on Huggingface.
2. Mostly Basic Python Problems (MBPP)
The Mostly Basic Python Problems Dataset (MBPP) was developed by the Google Research Team and introduced in the 2021 paper, "Program Synthesis with Large Language Models".
What it Assesses
MBPP is similar to HumanEval but differs in the formatting of the prompts. Like HumanEval, it assesses an LLM’s ability to synthesize short functional Python programs based on a description.
Dataset Structure
MBPP consists of 974 crowd-sourced Python programming problems (426 of which were hand-verified and edited by the paper’s authors), designed to be solvable by entry-level programmers.
Each dataset record includes:
- Description of problem
- Test cases (3 automated test cases)
- A code solution
Check out the full dataset on HuggingFace.
3. SWE-Bench
The SWE-Bench benchmark was introduced in the 2024 paper, "SWE-Bench: Can Language Models Resolve Real-World GitHub Issues?" and was developed by researchers at Princeton University and the University of Chicago.
What it Assesses
SWE-Bench assesses an LLM’s ability to provide patches and feature requests on full GitHub repositories. This dataset evaluates models on large systems and multi-file environments, requiring them to navigate complex codebases, understand interactions between different files, identify errors, and generate code patches. It simulates a realistic software engineering environment, making it highly applicable to real-world applications.
Dataset Structure
The full SWE-Bench dataset contains 2,294 unique software engineering (SE) issues sourced from GitHub. The SWE-Bench Verified dataset is a subset of the original, containing 500 samples that have been human-verified for quality. This version is typically used in technical reports evaluating LLMs.
Each dataset record includes:
- Description of the problem
- Test cases (cases to pass, cases to fail)
- The codebase (repository, base commit)
- Hints (though hints are not allowed when evaluating models)
Check out the full SWE-Bench Verified dataset on HuggingFace.
4. LiveCodeBench
LiveCodeBench was introduced in the 2024 paper “LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code” by researchers from UC Berkeley, MIT and Cornell.
LiveCodeBench was developed to address the limitations of older benchmarks like MBPP and HumanEval, which were often too simple and not as practical for real-world coding scenarios, especially for evaluating newer and better models.
What it Assesses
It assesses the following four aspects of model performance
The benchmark evaluates the following four aspects of model performance:
- Code generation: The model generates code from natural language problem statements and is evaluated against unseen test cases.
- Self-repair: The model is given a natural language problem and a self-generated candidate program. If the program contains an error, the model receives feedback and is tasked with producing a corrected solution.
- Code execution: The model must predict the output given an input and a program snippet.
- Test case output prediction: The model must predict the output given an input and a natural language description of the problem (without access to the function implementation).
Dataset Structure
The LiveCodeBench dataset consists of 500+ example problems sourced from LeetCode, AtCoder, and CodeForces and are evaluated against a set of test cases. These problems are intended to test models on their ability to handle a range of coding tasks and provide holistic evaluations of their capabilities.
Check out the full dataset on Huggingface.
Additional Noteworthy Benchmarks
These benchmarks provide a deeper dive into evaluating various aspects of coding and tackle more complex tasks compared to the more widely known, older benchmarks.
5. MultiPL-E
MultiPL-E is a benchmark introduced in the 2023 paper, "MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation" that originally extended the MBPP and HumanEval datasets to evaluate code generation across an additional 18+ programming languages.
What it Assesses
MultiPL-E evaluates an LLM's ability to generate code in multiple programming languages, expanding upon the tasks from MBPP and HumanEval to cover a broad range of languages. It currently supports 22 languages.
Dataset Structure
MultiPL-E consists of MBPP and HumanEval tasks across multiple programming languages, including Python, Java, JavaScript, C++, and more.
Each dataset record includes:
- Description of the problem
- Test cases
- The target solution language
Check out the full dataset on HuggingFace.
6. APPS (Automated Programming Process Standard)
APPS is a benchmark introduced in the 2021 paper, "Measuring Coding Challenge Competence With APPS" by researchers at UC Berkeley, UChicago, UIUC, and Cornell.
What it Assesses
It evaluates an LLM’s ability to understand problem statements and generate correct code implementations, with problems categorized into different difficulty levels ranging from basic scripting tasks to advanced algorithmic challenges.
Dataset Structure
APPS consists of 10,000 coding problems sourced from Codewars, AtCoder, Kattis, and Codeforces, organized into three difficulty levels:
- Simple introductory problems (3,639 problems for beginner programmers)
- Interview-level problems (5,000 more algorithmic problems)
- Coding competition challenges (1,361 problems from USACO, IOC, and ACM)
Each dataset record includes:
- Description of task with example inputs and expected outputs
- Test cases, averaging to 21.2 test cases per record
- Potential starter code (such as a function header), though not all records include this
Check out the full dataset on HuggingFace.
7. SWE-Lancer
SWE-Lancer is a benchmark introduced in the 2025 paper, “SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?” by researchers at OpenAI.
What it Assesses
It assesses an LLM’s ability to provide bug fixes and/or feature implementations and even perform managerial tasks, where models must evaluate technical implementation proposals.
The dataset consists of both individual contributor (IC) tasks and managerial tasks. IC tasks focus on creating patches or implementing features, while managerial tasks involve evaluating freelancer proposals and selecting the best one. Grading for IC tasks is based on end-to-end tests and decisions verified by experienced software engineers, while managerial tasks are assessed against the choices of the original engineering managers. The overall evaluation is determined by the percentage of tasks solved and the corresponding payout earned by the model, using real freelance rates, with a total payout of up to $1 million.
Dataset Structure
SWE-Lancer contains over 1,400 freelance software engineering tasks sourced from Upwork. Each dataset record includes:
- Description of issue
- End-to-end tests or managerial task correct choices
- Associated payout price
Check out this GitHub repository to view the dataset.
8. SAFIM (Syntax-Aware Fill-in-the-Middle)
SAFIM is a benchmark introduced in the 2025 paper “Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks” by researchers from the University of California at Berkeley and AI Meta.
What it Assesses
SAFIM evaluates an LLM’s ability to perform code Fill-in-the-Middle (FIM) tasks.
In these tasks, the model is given the beginning and end of a code snippet and must correctly "fill in the middle" — a challenge that requires more than simple autocomplete. Unlike traditional code completion tasks that predict the next token or line, FIM tasks require the model to understand context from both ends and generate syntactically correct, logically coherent code.
This benchmark emphasizes syntax-aware completions for critical program structures such as code blocks and conditional expressions. FIM tasks are designed to work with syntactic units rather than filling in randomly masked lines. SAFIM includes three primary subtasks:
- Algorithmic Block Completion: Completing a critical code block for solving the problem.
- Control-Flow Expression Completion: Completing key conditional expressions essential for the program’s logic.
- API Function Call Completion: Completing function calls or object constructor calls from popular API libraries, with hints provided as necessary.
Evaluations are conducted by applying unit tests (execution-based testing) or checking for syntax matching against the ground truth.
Dataset Structure
SAFIM includes 17,720 examples from multiple programming languages, sourced from platforms like Codeforces and GitHub.
Each dataset record includes:
- A prompt (Code to Fill-in-the-Middle), presented in five different forms to account for the influence of prompt design on model performance.
- Unit tests
- Ground truth
Check out the full dataset on HuggingFace.
9. HumanEvalExplain
HumanEvalExplain is part of the HumanEvalPack, which extends the HumanEval dataset to 3 additional scenarios across 6 languages (Python, Javascript, Java, Go, C++ and Rust).
What it Assesses
The HumanEvalExplain assesses an LLM’s ability to not only understand code but also explain it and then regenerate the code from its own explanation. This task involves two runs: one to generate the explanation and another to regenerate the solution based on that explanation.
This benchmark can provide insights into how a model handles tasks that convert code into text, such as explaining code, generating docstrings, or adding comments and can help with improving code clarity.
Check out the full dataset on HuggingFace.
Model Evaluations
Now that we've discussed some of the key benchmark datasets, let's dive into some state-of-the-art (SOTA) proprietary and open-source LLMs, which are specifically designed for coding tasks as well as general tasks that can be applied to coding.
For further details on these models, refer to the Appendix. You can also explore the lm-evaluation-harness to evaluate popular metrics on both pre-trained and custom models yourself.
Evaluation Results
This section presents reported performance metrics across the previously introduced benchmarks.
For HumanEval, MBPP, SWE-Bench (Verified), and LiveCodeBench, we report the pass@1 metric, which represents the percentage of tasks where a correct solution is generated on the first attempt. A solution is considered correct if it passes all the provided test cases for the corresponding problem.
Proprietary Models
Table 2: Performance of proprietary LLMs over various datasets, measured by pass@1.
Model | HumanEval | SWE-Bench (Verified) | LiveCodeBench |
OpenAI o1 | — | 48.9% | 63.4% |
OpenAI o1-mini | 92.4% | 41.6% | 53.8% |
OpenAI o3-mini (high) | 97.6% | 49.3% | 74.1% |
OpenAI 4o | 90.2% | 38.8% | 34.2% |
OpenAI 4o-mini | 87.2% | — | 23.0% |
Claude 3 Opus | 84.9% | 11.7% | 34.6% |
Claude 3.7 Sonnet | 97.8% | 70.3% | — |
Claude 3 Haiku | 75.9% | — | — |
Google Gemini 2.5 pro | 98.5% | 63.8% | 70.4% |
Codestral 22B | 81.1% | — | 31.0% |
Mistral Large 2 | 89.8% | — | 29.3% |
Figure 1: Performance of proprietary LLMs over various datasets, measured by pass@1.
Open-source Models
Table 3: Performance of open-source LLMs over various datasets, measured by pass@1.
Model | HumanEval | MBPP | SWE-Bench (Verified) | LiveCodeBench |
Google Gemma 3 27B | 48.8% | 65.6% | — | — |
Google Codegemma 7B | 44.5% | 56.2% | — | — |
Deepseek R1 | — | — | 49.2% | 65.9% |
Deepseek V3 | 82.6% | — | 42.0% | — |
Deepseek Coder-V2 Instruct | 90.2% | — | 12.7% | 43.4% |
Qwen2.5 72B Inst | 80.4% | — | 23.8% | — |
Qwen 2.5-coder 32B Inst | 92.7% | 90.2% | 31.4% | — |
Codegeex4-all-9B | 82.3% | 75.7% | — | — |
Figure 2: Performance of open-source LLMs over various datasets, measured by pass@1.
Evaluation Analysis
Here are the top performing models for each benchmark.
1. HumanEval
- Top Performing Models: Gemini 2.5 Pro (proprietary), Qwen 2.5-coder 32B Inst (open-source)
- What it Means: The models showed strong proficiency in small-scale code generation tasks, excelling in providing precise solutions for brief problem statements.
2. MBPP
- Top Performing Models: Qwen 2.5-coder 32B Inst (open-source)
- What it Means: This model showed excellent performance in its ability to solve basic coding problems typically handled by entry-level programmers.
3. SWE-Bench
- Top Performing Models: Claude 3.7 Sonnet (proprietary), Deepseek R1 (open-source)
- What it Means: These models showed impressive skills in managing real-world software engineering tasks. They can handle complex software engineering issues, including understanding and solving problems across various codebases, which makes it ideal for large-scale, multifaceted tasks.
4. LiveCodeBench
- Top Performers: o3-mini High (proprietary), Deepseek R1 (open-source)
- What it Means: These models excel in tasks involving code debugging, execution, and output prediction.
Summary:
- Gemini 2.5 Pro (proprietary): Excels at precise, small-scale code generation and solving well-defined programming tasks.
- Claude 3.7 Sonnet (proprietary): Excels at complex software engineering problems and understanding real-world issues across large codebases.
- GPT-4o-mini (proprietary): Excels at live/intermediate reasoning and multi-turn programming tasks requiring iteration and debugging.
- Qwen 2.5-coder 32B Inst (open-source): Excels in generating accurate code for both simple and moderately complex problems; ideal for entry-level and intermediate programming tasks.
- Deepseek R1 (open-source): Strong real-world coding capabilities, with good performance on both engineering-grade and interactive, iterative coding problems.
Conclusion
While benchmarks like HumanEval, SWE-Bench, and LiveCodeBench offer valuable insights into model strengths across a range of coding tasks, they capture only a slice of overall performance. Real-world software development is far more complex and can depend on other factors.
It’s also important to recognize that both models and benchmarks are evolving rapidly. New datasets are regularly introduced, and existing ones are continuously refined to better mirror real-world challenges.
To stay up to date on the newest models for coding tasks, check out popular leaderboards like the BigCodeBench Leaderboard, LiveCodeBench Leaderboard, and EvalPlus Leaderboard, which track model performance across a wide range of coding tasks.
Appendix
Here’s some additional information about the models selected for evaluation.
OpenAI Models
- o1: A reasoning-focused model tuned for complex problem solving, coding, and planning.
- o1-mini: A lighter version of o1 optimized for efficiency.
- o3-mini: A model optimized for reasoning and coding, offering better performance than o1-mini, while also being light-weight
- 4o: OpenAI’s flagship multimodal model for fast, accurate, and interactive tasks across text, audio, and vision.
- 4o-mini: A smaller, efficient version of GPT-4o
Claude
- Claude 3 Opus: Anthropic's most advanced and capable model, designed for top-tier performance in complex tasks like research, strategy, and high-level reasoning.
- Claude 3.5 Haiku: Anthropic’s fastest model, optimized for speed and efficiency while still offering advanced coding, tool use, and reasoning capabilities.
- Claude 3.7 Sonnet: A balance of intelligence and speed, making it suitable for a wide range of tasks with solid performance and responsiveness.
- Gemini 2.5 Pro: Google DeepMind’s top proprietary model for coding and complex tasks, designed for multimodal reasoning and large-scale AI applications.
- Gemma 3 27B: Google DeepMind’s latest open-weight model for advanced text generation and reasoning, built for flexibility across devices and research use.
- CodeGemma 7B: A text-to-code model from Google DeepMind, fine-tuned for pure code generation without instruction tuning.
Deepseek
- DeepSeek R1: An open-source LLM focused on advanced reasoning tasks like math, coding, and logic.
- DeepSeek V3: A large-scale model designed for advanced tasks including chat, coding, and general reasoning.
- Coder V2: A specialized model fine-tuned for coding tasks and excelling in coding and mathematical reasoning, trained on a diverse corpus of source code, math, and natural language data.
CodeGeeX4-All
- CodeGeeX-4-all: A model specifically fine-tuned to assist with code generation, completion, and problem-solving, optimizing productivity for developers across a variety of programming languages.
Mistral
- Codestral: A language model optimized for coding, specializing in low-latency, high-frequency tasks like fill-in-the-middle (FIM), code correction, and test generation.
- Mistral-Large: A top-tier reasoning model designed for high-complexity tasks requiring advanced logical analysis and problem-solving.
Qwen
- Qwen2.5: A model excelling in natural language understanding and generation.
- Qwen 2.5 Coder: A model tailored for coding tasks, leveraging 32 billion parameters to assist with code generation, debugging, and problem-solving.
Sources of reported metric values
Articles and Company Webpages
AnthropicAI Claude 3.7 Sonnet and Claude Code
AnthropicAI Introducing the next generation of Claude
Google DeepMind Gemini
OpenAI OpenAI o1-mini
Qwen Team Qwen2.5-Coder Series: Powerful, Diverse, Practical.
GitHub Links
Huggingface
huggingface google/codegemma-7b · Hugging Face
huggingface google/gemma-3-27b-pt · Hugging Face
huggingface deepseek-ai/DeepSeek-R1 · Hugging Face
huggingface deepseek-ai/DeepSeek-V3 · Hugging Face
huggingface THUDM/codegeex4-all-9b · Hugging Face
Others
- Introduction
- TL;DR
- Benchmark Datasets
- Popular Benchmarks
- 1. HumanEval
- 2. Mostly Basic Python Problems (MBPP)
- 3. SWE-Bench
- 4. LiveCodeBench
- Additional Noteworthy Benchmarks
- 5. MultiPL-E
- 6. APPS (Automated Programming Process Standard)
- 7. SWE-Lancer
- 8. SAFIM (Syntax-Aware Fill-in-the-Middle)
- 9. HumanEvalExplain
- Model Evaluations
- Evaluation Results
- Proprietary Models
- Open-source Models
- Evaluation Analysis
- Summary:
- Conclusion
- Appendix
- Sources of reported metric values