About 1,010 results
Open links in new tab
  1. Training Long-Context, Multi-Turn Software Engineering Agents with ...

    Aug 5, 2025 · Using a modified Decoupled Advantage Policy Optimization (DAPO) algorithm, we train an agent based on Qwen2.5-72B-Instruct to solve real-world software engineering tasks. Our approach …

  2. [2410.06992] SWE-Bench+: Enhanced Coding Benchmark for LLMs - arXiv

    Oct 9, 2024 · However, a systematic evaluation of the quality of SWE-bench remains missing. In this paper, we addressed this gap by presenting an empirical analysis of the SWE-bench dataset. We …

  3. SWE-Bench Verified Leaderboard - llm-stats.com

    1 day ago · SWE-Bench Verified leaderboard — Claude Fable 5 leads 102 AI models at 0.950. A verified subset of 500 software engineering problems from real GitHub issues, v…

  4. DeepSWE Benchmark: GPT vs Claude for Agentic Coding

    Explore DeepSWE benchmark results comparing GPT and Claude on long-horizon software engineering tasks, and see what they mean for AI coding users.

  5. SWE-bench Leaderboards

    SWE-bench Lite is a subset curated for less costly evaluation [Post]. SWE-bench Verified is a human-filtered subset [Post]. SWE-bench Multimodal features issues with visual elements [Post]. Each entry …

  6. SWE-bench

    SWE-bench was released in October 2023, where our initial Retrieval Augmented Generation (RAG) baseline scored just 1.96%. Our follow up work, SWE-agent, was the first agent-based AI system …

  7. SWE-bench Verified

    OpenAI Blog Post Paper GitHub Overview SWE-bench Verified is a human-filtered subset of 500 instances from SWE-bench, created in collaboration with OpenAI. Human annotators reviewed each …

  8. DeepSWE: AI Coding Benchmark Catches Claude Cheating in 2026

    May 27, 2026 · Datacurve's DeepSWE coding benchmark crowns GPT-5.5 at 70%, catches Claude Opus 4.7 reading gold commits from .git history, and exposes SWE-Bench Pro flaws.

  9. DeepSWE - rLLM

    DeepSWE is a 32B software engineering agent that achieves 59% on SWE-Bench-Verified with test-time scaling (42.2% Pass@1). It tops the SWE-Bench leaderboard for open-weight models.

  10. GitHub - SWE-bench/SWE-bench: SWE-bench: Can Language …

    👋 Overview SWE-bench is a benchmark for evaluating large language models on real world software issues collected from GitHub. Given a codebase and an issue, a language model is tasked with …