John Y

Detecting cheating in SWE-bench submissions

WEDNESDAY, NOVEMBER 19, 2025 • 5 MINS

I spent yesterday writing a script for determining how similar model solutions are to the ground truth for the SWE-bench benchmark. I wanted to know:

Generally, how similar are models' submissions to the ground truth? Are models just regurgitating PRs? Or are their solutions novel?
Are there any suspicious submissions in SWE-bench?

Generally, I thought about "suspicious" as a SWE-bench submission where a high percentage of the predictions exactly match the reference solution / gold patch.

By "exactly match", I'm referring to a scenario where the gold patch is found verbatim in the gold patch. Of course, simply doing gold patch == prediction patch of the patch strings doesn't work because of patch metadata, so I do the following minor transformations:

I remove all comments from the gold + prediction patches, so we're focused just on code.
I ignore all files in the prediction patch not edited by the gold patch.
I cast the gold/prediction patch strings into a unidiff.PatchSet object, then check if each "hunk" from the gold patch is found verbatim in the prediction patch.

Click here to see the relevant code (from the detect_similarity.py file).

def normalize_hunk(hunk):
    # Remove all comments from a hunk
    lines = []
    for line in hunk:
        if line.line_type == '+' and line.value.strip().startswith('#'):
            continue
        if line.line_type in ('+', '-', ' '):
            lines.append((line.line_type, line.value))
    return lines
def normalize_file(patched_file):
    return [normalize_hunk(hunk) for hunk in patched_file]
def patch_contained_in(orig, pred):
    """Returns True if orig found exactly in pred"""
    # Remove all comments from patches first
    try:
        orig_files = {f.target_file: normalize_file(f) for f in unidiff.PatchSet(orig)}
    except:
        # SHOULD NEVER HAPPEN
        raise OrigParseError("Failed to parse original patch")
    try:
        pred_files = {f.target_file: normalize_file(f) for f in unidiff.PatchSet(pred)}
    except:
        raise PredParseError("Failed to parse predicted patch")
# If prediction patch doesn't edit all the files the gold patch edits, assume False
if not set(orig_files.keys()).issubset(set(pred_files.keys())):
    return False

for filename, orig_hunks in orig_files.items():
    for orig_hunk in orig_hunks:
        if orig_hunk not in pred_files[filename]:
            # If hunk not found exactly in corresponding prediction file, return False
            return False
return True

I then ran our detection script on all SWE-bench Lite and SWE-bench Verified submissions (Leaderboard, Submission Repository).

Findings:

The average exact match rate for SWE-bench Verified submissions (excluding one outlier) is 6.7% (~34 out of 500), with a min/max of 0/13%.
For SWE-bench Lite, it's 4% (12/300) with a min/max of 0/11.2%
For SWE-bench, it's 2.45% (~56/2294) with a min/max of 0/4.05%
One submission (20240820_honeycomb) was suspicious, with exact match rates of 78.7/87.2% on Verified/Lite.

Below, I graph the % resolved rate and % exact match rate for the submissions in SWE-bench Verified (top), Lite (middle), and Test (bottom) with the highest % exact match rates.

[SWE-bench Verified] Percentage of predictions matching ground truth exactly

[SWE-bench Lite] Percentage of predictions matching ground truth exactly

[SWE-bench] Percentage of predictions matching ground truth exactly

Going forwards, we plan to run this script for all submissions and ask for clarification on submissions with abnormal (>20%) exact match rates.

Investigating Honeycomb

The Honeycomb submission stood out because its exact match rate was very high, and more strangely, far higher than its resolution rate, which prompted a closer look.

I found that evaluation/verified/20240820_honeycomb/all_preds.jsonl contained 2236 predictions, far more than the 500 instances in SWE-bench Verified. This is probably human error: a file intended for the full split was mistakenly uploaded as a Verified submission.

When I re-ran the analysis on only the 500 Verified instances, the exact match rate dropped to a normal level: 16 / 500 (~3.2%).

In short:

Honeycomb uploaded predictions covering the full test set (2236 instances).
Our pipeline correctly evaluated only the relevant 500 Verified instances.
Within those 500, there was no suspicious behavior.

What is strange is that the remaining 1,736 instances had an extremely high exact match rate. The most plausible explanation is that these were not true model predictions, but accidentally included gold solutions (or near-identical copies). Since the properly-evaluated subset looks normal, this points to a formatting mistake rather than intentional cheating.

I also checked the Honeycomb test split submission separately; its exact match rate was 1.7%, which is completely normal.

Overall, this appears to be a submission error — though it served as a useful stress test for the detection mechanism.

(Note: the submission was removed a while ago for an unrelated reason. We reached out for a technical report but received no response.)

Detecting cheating in SWE-bench submissions

Investigating Honeycomb

Share this blog!