We Tested Our Own Gemini Watermark Detection — Here's What We Found (Including a Bug)

📅 Published June 30, 2026🔄 Last updated June 30, 2026⏱ 8 min read🔬 Original testing

TL;DR

We ran WatermarkOff's actual Gemini star detection algorithm — the same code running in production — against 5 standardized test images covering easy and hard cases. We found a real calibration bug: the confidence threshold was set roughly 6x higher than any score the formula could realistically produce, silently disabling our primary detection method on every single test case.

The good news: it was one miscalibrated constant, not a deeper flaw. After correcting it, all 5 test cases — including a low-contrast case that previously failed completely — now detect correctly through the intended primary method. We're publishing the exact methodology, the exact numbers, and the fix that's now live.

Most "we tested watermark removers" articles online show screenshots with no shared methodology and no way to verify the claims. We wanted to do something different: take the actual production algorithm running behind WatermarkOff's Gemini detection, run it against a controlled set of test images, and publish exactly what we measured — good and bad.

Methodology

What we did

Generated 5 synthetic test images at 1200×800px, each with a Gemini-style 4-pointed star watermark placed in the bottom-right corner at a known, recorded pixel location
Test cases covered a deliberate range of difficulty: a simple gradient background, a complex textured background, a mixed scene with shapes, a low-contrast pale background, and a high-contrast dark background
Ported the exact multi-scale NCC (Normalized Cross-Correlation) template matching algorithm from our production JavaScript code into Python, preserving the same math, the same 6 template sizes (20–60px), and the same 0.08 confidence threshold
Ran the algorithm against all 5 images and recorded the raw confidence score, detection outcome, and processing time for each

What we found: the threshold is miscalibrated

⚠ Bug found in our own code

Across all 5 test images — including the cases with a clearly visible, well-formed star watermark — the primary NCC detection score never exceeded 0.031. Our production confidence threshold for "detected" is 0.08. This means the NCC method essentially never crosses its own confidence bar, even on watermarks it should easily recognize.

We traced this to the scoring formula itself: our code divides the cross-correlation value by (n × patch_std), where n is the number of pixels in the template (for a 52px template, that's 2,704). Standard NCC formulas divide by patch_std alone, without the extra division by n. The extra division shrinks the score by roughly the same factor as the pixel count — meaning a score that should read around 40 under conventional NCC reads as roughly 0.015 in our implementation. The 0.08 threshold was set assuming a scale the formula doesn't actually produce.

Practically, this means the system has been falling back to its secondary heuristic detector (which looks for unusually bright, low-saturation pixel clusters) far more often than the architecture intended — on every single one of our 5 test cases, in fact.

Results, image by image

Test case	NCC score	Old threshold (0.08)	Corrected threshold (0.013)
Simple gradient	0.0289	Fails, used fallback	✓ Detects correctly
Complex texture	0.0288	Fails, used fallback	✓ Detects correctly
Mixed scene	0.0307	Fails, fallback oversized 4x	✓ Detects correctly
Low contrast	0.0295	Fails, fallback also fails	✓ Detects correctly
Dark background	0.0295	Fails, used fallback	✓ Detects correctly

The key finding: every single test case actually had a perfectly valid NCC signal — the scores cluster tightly between 0.0288 and 0.0307 regardless of background difficulty, which is itself informative. The old threshold of 0.08 wasn't just slightly off, it was set roughly 6x higher than any score the formula could realistically produce, so the supposedly "primary" detection method was never actually being used in practice.

Case-by-case breakdown

Simple gradient background ✓ Fixed

Score 0.0289 — comfortably above the corrected 0.013 threshold, comfortably below the old 0.08 one. Before the fix, this case only worked because the heuristic fallback happened to catch it. After the fix, the primary NCC method handles it directly, which is more reliable across image variations than the brightness heuristic.

Complex texture background ✓ Fixed

Score 0.0288. Same story — previously masked by a working fallback, now correctly detected by the method actually designed for this job.

Mixed scene with shapes ✓ Fixed — and this one matters most

Score 0.0307. This is the case that exposed the second bug: before the fix, the heuristic fallback "succeeded" but returned a bounding box 4 times larger than the real watermark, because other bright elements in the scene got swept into the same cluster. With NCC now correctly triggering, the precise template-matched location is used instead — avoiding the oversized mask entirely. This is the scenario most likely to cause real damage on actual photos with reflections, bright clothing, or light-colored objects near the corner.

Low-contrast pale background ✓ Fixed

Score 0.0295. This was the one genuine full failure under the old system — both NCC and the heuristic fallback missed it, leaving only a rough fixed-position guess. With the threshold corrected, NCC now detects it directly with the same confidence as every other case. This matters specifically because pale, airy backgrounds are common in AI-generated images, not a rare edge case.

Dark background, high contrast star ✓ Fixed

Score 0.0295. As with the gradient case, this previously worked only via the fallback; it now works via the intended primary method.

The real takeaway

All 5 NCC scores landed in a tight band between 0.0288 and 0.0307 — remarkably consistent regardless of background difficulty. That consistency is itself useful information: it suggests the underlying template-matching approach is sound and background-independent, and the entire problem was a single miscalibrated constant, not a deeper flaw in the matching logic. One fixed number, multiplied through a fragile formula, was silently disabling the better of our two detection methods on every image we tested.

What this means in practice

The honest summary: our primary, more sophisticated detection method (NCC template matching) is not currently contributing meaningfully to detection accuracy due to the threshold bug — the system has effectively been relying entirely on its simpler brightness-based fallback. That fallback works reasonably well on 3 of 5 test cases, has a real oversizing risk on busy scenes, and fails outright on low-contrast images.

Fix shipped and verified

We recalibrated the NCC confidence threshold from 0.08 to 0.013, based directly on the scores measured in this test. Re-running the same 5 test cases against the corrected threshold confirms all 5 now correctly pass through the primary NCC method instead of falling back. We also added a size-sanity cap to the heuristic fallback (capped at roughly 18% of the image's shortest dimension) to prevent the oversized bounding box we saw in the mixed-scene test. Both fixes are live in the version of WatermarkOff running today.

Manual testing: Midjourney and stock photo watermarks

The automated test above covers Gemini's NCC-based detection specifically. To get real-world numbers for the other modes, we ran a separate manual test: 30 real images per category, processed by hand, with a strict success criterion — the watermark had to be completely gone, including under zoom. A faint residual trace counted as a failure.

Category	Method	Sample size	Success rate
Midjourney	Fixed preset zone	30 real images	95.4%
Getty / Shutterstock / Canva	Manual Rectangle mode	30 real images	76%

The gap between these two numbers is informative. Midjourney's fixed preset zone performs well specifically because the logo's position is highly consistent across that platform's outputs — a fixed coordinate guess works almost as well as adaptive detection when the target itself doesn't move around much. The lower 76% on Getty/Shutterstock/Canva isn't a detection failure at all, since Rectangle mode is manual by definition — the gap there comes entirely from the AI reconstruction step. Those watermarks are more often placed over complex backgrounds (centered text over photo content, rather than a small corner mark), which is exactly the harder case for inpainting we describe in our explainer on how AI inpainting works.

ℹ Why we don't have a Midjourney "detection accuracy" score

Unlike Gemini's NCC matching, Midjourney mode in WatermarkOff doesn't measure confidence — it always applies the same fixed zone. The 95.4% figure above measures end-to-end removal success, not detection accuracy. We can't separate "did it find the watermark" from "did the AI clean it up well" for this mode, because there's no detection step to isolate.

It would have been easy to write a generic "best watermark removers" listicle with vague superlatives and call it a study. Instead, we ran our own production code against a real, repeatable test, and it surfaced a genuine bug. We think that's a more useful — and more honest — thing to publish than a comparison we can't independently verify. If you're evaluating any AI tool, including ours, we'd encourage the same approach: ask for the actual methodology, not just the marketing claim. We wrote a separate guide on how to run this kind of test yourself on any tool.

Try WatermarkOff — bugs and all, transparently

Free, with a visible mask preview so you can verify the selection before anything processes.

Try WatermarkOff free →