User:Multichill/Scaled-down duplicates

People sometimes make the mistake to not transfer the original image from a wikipedia to Commons, but a thumbnail. This page describes a bot to spot and mark these kind of mistakes.

Process

Find pairs to work on

First we need pairs of images to work on. These pairs can be found in several ways:

On Commons we have an image and at some wikipedia we have an image with the same name, but a different hash
On Commons we have an image with a name in the form <number>px-<name>.<extension> where an image <name>.<extension> exists at some wikipedia or Commons

We should probably divide it:

Batch runs to find old duplicates
Daily run to find yesterdays duplicates

Match duplicates

We're working on pairs to peform matches

Size

One of the images should be smaller in size. This is the image which could be marked in the end.

Aspect ratio

The image should have about the same aspect ratio. For example with a 20% margin: 80% < (height image A / width image A) / (height image B / width image B) * 100 < 120%

Histogram

Histograms are the core of the matching. First the biggest image has to be scaled down to the same size as the other image. It's probably best to make a couple of histograms:

Whole images
Top left part of the images
Top right part of the images
Bottom left part of the images
Bottom right part of the images
Central part of the images

These histograms will match for a certain percentage. If this is above a certain threshold we have a match

Mark duplicates

The lowest quality image of the match should be marked with a template containing:

The location of the higher quality image
The size of this image and the other image
The height of this image and the other image
The width of this image and the other image
Maybe aspect ratio
The results of the histogram calculations
The match percentage

Implementation

The first implementation is available in the pywikipedia package and is called match_images.py (source).