User:Multichill/Scaled-down duplicates

People sometimes make the mistake to not transfer the original image from a wikipedia to Commons, but a thumbnail. This page describes a bot to spot and mark these kind of mistakes.

Process edit

Find pairs to work on edit

First we need pairs of images to work on. These pairs can be found in several ways:

  1. On Commons we have an image and at some wikipedia we have an image with the same name, but a different hash
  2. On Commons we have an image with a name in the form <number>px-<name>.<extension> where an image <name>.<extension> exists at some wikipedia or Commons

We should probably divide it:

  1. Batch runs to find old duplicates
  2. Daily run to find yesterdays duplicates

Match duplicates edit

We're working on pairs to peform matches

Size edit

One of the images should be smaller in size. This is the image which could be marked in the end.

Aspect ratio edit

The image should have about the same aspect ratio. For example with a 20% margin: 80% < (height image A / width image A) / (height image B / width image B) * 100 < 120%

Histogram edit

Histograms are the core of the matching. First the biggest image has to be scaled down to the same size as the other image. It's probably best to make a couple of histograms:

  • Whole images
  • Top left part of the images
  • Top right part of the images
  • Bottom left part of the images
  • Bottom right part of the images
  • Central part of the images

These histograms will match for a certain percentage. If this is above a certain threshold we have a match

Mark duplicates edit

The lowest quality image of the match should be marked with a template containing:

  • The location of the higher quality image
  • The size of this image and the other image
  • The height of this image and the other image
  • The width of this image and the other image
  • Maybe aspect ratio
  • The results of the histogram calculations
  • The match percentage

Implementation edit

The first implementation is available in the pywikipedia package and is called match_images.py (source).