User:Multichill/Scaled-down duplicates
People sometimes make the mistake to not transfer the original image from a wikipedia to Commons, but a thumbnail. This page describes a bot to spot and mark these kind of mistakes.
Process
editFind pairs to work on
editFirst we need pairs of images to work on. These pairs can be found in several ways:
- On Commons we have an image and at some wikipedia we have an image with the same name, but a different hash
- On Commons we have an image with a name in the form <number>px-<name>.<extension> where an image <name>.<extension> exists at some wikipedia or Commons
We should probably divide it:
- Batch runs to find old duplicates
- Daily run to find yesterdays duplicates
Match duplicates
editWe're working on pairs to peform matches
Size
editOne of the images should be smaller in size. This is the image which could be marked in the end.
Aspect ratio
editThe image should have about the same aspect ratio. For example with a 20% margin: 80% < (height image A / width image A) / (height image B / width image B) * 100 < 120%
Histogram
editHistograms are the core of the matching. First the biggest image has to be scaled down to the same size as the other image. It's probably best to make a couple of histograms:
- Whole images
- Top left part of the images
- Top right part of the images
- Bottom left part of the images
- Bottom right part of the images
- Central part of the images
These histograms will match for a certain percentage. If this is above a certain threshold we have a match
Mark duplicates
editThe lowest quality image of the match should be marked with a template containing:
- The location of the higher quality image
- The size of this image and the other image
- The height of this image and the other image
- The width of this image and the other image
- Maybe aspect ratio
- The results of the histogram calculations
- The match percentage
Implementation
editThe first implementation is available in the pywikipedia package and is called match_images.py (source).