Notice If you want to see Python source code that supports some of my projects, go to Github and help yourself. The code is not written with reuse in mind... -- (talk) 15:57, 15 May 2018 (UTC)

If you are concerned that a category gets flooded with automated uploads, check that a template like {{Disambig}}, {{Photographs}}, {{Categorise}}, {{CatDiffuse}} or {{CatCat}} has been applied before complaining. In the case of my batch upload projects, any category marked this way will not be added to new photographs. -- (talk) 16:32, 20 September 2018 (UTC)




@: searching for a photo of Frances Carpenter here on commons, I came across the images from Category:Chevy Chase Club and noticed the TIF files, I was wondering if it makes sense these files to be added to the same categories like for example jpg files. Thank you for your time. Also, this being the last day of 2018, allow me to present my very best wishes for 2019! Lotje (talk) 13:32, 31 December 2018 (UTC)

This search (for example) lists pictures in the collection of State Library of Victoria that are out of copyright. Each is available (via the "Available online" link) as a jpg, and as a high-res tif. The resultant "Download" pages lists them as "Out of Copyright". Could you automate downloading all their OoC artworks, please? They have an API. Category:Paintings in the State Library of Victoria has very few members. Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 17:04, 16 January 2019 (UTC)

Hi Fæ! Thanks a lot for User:Fæ/Project list/ESA! I'm categorizing many pictures from Category:ESA images (review needed) and found some duplicates of pictures previously imported in Category:ESA_files_uploaded_by_Revent. I didn't check all of them but it's likely all of these 254 photos have been duplicated. What should we do? Keep the oldest ones (those imported by Revent) or yours? vip (talk) 15:39, 25 December 2018 (UTC)

@Don-vip: The decision of which to keep should be based on which image is better quality (presumably these will all be digitally identical), and which have the better description. There's no issue with deleting my uploads where they are duplicates, I am still puzzled as to exactly how these are created, as I do automatic duplicate checks before upload. -- (talk) 09:19, 27 February 2019 (UTC)

@Don-vip: I have started an initial test generating a local (i.e. off-wiki) JSON database of image hashes for all ESA uploads. There are 2,689 subcategories, and I do not yet know how many unique files that includes, so it will take many hours/days to generate the database, but once generated testing new files will be quick. Depending on eventual size, I may have to rewrite my code if it becomes impossible to hold the results in a memory array. These hashes can then be referenced to see if a version of the same photograph already exists on Commons before upload. A simple minded check so that a version with the same extension that is not a higher resolution will never be uploaded makes sense for the vast majority of uploads we are interested in hosting from the ESA. This would allow for a TIFF to be uploaded which is identical to an existing jpeg, but not another jpeg.

Here is a typical example the image hashes can discover and could prevent, these two photographs were not previously linked or marked as duplicates. They are identical with a hex image hash of '2001191209594c68', the difference probably from the fact that one has EXIF data and the other does not. There may be other 'hidden' differences that appear to cause one thumbnail to look blurred in the gallery preview (at least in Chrome):

Here's a counter intuitive gallery created from 4 existing uploads which match the hex image hash "4070f8ecece9f162" and are 'virtually' identical because they were taken by a satellite only seconds apart. In the envisioned new upload process only the first would get uploaded:

I was tempted by the idea of adding the image hashes to the Commons image pages, either as an infobox parameter or as hidden metadata, but I am not sure that anyone else would ever use them. Even my specific way of generating the thumbnail to be hashed, and then which type of image hash to apply, is not an agreed 'standard'.

Note that the existing Commons API available SHA1 hash shows which files are 100% digitally identical. The image hash is effectively 'looking' at the image and shows up 'virtually' identical images, even if they have different resolutions or are hosted in different formats like jpeg/TIFF/png. -- (talk) 09:31, 28 March 2019 (UTC)

By the way, these hashes are being generated on my 8 year old second hand laptop, which is now my main machine. These type of experiment are at the limit of what I can do at home, and I have to push it down to the lowest processing priority to avoid over heating. I could do some of this on labs, but it's not such a good environment for playing around with early testing. If it was less of a hassle, I could shape a project like this into a grant request, and maybe upgrade some of my (literally) dying kit. But, it's a hassle, and the sun is shining today. -- (talk) 11:10, 28 March 2019 (UTC)

@: Great experiment! It would be useful for all imports, not only ESA. It's never been considered by the foundation to be added as an official Commons feature? vip (talk) 22:33, 29 March 2019 (UTC)

You can find more background and experiments at User:Fæ/Imagehash.
There was a discussion on the VP at the time, and there was a phabricator ticket (linked on the right), but the idea of the WMF picking this up was effectively abandoned.
No doubt I could get some funding and run after this as a potential Commons improvement, especially for tracking copyright violations, but my feelings are about the same as they were in 2017, I'm not sure I want to vanish down this rabbit hole. With a strategic eye, it may be that recent changes in copyright law, putting more onus on 'hosts' to track copyright violations, will force WMF development to spend some time on media file matching solutions at the time of upload, beyond SHA1 matching. Leaving this one on pause for now, apart from the odd interesting experiment like this one... -- (talk) 11:51, 30 March 2019 (UTC)
Among the new uploads, an animated GIF

After playing around a bit more with this today, I will try a new upload run. Not only are imagehash matches checked, but near matches are checked within a difference of 2 (which means very close matches). An example of a match stopping an upload is quite interesting, as it highlights duplicates at the ESA source, even though they have different ID numbers and descriptions:

On the other hand here is an example of an exact difference hash match, which in truth is not an exact match as the same data is used to generate a different colour image. Potentially I could use a different imagehash to account for colour, but no plan to do this at the moment:

Large TIFF duplicates will not be tested, mainly because of the limits of running this on my laptop*. However these seem less likely than png, gif and jpeg duplicates. * The source image is loaded to RAM, resized to 80px wide using 4 different methods and then 4 hashes are generated, often but not always giving the same hash, this added complexity is attempting to match how Commons does its thumbnail generation, without exactly understanding the detailed methods. Clearly, as TIFFs might be >200MB, it is unrealistic to handle those the same way.

There is potential to retrospectively go back through the past ~36,000 (unique) uploads from ESA and mark duplicates and matches with different filetypes, but that will be a separate exercise. -- (talk) 12:11, 3 April 2019 (UTC)

Revisiting the image hash database, I have started to tease out duplicates in more detail and examine 'outliers'. One amusing discovery was to find 30 ESA images with image hashes of zero, definitely a "what?" moment:

Then, perhaps more funky, is the discovery of 29 ESA images with a precise image (difference) hash of f0f0f0f0f0f0f0f0, presumably because the data for most of these was computer enhanced to have all pixels perfectly "balanced".

Eventually, with a bit more debugging on real/not real duplicates, I'll probably be creating a "duplicates" category and adding cross-referencing galleries for (literal) non-duplicates with matching & near match hashes in the other versions parameter.

Before someone else picks this up, some of the images are not ESA. They get included because they are inherited in the automatic searching though all subcategories of the top ESA category. Unfortunately that's down to the not-necessarily logical way that Commons lets human volunteers make subjective choices about category hierarchies. -- (talk) 12:30, 6 April 2019 (UTC)

Hi Fæ, it looks like that since 2019-03-18 there is no update any more. Is there some problem? I appreciate your list very much. Could you update it again? Thanks --DenghiùComm (talk) 06:29, 2 May 2019 (UTC)

I did not realize it was failing. Travelling, so looking at this may have to wait until next week. It may have been the forced move of bot tools from one server to another has broken it, or the ghastly arbitrary limit on length of SQL queries might have been tampered with, again. -- (talk) 06:50, 2 May 2019 (UTC)
Ok. Thank you for your answer. --DenghiùComm (talk) 22:05, 8 May 2019 (UTC)


Field ipb_reason has been replaced by ipb_reason_id in table ipblocks by the WMF, this seems to have broken the report. Now you have to get ipb_reason_id, search for the row where comment_id = ipb_reason_id in table comment and select field comment_text, which should be unique. -- (talk) 13:47, 13 May 2019 (UTC)

Dear Fæ: I just noted some of 50,000 photos in the Archive of the German Colonial Association which can be searched via a search engine. I guess that you are occasionally searching for ideas for new import projects, and if so, you might find there a lot of pre-1919 photos. Although the template of the search engine can be switched to English, I had to enter the German word "Kamerun" in the top field instead of "cameroon" to find, what I was searching for. --NearEMPTiness (talk) 06:46, 9 June 2019 (UTC)

I've taken a brief look today. It certainly can be mined and the images uploaded, but I have doubts about how to make a judgement on copyright. Raising for feedback on the VP. -- (talk) 12:33, 18 June 2019 (UTC)

I have noticed some double uploads : File:Operation Overlord (the Normandy Landings), 6 June 1944 B5018.jpg File:D-day - British Forces during the Invasion of Normandy 6 June 1944 B5018.jpg File:The British Army in Normandy 1944 B5027.jpg File:D-day - British Forces during the Invasion of Normandy 6 June 1944 B5027.jpg File:The British Army in Normandy 1944 B5028.jpg File:D-day - British Forces during the Invasion of Normandy 6 June 1944 B5028.jpg File:D-day - British Forces during the Invasion of Normandy, 6th June 1944 B5029.jpg File:The British Army in Normandy, 1944 B5029.jpg File:The British Army in Normandy 1944 B5037.jpg File:D-day - British Forces during the Invasion of Normandy 6 June 1944 B5037.jpg File:The British Army in Normandy 1944 B5038.jpg File:D-day - British Forces during the Invasion of Normandy 6 June 1944 B5038.jpg File:The British Army in Normandy 1944 B5039.jpg File:D-day - British Forces during the Invasion of Normandy 6 June 1944 B5039.jpg File:Operation Overlord (the Normandy Landings), 6 June 1944 B5042.jpg File:D-day - British Forces during the Invasion of Normandy 6 June 1944 B5042.jpg File:The British Army in Normandy 1944 B5043.jpg File:D-day - British Forces during the Invasion of Normandy 6 June 1944 B5043.jpg File:Crashed Horsa glider near Ranville.jpg File:D-day - British Forces during the Invasion of Normandy 6 June 1944 B5050.jpg File:D-day - British Forces during the Invasion of Normandy, 6th June 1944 B5102.jpg File:D-day - British Forces during the Invasion of Normandy 6 June 1944 B5102.jpg File:Landing_on_Queen_Red_Beach,_Sword_Area.jpg File:British_commandos_of_1st_Special_Service_Brigade,_led_by_Lord_Lovat,_landing_on_%27Queen_Red%27_sector_of_Sword_Beach,_at_La_Breche,_on_the_morning_of_6_June_1944._B5103.jpg

File:Operation Overlord (the Normandy Landings)- D-day 6 June 1944 B5111.jpg	File:D-day - British Forces during the Invasion of Normandy 6 June 1944 B5111.jpg

File:The British Army in the Normandy Campaign 1944 B5233.jpg File:Operation Overlord (the Normandy Landings)- D-day 6 June 1944 B5233.jpg File:The British Army in the Normandy Campaign 1944 B5266.jpg File:Operation Overlord (the Normandy Landings)- D-day 6 June 1944 B5266.jpg File:The British Army in the Normandy Campaign 1944 B5289.jpg File:The British Army in Normandy 1944 B5289.jpg

