Commons:Batch uploading/Commanster

James K. Lindsey maintains a site with pictures of all things living in the Commanster area, Belgium High Ardennes. All species pictures were uploaded in July 2009 by User:Sarefo.

First try edit

Actually the first batch was done much earlier, around 2007. At the time Commonist was used with lots of inefficient cut'n'paste into the descriptions. Although i tried to maintain a consistent appearance, these were done in several chunks of 100-1000 pictures, and the result varies. For example, species names are wikified only in some of those batches. Also, the names of the files were sometimes inconsistently changed from the original filenames.

Update 2009 edit

In 2009, the original Commanster site and the wiki upload had diverged so much that about 3,000 new pictures needed to be uploaded. Steps were taken to be more consistent this time.

  • all the original (from the Commanster site) filenames remained, with the ".-.lindsey"-suffix added.

File name logic edit

  • coding of numbers in filenames:
  • no number = male or sex unknown
  • 2 = female
  • 3 = couple
  • 4/5 = juvenile
  • 9 = wing
  • there are sometimes other numbers for special things.
  • in rather rare cases, there are single letters added in the original filenames, for example when there are two pictures females of a species, the second picture will be called Genus.species2b.jpg.

Preparing the data edit

all work was done on debian testing.

  • i first created a complete mirror of the commanster site with wget. then i moved all species pictures into one directory.
  • to be able to distinguish files that already were uploaded in 2007 from new ones, i used an md5 hash for each file. all files with a hash occuring in the already uploaded batch were deleted. (some awk/comm magic did this iirc, including an unelegant hack temporarily including the md5 in the filename)
  • then i prepared the data for mass upload using Erik Moeller's upload.pl, which i had still lying around from july 2008. it requires a files.txt inside the picture directory.
  • i put data like OTRS ticket, Lindsey category + copyright into the "@" field, which goes into all descriptions. later Duduman found that it's better to create a special template for this, so data needs only be changed once, if necessary.
  • As i use Vim all the time, i was too lazy to create a nice python script, and just created the individual descriptions using Vim's macro functionality for some mass automated cut and paste.

Lessons learned edit

  • Think first, then think again, then upload ;)
  • Better create one single script for all steps, than hack you way towards the goal. This way, if there's an error, you don't need to walk all the way again. Also this would make future updates easier, as update functionality could be added to the script.

Remaining problems edit

  • There is currently no good system in place to ensure automated/consistent transfer of information between Lindsey's site and Commons. Jim himself only changes filenames when the identification changes. But if he has a better picture for the female of "Genus species", he will delete the old picture and name the new one like the old one. Thus, filenames are still not 100% reliable. Also, his file system puts each species in one of four season (spring, summer,...) folders, according to first sightings. so if he suddenly encounters a species in spring that he before only knew from summer, the species page moves to a different folder; this will break the Source field in Commons.
  • The old batch of about 3,000 files should be worked into the same look as the new one. this would probably 90% be possible with a bot, no idea how much was changed in the last two years by human authors.
  • Many pictures are not in galleries yet.
  • Source links linking to the species pages are being worked on, but only for the new 3,000 batch. the data for this, to be found in User:Sarefo/data, was produced by "find"ing the commanster mirror for occurrences of .html files with the same genus.species name as the picture.