Commons:Batch uploading/Starr images

Forest & Kim Starr have a site with about 60.000 images of plants. All these images were uploaded in March 2009 by User:Multichill.

How did we i the images? edit

All the images were already released under a cc-by license and available for download at their site. Some of the images were already uploaded to Commons. Transfering single images was a lot of work. I noticed the site having a database error so I figured they had all the metadata in a database. I contacted Forest & Kim Starr to ask if i could use this database. I got instant response with a link to their database.

Processing the data edit

I don't have MS Access so i converted the database to csv and imported it in Open Office. Generated the descriptions from this table. I had the relevant data in different cells. These cells + some text gave be everything i needed to upload the images. Each image had in it's metadata either the species or the genus so adding categories right away was easy. I used User:Multichill/Starr to generate the actual pages.

Uploading edit

In the previous step i created a 60.000 line script with lines like

upload.py -keep -noverify http://www.hear.org/starr/images/full/starr-050831-7718.jpg -filename:"Starr 050831-7718 Asplenium trichomanes subsp. densum.jpg" "{{subst:User:Multichill/Starr|subst=subst:|URI=050831-7718|Species=Asplenium trichomanes subsp. densum|Description=sori|Location=Polipoli|Island=Maui|Date=31-08-2005}}"

I split this up in several batches and uploaded most batches from the toolserver. It would probably be easier to use import, but upload.py was very stable.

After care edit

A lot of the categories didn't exist yet and had to be created. Some category names were incorrect and needed to be moved, but most images were in a proper category right away. I uploaded a number of dupes. I did this on purpose because i knew my uploads were properly sourced. I marked all other images as dupes and some users sorted this out.

Lessons learned edit

  • This was probably by far the easiest batch upload i'll ever do. If metadata is properly structured like here a batch upload is easy. We should always try to get good metadata from the source.
  • Using import for this amount of images is much easier, but you need to find a shell user to do it for you.


Comments edit

This is a very interesting collection. For some reason, I missed it earlier or didn't really look into it.

  1. As the focus of the categorization seems to be plant-based, many images could benefit from additional categorization (e.g. animals, landscapes). Images showing "habitat" don't necessarily show the plant being categorized (or I missed it). As I'm not at ease with plant categorization, I didn't remove plant categories on these (an exception might be the heliport where I had most files renamed and re-wrote descriptions). We might want to invite others to categorize more of these images.
  2. There are sets of images of Florida (~2000), Nevada (~600), Midway (~5500). These and others might benefit from additional location based categories. I started on Midway. The "Flora of" categories on the plant categories partially take care of this though.
  3. There are a few personal pictures of F&K Starr. Personally I wouldn't have imported those. Many are taken at interesting locations though.
  4. Minor thing: the date in the information template isn't in standard format and, e.g., Emijrpbot seems to skip those that aren't unambiguous.
  5. {{Information2}} has a "location" field. I was wondering if this one should have been used, but it's probably better that we avoided that template, the standarized "Location: .." in the description is fine too.
  6. BTW nice consistent prefixes in the filenames, e.g. "Starr 041014-0001". I tried to keep this when requesting renames.

Thanks for the trip to Hawaii! Excellent work. -- User:Docu at 09:07, 23 October 2009 (UTC)[reply]

Made a numbered list for easy response:
  1. Categorization can always be improved. The species categories are just a start (a good start!)
  2. If you think this is useful you should add it
  3. I didn't manually check all images. It's about 60.000. The maybe 100 personal pictures won't hurt commons.
  4. Do you have an example? I clicked some random images and looked fine. Multichill (talk) 10:17, 23 October 2009 (UTC)[reply]
  5. {{Information2}} is a bad template high on my nuke list. Just add {{Location}}
  6. I try to use the identifiers in all my batch uploads to prevent naming conflicts
Multichill (talk) 10:17, 23 October 2009 (UTC)[reply]
Good idea to number it.
1. The species ones are good indeed. I think there is also a benefit that different persons categorize various aspects of one image.
3. It's less from the perspective of Commons than the one of the people in the image that I had thought of this. As a courtesy, maybe we should remove some.
4. I think it's any date before the 13th of a month, e.g. Starr 080604-6085
5. Yeah, {{Location}} would be good :) For now, I'm still busy with {{Object location}} for the categories though.
-- User:Docu at 11:32, 23 October 2009 (UTC)[reply]
3. Nah, don't bother. Got an email from them and they seem to like it that all their images are at Commons
4. Oh, that's an easy fix. Running a bot now.
Multichill (talk) 12:01, 23 October 2009 (UTC)[reply]
They have about 120k images on their flickr page. Another batch upload would really be worth it. --McZusatz (talk) 12:38, 13 November 2013 (UTC)[reply]