Commons:Batch uploading/Starr images
Forest & Kim Starr have a site with about 60.000 images of plants. All these images were uploaded in March 2009 by User:Multichill.
How did we i the images? edit
All the images were already released under a cc-by license and available for download at their site. Some of the images were already uploaded to Commons. Transfering single images was a lot of work. I noticed the site having a database error so I figured they had all the metadata in a database. I contacted Forest & Kim Starr to ask if i could use this database. I got instant response with a link to their database.
Processing the data edit
I don't have MS Access so i converted the database to csv and imported it in Open Office. Generated the descriptions from this table. I had the relevant data in different cells. These cells + some text gave be everything i needed to upload the images. Each image had in it's metadata either the species or the genus so adding categories right away was easy. I used User:Multichill/Starr to generate the actual pages.
Uploading edit
In the previous step i created a 60.000 line script with lines like
upload.py -keep -noverify http://www.hear.org/starr/images/full/starr-050831-7718.jpg -filename:"Starr 050831-7718 Asplenium trichomanes subsp. densum.jpg" "{{subst:User:Multichill/Starr|subst=subst:|URI=050831-7718|Species=Asplenium trichomanes subsp. densum|Description=sori|Location=Polipoli|Island=Maui|Date=31-08-2005}}"
I split this up in several batches and uploaded most batches from the toolserver. It would probably be easier to use import, but upload.py was very stable.
After care edit
A lot of the categories didn't exist yet and had to be created. Some category names were incorrect and needed to be moved, but most images were in a proper category right away. I uploaded a number of dupes. I did this on purpose because i knew my uploads were properly sourced. I marked all other images as dupes and some users sorted this out.
Lessons learned edit
- This was probably by far the easiest batch upload i'll ever do. If metadata is properly structured like here a batch upload is easy. We should always try to get good metadata from the source.
- Using import for this amount of images is much easier, but you need to find a shell user to do it for you.
Comments edit
This is a very interesting collection. For some reason, I missed it earlier or didn't really look into it.
- As the focus of the categorization seems to be plant-based, many images could benefit from additional categorization (e.g. animals, landscapes). Images showing "habitat" don't necessarily show the plant being categorized (or I missed it). As I'm not at ease with plant categorization, I didn't remove plant categories on these (an exception might be the heliport where I had most files renamed and re-wrote descriptions). We might want to invite others to categorize more of these images.
- There are sets of images of Florida (~2000), Nevada (~600), Midway (~5500). These and others might benefit from additional location based categories. I started on Midway. The "Flora of" categories on the plant categories partially take care of this though.
- There are a few personal pictures of F&K Starr. Personally I wouldn't have imported those. Many are taken at interesting locations though.
- Minor thing: the date in the information template isn't in standard format and, e.g., Emijrpbot seems to skip those that aren't unambiguous.
- {{Information2}} has a "location" field. I was wondering if this one should have been used, but it's probably better that we avoided that template, the standarized "Location: .." in the description is fine too.
- BTW nice consistent prefixes in the filenames, e.g. "Starr 041014-0001". I tried to keep this when requesting renames.
Thanks for the trip to Hawaii! Excellent work. -- User:Docu at 09:07, 23 October 2009 (UTC)
- Made a numbered list for easy response:
- Categorization can always be improved. The species categories are just a start (a good start!)
- If you think this is useful you should add it
- I didn't manually check all images. It's about 60.000. The maybe 100 personal pictures won't hurt commons.
- Do you have an example? I clicked some random images and looked fine. Multichill (talk) 10:17, 23 October 2009 (UTC)
- {{Information2}} is a bad template high on my nuke list. Just add {{Location}}
- I try to use the identifiers in all my batch uploads to prevent naming conflicts
- Multichill (talk) 10:17, 23 October 2009 (UTC)
- Good idea to number it.
- 1. The species ones are good indeed. I think there is also a benefit that different persons categorize various aspects of one image.
- 3. It's less from the perspective of Commons than the one of the people in the image that I had thought of this. As a courtesy, maybe we should remove some.
- 4. I think it's any date before the 13th of a month, e.g. Starr 080604-6085
- 5. Yeah, {{Location}} would be good :) For now, I'm still busy with {{Object location}} for the categories though.
- -- User:Docu at 11:32, 23 October 2009 (UTC)
- 3. Nah, don't bother. Got an email from them and they seem to like it that all their images are at Commons
- 4. Oh, that's an easy fix. Running a bot now.
- Multichill (talk) 12:01, 23 October 2009 (UTC)
- The most recent image that was uploaded seems to be from February 2009 (Special:AllPages/File:Starr_090213-25). Looking at the database, in the meantime, at least another 14k images are available. I think it would be worth doing an additional batch. -- User:Docu at 07:32, 13 November 2009 (UTC)
- They have about 120k images on their flickr page. Another batch upload would really be worth it. --McZusatz (talk) 12:38, 13 November 2013 (UTC)