Commons:Batch uploading/NPS Maps

NPS Maps edit

  • Source to upload from: http://npmaps.com/
    • Do the media URLs follow a pattern? All files are one giant directory: http://npmaps.com/wp-content/uploads/
    • Does the site have an API? No
    • What else could ease uploading? (is the site valid XHTML, do they use a WCM…?) Park detail pages follow a consistent format.
    • Did you contact the site owner? Yes
  • Describe the works to be uploaded in detail (audio files, images by …):

Public domain maps of U.S. National Parks, published by the National Park Service.

  • Which license tag(s) should be applied?

{{PD-USGov-NPS}}

  • Is there a template that could be used on the file description pages? Do you think a special template should be created?

No

This looks like it will need to be done by screen scraping. There is a page for each park and each page contains thumbnails of the map images where the filename ends in -thumb.jpg; the thumbnail links to the full-size version (GIF/JPG formats; may need to convert the GIFs to PNGs). There is a short blurb for each map image which may also include a link to a PDF version, so we should be able to upload both versions if we want. The files probably should be put into a category "Maps of XXX", to be created if it doesn't already exist.

Things to watch out for:

  • Park name on site may not match our category name
  • There are a handful of maps that are not from NPS
  • Please credit "National Park Service, restoration/cleanup by Matt Holly"

howcheng {chat} 17:53, 3 May 2017 (UTC)[reply]

Opinions edit

@Howcheng: I could do that, but I guess several files in the directory (mostly the thumbnails, but also some of the other files e.g. the map covers, the Amazon thumbnails or files like joshua-tree-things-to-do.jpg) are not usable and not necessary to copy. I guess demanding a minimal file size of some 20 kByte will filter them out.

Most of the PDF files are mere copies of the corresponding JPG files, I'm not sure whether it is helpful to upload them to Commons. Most Wikimedia projects will only use the JPG files, but maybe the PDFs are useful for Wikivoyage. --Reinhard Kraasch (talk) 09:57, 4 May 2017 (UTC)[reply]

That's why I suggest the screen scraping, because then we should be able to filter out the images we don't need, plus we'd get the descriptions. If you just go through the directory, it's just a bunch of random files and then manual work to figure out what each one is. I think it would be easier to manually clean up any unnecessary content than to manually input all that data. In addition, this site is still a work in progress, so we might want to look ahead to re-running the bot at regular intervals to grab anything new. howcheng {chat} 15:58, 4 May 2017 (UTC)[reply]
@Howcheng: Well, I think the job is not really "screen scraping", but analyzing the HTML of the pages linked from http://npmaps.com/parks/ and further.
I would describe the job as:
  • get the park names from http://npmaps.com/parks/, then
  • analyze the linked page, e.g. http://npmaps.com/grand-teton/ find the "xx free ... maps" on the page and then
  • extract the file names, file descriptions, etc. from the HTML of that page and store it somewhere.
  • this list can be used later to identify new files
I guess I will do that and come back with the results we can then talk about. --Reinhard Kraasch (talk) 19:46, 4 May 2017 (UTC)[reply]
OK, great. That was, in fact, what I meant by "screen-scraping", although it turns out "web scraping" is the actual term. howcheng {chat} 23:12, 4 May 2017 (UTC)[reply]
@Howcheng: I've got the data meanwhile and made a table with file URLs, file sizes, proposed descriptions and categories here: User:Reinhard Kraasch/NPS maps. Any comments are welcome. --Reinhard Kraasch (talk) 19:27, 9 May 2017 (UTC)[reply]
And here is a sample upload: File:NPS_acadia-map.jpg --RKBot (talk) 20:11, 9 May 2017 (UTC)[reply]
Mostly good, with a few issues.
  1. Please credit "U.S. National Park Service, restoration/cleanup by Matt Holly"
  2. If a "Maps of XXX" category doesn't exist, let's create it
  3. There are a few categories that need manual adjustment, which I can do in a bit.
    1. Category:Marsh-Billings-Rockefeller National Historical Park (HABS) seems like it should hold only HABS images, so I'll have to create a parent category for regular images.
    2. Category:National Trails of the United States is more of a meta-category. But there's not that many here, so that's easily cleaned up with Cat-a-Lot.
Otherwise, I think we are about good to go. Nice work! howcheng {chat} 23:07, 9 May 2017 (UTC)[reply]
@Howcheng: There are 6 categories I chose by "best guess" strategy. Maybe it's better to leave these few maps uncategorized, put them in some separate category, e.g. Category:NPS Maps to be categorized or just create the categories in advance manually (I've already done this for Category:Marsh-Billings-Rockefeller National Historical Park):
I created the "Maps of XXX" categories meanwhile, e.g. Category:Maps of Capitol Reef National Park, and updated File:NPS acadia-map.jpg as well as User:Reinhard Kraasch/NPS maps. --Reinhard Kraasch (talk) 10:24, 10 May 2017 (UTC)[reply]
I would also propose to use - in contrary to your original concept - a special creator and license template. Possible changes to attribution and licensing are then easier to maintain, and this is an easy way to put all the images in a single maintenance category (e.g. "NPS Maps uploaded by RKBot") to access them later on. --Reinhard Kraasch (talk) 11:54, 10 May 2017 (UTC)[reply]
Sample upload, now using the mentioned templates: File:NPS acadia-map.pdf --Reinhard Kraasch (talk) 15:58, 10 May 2017 (UTC)[reply]
@Reinhard Kraasch: I like your suggestions. howcheng {chat} 22:18, 10 May 2017 (UTC)[reply]

@Howcheng: I've corrected some errors, added "other version" if there is one and uploaded some more samples, they are in Category:Maps of Acadia National Park --Reinhard Kraasch (talk) 01:09, 11 May 2017 (UTC) Plus another 30 sample uploads, all in Category:Files from the National Park Service uploaded by RKBot. If you think these are OK, I will upload the remaining 1900 and some next night. --Reinhard Kraasch (talk) 08:12, 12 May 2017 (UTC)[reply]

For the above mentioned missing national park categories I have created the categories:

and categorized them "as good as possible".

Plus another 32 sample uploads in Category:Files from the National Park Service uploaded by RKBot. --Reinhard Kraasch (talk) 18:42, 13 May 2017 (UTC)[reply]

So, the job is done, all 1,968 files are uploaded and in Category:Files from the National Park Service uploaded by RKBot --Reinhard Kraasch (talk) 12:36, 17 May 2017 (UTC)[reply]
@Reinhard Kraasch: Excellent! Thanks so much. howcheng {chat} 15:59, 17 May 2017 (UTC)[reply]
Assigned to Progress Bot name Category
Reinhard Kraasch   Done RKBot