Commons:Batch uploading/US National Archives

US National Archives edit

I plan to use a bot to uploads images from the US National Archives' digital files. I currently have access to a cache of over 120,000 TIFF master files which are ready for upload. The bot is a custom pywikipediabot script written by Multichill (code) and it relies on slakr's toolserver tool to translate NARA metadata into Commons upload code. It will upload images using the custom {{NARA-image-full}}. Each page will be uploaded with that template filled out with the imported NARA metadata, plus {{Uncategorized-NARA}} to facilitate the categorization of these files. Dominic (talk) 19:15, 20 July 2011 (UTC)[reply]

Opinions edit

Moved form Commons:Bots/Requests/US National Archives bot I wrote a bot to do the uploads. I added the link to the source. Multichill (talk) 19:54, 19 July 2011 (UTC)[reply]


Dates in titles
*Would you do a test run? Currently there are just three files.
Date ranges in file names aren't ideal and the quotes a bit confusing. (Side view of adobe house with water in foreground," Acoma Pueblo (National Historic Landmark). --  Docu  at 20:14, 19 July 2011 (UTC)[reply]
    • The name comes directly from the title in the National Archives' catalog record. They won't usually have quotes and brackets, but it can happen. Many of these do have dates or date ranges, but that is their style. (I'll run a fuller test batch in a few hours when I get the chance.) Dominic (talk) 20:19, 19 July 2011 (UTC)[reply]
  Comment For photographs, like 3 example uploads, I would suggest to look into a way to add more categories:
  • Author category
  • Date category
  • Subject category
  • Medium category (photographs, paintings, handwritten documents, etc.)
  • etc.
For other types of records other category types might be suitable. It is easier to add some of those categories before the upload. --Jarekt (talk) 01:43, 20 July 2011 (UTC)[reply]
I'm not sure how we could do any of these in an automated way. Not all documents have subjects, and the ones that do do not map onto Commons categories anyway. The same is true of the medium and author fields. The dates also seem difficult. Some of the dates are ranges, some are exact days, just months, or just years. Dates can represent dates of creation, copyright, publication, or broadcast. I am hoping we will be able to organize a major community effort for categorizing these, as it will take humans. The one thing that we can do is categorize them hierarchically according to the National Archives catalog structure. For example, each of the Ansel Adams items would go in the a category for the "Ansel Adams Photographs of National Parks and Monuments, compiled 1941 - 1942, documenting the period ca. 1933 - 1942" series. Of course, some of the series are less descriptive than others, but it's a start. Dominic (talk) 03:00, 20 July 2011 (UTC)[reply]
I think we should try 2 approaches. Add categories based on NARA catalog structure, We could make them hidden categories and encourage people to move images out of them, but this way we can group similar images together. I still think that we should try to match NARA authors with Commons creators and add appropriate categories. In my WGA upload all images have Creator template and matching author category. May be a way to accomplish that would be to create translation table there each NARA author is matched with a creator and category. Than your bot would read this table and use it to add proper templates and categories. Table can be easily added to the bot if it was implemented as external CSV file. We probably do not need to match every NARA author, since some might be quite obscure, but we should at least match all authors that already have creator template and authors with large number of records. Dominic, do you think it would be possible to put somewhere list of all authors of the files you are planning to upload and how many records are associated with them? I can try to see how many I can match. --Jarekt (talk) 16:01, 22 July 2011 (UTC)[reply]
This is what I did. I made {{NARA-Author}} for all of the authors. Every author (or person listed as a "contributor", whether it's a photographer, artist, director, etc.) has an ID and a page in the catalog that links to the records they are associated with. That template creates a URL to these author records in the catalog. I am not sure if that helps or hinders the attempt to make categories for them, but maybe we can use the template in some way to add categories based on those unique IDs? I will note, though, that it's actually uncommon for authors to be listed at all. Most documents are created by uncredited federal workers, and others are grouped into series based on the author, but the author field in the record isn't actually used (cf. this series). The full list of author records could actually be extracted from the dataset, if anyone is brave enough to try. Dominic (talk) 16:18, 22 July 2011 (UTC)[reply]
I did not noticed {{NARA-Author}} before. If it is added to all the images, that have author, than we can easily add creator templates and categories latter. BTW I did not see author records in NARA dataset or its description. --Jarekt (talk) 17:00, 22 July 2011 (UTC)[reply]
I do not know if there are separate XML files for the person authority records, like there are for items. However, if an item has a contributor mentioned in its record, the contributor's ID is also there in a field in the item's data file. This is how I am able to upload the files with that information. Dominic (talk) 19:04, 22 July 2011 (UTC)[reply]


Change extensions to ".tif"
:  Comment another minor comment: Can we use "tif" or "tiff" file extension. instead of "TIF"? May be it is just me but I do not like capitalized extensions. --Jarekt (talk) 01:48, 20 July 2011 (UTC)[reply]
Sure, I have changed it. Dominic (talk) 03:00, 20 July 2011 (UTC)[reply]


Remove "Item from "
:  Comment Wouldn't it look more natural to remove "iterm from Regord Group" in Record group=Item from Record Group 79: Records of the National Park Service, 1785 - 2006 and the second "series" in

As a test run, I have gone ahead and finished the Ansel Adams batch (220 files). [1] Dominic (talk) 04:45, 20 July 2011 (UTC)[reply]


MediaWiki:Stockphoto.js bug
*Looks good. BTW the "Use this file" on the web button doesn't work with the NARA templates. Not sure if the template needs fixing or the function. --  Docu  at 05:31, 20 July 2011 (UTC)[reply]
That seems to depend on a file, for example I get the MediaWiki:Stockphoto.js buttons at File:"Two Medicine Lake, Glacier National Park," Montana, 1933 - 1942 - NARA - 519874.tif. I think it is MediaWiki:Stockphoto.js issue and I will discuss it there. --Jarekt (talk) 15:41, 22 July 2011 (UTC)[reply]
  Deferred to MediaWiki talk:Stockphoto.js#No buttons on some pages

End of move. Multichill (talk) 19:26, 20 July 2011 (UTC) I moved the discussion to here from Commons:Bots/Requests/US National Archives bot. We have two pages:[reply]

Why did I make this split? Because bot request take ages when we start discussing batch requests and a request gets closed when we actually want to provide more feedback. Can everyone please respect this? Multichill (talk) 19:26, 20 July 2011 (UTC)[reply]


Batch uploading/US National Archives
Overall this looks pretty good. There just a few things I would tweak a bit:

(a) Personally I'd use the format "NARA number - <title>" instead of "<title> - NARA - number".

I usually prefer "NARA number - <title>" instead of "<title> - NARA - number" format, but before my WGA upload several users insisted on "<title> - NARA - number" format so all the images from a single source do not start with the same letter and {{TOC}} templates can be used. I think we should stay with the current format. --Jarekt (talk) 17:07, 23 July 2011 (UTC)[reply]

(b) The following file names could use further normalization:

(c) In file descriptions, such as NARA 512467, the date seems to get repeated, once in the title field and once in the date field. This is possibly due to the way the source presents it and gets parsed. For comparison, check: NARA 530898.

(d) It would be helpful if something could be done about the categorization. Currently, e.g. NARA 530898 gets added into three NARA categories, but no topical one. Even just adding Category:Indians of North America (already present in the source) would be an improvement.

Hope this helps. --  Docu  at 06:14, 23 July 2011 (UTC)[reply]

Great images. Making different source categories might already do a presorting of the subjects. (Indian affairs, World War, ...) so it would largely accelerate final categorisations. As a test, I made already Category:Lower Brule Sioux, Category:Department of the Interior, Office of Indian Affairs, Category:Cherokee census cards, Category:Propaganda of advertising sections of the United States and Category:Office of the U.S. Coast Survey . For the last category, search seems to fail for some reason. --Foroa (talk) 07:39, 23 July 2011 (UTC)[reply]
(a) I am not bothered either way, though it sounds like it should stay as is unless there are more people who have want it the other way.
(b) The idea of adding a ... and an end quote to the end of a truncation is a good one. I still have qualms about removing the dates. As noted, these are the items' exact titles in the NARA catalog, so I think we should be faithfully reflecting them. Besides which, as you can see, the dates in the title field are not machine readable like the ones in the date field. They can come in any format, whether it's a single year, a range, or a "ca.". How would you distinguish the date in "Compromise of 1850" from one of those WWI posters? Certainly not just by trusting that a comma signifies it's a date appended to the name. I don't think it's worth the trouble.
(c) This is how the National Archives catalogs such items. Some dates are appended to titles, but there is also a date field. This small amount of duplication seems innocuous to me. If we're not going to remove the dates, as I noted above, we still want a date to put in our date field, anyway. So it will be repeated. I don't see the harm.
(d) I am a little confused by what you are suggesting here. We should add non-existent categories based on the NARA subject headings? That sounds like it would be controversial, and I'm not sure why it would be useful. There are a lot of things we can and should do automatically, but I think topical categories are mostly the job of human volunteers. This is why I created Category:Media from the National Archives and Records Administration needing categories, in anticipation of organizing a community effort specifically targeting these files for categorization.
I realize I just disagreed with most of your points, but I don't meen to be contrary. I'm glad to have the batch request scrutinized by others before it gets underway, because we want things to be as close to perfect as they can be before applying a given convention to thousands of file pages. :-) Dominic (talk) 00:26, 24 July 2011 (UTC)[reply]
(a) No problem. It's a "nice to have", e.g. the Tropenmuseum batches were done that way.
(b) I only meant dates that are date ranges. IMHO it's a good thing to have accurate dates (e.g. a year) in file names, but date ranges tend to be confusing (there is also the NARA number just next to it). If you post the list of dates to a page here, I could sort it out for you. For the reminder see (c) below.
(c) Looking at the HTML source of the NARA website, it seems to me that the date (sFCextra) is appended to the title (sFC) when displaying a record. It's clearly two separate fields. For Commons, sFC only should go into the title field of the file description page, sFCextra in the date field. You could check if they are identical before discarding them.
(d) You could prefix the categories with "NARA topic: ". Anything consistent you can add on upload (or later by bot) makes it easier to categorize them manually by volunteer. It can save so much time if at least one or two categories are added before. Practically, it doesn't matter if the category exists, but obviously it's better if they do. For the QLD upload, we mapped some of the topics beforehand, but this adds a new level of complexity you might want to skip. The solution suggested by Foroa is helpful.
Personally, I don't mind if you disagree as long as you have good reasons to do so and you bear in mind that afterwards we will have thousands of images presented that way. --  Docu  at 07:40, 24 July 2011 (UTC) (edited)[reply]
If you put the 120,000 files in one single category, it will take about 4 to 7 years before they are all categorised. If you can manage to spread them over several categories with different subjects/topics/sources, then 90 % might be categorised in one or two years. --Foroa (talk) 15:38, 24 July 2011 (UTC)[reply]
(b) and (c) are definitively
  Resolved
I would also consider all of the general comments on topical categorization, like (d)
  Resolved
at this point. (See my comment at #Resolusions.) I think (a) is a non-starter at this point, but if there is indication that more people actually want to make that change, we can do so. As I said, I don't mind either way.


More date comments
:::::I wasn't aware of the sFCextra field before, but checking some of them out, it looks like it includes all dates, like the accurate ones that you like. Do you like precise dates more than you dislike ranges? :-) Dominic (talk) 21:12, 24 July 2011 (UTC)[reply]
This is now also done. Compare the title in [2] with the title in File:SULFUR-DUSTING OF GRAPE VINES - NARA - 542506.tif. Dominic (talk) 04:50, 5 August 2011 (UTC)[reply]


Comments on categorization
::I changed the template to add categories based on the series, like Category:US National Archives series: Signal Corps Photographs of American Military Activity, compiled 1754 - 1954 and Category:US National Archives series: Enrollment Cards, compiled 1898 - 1914 (some of these might get even longer, though). Beyond that, I suppose I am still confused how we would do categorization in any automated way. We could do the mapping beforehand like QLD, but I want to get these done sooner rather than later. I'm only at NARA for another month. Maybe we should just include a field in {{NARA-image-full}} that copies over the subjects, if that's helpful. Dominic (talk) 04:17, 25 July 2011 (UTC)[reply]
Thank you. That is already a significant improvement. A deeper subject categorisation might indeed be useful, provided that there are not several thousands; the most important is to group them by theme. --Foroa (talk) 05:43, 25 July 2011 (UTC)[reply]
Looks good indeed. I asked slakr to look into the date thing for the information template and will try to provide a suggested fix for the upload title later today. --  Docu  at 06:38, 25 July 2011 (UTC)[reply]
After some further tests, I think that one should not categorise too deep because categories are permanent (template generated), so cannot be moved or emptied after final categorisation. --Foroa (talk) 07:11, 25 July 2011 (UTC)[reply]
Can you clarify what you mean by "deep" categorization? I'm a little confused by that terminology. Dominic (talk) 08:35, 25 July 2011 (UTC)[reply]
Too deep = too many categories that contain each only a couple of images. As seen with KIT images in Special:WantedCategories, it takes years to clean them out, especially the small ones. --Foroa (talk) 11:49, 26 July 2011 (UTC)[reply]
As a note, the series categories are not intended to be temporary. I think they are a nice touch, since that is part of the NARA catalog structure. We can even include the series-level descriptions from NARA's series catalog records on their category pages. Dominic (talk) 14:35, 26 July 2011 (UTC)[reply]


More date comments
::::See below for the suggested update. --  Docu  at 21:24, 25 July 2011 (UTC)[reply]
Source
  • Suggested update to NARA_uploader.py:
  • Insert after "getDescription", replacing everything before "main".
def getDate(description):
    dateRe = re.compile('^\|Date=(.+)$', re.MULTILINE)
    dateMatch = dateRe.search(description)
    if dateMatch:
        dateText = dateMatch.group(1)
    else:
        dateText = ""
    return dateText.strip()

def fixDescription(description, dateText):
    description = description.replace(u"{{int:license}}", u"{{int:license-header}}")
    titleRe = re.compile('^\|Title=(.+)$', re.MULTILINE)
    titleMatch = titleRe.search(description)
    titleText = titleMatch.group(1)
    if titleText[-len(dateText):] == dateText: 
        description = description.replace(titleText, titleText[:(-len(dateText)-2)])
    return description

def getTitle(fileId, description, dateText):
    titleRe = re.compile('^\|Title=(.+)$', re.MULTILINE)
    titleMatch = titleRe.search(description)
    titleText = titleMatch.group(1)
    titleText = cleanUpTitle(titleText)
    suffix = ""
    if len(dateText)<11 and len(dateText)>0:
        suffix = " ("+dateText+")"
    if len(titleText+suffix)>120:
        titleText = titleText[0 : 120-len(suffix)]
        if titleText.count('"')%2<>0:
            titleText = titleText[:-3]+'.."'
    title = u'NARA %s: %s.tif' % (fileId, titleText+suffix)
    return title.replace(u" ", u"_")
    

def cleanUpTitle(title):
    '''
    Clean up the title of a potential mediawiki page. Otherwise the title of
    the page might not be allowed by the software.

    '''
    title = title.strip()
    title = re.sub(u"[<{\\[]", u"(", title)
    title = re.sub(u"[>}\\]]", u")", title)
    title = re.sub(u"[ _]?\\(!\\)", u"", title)
    title = re.sub(u",:[ _]", u", ", title)
    title = re.sub(u"[;:][ _]", u", ", title)
    title = re.sub(u"[\t\n ]+", u" ", title)
    title = re.sub(u"[\r\n ]+", u" ", title)
    title = re.sub(u"[\n]+", u"", title)
    title = re.sub(u"[?!]([.\"]|$)", u"\\1", title)
    title = re.sub(u"[&#%?!]", u"^", title)
    title = re.sub(u"[;]", u",", title)
    title = re.sub(u"[/+\\\\:]", u"-", title)
    title = re.sub(u"--+", u"-", title)
    title = re.sub(u",,+", u",", title)
    title = re.sub(u"[-,^]([.]|$)", u"\\1", title)
    return title


  • In "main", replace "description = getDescription(fileId)" with the following:
                description = getDescription(fileId)
                dateText = getDate(description)
                description = fixDescription(description, dateText)


  • In main, replace "title = getTitle(fileId, description)" with the following:
 
                title = getTitle(fileId, description, dateText)


i18n
*  Question Could the 'Record creator' & 'Location' fields be i18n ? Eg, 'Department of the Interior. Office of Indian Affairs. Office of the Commissioner to the Five Civilized Tribes. (1893 - 1914)' would be in French 'Département de l’Intérieur, Bureau des affaires indiennes' − indeed, maybe the names should also be linked to the relevant WP articles. Jean-Fred (talk) 11:40, 26 July 2011 (UTC)[reply]
Yes, I added {{NARA-image-full/I18n}} which allows translation of several field names and all the locations. I will add option for translating record creators. Still have to add links to the relevant WP articles. --Jarekt (talk) 15:25, 2 August 2011 (UTC)[reply]


Teofilo's block request
* I am asking that the bot is blocked until the file name bug is solved : Commons:Administrators'_noticeboard/Blocks_and_protections#User:US_National_Archives_bot Teofilo (talk) 22:39, 30 July 2011 (UTC)[reply]

  Dominic (talk) 16:13, 11 August 2011 (UTC)[reply]


Records in TIFF format
====Records in TIFF format====
 

There are a series of textual records included in the trial upload. Commons is a multimedia database and as such doesn't host primarily text documents. There is a sample on the right. For more see: Category:US National Archives series: Enrollment Cards, compiled 1898 - 1914. Which percentage of the 120,000 tiffs do they represent? Is there a planned use for them on a WikiMedia project? --  Docu  at 06:24, 26 July 2011 (UTC) (edited)[reply]

Er, what? Of course Commons hosts textual documents. Our Wikisource project is humming along. :-) Dominic (talk) 13:15, 26 July 2011 (UTC)[reply]
Nothing stops you from converting them in ASCII text format; then we don't need them anymore. --Foroa (talk) 13:37, 26 July 2011 (UTC)[reply]
No, Commons most definitely hosts textual documents. Such documents are the entire purpose of one of our projects. It sounds like you should get to know Wikisource better. Saying we should transcribe a document and then we don't need it anymore is kind of like saying we should we should caption a photo and then we don't need it anymore. Besides which, they are important historical documents. We would host them whether or not they belong on a project. Dominic (talk) 13:44, 26 July 2011 (UTC)[reply]
I guess you did not capture my irony. --Foroa (talk) 14:08, 26 July 2011 (UTC)[reply]
Heh, sorry if I am jumpy. This page is quickly sucking the humor out of me. I do appreciate the the irony now. :-) Dominic (talk) 14:26, 26 July 2011 (UTC)[reply]
I can't find census records on s:Wikisource:WikiProject NARA/To prepare or s:Wikisource:WikiProject_NARA/To_retrieve. Is this the type of records they generally transcribe ? --  Docu  at 06:39, 27 July 2011 (UTC)[reply]
I haven't added the census records because it's not finished uploading yet. As a historically significant government document, It's easily within the project's scope. Dominic (talk) 12:21, 27 July 2011 (UTC)[reply]

ARC number edit

Another solution for ARC could be: store them all in a separate template page so that series ARC=408 would give "Record group 79: Records of the National Park Service, 1785 - 2006 (ARC identifier: 408)". This would make page description more concise and would also allow to add translations of record group and series names by editing only one page.--Zolo (talk) 01:51, 28 July 2011 (UTC)[reply]

We could even store more data than that in the template so that we would only need to provide the document ARC in the file description. This would not be as efficient but this would minimize duplicate info and would provide cleaner, potentially reusable data. Additionnally, this would make file description even easier by hiding away info that in most cases should not be changed by users. I have created a toy template in {{ARC/sandbox2}}. {{ARC/sandbox2|306514}} gives
 
This media is available in the holdings of the National Archives and Records Administration, cataloged under the National Archives Identifier (NAID) 306514.

This tag does not indicate the copyright status of the attached work. A normal copyright tag is still required. See Commons:Licensing.

العربية  Deutsch  English  español  français  italiano  日本語  한국어  македонски  മലയാളം  Nederlands  polski  português  русский  slovenščina  Türkçe  українська  Tiếng Việt  中文(简体)  中文(繁體)  +/−

  • Record group: Committee Papers, compiled 1806 - 2000 (: 306513)
  • Series: 128: Records of Joint Committees of Congress, 1789 - 2004 (: 457)

This means that {{ARC/data}} will need to be quite large. To make it smaller, it could also be used for record groups and series only, and not for individual documents. But it would make it less useful.--Zolo (talk) 03:29, 29 July 2011 (UTC)[reply]

ParserFunctions are a bit beyond me, but could that possibly work with tens of thousands of records? Dominic (talk) 12:54, 29 July 2011 (UTC)[reply]

Template edit

Batch uploading/US National Archives
* While we are at modifying the bot script, could it be possible to add
  • I think it would be useful for our users to have on each page a link to the relevant "Scope & Content" page of the photographic series the picture belongs to on the ARC website. These "Scope & Contents" pages contain valuable information on the origins of the pictures. As they are 2 clicks away from Wikimedia Commons, among a number of not-so-useful links, I think most users won't find them if we don't provide a direct link (We might also copy them to wikisource and link to the corresponding wikisource pages. We might copy them here on Commons if we get community approval for using gallery pages for that purpose). So for example, on this file it would be good to have the following : "Series: Signal Corps Photographs of American Military Activity, compiled 1754 - 1954 (Scope & Content)". I think "Scope & Content" is more important, for a first reading, than "Details". The "record ID" and "Source" fields should be merged and called "Source". Teofilo (talk) 23:21, 1 August 2011 (UTC).[reply]
    I changed my mind. I feel more like removing all the Record group, Series, NAIL Control Number information. The {{NARA-image}} template with its single arcweb link is enough. The users who want to know more can click on that single link which is an entrance to all the extra information. The "Record ID" field is not useful save the Nara-image template. Teofilo (talk) 08:44, 2 August 2011 (UTC)[reply]
    I doubt you'll find anyone agreeing with that point of view. And note that the ARC ID is more just the identifier that refers to the catalog record and allows us to make predictable URLs. The series and record group are actually descriptive metadata assigned by the archives that relate to the document creator and/or subject. Dominic (talk) 04:50, 5 August 2011 (UTC)[reply]
    I am afraid you are swapping the parts. Until now hardly any upload from the NARA was made by including those extravagant and noisy data which are not useful to a majority of users. You will find hardly anyone among those who uploaded contents from NARA in the past who agrees with you. For example File:USS Intrepid (CV-11) - Nov 44 a.jpg. That these extra data are not useful is common sense. For example, let's see how the Bundesarchiv pictures are documented. In the case of File:Bundesarchiv Bild 101I-731-0388-38, Frankreich, nach der Invasion, Infanteristen.jpg, all the extra information such as
  • Inventory: Bild 101 I - Propagandakompanien der Wehrmacht - Heer und Luftwaffe
  • Classification: Sachklassifikation/E {Zweiter Weltkrieg 1939-1945}/Ee {Kriegsschauplätze und Feldzüge}/Ee 300 {Westfeldzug}/Ee 350 / 360 / 370 / 380 {Frankreich*}/Ee 380 {Frankreich nach der Invasion (ab 6.6.1944)}/Ee 381 {Infanterie} Sachklassifikation/E {Zweiter Weltkrieg 1939-1945}/Ed {Truppen- und Formationsgeschichte*}/Ed 100 / 200 {Heer*}/Ed 110 {Infanterie}
was removed. Removing is the right thing to do. Please note also that the creator template was made collapsible because a lot of people found it too noisy. There is a wide support to the idea of keeping description pages streamlined and simple. Teofilo (talk) 09:24, 5 August 2011 (UTC)[reply]
  • Each page contains 2 links to en:U.S. National Archives and Records Administration. I think this is one too many (or two too many if you count commons:National Archives and Records Administration). Couldn't we just get rid of the "Current location" field altogether? Isn't the {{NARA-image}} template sufficient to mean that the pictures are located there ? Teofilo (talk) 23:02, 1 August 2011 (UTC)[reply]
    • NARA is a major US government agency with more than two dozen facilities. It's not a location. The location field is the record of where the physical document digitized on Commons is located. That the institution's name is linked more than once is because there are three separate templates used on the pages that are complete; it seems pretty trivial. Dominic (talk) 20:12, 2 August 2011 (UTC)[reply]
      Brainwashing the user by repeating three times the same message is an advertising technique amounting to using Wikimedia for a promotional campaign at the expense of usability. It overcrowds the template and makes the other information such as the author, date, or description fields proportionately less visible. The reason why the Artwork template contains both a "location" field and a "source" field is that we are dealing with photographs of paintings and photographs of sculptures. The "location" field is for the location of the painting/sculpture, while the source field is for the source of the photograph. For this reason, NARA uploads of paintings such as File:"Crocodile and Snake Fighting" - NARA - 558928.tif are wrong. The "location" field should be filled with "unknown", or with the name of the museum or of the private owner who owns the painting. Writing "National Archives and Records Administration, Still Picture Records Section, Special Media Archives Services Division (NWCS-S)" in the "current location" field of this painting is a mistake (for example compare with File:Serapis Louvre AO1027 profil.jpg, and count the number occurrences of the "Louvre" word there). For works that are just photographs, not photographs of paintings or photographs of sculptures, the "location" field should be removed. Teofilo (talk) 09:24, 5 August 2011 (UTC)[reply]
      These files are the records of a government agency, and the location field is the listing of the repository in which the records are held. That is not extraneous or unusual information. Your accusations of brainwashing and advertising are getting tiresome. The institution you are talking about is a public agency that holds public records; it is graciously making its high-res scans available to Commons with no strings attached. The "advertising" you are talking about is metadata added and maintained by Wikimedians because it is useful. Nothing of the sort has been demanded or even asked by the institution you are maligning. Dominic (talk) 16:13, 11 August 2011 (UTC)[reply]


line spacing bug
*On some pages like this file, the template is pre-filled with |Author=|Location= on the same line. It makes filling the |Author line a little cumbersome. This yet another reason, even if a weak one, to remove the |location field altogether! Teofilo (talk) 12:08, 5 August 2011 (UTC)[reply]
This has nothing to do with removing the location field. It was a line spacing bug. Dominic (talk) 16:13, 11 August 2011 (UTC)[reply]

File name maximum length and file name cutting format edit

The following is copied from Commons:Administrators' noticeboard/Blocks and protections#User:US National Archives bot

I think the bot should be blocked until the file-name issue is solved. See the "File:Combat memorable..." entry in Commons:National Archives and Records Administration/Error reporting or compare this NARA upload (name cut after "Gene") with previously uploaded picture with full name. Look at this list of 50 uploaded files where most of the file names are cut. It is not realistic to correct all these file name errors afterwards one by one, tagging each picture with {{Rename}}. The upload software bug must be solved so that the files are uploaded with the full name, without cut. Cut names not only produce an impression of bad quality upon users, it also creates a lot of potential wrong keyword searches in search engines. Someone looking for a "gene" (a biological system) should not find the "Alphonse Juin, Commanding Gene" picture in his search results. Teofilo (talk) 22:23, 30 July 2011 (UTC)[reply]

Er, you want it blocked? I can just turn it off, you know. I'm not exactly sure what the issue is, though. The titles get cut off when they reach the length limit. "The upload software bug must be solved so that the files are uploaded with the full name, without cut" is an impossible solution. This doesn't seem like a huge problem, certainly not one that's more important than getting the content on Commons. Most end users are going to be viewing the images on the projects, so the idea that these titles somehow negatively affect users because they are stylistically displeasing is a little baffling to me. Dominic (talk) 23:29, 30 July 2011 (UTC)[reply]
Oh? How come there no polite enquiry from Teofilo on either Commons:Batch uploading/US National Archives or User talk:Dominic? Oh wait... Jean-Fred (talk) 23:37, 30 July 2011 (UTC)For Jean-Frédéric, here is the Commons:National Archives and Records Administration/Error reporting link again, where the problem was debated between Dominic and me below the "File:Combat memorable..." entry. Teofilo (talk) 11:11, 31 July 2011 (UTC)[reply]
Actually, I posted a fix a couple of days ago for the problem Teofilo mentions. Oddly it hasn't been applied yet. --  Docu  at 05:36, 31 July 2011 (UTC)[reply]
Thank you for doing so. I was not aware that you had prepared a fix. Teofilo (talk) 11:11, 31 July 2011 (UTC)[reply]
I thought that that was about the dates appended to the end of titles. I don't see where you mentioned the issue Teofilo is concerned about anywhere on the page. Dominic (talk) 19:10, 31 July 2011 (UTC)[reply]
Who is(are) the person(s) in charge of the upload software ? According to en:Wikipedia:Naming_conventions_(technical_restrictions)#Title_length, "Titles must be less than 256 bytes long when encoded in UTF-8.". Measured with http://bytesizematters.com/ , File:US Navy 050419-N-5313A-049 A U.S. Marine Corps AV-8B Harrier launches from the flight deck of the amphibious assault ship USS Kearsarge (LHD 3) during flight operations in the Mediterranean Sea.jpg is 202 bytes long and File:Combat memorable donne le 22, 7re 1779, entre le Captaine Pearson commandant le Serapis et Paul Jones commandant le Bonh - NARA - 532895.tif is only 145 bytes long. So it looks possible to add 256-145=111 more characters into NARA uploads' file names. The full title "Combat memorable donne le 22, 7re 1779, entre le Captaine Pearson commandant le Serapis et Paul Jones commandant le Bonhomme Richard et son escadre, 07/22/1779" being 159 characters long, it should be OK. With 249 characters, "Pvt. Jonathan Hoag,...of a chemical battalion, is awarded the Croix de Guerre by General Alphonse Juin, Commanding General of the F.E.C., for courage shown in treatingwounded, even though he, himself, was wounded. Pozzuoli area, Italy.", 03/21/1944" is perhaps only one or two characters longer than the 256 limit after adding "File:" and ".tif". Also it could be decided to cut whole words instead of cutting in the middle of the words, and to use (…) at the location where the cut is performed, like I did for this upload of mine. Perhaps it would be best to always keep the date at the end of the title, and to cut the words located before the date. Teofilo (talk) 12:40, 1 August 2011 (UTC)[reply]
I am running a script that was written by Multichill; he's not in charge of the bot's actions, but I am not a programmer, so I can't easily make changes without him. I was not originally aware that the character limit was that high. I had thought that the limit was being imposed by the upload form, not by the bot's script, which is why I was saying it wasn't fixable. I see now that we can allow even longer titles, but I am not sure if we should. This should be discussed at Commons:Batch uploading/US National Archives, as the names already seem rather long and unwieldy to me. Your suggestion to not have it cut off titles mid-word, though, is a good one, I agree. In any case, I don't think this is a dealbreaker. The full titles are all contained in the template's "title" parameter, so we wouldn't have to go back and rename anything manually anyway, since a bot can extend the names using that data. I think it is more important to get the files actually uploaded at this point. Dominic (talk) 14:27, 1 August 2011 (UTC)[reply]

End of copy from Commons:Administrators' noticeboard/Blocks and protections#User:US National Archives bot

Do you have a deadline after which the files won't be available any longer ? File renaming is an activity which consumes a lot of resources and which is generally frown upon unless there is a good reason to do so. I am afraid the massive file renaming operation will be refused. When there is a problem in a car factory you stop the production line until the problem is solved. You don't sell the cars first and recall them a year later to change the defective part. The latter is more expensive. I think we need more opinions from people with bot software writing experience and help from people who would be willing to actually modify the script or write the file renaming bot's script. I am going to copy the present talk on Commons:Batch uploading/US National Archives. Teofilo (talk) 17:00, 1 August 2011 (UTC)[reply]
Well, I am only here for a couple more weeks. The files are not available on the Internet, but on hard drives here in the office. So it wouldn't be wrong to say there is a deadline of sorts. I am not sure the analogy to the factory is appropriate, as we're not recalling anything, just changing a name on a wiki. I'm not even sure if this is important enough that we would want to go back and change past uploads, even if we do change the convention going forward. They are not erroneous, just truncated. Dominic (talk) 17:21, 1 August 2011 (UTC)[reply]

For those who don't want to read all that text, the question is whether we want to make use of the full 250 characters we are allowed for the file names, which can be quite long, or whether we want to truncate it at a shorter length. The script is currently truncating at 120 characters, which isn't exactly short either, but does cause a lot of titles to get cut off. Dominic (talk) 17:21, 1 August 2011 (UTC)[reply]

I agree the file name issue should be fixed before next batch of uploads and I think we should keeping titles short. Lets concentrate on the issue of how to do it. Dominic, Is this still the code you are running? If so than I assume that the issue is with "if len(titleText)>120: titleText = titleText[0 : 120]" line. Docu, did you say you posted a fix somewhere? If so than where? I think we can solve this issue in the timely manner as not to slow down Dominic too much. --Jarekt (talk) 17:43, 1 August 2011 (UTC)[reply]
Yes, that is the code. It seems easy enough to change, except this is more a question of style than a bug in the code, so I'm not sure what chance, if any, to apply. (I think Docu is referring to the date issue, not this one, but I am not sure.) Dominic (talk) 17:59, 1 August 2011 (UTC)[reply]
The date issue appears on the NARA website too. It is not a simple upload bot script problem, although a script could help remove the extra date. I don't think there might be so many files with the date duplicate issue, so I guess it won't be so bad if we leave that issue unsolved. Teofilo (talk) 18:19, 1 August 2011 (UTC)[reply]
I have inquired, and these are actually not errors so much as limitations in the NARA catalog software. That "coverage dates" field, which is used to refer to the dates depicted in the document's subject rather than the document's creation, can only take ranges. When you put in a single day, it still makes it into a range. This isn't something they are going to fix. Dominic (talk) 18:35, 1 August 2011 (UTC)[reply]
A few more ideas:
1) Unwieldy ? Of course they are but we are in a situation where we must choose between the less unwieldy of two unwieldy possibilities. The possibility with extra-long names, and the possibility with names cut in an automatic fashion which creates wordings that are at times perfectly meaningless. It should not be forgotten that for a number of users English is a foreign language and it is less obvious when you don't master the language to understand that a sentence was cut and you should not even try to read a meaning. Also we should try as much as possible not to misrepresent the quality of the NARA's work. The NARA's work might have a number of shortcomings, but in any case the NARA does not produce botched file names.
2) While the files with a cut name are, in my opinion, a problem, there is no reason to prevent the bot from uploading all the other files with a short name. One possibility would be to quickly modify the bot script so that the files with long names are avoided for the time being, and to upload them later, after we have decided what to do with them.
3) One option would be to decide the new shorter names manually, on a case by case basis. We would have a bot write all the long file names in the left column of a table, and then we would request Wikimedians to write the shorter names with (…) in the right column. Then when all shorter names are available, the upload bot would be able to pick up the shortened files names from the table. Teofilo (talk) 17:48, 1 August 2011 (UTC)[reply]
Ensuring that we don't cut off names mid-word will help, as would adding "..." to the end when cut off will help. Note that even at 250 characters, some titles will be cut off. I am not sure (especially judging by Jarekt's reply) that there is agreement to do that, though. Dominic (talk) 17:59, 1 August 2011 (UTC)[reply]
I see 2 possible solutions:
  • Automatic: if filename is longer than 120 characters than look for periods, semicolons or commas and trim there. If string still longer than 120 than trim on the word end. Add ... in last case and may be in case of the trimming at a comma.
  • Manual: if filename is longer than 120 characters than (as Teofilo suggested) skip it for time being, while writing its ID and title to some log file. Than from time to time read the log file in Excel (or some other spreadsheet) and manually trim the title. Or post the file somewhere, so others can help (Teofilo?). Than alter your bot to allow upload of those specific files with provided filenames. I should be able to help with this part, if you need help.
The first solution is much less work. So that would be my preference. --Jarekt (talk) 18:49, 1 August 2011 (UTC)[reply]
1) If you are patient enough to read 120 characters, why aren't you patient enough to read 256 ? Both the NARA website designers and the Library of Congress website designers have felt normal to require from their users to read titles longer than that. For example the html < title > attribute of http://www.loc.gov/pictures/item/2004670247/ is 330 characters long. What is wrong with that ? If the Library of Congress asked you for advice, what advice would you give ? Also, the fact that a title is displayed on your browser page does not mean you have to read the whole of it. If you are tired with reading, you can stop reading and look at some other area of the page.
2) I tagged one the the NARA uploads with {{Rename}} diff. The file was renamed today. Here is the result and I think it is much better (although I forgot to include the date). And I don't feel it is too long. If you remove the last part, the dramatic - tragedy - effect meant by the creator is lost. Sometimes titles are pieces of litterature, meant to create emotions. Many of these pictures were used for propaganda. The caption was perhaps as important as the scene represented. Teofilo (talk) 22:18, 1 August 2011 (UTC)[reply]
3) For people who are unhappy with file names longer than 140 characters (while being shorter than 256 characters) it may be possible to create a Javascript (or gadget, or fullfledged mediawiki extension) which automatically cuts the name that is displayed onscreen (with the possibility to read the longer version in a mouseover). Teofilo (talk) 23:41, 1 August 2011 (UTC)[reply]

I think you are looking at this entirely the wrong way. Relatively few people are looking at the images on Commons itself, and the ones that are are usually the editors that are maintaining them, not the people using the images. No one is really concerned about a long title looking a little unsightly at the top of a description page. We do, however, have to think about how this is going to be used on the projects, and huge file names make article text hard to read in the edit view and make Wikisource index pages incredibly odd-looking. And for what? You're writing as if the file name, which is clearly marked off with a "File:" and a ".tif" and has other data in it, is the title itself. It may be true that titles are pieces of literature and that they are important, but no one wants to remove the title. There is a title field in the metadata for that, quite apart from the file name. Dominic (talk) 00:15, 2 August 2011 (UTC)[reply]

The view that Commons is for Wikipedia is not very popular here. A lot of people insist that Commons should be viewed as a media repository independently of its value for Wikipedia. The file name is aslo important as being the caption you read when your mouse hovers on a file name below a thumbnail in a category page. Teofilo (talk) 00:39, 2 August 2011 (UTC)[reply]
4)I have found the following pictures from a batch upload a (171 B), b (170 B), c (176 B), d (177 B), e (174 B), f (176 B), g (180 B), which probably means the uploader did not found these lenghts annoying. Teofilo (talk) 00:39, 2 August 2011 (UTC)[reply]

For me filename needs to meet 2 requirements be meaningful and be unique. The second part (<20 characters) provides uniqueness, and the first part is trying to be be meaningful and I think 100 characters is plenty to accomplish that. I find long names to be distracting and award, and wikitext using them hard to read. However raising the maximum length of the filename would be by far the simplest way to "fix" the issue. --Jarekt (talk) 03:38, 2 August 2011 (UTC)[reply]

In my view, filenames needs to be authentic. If Shakespeare called his play "Romeo & Juliet" you can't rename it "Richard & Julia" because you have a personal liking for these names. If some obscure Office of War Information bureaucrat during World War II decided to call a picture "Members of the 6888th Central Postal Directory Battalion take part in a parade ceremony in honor of Joan d'Arc at the marketplace where she was burned at the stake" you cannot change it. The only alternative would be to use a totally cryptic name, like 43-0194a.gif. I don't think there is a middle unauthentic term between a totally cryptic name and the full authentic name. The argument that the full name is written in the "title" field of the template anyway, fails to convince me, because putting an unauthentic name in a more prominent place than the authentic name remains an aggression of authenticity. The choosing of a long caption or name in association with a picture by some administration during World War II is a historical fact. Even if you find that fact distracting or ugly, you can't change it. By the same token, some picture happen to be ugly. But for authenticity's sake one should not retouch an ugly historical picture to make it look nicer. If a picture has an ugly title, you can't change it either. You can't retouch "Romeo & Juliet". Teofilo (talk) 09:17, 2 August 2011 (UTC)[reply]
For this file, and this one key information, location and year, are cut. Teofilo (talk) 16:04, 2 August 2011 (UTC)[reply]
It is quite clear by now what your opinion is, Teofilo. What we are looking for is other opinions to see if anyone actually agrees with you. Dominic (talk) 20:17, 2 August 2011 (UTC)[reply]
The only absolute criteria for filenames are 1) uniqueness (easily done with the ARC) and 2) length is under the technical limit (easily done by truncation). All other considerations are cosmetic, as the full metadata is listed in the info template. The filename is just a key for the file database: it doesn't have to contain a perfect description of the image, most files at Commons don't. To be honest, we could call all images "NARA image - ARC 123456.tiff" and be done with it. So I don't think it matters where we chop the description. I'd lean towards shorter, as long filenames can be pain at Wikisource (we have the full name in the Page: namespace, for example), but that is a minor gripe. The metadata will always be in the info area, and only the ARC is required to uniquely identify the image. So, I'd say truncate at whatever is most convenient. Inductiveload (talk) 23:29, 2 August 2011 (UTC)[reply]

This file name cut removed the most important : Captain Harry Truman Teofilo (talk) 21:40, 3 August 2011 (UTC)[reply]

Teofilo, You provided dozen of examples of trimed filenames. However to me the only issues with those is that they are too long. I agree with Inductiveload that "filename is just a key for the file database" and that descriptions can be found inside file descriptions. --Jarekt (talk) 02:39, 4 August 2011 (UTC)[reply]
You wrote "I agree the file name issue should be fixed" above on this page on 1 August (diff). If you agree with Inductiveload that "filename is just a key for the file database", what is the issue which you want to fix ? Or have you changed your mind since 1 August ? Teofilo (talk) 12:58, 4 August 2011 (UTC)[reply]
Note that truncated names now only terminate at the end of complete words and include a "..." when there is any truncation. Dominic (talk) 04:31, 4 August 2011 (UTC)[reply]

File matching tool edit

I think we need a developer for the development of a file matching tool. That tool would use an interface similar to that of Cat-a-lot, with the possibility to select two files from a gallery page. Then the tool would

This does not make sense to me. What gallery page? How will non-identical versions be detected by a bot? The eventual plan is to add JPG/DjVu versions of all these files by bot, so they will all have linked file in "Other versions" that will be usable on the projects at some point. Dominic (talk) 16:13, 11 August 2011 (UTC)[reply]

Author information retrieving bot edit

We need a bot to explore systematically all www.archives.gov pages similar to http://www.archives.gov/research/military/ww2/photos/ in order to retrieve author information. At present such author information is not provided by the upload bot. Perhaps it is simpler to to this separately with another bot. I think I am personally getting tired to add this information manually (for example, see this diff). Teofilo (talk) 12:08, 5 August 2011 (UTC)[reply]

Those are not structured pages and I see no way for a bot to extract author information from them. There are some tasks that simply require a human. Dominic (talk) 15:43, 5 August 2011 (UTC)[reply]
All captions from http://www.archives.gov/research/military/ww2/photos/ (example : "Danny Kaye, well known stage and screen star, entertains 4,000 5th Marine Div. occupation troops at Sasebo, Japan. The crude sign across the front of the stage says: `Officers keep out! Enlisted men's country.'" Pfc. H. J. Grimm, October 25, 1945. 127-N-138204) and similar pages should be extracted (by a bot or human) and put into the left column of a table. Then a bot should say if the file was uploaded on Commons or not, and if so, provide a link to the file uploaded on Commons, and say if the |author= is still void. Then humans could pickup the author name from the full caption. This would ensure that this is done in a systematic way, and that no chance was missed to find author names. Teofilo (talk) 15:30, 6 August 2011 (UTC)[reply]

Actually a bot could compare the string of characters in the full caption at http://www.archives.gov/research/military/ww2/photos/ and the string of character in the |title= field on Commons. For example, comparing ["Danny Kaye, well known stage and screen star, entertains 4,000 5th Marine Div. occupation troops at Sasebo, Japan. The crude sign across the front of the stage says: `Officers keep out! Enlisted men's country.'" Pfc. H. J. Grimm, October 25, 1945. 127-N-138204] with [|Title=Danny Kaye, well known stage and screen star, entertains 4,000 5th Marine Division occupation troops at Sasebo, Japan. The crude sign across the front of the stage says: "Officers keep out! Enlisted men's country."] would reveal that "Pfc. H. J. Grimm, October 25, 1945. 127-N-138204" was left out. After all left out parts are neatly listed in a table by a bot, humans could try to figure out what they can do with them. Teofilo (talk) 15:44, 6 August 2011 (UTC)[reply]
I think you missed the point. How do you know what to compare? You have a Commons image file, and then you have a string of characters on a random webpage. If a human has to find http://www.archives.gov/research/military/ww2/photos/ and point the script to the line on the page that has the information, it kind of defeats the purpose. Dominic (talk) 17:10, 8 August 2011 (UTC)[reply]

Categorizing progress statistics software edit

[concerning Commons:National Archives and Records Administration/Categorize/Progress ]

Hello,

Would it be possible for BernsteinBot to compile more data ? At present the "categorized" column on, for example, this page only provides a boolean "categorized" YES/NO parameter. Would it be possible to retrieve the number of added categories and to calculate the percentage of files with 2 or more categories, with 3 or more categories, etc... ? Especially if the number of categories is only one, I consider that the job is not finished. Files should have at least 2 or 3 categories, in most cases. It would be good to have a way to find the files with only one category, so that people can quickly go to those files to finish the job. Teofilo (talk) 12:58, 1 August 2011 (UTC)[reply]

The above is a copy of a message I left on Bernsteinbot's owner talk page Teofilo (talk) 12:12, 5 August 2011 (UTC)[reply]

I think we need also statistics to control whether the |author field has been completed or is still left blank. Teofilo (talk) 12:19, 5 August 2011 (UTC)[reply]

It operates based on normal Commons procedure. Files are either uncategorized or they're not. I don't see much evidence for your opinion that files with only one category are "unfinished". It would be nice to collect some of these statistics for measuring outcomes, but I'm not convinced it would be very useful (or very much used) by people categorizing. Its certainly not a pressing need. Dominic (talk) 17:05, 8 August 2011 (UTC)[reply]

Using en language templates edit

Dunno if there'll be any further bots edits to the already uploaded images, but I guess there will. So if there is a chance, could someone please add {{en|…}} around the descriptions (title and general notes)? I'm a bit surprised that this (apparently) didn't happen already on upload. Using the template would make future translations a bit easier, and is generally recommended here on Commons for internationalization issues (even if it's only regarded as helpful for users who don't speak English, to allow quick and easy identification of the language used). Many thanks in advance --:bdk: 14:32, 20 August 2011 (UTC)[reply]

Resolutions edit

This page is getting very unwieldy. I am going to be marking and collapsing threads that seem to be resolved so that it is easier to navigate the page and see what needs to be addressed. If anyone feels that I have erroneously marked something as resolved, please feel free to uncollapse it and say so. Dominic (talk) 17:45, 11 August 2011 (UTC)[reply]

I marked general questions of categorization as resolved, as we have developed a process for assisting editors in categorizing. Every image uploaded is given {{Uncategorized-NARA}}, which places it in Category:Media contributed by the National Archives and Records Administration. Each file is also automatically placed in a category for its NARA series. We have an automatically updated project page at Commons:National Archives and Records Administration/Categorize/Progress where Commons editors can see the progress of per-series categorizing and navigate down to to a list of individual images that need categorizing. In tis way, hopefully adding topical categories for all files will be manageable. Dominic (talk) 18:43, 11 August 2011 (UTC)[reply]
Open issues

I am trying to summarize the issues that are in any way open, so we can bring some closure to this and the uploading can be completely above board.

  1. Can we automatically match NARA author data with Creator: templates and categories on Commons? — I'd like to work on this, but it can be done within the template, so it doesn't need to block uploads.
  2. Do we want to move the "NARA - <ID> - " part of file names to the front? — It will stay as is unless we hear from more people that they want this.
  3. Storing metadata on a separate template. — I wasn't entirely sure how useful or even possible this is, so I have left it alone in case others have thoughts.
  4. Teofilo's requests:

It seems to me that all of these fall into the category of things that can be worked on during/after the actual upload of files, with the possible exception of the file name lengths. However, that and several others either do not seem very well supported or thought out. New comments, even if it's just simple agreement or disagreement, would help clarify the level of support. Dominic (talk) 19:09, 11 August 2011 (UTC)[reply]

Uploaded Progress Recent uploads Category
1,524,016 617 % Gallery Category:Media contributed by the National Archives and Records Administration

617.4% completed (estimate)

   

Assigned to Progress Bot name Category
Dominic 617 % US National Archives bot Category:Media contributed by the National Archives and Records Administration