User:Jean-Frédéric/Batch uploading vision

Some notes about my vision of a batch uploading.

Please read too my Python library Read Me − it goes a long way into explaining stuff.

Some notes, then edit

  • The GLAM provides us with metadata − the format is not important − may be CSV, XML, built-in IPTC metadata, whatever. Ultimately, metadata should be reducible[1]as attribute−value pairs. We may have stuff like Cote=51Fi480 or Auteur=Trutat, E.

Data ingestion edit

The all idea is to get at the end a MediaWiki template stucture which reflects the GLAM metadata − for example

{{DataIngestionTemplate
|Type de document=Positif N&B
|_ext=jpg
|categories=[[Category:Avignon]]
|Format= {{Size|cm|8.5|10}}
|_filename=FRAC31555_51Fi480.jpg
|Support=glass
|Technique=photo
|Analyse=Remparts à Avignon (Vaucluse). Fin 19e siècle. Vue perspective des remparts prise du côté ville. Au premier plan à gauche façade de bâtiments, au centre route, charrettes, à droite murs de fortifications.
|Origine=Dépôt de l'association "Les Toulousains de Toulouse et Amis du Vieux Toulouse", le 14/12/2006.
|Cote=51Fi480
|Auteur={{Creator:Jules Lévy}}
|Observations=Moitié d'une vue stéréoscopique, l'autre porte de la cote 51Fi368.
|_url=/home/jfk/Tuile/MyStuff/TrutatBis/images/FRAC31555_51Fi480.jpg
|date={{Other date|end|{{Other date|century|19}}}}
|Titre=[Avignon (Vaucluse). Remparts]
|Réalisé en=
}}

Where {{DataIngestionTemplate}} is, as it says on the tin, a Data ingestion template. It pass-throughs values to a real Commons template − be it {{Photograph}}, {{Artwork}} , whatever is appropriate − while putting everything into place.

  • For an easy case, just putting stuff into its place:
|accession number = {{{Cote|}}}
  • Or by easy formatting using template-wrapping:
|references = {{Archives municipales de Toulouse - FET link|{{{Cote|}}}}}

Alignment edit

As explained above, stuff like Cote=51Fi480 is easy − just give it to the data ingestion template.

But Auteur=Trutat, E. is less easy. We could just pass it, and end up with {{Photograph|Auteur = Trutat E.}}, but that’s suboptimal − we want to make the link with Commons metadata. Wouldn’t it be nice if we could somehow make the bot understand that « Trutat. E » = this guy named Eugène Trutat = Creator:Eugène Trutat? And we could add Category:Photographs by Eugène Trutat while we are at it!

Enters the alignment. First, crawling through the metadata, getting all the metadata and put it on Commons as a wikitable − like this or like this. Awesome volunteers get active at matching that to what exists on Commons. After that, it’s easy enough to give the wiktable to the bot, and ordering the bot « when you find the field named 'Auteur', have a look in this table and replace with the associated value ». Et voilà, we now have Auteur={{Creator:Eugène Trutat}}. And magical magic is magic, the category gets retrieved too.

Post-processing edit

So after retrieving the metadata, and just before uploading, we go through our metadata and for given fields we align. That’s a kind of what we refer to as "post-processing". But there’s so much more we can do! We can make complex parsing for identify dates and wrapping them into crazy {{Other date}} constructs ; take a field and split it into N fields ; apply a template to bits of a field value ; update a field based on another field… Good news is: most of the logic is hidden in the library. You only get to write actual code doing stuff.

But why… edit

…aren’t you just dirty-cleaning up data using Excel or Refine or whatever? Could be quicker.
Maybe so. But I want to be able to take the original metadata file, run the program and have the final output − in a word, for the upload to be reproducible.
…isn’t everything done in Python? Why the data ingestion template stuff?
Data ingestion templates are nice for several reasons.
  • No need to customise that part of the code for each batch upload − which is a pain, trust me.
  • Some things are actually easier to do in MediaWiki/parser functions.
  • People can help without getting their hands in the code − they can just edit the template.
…isn’t everything done with the Data ingestion template then? Why the Python post-processing?
Because other things are easier to do in Python ;-þ. Like using crazy regexes to extract {{Other date|circa|1905-07}} from « Place des Carmes, côté nord. Vers juillet 1905. Vue perspective… »
(Parser function and Lua can go a long way though, and could take over some stuff − maybe the alignment matching. But still.)
…aren’t you using the GlamWiki Toolset? It is awesome!
It is indeed. :-) If the files are available online I use my MassUploadLibrary to generate an XML compatible with the Toolset, and use the Toolset with the ingestion template as the custom MediaWiki template. Best of both worlds :)

Footnotes edit

  1. Not always ; I have encountered nested, cross-referenced metadata once. This is a pain to process. :-þ