Commons:Batch uploading/Bibliothèque Nationale de France

French National LibraryEdit

Thanks to the partnership of Wikimédia France with the French National Library (BnF/Bibliothèque Nationale de France) [1] [2] [3], we will have to upload 1416 DjVu files on Commons. The volume is about 100 Go.

We have a server where we must create the DjVu before (the BnF gave us images files + OCR files). I'm not sure, but perhaps there are files whose the size is more than 100 Mo (limit for the HTTP upload). Would it be possible that someone upload these DjVu files from our server May 15 or slightly after? Thanks a lot ~ Seb35 [^_^] [mail] 18:38, 2 May 2010 (UTC)

UPDATE: the files should be available in some days (Wednesday?), is it possible to have a contact/answer quickly because we must return the server (we rent) Monday 24, May (else we could rent the server one another month but... we would prefer stop the 24). ~ Seb35 [^_^] 20:20, 16 May 2010 (UTC)

OpinionsEdit

  • Symbol support vote.svg Support, the first sample can be found here : File:Anomyme - Raoul de Cambrai.djvu. VIGNERON * discut. 00:29, 13 May 2010 (UTC)

We hope we can make a test upload very soon. Early feedback on the sample Vigneron pointed would be much appreciated (for obvious things we would have missed). Jean-Fred (talk) 23:27, 17 May 2010 (UTC)

Sound like a nice project. You need to end up with 1416 pairs of files. Each pair would consist of <some file>.djvu, the djvu file, and <some file>.txt, a UTF-8 encoded text file containing the description of the file in wikitext format. These files need to be available on a publicly accessible website (you can keep the address private) so that one of the sysadmins can download the files and import these files to Commons.
So how to get there?
You probably have the metadata in some format. You have to use this to generate descriptions. You can find some programs I wrote in the past here, maybe that helps. Is it possible to make this metadata available somewhere so I can comment on it?
You should create a bot account for the import, user:BNFbot is probably a nice username. If you create the account while logged in it's attached to your account (examples).
You should probably add a tracker category to {{BNF cooperation project}}
Far from complete, but I hope this can get you going. Feel free to contact met on irc (freenode) if you have any questions. Multichill (talk) 07:08, 18 May 2010 (UTC)
Thanks for your comments.
  • Thanks to the hints you gave us on your talk page, our process already set up a tar file with one text file with wikitext and the DjVu. We also have a dedicated SSH entry point to the server for the sysadmins (Plyd has the details).
  • We have two sources of metadata :
  • We already wrote some code in Python that parses the XMLs for the following uses :
    • The Commons description page (metadata — as described above) ;
    • The WikiSource Book namespace page (metadata) ;
    • Every Wikisource Page namespace pages (text)
    • The DjVu (text and images)
At the moment, we still have some issues with the DjVu, but hopefully this will get resolved today. The rest of the process is okay.
Jean-Fred (talk) 10:45, 18 May 2010 (UTC)
Looks good! It's probably nice to create {{Creator}} templates for the authors like for example Creator:Émile Zola and also add a category for the author. Could you publish the code you're written so other projects can benefit from it? Multichill (talk) 18:14, 18 May 2010 (UTC)
Thanks :-)
  • The authors : we have too poor metadata about them, unfortunately : only names, and sometimes mispelled. VIGNERON has made the huge work of checking manually the authors list at s:fr:Wikisource:Dialogue BnF/Auteurs. With this, we will probably be able to generate proper author names, but we do not have enough metadata to generate decent Creator templates. We're looking into testing their existence and adding the Creator template in this case.
  • Yeah, we will defintely publish the code when it is completed.
Jean-Fred (talk) 22:12, 19 May 2010 (UTC)
I like creator templates so I hacked something together to generate them based on info from wikipedia. See Commons:Batch uploading/Bibliothèque Nationale de France/Auteurs for the progress. Each creator template should probably be checked and maybe improved, but that doesn't stop the upload from already using them. Also corresponding categories should be created, I could probably do that based on the info from Wikipedia too. Multichill (talk) 16:19, 23 May 2010 (UTC)
I'm done, I used en, fr & de Wikipedia as sources. Multichill (talk) 18:08, 23 May 2010 (UTC)
Wow, that is great. Thanks! Jean-Fred (talk) 00:22, 24 May 2010 (UTC)
I got to this page by investigating why do we suddenly have over 150 creator templates in Category:Creator templates without home category after I recently managed to resolve all the creator pages in that category by creating proper categories and adding them to the existing category tree (see File:Creator Category Schema.png). Creation of proper categories would be greatly helped if the script which generated those pages run by User:Multichill would update them by adding birth/death locations and full dates of death/birth which are often present in the articles. Also the wiki-links to the wikipedia articles ( in the NAME field) and descriptions (like french writer, etc.) would be great. I realize that what I am asking for would be much more than "something hacked together", but I believe it could be used over and over for creators related to other mass uploads. I recently manually fixed over 200 of creator pages created by user:BrooklynMuseumBot and it is a lot of work. --Jarekt (talk) 17:06, 14 June 2010 (UTC)
I can create the categories, but I was waiting for the checking effort to finish. Just say when I should run and I'll create the categories. Multichill (talk) 17:33, 14 June 2010 (UTC)
I am a newcomer to this discussion. Is there any checking/improving effort going on? --Jarekt (talk) 17:59, 14 June 2010 (UTC)
Afaik User:VIGNERON and User:Yann are working on this (linkie). Multichill (talk) 19:06, 14 June 2010 (UTC)
Welcome Jarekt! The process was delayed because the Wikisource folks wanted to re-check the list (book titles and authors names). The checking is now finished, and we plan to launch the generation of the files like, tomorrow :-). I will upload the file File:Frédéric II de Prusse - Correspondance avec Voltaire, tome 5.djvu later tonight with the new descriptions and metadata. Let us know your opinions on it, but as soon as possible (if you do not mind). It is quite urgent, because it takes about six days to generate the files, and we are behind schedule. We would like to launch the generation tommorrow evening. Jean-Fred (talk) 19:46, 16 June 2010 (UTC)
Gasp. That book is too big, could not upload :-/. I updated File:Paris, Gaston - Le roman du comte de Toulouse.djvu so you can see how the description page is now. This file is a pretty good example of what metadata we have.
Note: If the book is anonymous, the script uses {{Anonymous}} as author and Category:Anonymous books
Jean-Fred (talk) 23:14, 16 June 2010 (UTC)
I do not want to be nagging here but are there any plans to create categories (and categorize them) for creator pages which do not have them? They can be found in Category:Creator templates without home category. I could easily run a bot to create those categories and add category:people by name and {{creator:{{PAGENAME}}}} to them but my bot can not add death/birth year categories or categories like category:Writers from France. That would have to be done by hand unless some other bot can do it. --Jarekt (talk) 13:53, 8 July 2010 (UTC)
Yes, I have been doing that, but manually it takes a lot of time. If you could run a bot, you are more than welcome. Yann (talk) 16:34, 8 July 2010 (UTC)
I started my bot the new categories will be temporarily added to Category:Creator templates to fix, until they are checked and categories like category:Writers from France are added. --Jarekt (talk) 03:42, 9 July 2010 (UTC)

Test uploadEdit

Just came out today.

ReviewsEdit

Multichill already raised the following issues, :

  • Manual "<year> books" category (done by template = bad) : ✓ Done
  • Creator templates and Author categories : ✓ Done as for coding the description pages ; but the basis is the table on Wikisource which needs to be updated/checked beforehand. Follow-up on the French Scriptorium. Jean-Fred (talk) 22:29, 25 May 2010 (UTC)

First batch readyEdit

Our of the ~1400 books, we already have 777 ready. The books size is in average 180 Mo, ranging from 250Mo to 300Mo, so as we suspected we will have to rely on a server-side import. We have everything ready (file+description text all in a tar file), could you please point us to the relevant people ? Jean-Fred (talk) 20:52, 26 June 2010 (UTC)

Please could you wait some time before downloading the files on our server. The big sizes (one book weights 1Go and all are x00Mo) makes it would be better amha to review our process, since there is perhaps an alternative to drastically reduce the size (use bitonal images instead of multicolor images, since all are in black-and-white). ~ Seb35 [^_^] 21:47, 27 June 2010 (UTC)
Yes, 1 Go seems unreasonable for a DJVU file. Yann (talk) 04:42, 28 June 2010 (UTC)

We revised our process, DjVus are much lighter now (average is 11 Mio, only three beyond 100 Mio). Much, much better :-)

We just requested a server-side upload to Tim Starling (though the books are small, there are quite a lot of them). Expect 1416 books to hit Commons after the week-end.

Jean-Fred (talk) 23:34, 2 July 2010 (UTC)

I created the category Category:Books provided by the BNF. Yann (talk) 13:26, 8 July 2010 (UTC)
We need a list of files without a OCR text. Yann (talk) 17:15, 8 July 2010 (UTC)
I suppose you changed the configuration of the conversion to DjVu. What about sharing it (on s:en:Help:DjVu files or whatever)? Thank you, Nemo 18:23, 20 July 2010 (UTC)
Nothing really fancy, we basically changed switched from multicolor to shades of gray (short story, Seb35 has the details).
Our current planning is to finish next week the technical report of the project, which will explain all our process, decisions and all, and will probably be made public soon after. After that, we will clean up our code and put it on FishEye, GitHub or similar, so that others can reuse and improve it. Our idea is to make a central place to build some solid code which could create any DjVu from any standard source (daydreaming a bit ;-)
Jean-Fred (talk) 23:05, 20 July 2010 (UTC)
Great, I love this daydream. :-) --Nemo 19:23, 26 July 2010 (UTC)

CategoriesEdit

Is there any way the upload bot dealing with these files could check on existing categories before adding new ones? For example, File:Frédéric II de Prusse - Correspondance avec Voltaire, tome 2.djvu, together with 19 other files belonging to the same batch, is tagged Category:Frédéric II de Prusse. This category doesn't exist, but the correct category already does, i.e. Category:Friedrich II of Prussia. I have corrected a few cases manually, such as Category:Sophie Rostopchine, Comtesse de Ségur (instead of Category:Comtesse de Ségur), but after a while it seems like a huge waste of time and I doubt I'll be able to keep up with the bot for very long :-) Mu (talk) 01:28, 14 July 2010 (UTC)

Yes, this should be corrected. I think the upload is complete. Yann (talk) 05:07, 14 July 2010 (UTC)
Thanks for the info - too bad I missed the bus. But maybe the point could be noted in case of future uploads. - Mu (talk) 10:48, 14 July 2010 (UTC)
Most of those bot created categories have to be checked by hand and merged with existing category structure multiple names for the same person are quite common. Right now over 100 categories wait in (a temporary category I used before with no longer correct name) for check if they should be merged with other categories and for inclusion into profession by country category tree. --Jarekt (talk) 12:55, 14 July 2010 (UTC)

TIFFsEdit

Hei, we intend to upload the TIFF files (the original files given by the BnF), it would permit to re-work on these files if needed. The TIFF handler is not already functionnal, but it seems it will be in some time [4]. Jean-Fred has already upload one test file, probably with the same metadata/wikitext we put with the DjVu files, it should be a good and easy way for that. ~ Seb35 [^_^] 20:40, 2 August 2010 (UTC)

Don't forget to add links between the file's in the other versions field. Multichill (talk) 20:50, 2 August 2010 (UTC)
Seb is right, I used the very same wikitext than the DjVu.
@Multichill : Yeah, but {{Book}} does not have such a field. Should be trivial to add though.
Question : Should we dump the files in the same tracking category, or set up a different one ?
Jean-Fred (talk) 21:01, 2 August 2010 (UTC)
I added the Other versions parameter in the program. For the category, is Category:Scanned French books in TIFF correct? And I think we can use the same Category:Books provided by the BNF with the same template. ~ Seb35 [^_^] 07:50, 16 August 2010 (UTC)
Do we keep the category "Scanned French books in TIFF" or not? Since we have not a lot of time before to return the server, I propose to replace the template "BnF cooperation project" by "BnF cooperation project/TIFF" (which would be simply a redirect) so that we can do some adjustements after the upload and possibly subst/replace with some other stuff after. I intend to launch the computation of the notices this evening to upload in the next days. ~ Seb35 [^_^] 19:10, 19 August 2010 (UTC)
I intend to upload the files tomorrow evening (25/08/2010 17:00 UTC). If you see any objection to the first upload File:Sand - La dernière Aldini. Simon.tif, please say it before. One small point we have not discussed is the extension .tif or .tiff, I prefer the first for better compatibility with old systems Windows 98 or so. ~ Seb35 [^_^] 22:03, 24 August 2010 (UTC)
OK for me. I created the category. Yann (talk) 03:49, 25 August 2010 (UTC)
The upload is finished for 1397 books. The 19 others are bigger than 100 Mio and cannot be uploaded via an external bot, they are currently on the toolserver since we have no more our server. Don't know if it is worthwhile to try to upload them. Any suggestion? ~ Seb35 [^_^] 18:22, 30 August 2010 (UTC)

Just added |Other versions= to {{Book}}.

See File:Vinet - Boutmy - Quelques idées sur la création d'une faculté libre d'enseignement supérieur.tif. Should we display the link as a link or as a thumb ? If the latter, should we set up a template to i18n the caption « DjVu version » ?

Jean-Fred (talk) 16:30, 14 August 2010 (UTC)

A link for me: the two versions are pretty similar so we don't need to get the image of the thumb. ~ Seb35 [^_^] 07:52, 16 August 2010 (UTC)

File not public domainEdit

Hello, Please add your opinion here: Commons:Deletion requests/File:Allais - À se tordre.djvu. Illustrations by Pierre Delarue-Nouvelliere (1889-1973). Yann (talk) 07:02, 27 August 2010 (UTC)

Last modified on 11 December 2013, at 10:55