User:Rama/Structured Data

Most have by now heard of Wikipedia, the Free encyclopedia, but the Wikimedia Foundation supports a number of other projects of which the general public might be less aware. Amongst these are Wikimedia Commons, a Free media repository that provides images, sound files and videos; and Wikidata, a knowledge database and onthology in semantic web. All these projects have a life of their own, but their function is more easily explained by the services they render to Wikipedia.

Commons provides a single repository where iconography is available as if it was stored locally on any given Wikipedia: therefore, there is no need to upload the same image on Wikipedias in various languages for it to be displayed on the French-, German- and English-language articles. This centralisation also allows for a more efficient management of file metadat such as licences, descriptions, etc. From there has arisen a thriving community that encourages better quality images through competitions and peer-awarded labels, and more volume though partnerships or special events that promote photography, such as Wiki Loves Monuments, Wiki Loves Earth, Wikicheese, etc.

Wikidata is a knowledge database that uses the Wikibase software to store information as identifyer-property-value triples (for instance, the object Q684661, the Jet d'Eau in Geneva, has a "localisation" property whose value is "Geneva"). Originally, its function was to centralise "interwikis", that is links between equivalent Wikipedia articles in various languages (for instance en:apple has links to fr:pomme and als:Öpfel), much like Commons mutualises images. It was soon found out that the place that stored interwikis could also hold information: therefore an object would be created, with an arbitraty ID number Q89, labels in several languages ("apple" for English, "pomme" for French, etc.). This object could then associate to an unlimited quantity of further information, coded as triples. For instance, to transcribe the notion that "an apple may be red", we would build the triple "apple"-"colour"-"red" (on Wikidata, the corresponding identifiers would be Q89-P462-Q3142); to further note that an apple might as well be green, we would simply add another triple Q89-P462-Q3133, and so on.

Structuring information in Wikidata yield multiple advantages. We can start by noting that Wikidata is intrinsically multi-lingual, entailing that any project that makes use of its services is too; or it allowing users to perform complex search using the SPARQL language, which provides lists of objects featuring the properties and satisfying the conditions that the user will have specified (for instance, "fruits that may be red but not green, and with pips rather than kernels"). Furthermore, a wealth of visualisation tools allows users to chose from maps when data bear geographical coordinates ("birthplace of female authors and former students of Edimburg University"), timelines when they bear dates ("founding dates of European universities"), etc.

Commons is currently organised in an informal and unstructured manner, which make it difficult to search for files, and thus to improve or make use of its content. The Structured Data project is an initiative aiming to deploy Wikibase, the software engine behind Wikidata, on Wikimedia Commons. Each file on Wikimedia Commons is indeed an object in the sense of Wikidata, susceptible of being described with properties and values. For instance, any given photograph has a subject, which can point to another Wikidata item; a shutter time (number of seconds), a licence (another Wikidata object), etc.

Many of these properties are already provided automatically: a file from a modern digital camera will hold EXIF metadata that detail such elements as the time of the shot, photographic parameters (focal length, shutter time, sensor sensitivity, diaphragm apertures, etc.), the model of the lens, and even the localisation of the camera where equiped with a GPS. Other properties must be provided by a human (the subject, for instance), or even specifically by the copyright holder (the licence, which has a legally binding nature).

Using Wikibase on Commons might also allow users to enjoy the power of Wikidata objects in file upload forms: this would allow users with limited or no command of English to contribute in their own language. Scrolling menus could eventually appear in their own language for fields with a limited number of options (such as licences). For more complex properties, such as image description, a field identical to those found on Wikidata could allow the user to enter a value and have their entry auto-completed into a scrolling menu proposing known Wikidata objects: for instance, entering "Cervin" would yield a list with choices for the family name "Cervin", for a hill in Antartica, and for the Italian-Swiss mountain; by selecting the latter, the user would link their image to object Q1374 on Wikidata, and the legend could automatically be displayed as "Matterhorn" to Germand adn English speakers, but also as "Маттерхорн" to Russophones and "マッターホルン" to Japanese speakers.

This project also opens tremendous opportunities for information retrieval on Commons. Search is currently limited to string matching in filenames and descriptions. With Structured Data, it would become possible to issue multiple-criteria search, such as "images that have Q12495 has a subject, ordered by date", as to retrace the construction steps of Burj Khalifa tower in Dubai. Or localise images of a ship on a world map as to follow her journeys. We could even envision complex queries making use of the geographic coordinates of the camera and of the subject, to compute the angle of the picture.

Another function for which great expectations are allowed: categorisation. On Commons, images are grouped into categories and sub-categories, which go more and more specific depending on the number of images. For instance, a rather uncommon subject, such as the little 5000-inhabitant town of Cervino, in Italy, has a single category to itself, with a few images in it. A subject with more photographs, such as the 760-inhabitant town of Esino Lario, smaller but famous for its hosting of Wikimania in 2016, has its numerous images subdivised into many largely arbitrary subcategories, which are only labeled in English, and in which search for a specific image is not straightforward. The running joke on the subject is the phrase "looking left", an allusion to the many bizarre subcategories such as "women looking left". With Wikibase, these arbitrary and English-only categories would be replaced with properties in unlimited numbers, which do not interfere one with another, and that can be taken into account or not according the needs of the users.

This ambitious project might bear its first fruits around 2018 or 2019, with the Wikibase engine going live and the files available on Commons starting to be processed.


Thanks to Sandra Fauconnier for reviewing the draft and her many useful remarks.