Commons:WMF support for Commons/Commons Impact Metrics/GLAM Prototype Data Model and Implementation Plan

This page is a work in progress page, not an article or policy, and may be incomplete and/or unreliable.
Please offer suggestions on the talk page.

Overview

This document describes the data model for the GLAM metrics prototype to be presented at the GLAM conference 2023 and to be used as a base for the final implementation of the GLAM metrics. It also explains the decisions made when designing the datasets, shows how to use the proposed datasets to extract the currently existing GLAM metrics, and gives implementation suggestions for the calculations to generate those datasets.

None of the choices made are final, the structure and naming of the datasets and their fields is tentative and open for discussion. So please add any comments that you see fit!

Assumptions

We are focusing on providing easy access to accurate metrics that cover the existing use cases in community’s GLAM tools, like BaGLAMa2 or GLAM Wiki Dashboard.
Each GLAM institution has a primary category that contains all their subcategories and media files.
We have access to the list of GLAM primary categories, either via an allow-list or a similar method.
There are about ~1000 GLAM primary categories and each of those has in average ~1000 media files. Collected this proxy from BaGLAMa2 and GLAM Wiki Dashboard.
The main release channels for these datasets are going to be dumps and AQS.

Datasets

This section describes 5 datasets that intend to cover most (if not all) the existing GLAM tool use cases. The proposed datasets are organized into two types: snapshot-based and incremental.

Snapshot-based datasets

These datasets describe the current state of things at the time the snapshot is taken. For instance, a snapshot of the state of all GLAM Commons categories and their properties at 2021-03. Or a snapshot of the state of all GLAM media files and their properties at 2023-10.

GLAM Commons categories

This dataset gives information about the GLAM Commons categories at the time the snapshot is taken. It includes both primary categories and all their subcategories, recursively down the tree.

FIELD	TYPE	DESCRIPTION
name	string	The category name.
primary categories	array[string]	An array with the names of the primary categories this category belongs to. It should only be 1. But, since MediaWiki does not prevent that, I think this field should allow it. If the category in question is a primary category, its name will be in the primary categories array.
parent_categories	array[string]	An array with the names of the parent categories of this category, if any.
child_categories	array[string]	An array with the names of the subcategories of this category, if any.
media_files	int	The number of media files directly included under this category.
tree_media_files	int	The number of media files included (recursively) under this category tree.
media_files_used	int	The number of media files included under this category that are displayed in at least one wiki article.
tree_media_files_used	int	The number of files included (recursively) under this category tree that are displayed in at least one wiki article.
leveraging_wikis	int	The number of wikis with at least one article displaying at least one media file directly included under this category.
tree_leveraging_wikis	int	The number of wikis with at least one article displaying at least one media file included (recursively) under this category tree.
leveraging_articles	int	The number of articles displaying at least one media file directly included under this category.
tree_leveraging_articles	int	The number of articles displaying at least one media file included (recursively) under this category tree.
snapshot	string	The snapshot label with YYYY-MM format.

Data size guess

Depending on how many subcategories a tree has, the size of this dataset will vary. Let’s assume there are 50 subcategories in an average tree (20 images per category), and that each category row is about 500 Bytes. The rough uncompressed size of this dataset’s snapshot should be 1000 * 50 * 500 = 25MB.

GLAM Commons media files

This dataset gives information about the GLAM Commons media files at the time the snapshot is taken. It includes category information so that the user can filter by category.

FIELD	TYPE	DESCRIPTION
name	string	The media file name.
media_type	string	The media type (image, video, audio, etc.).
primary_categories	array[string]	An array with the names of the primary categories this media file belongs to. It should only be 1. But, since MediaWiki does not prevent that, I think this field should allow it.
categories	array[string]	An array with the names of the immediate categories this media file belongs to.
leveraging_wikis	int	The number of wikis with at least one article displaying this media file.
leveraging articles	int	The number of articles displaying this media file.
snapshot	string	The snapshot label with YYYY-MM format.

Data size guess

Since we assume there are about 1000 primary categories and 1000 media files per primary category, we expect there to be 1M media files in each snapshot. If each media file row takes 300 Bytes, the size of the snapshot should be around 300MB uncompressed.

Design considerations of snapshot-based datasets

Use category and media file names as unique identifiers instead of ids. It seemed to me to be simpler and more useful when reading query results.
You can not aggregate metrics for these datasets. The user must always filter or group by category / media file, and must always specify a snapshot. Otherwise, the results would contain duplicate counts.
Does not include urls for categories or media files. They can be easily rebuilt from category or media file names.
Does not include event-based metrics like media_files_viewed. These are subject to an interval of time and are not suitable for a snapshot-based dataset. To extract those, the user will have to query the incremental (event-based) datasets.
Release frequency should be monthly, and thus the snapshots should be monthly (YYYY-MM). The source data for these datasets (MediaWiki tables) is imported monthly and is also snapshot-based.
Hive should be good enough as underlying tech for the table. I don’t think we need Iceberg here, since monthly snapshots are handled well by Hive. But Iceberg is the default tech now, and we should use it anyway.
These datasets do not allow reruns or back-filling.

Incremental datasets

These datasets contain aggregated metrics about GLAM-related events that are tied to a timeline. For instance, views of articles containing media files belonging to a GLAM Commons category. Or media file uploads made to a Commons category by a given contributor.

GLAM article views by category

This dataset gives information about the number of views GLAM articles receive across time. And it breaks them down by the GLAM category. A GLAM article is a wiki article (ns=0 page) that contains at least one media file belonging to a GLAM Commons category.

FIELD	TYPE	DESCRIPTION
category	string	The name of the category for which the metrics are aggregated. You should always filter or breakdown by this dimension.
wiki	string	The wiki that the viewed article belongs to.
article	string	The name of the viewed article.
media_from_direct_category	boolean	This is true when the viewed article contains at least one media file belonging to the direct (top) category. It can contain media files from the downstream category tree as well and still be true. Use this when aggregating to count direct category view counts vs. category tree view counts.
article_views	int	The number of views the article received in the specified month.
month	string	The month the article views happened (YYYY-MM).

Data size guess

This one is more difficult to model… Assuming we have 1M media files, let’s say 900K are used. Let’s also assume the number of GLAM articles equals the number of GLAM used media files. Depending on the granularity of the data, a bigger or smaller share of the articles will receive at least 1 view (which makes them appear in the dataset). Let’s imagine 50% of the articles are viewed. We still need to multiply that by the number of subcategories in a category tree, which we said was about 50 in average. However both dimensions (article and category) are not orthogonal, so their product would be reduced by a factor. Let’s assume it’s 0.25. And that we multiply by the number of months in a year. Assuming each record takes 250B, then: 900K * 0.25 * 50 * 0.5 * 12 * 250 = 17GB/year uncompressed. This calculation is crazy and has too many assumptions and probably many mistakes on my part, so it could be very far from reality.

GLAM article views by media file

This dataset also gives information about the number of views GLAM articles receive across time. However, it breaks them down by the GLAM media file. A GLAM article is a wiki article (ns=0 page) that contains at least one media file belonging to a GLAM Commons category.

FIELD	TYPE	DESCRIPTION
media_file	string	The name of the media file for which the metrics are aggregated. You should always filter or breakdown by this dimension.
primary_categories	array[string]	The list of GLAM primary Commons categories in whose category tree this media file falls.
categories	array[string]	The list of Commons categories this media file directly belongs to.
wiki	string	The wiki that the viewed article belongs to.
article	string	The name of the viewed article.
article_views	int	The number of views the article received in the specified month.
month	string	The month the article views happened (YYYY-MM).

Data size guess

This one is similar to the previous one in terms of difficulty and fuzziness… We can use the previous data size approximation (GLAM article views by category) to calculate this one. There are two differences: 1) this dataset contains two extra arrays of strings, and 2) in this dataset we have to multiply the media_file dimension cardinality by the article dimension cardinality (instead of category by article). Although the cardinality of the media_file dimension is much greater than the category dimension, the two dimensions in question here are very aligned; many images go in articles with the same name. But yea, I imagine this dataset will be a bit bigger than the previous one. Let’s say it’s double the size: ~34GB/year uncompressed.

GLAM Commons edits

This dataset contains a record for each edit (creation, update, deletion) to a Commons media file. It also records the user who did the edit and the categories the media file belongs to.

FIELD	TYPE	DESCRIPTION
user	string	The username of the contributor who edited the media file.
edit_type	string	The type of edit: create, update, delete.
media_file	string	The filename of the media file.
primary_categories	array[string]	A list of primary GLAM category names whose category tree includes the media file.
categories	array[string]	A list of category names that the media file directly belongs to.
timestamp	timestamp	The timestamp of the edit.

Data size guess

For this dataset we can assume that the number of edits is going to be close to the number of media files. A little bit over that, considering updates and deletes, say +30%. Let’s assume each record is going to take about ~300B. Times 1000000 media files, the overall size will be around ~400MB uncompressed.

Design considerations of incremental datasets

The two article views datasets have the potential to become medium-big (in terms of cluster). But in terms of downloadable data via dumps, they can become quite big if we choose to release them in a daily granularity. Thus, I believe the monthly granularity is better (to start at least), even if it doesn’t show possible daily trends of the data.
We can store the edits dataset without aggregation, since I believe it will be small enough.
In these datasets you can aggregate across most dimensions without breaking the data, i.e. counting duplicates. The only ones that you can not aggregate across are `category` in “GLAM article views by category”, and `media_file` in “GLAM article views by media file”.
Data size guesses are wild for article views datasets, take them with a bucket of salt.
Again, use category names, media file names and user names as unique identifiers. It’s simpler and more direct to chart.
Release frequency: Since the snapshot-based datasets are monthly, and these (incremental) datasets probably should have monthly granularity for size reasons, I guess the best would be to release these monthly as well.
Tech: In this case, I think Iceberg would be a better option. We could partition the article views datasets by both category / media file and time; since we always want to either filter or break down by category / media file. Also, because they are incremental, we can take advantage of Iceberg.
These datasets allow reruns or back-filling.

General design considerations

Reruns and back-fills

We can not rerun or back-fill snapshot-based datasets. We can do all that with incremental datasets, but with the drift explained below.

Category / media file drift

We will base incremental metrics on snapshot categories and media files, so there can be a minimal drift from month to month (a category might change name in the middle of a month, so we’ll see it as 2 different categories, one for the first part of the month, the other for the second). The drift can be bigger for reruns or back-fills.

Array fields

In some cases the array[string] fields are not strictly needed to fulfill the existing GLAM tools use cases. We might want to keep them because they are interesting and allow for extra querying options. But at the same time they add size to the data.

Category vs. category tree

I tried to store metrics for both primary GLAM categories and any subcategory within their category trees. I’m not sure how much that will be used by the GLAM people. I understand there’s value in having subcategory metrics, but at the same time reporting on them makes the datasets quite larger.

Usage examples

Total views per primary category

Get total views and time-since-tracked for all primary categories. From baglama2 main view.

select
    category,
    min(date),
    sum(views)
from glam_article_views_by_category
where array_contains(primary_categories, category)
group by category
order by category asc
limit 3;

category	min(date)	sum(views)
2021 Africa Cup of Nations	2021-05-23	45689457
Aeroseum	2013-08-11	5698677
American Currency Collection	2005-10-01	668799

This query would read all the data in the glam_article_views_by_category. It seems it would be OK for a cluster, but tools outside the cluster might have problems, although indexes (on is_primary?) might solve them? In any case, we could load this data to Druid and serve it via AQS or Superset querying the public Druid instance, I think.
To determine whether a category is primary, the query uses "array_contains(primary_categories, category)". We could add an "is_primary" field maybe? This would allow for indexes.

Monthly views for a given category

Get monthly views for a given category. From baglama2 category view, or glamwikidashboard page-views view(monthly instead of daily).

select
    month,
    sum(views)
from glam_article_views_by_category
where category = 'My Category'
group by month
order by month asc
limit 3;

month	sum(views)
2021-01	47536745
2021-02	87867334
2021-03	24565677

This query would benefit from the Iceberg table being partitioned by category. If queried outside the cluster, an index on the category field would help.

Articles viewed and article views

Get number of articles viewed and article views broken down by wiki for a given category and month, like in baglama2 category view.

select
    wiki,
    count(distinct article),
    sum(views)
from glam_article_views_by_category
where
    category = 'My Category' and
    date between '2021-01-01' and '2021-02-01'
group by wiki
order by wiki asc
limit 3;

wiki	count(distinct article)	sum(views)
en.wikipedia.org	13	33475
he.wikipedia.org	2	655
hi.wikipedia.org	1	535

This query would also benefit from the Iceberg table being partitioned by category. If queried outside the cluster, an index on the category field would help.

Breakdown by wiki and category

Get a breakdown of article views by wiki and category for a given month, like in baglama2 wiki view.

select
    article,
    views
from glam_article_views_by_category
where
    category = 'My Category' and
    wiki = 'my.wiki.org' and
    month = '2021-01'
order by article asc
limit 3;

article	sum(views)
Minimum support price	45456
Sonia Mann	4564
Balbir Singh Rajewal	456

This query would also benefit from the Iceberg table being partitioned by category and time. If queried outside the cluster, indexes on the category and time fields would help.

List category media files

Get a list of media files and their media types, belonging to a given category, like in glamorgan main view.

select
    name,
    media_type
from glam_commons_media_files
where
    array_contains(categories, 'My Category') and
    snapshot = '2021-01'
limit 3;

name	media_type
Saab AJ-37 Viggen 37094 57 (7609646264).jpg	image
Inside the Aeroseum entrance tunnel (7490290326).jpg	image
Aeroseum in Gothenburg, Sweden.jpg	image

This query is fine to execute from anywhere and doesn’t require any special partitions/indexes, since the data is small.

Article views by category media files

Get the total article views for a given category, broken down by media file, like in glamorgan main view.

select
    media_file,
    sum(views)
from glam_article_views_by_media_file
where
    array_contains(categories, 'My Category')
group by
    media_file
order by sum(views) desc
limit 3;

name	media_type
Saab AJ-37 Viggen 37094 57 (7609646264).jpg	96786
Inside the Aeroseum entrance tunnel (7490290326).jpg	42879
Aeroseum in Gothenburg, Sweden.jpg	5467

This query would read all the data in the glam_article_views_by_media_file dataset. It seems it would be OK for a cluster, but tools outside the cluster might have problems, although indexes might solve them? In any case, we could load this data to Druid and serve it via AQS or Superset querying the public Druid instance, I think.
If we assume that there’s only 1 primary category for each media file, we could have a `primary_category` field instead of a `primary_categories` field, and we’d be able to partition/index on that.

Prototype implementation suggestions

This is an implementation plan draft for the GLAM Commons Impact Metrics datasets, specifically for the prototype. They are in the order believed to be necessary.

Collect/generate a list of GLAM primary categories. In production this could be an allow-list file in some repo, or as Desiree suggested structured data in Commons.

Extract GLAM secondary categories (recursive subcategories of the primary categories), and create an intermediate category dataset (including primary categories). In production, we’d probably need a Spark fixed-point algorithm to extract those (using the category table plus the categorylinks table). We’ll have to take into account possible duplicates and even loops in the category “tree”. In the same job, I guess we could record the parent_categories and subcategories of each category in question.
For the prototype, we choose a couple convenient primary category trees and manually extract all secondary categories.

FIELD	TYPE
name	string
primary_categories	array[string]
parent_categories	array[string]
subcategories	array[string]

List all media files contained in GLAM categories (primary or secondary) and build an intermediate media file dataset with it. For this, a SparkSql query is probably enough, using the refined category information above (2), plus the image table and the imagelinks table. In the process we could include the media type and store the primary and direct categories of the media file.

FIELD	TYPE
name	string
media_type	string
primary_categories	array[string]
categories	array[string]

We calculate glam_commons_categories and glam_commons_media_files. We need the data above: (2) and (3) plus imagelinks. Probably a set of SparkSql queries are enough? They don’t seem super complicated, but there are several metrics, so some tinkering to be done…
Output: the tables as described in the data model.
Calculate glam_article_views_by_category and glam_article_views_by_media_file. A SparkSql query that uses the data above (4) and pageviews_hourly (or med iarequests?). Probably we need several queries, one for category and one for category_tree. In my head, this needs some joining and aggregating, but doesn’t seem impossible.
Output: the datasets as described in the data model.
Calculate glam_commons_edits. A SparkSql query that uses the data above (4) plus an edit-based dataset like the revision table or MediaWiki History. This one is probably the easiest?
Output: the dataset as described in the data model.
Populate the spreadsheet with the data samples plus add queries and charts to it. Apparently, Google sheets lets you use SQL-like syntax to query data. There could be 1 sheet per initial dataset. And then more sheets as needed for the queries and charts?