Commons:WMF support for Commons/Commons Impact Metrics/GLAM Prototype Data Model and Implementation Plan
Overview
editThis document describes the data model for the GLAM metrics prototype to be presented at the GLAM conference 2023 and to be used as a base for the final implementation of the GLAM metrics. It also explains the decisions made when designing the datasets, shows how to use the proposed datasets to extract the currently existing GLAM metrics, and gives implementation suggestions for the calculations to generate those datasets.
None of the choices made are final, the structure and naming of the datasets and their fields is tentative and open for discussion. So please add any comments that you see fit!
Assumptions
edit- We are focusing on providing easy access to accurate metrics that cover the existing use cases in community’s GLAM tools, like BaGLAMa2 or GLAM Wiki Dashboard.
- Each GLAM institution has a primary category that contains all their subcategories and media files.
- We have access to the list of GLAM primary categories, either via an allow-list or a similar method.
- There are about ~1000 GLAM primary categories and each of those has in average ~1000 media files. Collected this proxy from BaGLAMa2 and GLAM Wiki Dashboard.
- The main release channels for these datasets are going to be dumps and AQS.
Datasets
editThis section describes 5 datasets that intend to cover most (if not all) the existing GLAM tool use cases. The proposed datasets are organized into two types: snapshot-based and incremental.
Snapshot-based datasets
editThese datasets describe the current state of things at the time the snapshot is taken. For instance, a snapshot of the state of all GLAM Commons categories and their properties at 2021-03. Or a snapshot of the state of all GLAM media files and their properties at 2023-10.
GLAM Commons categories
editThis dataset gives information about the GLAM Commons categories at the time the snapshot is taken. It includes both primary categories and all their subcategories, recursively down the tree.
FIELD | TYPE | DESCRIPTION |
---|---|---|
name | string | The category name. |
primary categories | array[string] | An array with the names of the primary categories this category belongs to. It should only be 1. But, since MediaWiki does not prevent that, I think this field should allow it. If the category in question is a primary category, its name will be in the primary categories array. |
parent_categories | array[string] | An array with the names of the parent categories of this category, if any. |
child_categories | array[string] | An array with the names of the subcategories of this category, if any. |
media_files | int | The number of media files directly included under this category. |
tree_media_files | int | The number of media files included (recursively) under this category tree. |
media_files_used | int | The number of media files included under this category that are displayed in at least one wiki article. |
tree_media_files_used | int | The number of files included (recursively) under this category tree that are displayed in at least one wiki article. |
leveraging_wikis | int | The number of wikis with at least one article displaying at least one media file directly included under this category. |
tree_leveraging_wikis | int | The number of wikis with at least one article displaying at least one media file included (recursively) under this category tree. |
leveraging_articles | int | The number of articles displaying at least one media file directly included under this category. |
tree_leveraging_articles | int | The number of articles displaying at least one media file included (recursively) under this category tree. |
snapshot | string | The snapshot label with YYYY-MM format. |
Data size guess
editDepending on how many subcategories a tree has, the size of this dataset will vary. Let’s assume there are 50 subcategories in an average tree (20 images per category), and that each category row is about 500 Bytes. The rough uncompressed size of this dataset’s snapshot should be 1000 * 50 * 500 = 25MB.
GLAM Commons media files
editThis dataset gives information about the GLAM Commons media files at the time the snapshot is taken. It includes category information so that the user can filter by category.
FIELD | TYPE | DESCRIPTION |
---|---|---|
name | string | The media file name. |
media_type | string | The media type (image, video, audio, etc.). |
primary_categories | array[string] | An array with the names of the primary categories this media file belongs to. It should only be 1. But, since MediaWiki does not prevent that, I think this field should allow it. |
categories | array[string] | An array with the names of the immediate categories this media file belongs to. |
leveraging_wikis | int | The number of wikis with at least one article displaying this media file. |
leveraging articles | int | The number of articles displaying this media file. |
snapshot | string | The snapshot label with YYYY-MM format. |
Data size guess
editSince we assume there are about 1000 primary categories and 1000 media files per primary category, we expect there to be 1M media files in each snapshot. If each media file row takes 300 Bytes, the size of the snapshot should be around 300MB uncompressed.
Design considerations of snapshot-based datasets
edit- Use category and media file names as unique identifiers instead of ids. It seemed to me to be simpler and more useful when reading query results.
- You can not aggregate metrics for these datasets. The user must always filter or group by category / media file, and must always specify a snapshot. Otherwise, the results would contain duplicate counts.
- Does not include urls for categories or media files. They can be easily rebuilt from category or media file names.
- Does not include event-based metrics like media_files_viewed. These are subject to an interval of time and are not suitable for a snapshot-based dataset. To extract those, the user will have to query the incremental (event-based) datasets.
- Release frequency should be monthly, and thus the snapshots should be monthly (YYYY-MM). The source data for these datasets (MediaWiki tables) is imported monthly and is also snapshot-based.
- Hive should be good enough as underlying tech for the table. I don’t think we need Iceberg here, since monthly snapshots are handled well by Hive. But Iceberg is the default tech now, and we should use it anyway.
- These datasets do not allow reruns or back-filling.
Incremental datasets
editThese datasets contain aggregated metrics about GLAM-related events that are tied to a timeline. For instance, views of articles containing media files belonging to a GLAM Commons category. Or media file uploads made to a Commons category by a given contributor.
GLAM article views by category
editThis dataset gives information about the number of views GLAM articles receive across time. And it breaks them down by the GLAM category. A GLAM article is a wiki article (ns=0 page) that contains at least one media file belonging to a GLAM Commons category.
FIELD | TYPE | DESCRIPTION |
---|---|---|
category | string | The name of the category for which the metrics are aggregated. You should always filter or breakdown by this dimension. |
wiki | string | The wiki that the viewed article belongs to. |
article | string | The name of the viewed article. |
media_from_direct_category | boolean | This is true when the viewed article contains at least one media file belonging to the direct (top) category. It can contain media files from the downstream category tree as well and still be true.
Use this when aggregating to count direct category view counts vs. category tree view counts. |
article_views | int | The number of views the article received in the specified month. |
month | string | The month the article views happened (YYYY-MM). |
Data size guess
editThis one is more difficult to model… Assuming we have 1M media files, let’s say 900K are used. Let’s also assume the number of GLAM articles equals the number of GLAM used media files. Depending on the granularity of the data, a bigger or smaller share of the articles will receive at least 1 view (which makes them appear in the dataset). Let’s imagine 50% of the articles are viewed. We still need to multiply that by the number of subcategories in a category tree, which we said was about 50 in average. However both dimensions (article and category) are not orthogonal, so their product would be reduced by a factor. Let’s assume it’s 0.25. And that we multiply by the number of months in a year. Assuming each record takes 250B, then: 900K * 0.25 * 50 * 0.5 * 12 * 250 = 17GB/year uncompressed. This calculation is crazy and has too many assumptions and probably many mistakes on my part, so it could be very far from reality.
GLAM article views by media file
editThis dataset also gives information about the number of views GLAM articles receive across time. However, it breaks them down by the GLAM media file. A GLAM article is a wiki article (ns=0 page) that contains at least one media file belonging to a GLAM Commons category.
FIELD | TYPE | DESCRIPTION |
---|---|---|
media_file | string | The name of the media file for which the metrics are aggregated. You should always filter or breakdown by this dimension. |
primary_categories | array[string] | The list of GLAM primary Commons categories in whose category tree this media file falls. |
categories | array[string] | The list of Commons categories this media file directly belongs to. |
wiki | string | The wiki that the viewed article belongs to. |
article | string | The name of the viewed article. |
article_views | int | The number of views the article received in the specified month. |
month | string | The month the article views happened (YYYY-MM). |
Data size guess
editThis one is similar to the previous one in terms of difficulty and fuzziness… We can use the previous data size approximation (GLAM article views by category) to calculate this one. There are two differences: 1) this dataset contains two extra arrays of strings, and 2) in this dataset we have to multiply the media_file dimension cardinality by the article dimension cardinality (instead of category by article). Although the cardinality of the media_file dimension is much greater than the category dimension, the two dimensions in question here are very aligned; many images go in articles with the same name. But yea, I imagine this dataset will be a bit bigger than the previous one. Let’s say it’s double the size: ~34GB/year uncompressed.
GLAM Commons edits
editThis dataset contains a record for each edit (creation, update, deletion) to a Commons media file. It also records the user who did the edit and the categories the media file belongs to.
FIELD | TYPE | DESCRIPTION |
---|---|---|
user | string | The username of the contributor who edited the media file. |
edit_type | string | The type of edit: create, update, delete. |
media_file | string | The filename of the media file. |
primary_categories | array[string] | A list of primary GLAM category names whose category tree includes the media file. |
categories | array[string] | A list of category names that the media file directly belongs to. |
timestamp | timestamp | The timestamp of the edit. |
Data size guess
editFor this dataset we can assume that the number of edits is going to be close to the number of media files. A little bit over that, considering updates and deletes, say +30%. Let’s assume each record is going to take about ~300B. Times 1000000 media files, the overall size will be around ~400MB uncompressed.
Design considerations of incremental datasets
edit- The two article views datasets have the potential to become medium-big (in terms of cluster). But in terms of downloadable data via dumps, they can become quite big if we choose to release them in a daily granularity. Thus, I believe the monthly granularity is better (to start at least), even if it doesn’t show possible daily trends of the data.
- We can store the edits dataset without aggregation, since I believe it will be small enough.
- In these datasets you can aggregate across most dimensions without breaking the data, i.e. counting duplicates. The only ones that you can not aggregate across are `category` in “GLAM article views by category”, and `media_file` in “GLAM article views by media file”.
- Data size guesses are wild for article views datasets, take them with a bucket of salt.
- Again, use category names, media file names and user names as unique identifiers. It’s simpler and more direct to chart.
- Release frequency: Since the snapshot-based datasets are monthly, and these (incremental) datasets probably should have monthly granularity for size reasons, I guess the best would be to release these monthly as well.
- Tech: In this case, I think Iceberg would be a better option. We could partition the article views datasets by both category / media file and time; since we always want to either filter or break down by category / media file. Also, because they are incremental, we can take advantage of Iceberg.
- These datasets allow reruns or back-filling.
General design considerations
editReruns and back-fills
editWe can not rerun or back-fill snapshot-based datasets. We can do all that with incremental datasets, but with the drift explained below.
Category / media file drift
editWe will base incremental metrics on snapshot categories and media files, so there can be a minimal drift from month to month (a category might change name in the middle of a month, so we’ll see it as 2 different categories, one for the first part of the month, the other for the second). The drift can be bigger for reruns or back-fills.
Array fields
editIn some cases the array[string] fields are not strictly needed to fulfill the existing GLAM tools use cases. We might want to keep them because they are interesting and allow for extra querying options. But at the same time they add size to the data.
Category vs. category tree
editI tried to store metrics for both primary GLAM categories and any subcategory within their category trees. I’m not sure how much that will be used by the GLAM people. I understand there’s value in having subcategory metrics, but at the same time reporting on them makes the datasets quite larger.
Usage examples
editTotal views per primary category
editGet total views and time-since-tracked for all primary categories. From baglama2 main view.
select category, min(date), sum(views) from glam_article_views_by_category where array_contains(primary_categories, category) group by category order by category asc limit 3;
category | min(date) | sum(views) |
---|---|---|
2021 Africa Cup of Nations | 2021-05-23 | 45689457 |
Aeroseum | 2013-08-11 | 5698677 |
American Currency Collection | 2005-10-01 | 668799 |
- This query would read all the data in the glam_article_views_by_category. It seems it would be OK for a cluster, but tools outside the cluster might have problems, although indexes (on is_primary?) might solve them? In any case, we could load this data to Druid and serve it via AQS or Superset querying the public Druid instance, I think.
- To determine whether a category is primary, the query uses "array_contains(primary_categories, category)". We could add an "is_primary" field maybe? This would allow for indexes.
Monthly views for a given category
editGet monthly views for a given category. From baglama2 category view, or glamwikidashboard page-views view(monthly instead of daily).
select month, sum(views) from glam_article_views_by_category where category = 'My Category' group by month order by month asc limit 3;
month | sum(views) |
---|---|
2021-01 | 47536745 |
2021-02 | 87867334 |
2021-03 | 24565677 |
- This query would benefit from the Iceberg table being partitioned by category. If queried outside the cluster, an index on the category field would help.
Articles viewed and article views
editGet number of articles viewed and article views broken down by wiki for a given category and month, like in baglama2 category view.
select wiki, count(distinct article), sum(views) from glam_article_views_by_category where category = 'My Category' and date between '2021-01-01' and '2021-02-01' group by wiki order by wiki asc limit 3;
wiki | count(distinct article) | sum(views) |
---|---|---|
en.wikipedia.org | 13 | 33475 |
he.wikipedia.org | 2 | 655 |
hi.wikipedia.org | 1 | 535 |
- This query would also benefit from the Iceberg table being partitioned by category. If queried outside the cluster, an index on the category field would help.
Breakdown by wiki and category
editGet a breakdown of article views by wiki and category for a given month, like in baglama2 wiki view.
select article, views from glam_article_views_by_category where category = 'My Category' and wiki = 'my.wiki.org' and month = '2021-01' order by article asc limit 3;
article | sum(views) |
---|---|
Minimum support price | 45456 |
Sonia Mann | 4564 |
Balbir Singh Rajewal | 456 |
- This query would also benefit from the Iceberg table being partitioned by category and time. If queried outside the cluster, indexes on the category and time fields would help.
List category media files
editGet a list of media files and their media types, belonging to a given category, like in glamorgan main view.
select name, media_type from glam_commons_media_files where array_contains(categories, 'My Category') and snapshot = '2021-01' limit 3;
name | media_type |
---|---|
Saab AJ-37 Viggen 37094 57 (7609646264).jpg | image |
Inside the Aeroseum entrance tunnel (7490290326).jpg | image |
Aeroseum in Gothenburg, Sweden.jpg | image |
- This query is fine to execute from anywhere and doesn’t require any special partitions/indexes, since the data is small.
Article views by category media files
editGet the total article views for a given category, broken down by media file, like in glamorgan main view.
select media_file, sum(views) from glam_article_views_by_media_file where array_contains(categories, 'My Category') group by media_file order by sum(views) desc limit 3;
name | media_type |
---|---|
Saab AJ-37 Viggen 37094 57 (7609646264).jpg | 96786 |
Inside the Aeroseum entrance tunnel (7490290326).jpg | 42879 |
Aeroseum in Gothenburg, Sweden.jpg | 5467 |
- This query would read all the data in the glam_article_views_by_media_file dataset. It seems it would be OK for a cluster, but tools outside the cluster might have problems, although indexes might solve them? In any case, we could load this data to Druid and serve it via AQS or Superset querying the public Druid instance, I think.
- If we assume that there’s only 1 primary category for each media file, we could have a `primary_category` field instead of a `primary_categories` field, and we’d be able to partition/index on that.
Prototype implementation suggestions
editThis is an implementation plan draft for the GLAM Commons Impact Metrics datasets, specifically for the prototype. They are in the order believed to be necessary.
- Collect/generate a list of GLAM primary categories. In production this could be an allow-list file in some repo, or as Desiree suggested structured data in Commons.
-
Extract GLAM secondary categories (recursive subcategories of the primary categories), and create an intermediate category dataset (including primary categories). In production, we’d probably need a Spark fixed-point algorithm to extract those (using the category table plus the categorylinks table). We’ll have to take into account possible duplicates and even loops in the category “tree”. In the same job, I guess we could record the parent_categories and subcategories of each category in question.
For the prototype, we choose a couple convenient primary category trees and manually extract all secondary categories.FIELD TYPE name string primary_categories array[string] parent_categories array[string] subcategories array[string] - List all media files contained in GLAM categories (primary or secondary) and build an intermediate media file dataset with it. For this, a SparkSql query is probably enough, using the refined category information above (2), plus the image table and the imagelinks table. In the process we could include the media type and store the primary and direct categories of the media file.
FIELD TYPE name string media_type string primary_categories array[string] categories array[string] - We calculate glam_commons_categories and glam_commons_media_files. We need the data above: (2) and (3) plus imagelinks. Probably a set of SparkSql queries are enough? They don’t seem super complicated, but there are several metrics, so some tinkering to be done…
Output: the tables as described in the data model. - Calculate glam_article_views_by_category and glam_article_views_by_media_file. A SparkSql query that uses the data above (4) and pageviews_hourly (or med iarequests?). Probably we need several queries, one for category and one for category_tree. In my head, this needs some joining and aggregating, but doesn’t seem impossible.
Output: the datasets as described in the data model. - Calculate glam_commons_edits. A SparkSql query that uses the data above (4) plus an edit-based dataset like the revision table or MediaWiki History. This one is probably the easiest?
Output: the dataset as described in the data model. - Populate the spreadsheet with the data samples plus add queries and charts to it. Apparently, Google sheets lets you use SQL-like syntax to query data. There could be 1 sheet per initial dataset. And then more sheets as needed for the queries and charts?