Commons:GLAMwiki Toolset Project/NARA analytics pilot

The NARA analytics pilot is a pilot project to produce the first image usage analytics of a GLAM upload. This is a quick and dirty approach done at the Zürich Hackathon 2014. Beware, the data and analytics didn't go through any qualilty control so it might be full of errors.

History edit

 
Report on Requirements for Usage and Reuse Statistics for GLAM Content

At the moment the Wikimedia Foundation collects page view statistics, but no image view statistics. This means we don't know how many views our images and other media files get, we try to derive it based on page views. A long time ago someone enabled logging of image views for NARA images. The data has been collecting for several years, but nothing has been done with the data yet. Unfortunately the data was rotated (old files deleted) so the dataset we worked on was from 2014-02-05 to 2014-05-09. The logs are sampled 1:10 so as a rule of thumb you night to multiple the numbers with 10. We made an overview of requirements for usage and re-usage statistics for GLAM content so we know what questions we want to answer.

Steps edit

  • udplog collects NARA data. The Raw data format is at wikitech:Analytics/Data_access#Request_logs
  • This data is imported to Hadoop on stat1002.eqiad.wmnet to the table webrequest_glam_nara on database Otto (thanks Andrew)
  • We do a query to clean up the data and to get the same image names (multiple url's for the same image, I hope we solved the fileviews problem)
CREATE TABLE webrequest_glam_nara_cleaned(
  dt string COMMENT 'from deserializer', 
  country_code string COMMENT 'from deserializer', 
  uri string COMMENT 'from deserializer', 
  content_type string COMMENT 'from deserializer', 
  referer string COMMENT 'from deserializer')
--PARTITIONED BY ( 
--  year int, 
--  month int, 
--  day int)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\t' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
INSERT INTO TABLE
  webrequest_glam_nara_cleaned
SELECT
  dt, country_code, regexp_replace(uri, 'https?://upload.wikimedia.org/wikipedia/commons(/thumb)?/(\\w/\\w\\w)/([^\/]+)(.*)', '//upload.wikimedia.org/wikipedia/commons/$2/$3'), content_type, referer
FROM
  webrequest_glam_nara
WHERE 
    year=2014 
  AND 
    (uri LIKE 'http://upload.wikimedia.org/%' OR uri like 'https://upload.wikimedia.org/%');

First results edit

Total number of views edit

SELECT COUNT(*) FROM webrequest_glam_nara_cleaned;

Over the time period we had a total of 21,547,979 hits. So, taking into account the 1:10 sampling, this means roughly 200 million hits in 3 months. We could split this out by day and make a histogram

Views per day edit

"Give me a daily overviewcount of the views my media files had (FileViews)"

See dataset and visualisation.

There is a spike on March 24. Further analysis shows that the biggest referral on that day is Dorothy Height. Turns out this lady was featured on a Google Doodle on that day.

Views per month edit

"Give me a monthly overview of the views my media files had (FileViews)"

Period Views
2014-02 (from 05) 5,394,192
2014-03 7,222,593
2014-04 6,848,673
2014-05 (end 09) 2,082,521

The most popular images edit

SELECT 
  count(*) as freq, uri
FROM
  webrequest_glam_nara_cleaned 
GROUP BY 
  uri 
ORDER BY
  freq desc 
LIMIT 20;

Most popular images by month edit

February 2014
March 2014
April 2014

Imageviews by country edit

SELECT 
  country_code as country, count(*) as freq
FROM
  webrequest_glam_nara_cleaned 
GROUP BY 
  country_code 
ORDER BY
  freq desc
LIMIT 2000;

We could make a heatmap of this one.

country freq
US 10806780
GB 1338132
XX 1050605
DE 879286
CA 822878
JP 566449
FR 479325
AU 459854
IN 370386
NL 293125
RU 235243
MX 226716
IT 217636
CN 199184
ES 194763
BR 144575
PH 134343
SE 126166
PL 113083
EU 109736
CH 104411
AT 103840
IE 93439
CO 81917
NZ 81675
SG 79750
NO 78881
TR 75130

Imageviews per country per month edit

Give me a monthly overview of the provenance of the users that see my media files

February 2014
country freq
US 2710846
GB 343203
XX 248388
DE 224909
CA 203417
JP 175721
FR 126888
AU 103175
IN 96543
MX 65935
NL 65767
IT 60654
ES 55110
RU 50584
CN 40986
PH 39484
SE 30751
BR 30243
EU 28404
CH 25599
March 2014
country freq
US 3540291
GB 465823
XX 351699
DE 303557
CA 273932
JP 189155
FR 163550
AU 152841
IN 123431
NL 105033
RU 91192
MX 84487
CN 73168
IT 68658
ES 65726
PH 51155
BR 46257
SE 43836
PL 41642
EU 37951
April 2014
country freq
US 3476374
GB 404892
XX 344029
CA 269814
DE 267976
AU 158220
JP 154575
FR 145757
IN 117831
NL 96214
RU 74868
IT 67890
CN 64992
MX 58800
ES 57575
BR 47004
SE 39736
PL 36828
PH 33647
EU 33524

Top referers edit

SELECT
  referer, count(*) as freq 
FROM
  webrequest_glam_nara_cleaned
GROUP BY
  referer
ORDER BY
  freq
DESC
LIMIT 20;
referer freq
<none> 906001
http://en.wikipedia.org/wiki/World_War_I 505325
http://en.m.wikipedia.org/wiki/World_War_I 392841
http://en.wikipedia.org/wiki/Japanese_American_internment 197975
http://en.wikipedia.org/wiki/Vietnam_War 196377
http://en.wikipedia.org/wiki/Franklin_D._Roosevelt 191373
http://en.wikipedia.org/wiki/Atomic_bombings_of_Hiroshima_and_Nagasaki 188715
http://en.wikipedia.org/wiki/Martin_Luther_King,_Jr. 180489
http://en.wikipedia.org/wiki/World_War_II 166533
http://en.wikipedia.org/wiki/John_F._Kennedy 164430
http://en.m.wikipedia.org/wiki/List_of_Presidents_of_the_United_States 149187
http://en.m.wikipedia.org/wiki/Vietnam_War 141585
http://en.wikipedia.org/wiki/George_H._W._Bush 138513
http://en.wikipedia.org/wiki/List_of_Presidents_of_the_United_States 138324
https://www.google.com/ 136984
http://en.m.wikipedia.org/wiki/Atomic_bombings_of_Hiroshima_and_Nagasaki 136312
http://en.m.wikipedia.org/wiki/John_F._Kennedy 131820
http://en.m.wikipedia.org/wiki/World_War_II 128544
http://en.m.wikipedia.org/wiki/Franklin_D._Roosevelt 126696
http://en.wikipedia.org/wiki/March_on_Washington_for_Jobs_and_Freedom 113333

Top external referers edit

SELECT
  referer, count(*) as freq 
FROM
  webrequest_glam_nara_cleaned
WHERE
  referer not like '%wiki%'
GROUP BY
  referer
ORDER BY
  freq
DESC
LIMIT 20;
referer freq
https://www.google.com/ 136984
http://www.google.com/ 47612
http://www.google.com/blank.html 33039
http://hainn12.blogspot.com/ 12769
https://www.google.co.uk/ 11585
http://www.reddit.com/ 7940
https://www.google.ca/ 7713
https://www.google.co.in/ 4959
http://www.google.co.uk/ 4165
https://www.google.fr/ 3816
https://www.google.com.au/ 3708
https://www.google.com.mx/ 3228
https://www.google.es/ 3182
http://hainn8x12.blogspot.com/ 3128
https://www.google.it/ 2687
https://www.google.co.jp/ 2646
https://www.google.de/ 2565
http://www.google.de/ 2508
http://www.google.co.uk/blank.html 2460

(removed the none one)

Top external referers (without Google) edit

SELECT
  referer, count(*) as freq 
FROM
  webrequest_glam_nara_cleaned
WHERE
  referer not like '%wiki%' and referer not like '%google%'
GROUP BY
  referer
ORDER BY
  freq
DESC
LIMIT 20;
http://hainn12.blogspot.com/ 12769
http://www.reddit.com/ 7940
http://hainn8x12.blogspot.com/ 3128
http://newtownliterary.wordpress.com/wp-admin/post.php?post=767&action=edit&message=1 2379
http://image-search.kik.com/ 1336
http://hainn12.blogspot.com.es/ 1292
http://www.reddit.com/r/all/ 954
http://ex-frat-man.tumblr.com/ 915
http://hainn12.blogspot.co.uk/ 913
http://blogs.discovermagazine.com/bodyhorrors/ 882
http://www.reddit.com/r/HistoryPorn/ 710
http://www.moava.org/index.php?pageID=222 670
http://hainn12.blogspot.de/ 615
http://www.answers.com/topic/environmental-protection-agency 608
http://hainn12.blogspot.com.tr/ 587
http://hainn12.blogspot.tw/ 569
http://www.answers.com/topic/martin-luther-king-jr 509
http://www.answers.com/topic/world-war-i 481
http://www.answers.com/topic/new-deal 445

Source code edit

See https://github.com/Commonists/limn-glam for a visualization code base which was based on now deprecated Limn.