Open main menu

User:Fæ/Project list/CDC videos

< User:Fæ‎ | Project list
Video giving easy to understand pregnancy tips.
Summary of the Surgeon General's 50 year history of anti-smoking campaigning.
Spanish video encouraging the use of insect repellent to defend from the Zika virus.

This batch upload project populates Category:CDC videos.

The videos are focused on health promotion, prevention and preparedness activities in the United States, though some relate to international programmes.

The project is run by Fæ as an independent volunteer. Questions and comments can be raised at User talk:Fæ. https://petscan.wmflabs.org/?psid=1181470 gives a report of uploads in the last 7 days.

Contents

IntroductionEdit

Centers for Disease Control and Prevention (CDC) is a U.S. federal agency, so all of works created under its projects are automatically public domain. The source Youtube channel CDCStreamingHealth used for the videos is their official channel, linking back to their home website www.cdc.gov. In practice, videos that are displayed at www.cdc.gov are actually Youtube hosted videos which have been embedded within CDC webpages.

This batch upload project for CDC videos was suggested in an email from James Heilman. Video transcoding and uploading has challenges for Wikimedia Commons, creating many barriers so that video remains only a very small part of Commons' collections.

Technical pointersEdit

Youtube makes a range of formats available for hosted videos behind the scenes. However the default format, and normally the best format, is mp4 which is not an open source standard for video. As Commons is limited to open source formats, this means that the files have to be reprocessed into an open standard. For videos the most common accepted format is webm using the VP8 or VP9 codec for video and vorbis for audio.

Example formats available for https://www.youtube.com/watch?v=YovSyrTUpxc; a CDC video, uploaded at File:I Am CDC- Linda Schieb.webm

[info] Available formats for YovSyrTUpxc:
format code  extension  resolution note
139          m4a        audio only DASH audio   49k , m4a_dash container, mp4a.40.5@ 48k (22050Hz), 540.46KiB
249          webm       audio only DASH audio   55k , opus @ 50k, 546.69KiB
250          webm       audio only DASH audio   78k , opus @ 70k, 765.37KiB
171          webm       audio only DASH audio  121k , vorbis@128k, 1.09MiB
140          m4a        audio only DASH audio  129k , m4a_dash container, mp4a.40.2@128k (44100Hz), 1.40MiB
251          webm       audio only DASH audio  192k , opus @160k, 1.80MiB
278          webm       256x144    144p   90k , webm container, vp9, 15fps, video only, 760.28KiB
160          mp4        256x144    DASH video  112k , avc1.4d400c, 15fps, video only, 1.14MiB
242          webm       426x240    240p  169k , vp9, 30fps, video only, 1.14MiB
133          mp4        426x240    DASH video  261k , avc1.4d4015, 30fps, video only, 2.61MiB
243          webm       640x360    360p  293k , vp9, 30fps, video only, 2.05MiB
134          mp4        640x360    DASH video  352k , avc1.4d401e, 30fps, video only, 2.43MiB
244          webm       854x480    480p  514k , vp9, 30fps, video only, 3.36MiB
135          mp4        854x480    DASH video  650k , avc1.4d401f, 30fps, video only, 4.90MiB
247          webm       1280x720   720p 1057k , vp9, 30fps, video only, 6.90MiB
136          mp4        1280x720   720p 1197k , avc1.4d401f, 30fps, video only, 9.67MiB
302          webm       1280x720   720p60 1634k , vp9, 60fps, video only, 11.15MiB
298          mp4        1280x720   DASH video 2819k , avc1.4d4020, 60fps, video only, 19.25MiB
17           3gp        176x144    small , mp4v.20.3, mp4a.40.2@ 24k
36           3gp        320x180    small , mp4v.20.3, mp4a.40.2
43           webm       640x360    medium , vp8.0, vorbis@128k
18           mp4        640x360    medium , avc1.42001E, mp4a.40.2@ 96k
22           mp4        1280x720   hd720 , avc1.64001F, mp4a.40.2@192k (best)

In the example above, a Commons compliant video could be created by downloading one of the webm format video-only files and merging in a vorbis format audio. This would be relatively fast, as the video and audio would not need transcoding on a local machine (the 'client' machine). In past projects this has been done, for example as part of the import of Youtube videos for the LGBT free media collective collective. However the range of available formats varies, so this type of video creation has been a manual choice at the time of merging, so the list of webm streams and vorbis streams were presented to the operator and picked out in a terminal screen. The hand-picking is not obvious, as very high resolution video is not suitable for Commons as it will fail to display on the image page.

Due to the numbers of CDC videos, and the shortage of 'technical' volunteer time to run the process, a fully automated approach was preferred. This is much slower as it relies on letting Youtube recommend the default best quality video for download, and then having the local machine transcode from mp4 (h264 codec) with an aac codec for audio to webm with VP8 codec and vorbis audio.

The choice of VP8 is based on transcoding times and the indifference to end quality that using VP9 would make. Further research might shift this viewpoint.

2018Edit

Taking account of upgrades to the youtube-dl module, the much simpler call equivalent is used:

youtube-dl -o <local> <webpage_url> --recode-video webm

This uses the DASH manifest to pick the best possible video and audio options and then recodes them to webm format using the local installation of ffmpeg. Example Hurricane Preparedness.webm

CodeEdit

The upload script uses Pywikibot-core. However at the time of batch upload started, the standard Pywikibot-core would fail to upload files at or above around 230 mb in size. A patch by Zhuyifei1999 discussed in phab:T129216, adds an experimental version of asynchronous uploads to Pywikibot-core by adapting the site.py module. Unfortunately this makes this project hard to replicate, unless users are technical enough to install (informal) patches.

The guts of handling Youtube videos is done by making calls with youtube-dl, see https://youtube-dl.org/. This is open source code available on Github. The alternative would be to access the Youtube API, where similar ways of reporting on Youtube video metadata could be done, in particular examining playlists under a Youtube channel account. However youtube-dl was a quick fix based on previous code available, so this option saved on our limited (free) volunteer development time. youtube-dl has options for returning results in JSON format.

youtube-dl → JSON

As a code note, probably to myself rather than expecting others to reuse this, after a lot of trial and error, the trick was:

  1. Taking youtube-dl output to a terminal using the "-j" option to get JSON.
  2. The output looks like ["data-wanted\n", null], so take output[0][:-1] to drop the null then chop of the redundant new line.
  3. re.sub('\n',',', output) so that each video (as a JSON array) in the playlist is separated by a comma, not a newline.
  4. data = json.loads('[' + output + ']') to wrap the naked list of arrays into a list so it can be read as correct JSON.

Getting this right was surprisingly time consuming. Though inelegant, if starting from scratch, it's probably better to parse the output as a long string rather than JSON.

Video transcoding is done using a slightly complex call to ffmpeg, https://ffmpeg.org/. This is probably the best known open source multi format converter tool. Transcoding works at approximately 12x "real time" for high quality video, i.e. a 1 hour high quality video will take 12 hours to transcode. This is based on using a 2012 Mac mini with 16 GB of ram and a 2.5 GHz i5 processor. Faster times would be achieved on better machines, or if WMF-Labs were used as the transcoding environment - this was not done as youtube-dl is not available on labs; again the trade-off between getting the project done with limited volunteer time, or investing that time on further research on getting the same thing working in the labs environment has to be considered.

The source code itself is "fairly" basic and was quickly hacked from a previous bit of code, along with odd local dependencies. So this is not a good bit of example code to learn from. The principles however, could be duplicated if someone wanted to create alternative code:

  • It makes calls to youtube-dl to examine the CDC video channel, pulls out all playlists, then loops through each playlist to then generate data for each video to be uploaded.
  • A local copy of the mp4 best version of the Youtube video is created with a youtube-dl call.
  • The youtube-dl information about the video is used to create the Commons image page text and provide a filename. Extra useful data included is the Youtube "categories" and the Youtube "tags".
  • The local copy is transcoded using ffmpeg.
  • Pywikbot (patched) uploads the file to Commons. Local copies are then removed.
  • Uploads use the Youtube Playlist name as a category name. This will be a red-link until created manually.

Checks made:

  1. The intended filename is checked to see whether it exists on Commons, and skips if it does.
  2. The code checks for whether the "youtube ID" (?v=<id> part of website calls) is already on Commons, presuming that the alpha string will be unique. If found the file is skipped, presuming it exists under another filename.
  3. If Commons upload fails twice, the code gives up on the file, presuming there is a fundamental problem with file encoding.

The critical call for the transcoding is:

call(["nice","-n","15", "ffmpeg","-i","youtube-x.mp4","-c:v", "libvpx", "-crf", "4", "-b:v", "2M", "-c:a", "libvorbis", local])

"call" is from the Python subprocess module, creating a new process to do the transcoding. This is running in OSX, so the command "nice" sets the priority of the process, in this case to a low priority (higher 'niceness') so that the local machine is not overloaded for other jobs. The variable local is where the file is getting created based on the original youtube-x.mp4. The other ffmpeg options are based on the experience of creating videos for Commons over the last couple of years.

Upload runsEdit

15 May 2017

A test run created File:2017 NHSN Training - Standardized Antibiotic Administration Ratio.webm and started Category:I Am CDC.

16 May 2017

Full automated run started. The file upload comment, in the file history, points to this project page. One unintended consequence of using a local copy for transcoding was this rapidly broke my Google drive allowance (15GB), as files were being synced with my drive account. Switching off the sync while this project ran fixed the issue, but a better solution for similar projects would be to let the transcoding run on a stand-alone machine as storage, processor time and bandwidth are all seriously affected.

23 May 2017

Run restarted, with improvement to identifying videos from the youtube-dl playlist JSON listing. There seems to be two different ways that the list may be returned, one having a list of 'entries' within an array for the playlist, and the other being a list of videos as a naked set of arrays, causing errors in JSON parsing of the output. Previous uploads skipped these errors, but were missing some videos/sets of videos as a consequence, for example CDC Zika virus videos.

29 July 2017

First run completed, re-run started. It seems that the transcoding takes long enough for many videos to be uploaded while the upload run is happening. Regular refreshes may be helpful.

17 May 2018

First run with new simpler command, leaving post-processing choices to youtube-dl.