This batch upload project populates Category:CDC videos.
The videos are focused on health promotion, prevention and preparedness activities in the United States, though some relate to international programmes.
Centers for Disease Control and Prevention (CDC) is a U.S. federal agency, so all of works created under its projects are automatically public domain. The source Youtube channel CDCStreamingHealth used for the videos is their official channel, linking back to their home website www.cdc.gov. In practice, videos that are displayed at www.cdc.gov are actually Youtube hosted videos which have been embedded within CDC webpages.
This batch upload project for CDC videos was suggested in an email from James Heilman. Video transcoding and uploading has challenges for Wikimedia Commons, creating many barriers so that video remains only a very small part of Commons' collections.
Youtube makes a range of formats available for hosted videos behind the scenes. However the default format, and normally the best format, is mp4 which is not an open source standard for video. As Commons is limited to open source formats, this means that the files have to be reprocessed into an open standard. For videos the most common accepted format is webm using the VP8 or VP9 codec for video and vorbis for audio.
[info] Available formats for YovSyrTUpxc: format code extension resolution note 139 m4a audio only DASH audio 49k , m4a_dash container, mp4a.40.5@ 48k (22050Hz), 540.46KiB 249 webm audio only DASH audio 55k , opus @ 50k, 546.69KiB 250 webm audio only DASH audio 78k , opus @ 70k, 765.37KiB 171 webm audio only DASH audio 121k , vorbis@128k, 1.09MiB 140 m4a audio only DASH audio 129k , m4a_dash container, mp4a.40.2@128k (44100Hz), 1.40MiB 251 webm audio only DASH audio 192k , opus @160k, 1.80MiB 278 webm 256x144 144p 90k , webm container, vp9, 15fps, video only, 760.28KiB 160 mp4 256x144 DASH video 112k , avc1.4d400c, 15fps, video only, 1.14MiB 242 webm 426x240 240p 169k , vp9, 30fps, video only, 1.14MiB 133 mp4 426x240 DASH video 261k , avc1.4d4015, 30fps, video only, 2.61MiB 243 webm 640x360 360p 293k , vp9, 30fps, video only, 2.05MiB 134 mp4 640x360 DASH video 352k , avc1.4d401e, 30fps, video only, 2.43MiB 244 webm 854x480 480p 514k , vp9, 30fps, video only, 3.36MiB 135 mp4 854x480 DASH video 650k , avc1.4d401f, 30fps, video only, 4.90MiB 247 webm 1280x720 720p 1057k , vp9, 30fps, video only, 6.90MiB 136 mp4 1280x720 720p 1197k , avc1.4d401f, 30fps, video only, 9.67MiB 302 webm 1280x720 720p60 1634k , vp9, 60fps, video only, 11.15MiB 298 mp4 1280x720 DASH video 2819k , avc1.4d4020, 60fps, video only, 19.25MiB 17 3gp 176x144 small , mp4v.20.3, mp4a.40.2@ 24k 36 3gp 320x180 small , mp4v.20.3, mp4a.40.2 43 webm 640x360 medium , vp8.0, vorbis@128k 18 mp4 640x360 medium , avc1.42001E, mp4a.40.2@ 96k 22 mp4 1280x720 hd720 , avc1.64001F, mp4a.40.2@192k (best)
In the example above, a Commons compliant video could be created by downloading one of the webm format video-only files and merging in a vorbis format audio. This would be relatively fast, as the video and audio would not need transcoding on a local machine (the 'client' machine). In past projects this has been done, for example as part of the import of Youtube videos for the LGBT free media collective collective. However the range of available formats varies, so this type of video creation has been a manual choice at the time of merging, so the list of webm streams and vorbis streams were presented to the operator and picked out in a terminal screen. The hand-picking is not obvious, as very high resolution video is not suitable for Commons as it will fail to display on the image page.
Due to the numbers of CDC videos, and the shortage of 'technical' volunteer time to run the process, a fully automated approach was preferred. This is much slower as it relies on letting Youtube recommend the default best quality video for download, and then having the local machine transcode from mp4 (h264 codec) with an aac codec for audio to webm with VP8 codec and vorbis audio.
The choice of VP8 is based on transcoding times and the indifference to end quality that using VP9 would make. Further research might shift this viewpoint.
Taking account of upgrades to the youtube-dl module, the much simpler call equivalent is used:
youtube-dl -o <local> <webpage_url> --recode-video webm
The upload script uses Pywikibot-core. However at the time of batch upload started, the standard Pywikibot-core would fail to upload files at or above around 230 mb in size. A patch by Zhuyifei1999 discussed in phab:T129216, adds an experimental version of asynchronous uploads to Pywikibot-core by adapting the site.py module. Unfortunately this makes this project hard to replicate, unless users are technical enough to install (informal) patches.
The guts of handling Youtube videos is done by making calls with youtube-dl, see https://youtube-dl.org/. This is open source code available on Github. The alternative would be to access the Youtube API, where similar ways of reporting on Youtube video metadata could be done, in particular examining playlists under a Youtube channel account. However youtube-dl was a quick fix based on previous code available, so this option saved on our limited (free) volunteer development time. youtube-dl has options for returning results in JSON format.
|youtube-dl → JSON|
As a code note, probably to myself rather than expecting others to reuse this, after a lot of trial and error, the trick was:
Getting this right was surprisingly time consuming. Though inelegant, if starting from scratch, it's probably better to parse the output as a long string rather than JSON.
Video transcoding is done using a slightly complex call to ffmpeg, https://ffmpeg.org/. This is probably the best known open source multi format converter tool. Transcoding works at approximately 12x "real time" for high quality video, i.e. a 1 hour high quality video will take 12 hours to transcode. This is based on using a 2012 Mac mini with 16 GB of ram and a 2.5 GHz i5 processor. Faster times would be achieved on better machines, or if WMF-Labs were used as the transcoding environment - this was not done as youtube-dl is not available on labs; again the trade-off between getting the project done with limited volunteer time, or investing that time on further research on getting the same thing working in the labs environment has to be considered.
The source code itself is "fairly" basic and was quickly hacked from a previous bit of code, along with odd local dependencies. So this is not a good bit of example code to learn from. The principles however, could be duplicated if someone wanted to create alternative code:
- It makes calls to youtube-dl to examine the CDC video channel, pulls out all playlists, then loops through each playlist to then generate data for each video to be uploaded.
- A local copy of the mp4 best version of the Youtube video is created with a youtube-dl call.
- The youtube-dl information about the video is used to create the Commons image page text and provide a filename. Extra useful data included is the Youtube "categories" and the Youtube "tags".
- The local copy is transcoded using ffmpeg.
- Pywikbot (patched) uploads the file to Commons. Local copies are then removed.
- Uploads use the Youtube Playlist name as a category name. This will be a red-link until created manually.
- The intended filename is checked to see whether it exists on Commons, and skips if it does.
- The code checks for whether the "youtube ID" (?v=<id> part of website calls) is already on Commons, presuming that the alpha string will be unique. If found the file is skipped, presuming it exists under another filename.
- If Commons upload fails twice, the code gives up on the file, presuming there is a fundamental problem with file encoding.
The critical call for the transcoding is:
call(["nice","-n","15", "ffmpeg","-i","youtube-x.mp4","-c:v", "libvpx", "-crf", "4", "-b:v", "2M", "-c:a", "libvorbis", local])
"call" is from the Python subprocess module, creating a new process to do the transcoding. This is running in OSX, so the command "nice" sets the priority of the process, in this case to a low priority (higher 'niceness') so that the local machine is not overloaded for other jobs. The variable local is where the file is getting created based on the original youtube-x.mp4. The other ffmpeg options are based on the experience of creating videos for Commons over the last couple of years.
- 15 May 2017
A test run created File:2017 NHSN Training - Standardized Antibiotic Administration Ratio.webm and started Category:I Am CDC.
- 16 May 2017
Full automated run started. The file upload comment, in the file history, points to this project page. One unintended consequence of using a local copy for transcoding was this rapidly broke my Google drive allowance (15GB), as files were being synced with my drive account. Switching off the sync while this project ran fixed the issue, but a better solution for similar projects would be to let the transcoding run on a stand-alone machine as storage, processor time and bandwidth are all seriously affected.
- 23 May 2017
Run restarted, with improvement to identifying videos from the youtube-dl playlist JSON listing. There seems to be two different ways that the list may be returned, one having a list of 'entries' within an array for the playlist, and the other being a list of videos as a naked set of arrays, causing errors in JSON parsing of the output. Previous uploads skipped these errors, but were missing some videos/sets of videos as a consequence, for example CDC Zika virus videos.
- 29 July 2017
First run completed, re-run started. It seems that the transcoding takes long enough for many videos to be uploaded while the upload run is happening. Regular refreshes may be helpful.
- 17 May 2018
First run with new simpler command, leaving post-processing choices to youtube-dl.