Last modified on 14 October 2014, at 07:10

Help:Converting

Translate this page; This page contains changes which are not marked for translation.

Other languages:
English • ‎日本語 • ‎polski

Converting a document from one format to another for Commons can be useful in order to make derivatives that are more readily accessible.

Converting from PDF to images and textEdit

  • Although PDF documents are accepted by Commons, they can nevetheless be difficult to access.
  • Extracting images:
    • Use Open source software GSview's File/Convert menu item to convert any sequence of PDF pages to a sequence of images in any format from bit to tiffpack with resolutions; then use IrfanView's Image menu "Create panorama image..." to combine the sequence into a single vertical image.
    • Online web tool PDFaid. Extract pdf images as jpg, gif, png or bmp image format.
  • Extracting text:
    • If the PDF contains the text in an easily extracted form, then use either GSview's "Edit" / "Text extract..." or Adobe Acrobat viewer's "Save as text", otherwise:
    • Follow the advice in "Extracting images" above, then follow the advice in "Converting from image formats to text" below.
  • Converting to SVG:
    • Use pdf2svg (Linux: pdf2svg) to convert to an SVG if the entire PDF file should be used as an image, e.g., if it is a diagram generated by some program.

Converting from image formats to textEdit

  • use IrfanView's Option menu "Start OCR" plugin (OCR is Optical character recognition) to extract the text. As of 2010-02-20 the KADMOS OCR plugin for Irfanview is limited to around six pages depending on your computer's free memory (about one gigabyte needed per 10 pages). You may need to convert in sections. You will need to manually correct the generated text because the conversion is not perfect.
  • or use the free open source tesseract software (Linux, Mac OS X or Windows):
    • download both the "tesseract" software and the "tessdata" language packs relevant to the languages appearing in the scanned document, and unpack them into the same folder; no installation is needed for the Windows executable
    • obtain the highest resolution scan possible, and if necessary further enlarge the scan image (use fast resize, avoid resample filters) until characters are over 20 pixels high (experiment for best results) and save it as an uncompressed TIFF (use ImageMagick or IrfanView); you will need a lot of disk space
    • try small fragments first because recognition can take several minutes per page
    • tesseract may crash with input fragments larger than about 12 pages
    • use the command "tesseract.exe input.tif output"

Converting from image formats to GIF, JPEG, PNG or TIFFEdit

  • Use free software w:en:IrfanView, and jpegcrop (for advanced lossless cropping and other transformations)
  • Or use the free software suite w:en:ImageMagick

Converting from video formats to TheoraEdit

Converting from audio formats to Vorbis oggEdit

See alsoEdit