Open main menu

Wikimedia Commons β


Other languages:
Deutsch • ‎English • ‎فارسی • ‎日本語 • ‎नेपाली • ‎polski • ‎українська

Converting a document from one format to another for Commons can be useful in order to make derivatives that are more readily accessible.

Converting from PDF to images and textEdit

  • Although PDF documents are accepted by Commons, they can nevetheless be difficult to access.
  • Extracting images:
    • Use Open source software GSview's File/Convert menu item to convert any sequence of PDF pages to a sequence of images in any format from bit to tiffpack with resolutions; then use IrfanView's Image menu "Create panorama image..." to combine the sequence into a single vertical image.
  • Extracting text:
    • If the PDF contains the text in an easily extracted form, then use either GSview's "Edit" / "Text extract..." or Adobe Acrobat viewer's "Save as text", otherwise:
    • Follow the advice in "Extracting images" above, then follow the advice in "Converting from image formats to text" below.
  • Converting to SVG:
    • Use pdf2svg (Linux: pdf2svg) to convert to an SVG if the entire PDF file should be used as an image, e.g., if it is a diagram generated by some program.

Converting from image formats to textEdit

  • use IrfanView's Option menu "Start OCR" plugin (OCR is Optical character recognition) to extract the text. As of 2010-02-20 the KADMOS OCR plugin for Irfanview is limited to around six pages depending on your computer's free memory (about one gigabyte needed per 10 pages). You may need to convert in sections. You will need to manually correct the generated text because the conversion is not perfect.
  • or use the free open source tesseract software (Linux, Mac OS X or Windows):
  • download both the "tesseract" software and the "tessdata" language packs relevant to the languages appearing in the scanned document, and unpack them into the same folder; no installation is needed for the Windows executable
  • obtain the highest resolution scan possible, and if necessary further enlarge the scan image (use fast resize, avoid resample filters) until characters are over 20 pixels high (experiment for best results) and save it as an uncompressed TIFF (use ImageMagick or IrfanView); you will need a lot of disk space
  • try small fragments first because recognition can take several minutes per page
  • tesseract may crash with input fragments larger than about 12 pages
  • use the command "tesseract.exe input.tif output"

Converting from image formats to GIF, JPEG, PNG or TIFFEdit

See alsoEdit