Converting a document from one format to another for Commons can be useful in order to make derivatives that are more readily accessible.
Converting from PDF to images and textEdit
- Although PDF documents are accepted by Commons, they can nevetheless be difficult to access.
- Extracting images:
- Extracting text:
- If the PDF contains the text in an easily extracted form, then use either GSview's "Edit" / "Text extract..." or Adobe Acrobat viewer's "Save as text", otherwise:
- Follow the advice in "Extracting images" above, then follow the advice in "Converting from image formats to text" below.
Converting from image formats to textEdit
- use IrfanView's Option menu "Start OCR" plugin (OCR is Optical character recognition) to extract the text. As of 2010-02-20 the KADMOS OCR plugin for Irfanview is limited to around six pages depending on your computer's free memory (about one gigabyte needed per 10 pages). You may need to convert in sections. You will need to manually correct the generated text because the conversion is not perfect.
- or use the free open source tesseract software (Linux, Mac OS X or Windows):
- download both the "tesseract" software and the "tessdata" language packs relevant to the languages appearing in the scanned document, and unpack them into the same folder; no installation is needed for the Windows executable
- obtain the highest resolution scan possible, and if necessary further enlarge the scan image (use fast resize, avoid resample filters) until characters are over 20 pixels high (experiment for best results) and save it as an uncompressed TIFF (use ImageMagick or IrfanView); you will need a lot of disk space
- try small fragments first because recognition can take several minutes per page
- tesseract may crash with input fragments larger than about 12 pages
- use the command "tesseract.exe input.tif output"