Help:Zoomable images/dezoomify.py

Dezoomify is now hosted at SourceForge: https://sourceforge.net/projects/dezoomify/, following a change to the jpegtran library and a code overhaul by User:Quibik . Please use the files and documentation there, and if you have a problem, use the Sourceforge issue tracker.

Archive version edit

This version is the last version before the upgraded version at Sourceforge, and it is now unmaintained. This version requires PIL, but the current version needs jpegtran. If you'd rather use PIL, use the code below.

For a list of scripts which allow to stitch images from a zoomify viewer, see Help:Zoomable images.

The dezoomify.py Python script is capable of grabbing images from a Zoomify folder(s) and automatically stitching the images with Python Imaging Library. This script will take the URL of a page containing the Zoomify Flash object, search for the appropriate directory path, determine the maximum zoom depth and number of tiles and then download the tiles and combine without intervention.

The image locations are given with the "-i" parameter. In its raw state, this is the URL of a page containing the Zoomify object, for example http://www.bl.uk/onlinegallery/onlineex/evancoll/a/zoomify72459.html

If you set the "-b" flag, the script will take the "-i" parameter as the raw base directory of the Zoomify file structure. In this case it would be: http://ogimages.bl.uk/images/zoomify/014/014EVA000000000U05043000. These is no slash at the end of the URL.

If you set the "-l" flag, the script takes as an input a local file containing a list of URLs, one on each line. Each of these is interpreted as before (you can give a simple page URL or a base URL with the -b parameter). If you give a list, you currently cannot specify unique file names for each output file, they will have a numeric suffix to differentiate them.

The "-e" flag sets the extension of the files we will be collecting and tiling: in all cases I have seen, this is the default, jpg. The "-q" parameter allows you to set the quality that PIL saves the image at. The default is 75, which is more than adequate for most images. Images with fine, sharp detail may need a higher quality. It is not recommended to exceed 95, as this will produce a huge image with no noticeable quality improvement. 100% quality is included in PIL mostly for testing purposes, not for actual use.

The script is currently designed to grab images from the BaseDirectory/TileGroupX/DEPTH-COL-ROW.jpg locations, and will search these locations automatically. The information about the number of tiles, tile size, etc. is gathered or derived from BaseDirectory/ImageProperties.xml.

PIL is required to handle the image files and paste them into position.

This script was written by Inductiveload, contact at his talk page in case of bugs, suggestions, fixes, or questions. Happy scraping!

Disclaimer: As always, only download what is legal in your area. I will not be held responsible for what you do with this script.

Troubleshooting: Scraping Zoomify objects can be non-trivial. See this page for troubleshooting information.

Examples edit

You run the script by writing a command like the examples below into the Windows or Linux (or whatever) command line, and not into the Python interactive prompt. If your prompt begins ">>>", you are in Python already, and need to exit.

To download an image from a page containing a Zoomify object:

python dezoomify.py -i http://www.bl.uk/onlinegallery/onlineex/evancoll/a/zoomify72459.html -o c:\output.jpg

To download an image from a page containing a Zoomify object, saving at 90% quality:

python dezoomify.py -i http://www.bl.uk/onlinegallery/onlineex/evancoll/a/zoomify72459.html -q 90 -o c:\output.jpg

To download from the base URL (add the -b flag, but don't remove the -i parameter, -b just modifies the intent of -i, and doesn't replace it):

python dezoomify.py -i http://fotothek.slub-dresden.de/zooms/df/dk/0001000/df_dk_0001708 -b -o  c:\output.jpg

To download from pages containing Zoomify objects, but using a list of page URLs:

python dezoomify.py  -i c:\list.txt -l -o c:\output.jpg

To download from pages containing Zoomify objects, but using a list of base URLs:

python dezoomify.py  -i c:\list.txt -l -b -o c:\output.jpg

To get the image at a different zoom level (0 is lowest, highest depends on the image), add -z <level> eg "-z 3":

python dezoomify.py -i http://www.bl.uk/onlinegallery/onlineex/evancoll/a/zoomify72459.html -z 3 -o c:\output.jpg

To save the tiles as they are downloaded (in the same directory as the output), add -s

python dezoomify.py -i http://www.bl.uk/onlinegallery/onlineex/evancoll/a/zoomify72459.html -s -o c:\output.jpg

Source Code edit

#!/usr/bin/python
# coding=utf8
 
"""
TAKE A URL CONTAINING A PAGE CONTAINING A ZOOMIFY OBJECT, A ZOOMIFY BASE
DIRECTORY OR A LIST OF THESE, AND RECONSTRUCT THE FULL RESOLUTION IMAGE
 
 
====LICENSE=====================================================================
 
This software is licensed under the Expat License (also called the MIT license).
"""
 
import sys, time, os, re, cStringIO, urllib2, urlparse, optparse, threading, Queue
from math import ceil, floor

try:
    import Image
except ImportError:
    print('(ERR) Needs PIL to run. You can get PIL at http://www.pythonware.com/products/pil/. Exiting.')
    sys.exit()
 
def main():
 
    parser = optparse.OptionParser(usage='Usage: %prog -i <source> <options> -o <output file>')
    parser.add_option('-i', dest='url', action='store',\
                             help='the URL of a page containing a Zoomify object (unless -b or -l flags are set) (required)')
    parser.add_option('-b', dest='base', action='store_true', default=False,\
                             help='the URL is the base directory for the Zoomify tile structure' )
    parser.add_option('-l', dest='list', action='store_true', default=False,\
                             help='the URL refers to a local file containing a list of URLs or base directories to dezoomify' )
    parser.add_option('-d', dest='debug', action='store_true', default=False,\
                             help='toggle debugging information' )
    parser.add_option('-e', dest='ext', action='store', default='jpg',\
                             help='input file extension (default = jpg)' )
    parser.add_option('-q', dest='qual', action='store', default='75',\
                             help='output image quality (default=75)' )
    parser.add_option('-z', dest='zoomLevel', action='store', default=False,\
                             help='zoomlevel to grab image at (can be useful if some of a higher zoomlevel is corrupted or missing)' )
    parser.add_option('-s', dest='store', action='store_true', default=False,\
                             help='save all tiles locally' )
    parser.add_option('-t', dest='nthreads', action='store', default=16,\
                             help='how many downloads will be made in parallel (default: 16)' )
    parser.add_option('-o', dest='out', action='store',\
                             help='the output file for the image' )
 
    (opts, args) = parser.parse_args()
 
    # check mandatory options
    if (opts.url is None):
        print("ERR: The input option '-i' must be given\n")
        parser.print_help()
        exit(-1)
 
    if (opts.out is None) :
        parsedURL = urlparse.urlparse(opts.url)
        basename = os.path.splitext(os.path.basename(parsedURL.path))[0]
        opts.out = "%s-%s.%s" % (parsedURL.netloc, basename, opts.ext)
        if opts.debug:
            print("(INF): No output file (-o) given. Choosing %s" % opts.out)
 
    if (int(opts.qual) > 95) :
        print("INF: Output quality over 95% is discouraged due to large filesize without useful quality increase\n")
        cont = raw_input("Continue? [y/n] >")
        if ((cont == 'n') or (cont == 'N') ):
            exit(-1)
 
    Dezoomify(opts)

def getFilePath(level, col, row, ext):
    """
    Return the file name of an image at a given position in the zoomify
    structure.
 
    The tilegroup is NOT included
    """
    return "%d-%d-%d.%s" % (level, col, row, ext)

def getUrl(url, debug=True):
    """
    getUrl accepts a URL string and return the server response code, 
    response headers, and contents of the file
 
    Keyword arguments:
    url -- the url to fetch
    """
    # spoof the user-agent and referrer, in case that matters.
    req_headers = {
        'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/525.13 (KHTML, like Gecko) Chrome/0.A.B.C Safari/525.13',
        'Referer': url
        }
 
    request = urllib2.Request(url, headers=req_headers) # create a request object for the URL
    opener = urllib2.build_opener() # create an opener object
    timer = 1
    for i in range(6):
        try:
            return opener.open(request).read()
        except:
            time.sleep(timer)
            timer *= 2
    raise IOError("Unable to download %s."%url)

def printLine (line):
	'''Replace the last line of the console with a new line'''
	sys.stdout.write('\r' + line)
	sys.stdout.flush()

class Dezoomify():
 
    def getImageDirectory(self, url):
        """
        Gets the Zoomify image base directory for the image tiles. This function
        is called if the user does NOT supply a base directory explicitly. It works
        by parsing the HTML code of the given page and looking for 
        zoomifyImagePath=....
 
        Keyword arguments
        url -- The URL of the page to look for the base directory on
        """
 
        try:
            content = urllib2.urlopen(url).read()
        except:
            print("ERR: Specified directory not found. Check the URL.\nException: %s " % sys.exc_info()[1])
            sys.exit()

        #There are two path patterns : the first is for the flash viewer, the second for the HTML5 viewer
        pathPatterns = ('zoomifyImagePath=([^\'"&]*)[\'"&]',\
        "showImage\([^),]*,[^\"']*[\"']([^'\"]*)[^)]*\)", \
        "imagepath[:blank:]*=([^&\"']+)", \
        '["\']([^"]+)/TileGroup0[^"]*\\1')
        for pattern in pathPatterns:
            m = re.search(pattern, content)
            if m: break
        if not m:
            print("ERR: Source directory not found. Ensure the given URL contains a Zoomify object.")
            sys.exit()
        else:
            imagePath = m.group(1)
            imageDir = urlparse.urljoin(url, imagePath) #add the relative url to the current url
 
            if self.debug:
                print('INF: Found image directory: ' + imageDir)
            return imageDir
 
    def getMaxZoom(self):
        """Construct a list of all zoomlevels with sizes in tiles"""
 
        zoomLevel = 0 #here, 0 is the deepest level
        width = int(ceil(self.maxWidth/float(self.tileSize))) #width of full image in tiles
        height = int(ceil(self.maxHeight/float(self.tileSize))) #height
 
        self.levels = []
 
        while True:
 
            self.levels.append((width, height))
 
            if width == 1 and height == 1:
                break
            else:
                width = int(ceil(width/2.0)) #each zoom level is a factor of two smaller
                height = int(ceil(height/2.0))
 
        self.levels.reverse() # make the 0th level the smallestt zoom, and higher levels, higher zoom
 
 
 
    def getProperties(self, imageDir, zoomLevel):
        """
        Retreive the XML properties file and extract the needed information.
 
        Sets the relevant variables for the grabbing phase.
 
        Keyword arguments
        imageDir -- the Zoomify base directory
        zoomLevel -- the level which we want to get
        """
 
        #READ THE XML FILE AND RETRIEVE THE ZOOMIFY PROPERTIES NEEDED TO RECONSTRUCT (WIDTH, HEIGHT AND TILESIZE)
        xmlUrl = imageDir + '/ImageProperties.xml' #this file contains information about the image tiles
 
        content = getUrl(xmlUrl) #get the file's contents
        #example: <IMAGE_PROPERTIES WIDTH="2679" HEIGHT="4000" NUMTILES="241" NUMIMAGES="1" VERSION="1.8" TILESIZE="256"/>
 
        print (xmlUrl)
        m = re.search('WIDTH="(\d+)"', content)
        if m:
            self.maxWidth = int(m.group(1))
        else:
            print('ERR: Width not found in ImageProperties.xml')
            sys.exit()
 
        m = re.search('HEIGHT="(\d+)"', content)
        if m:
            self.maxHeight = int(m.group(1))
        else:
            print('ERR: Height not found in ImageProperties.xml')
            sys.exit()
 
        m = re.search('TILESIZE="(\d+)"', content)
        if m:
            self.tileSize = int(m.group(1))
        else:
            print('ERR: Tile size not found in ImageProperties.xml')
            sys.exit()
 
        #PROCESS PROPERTIES TO GET ADDITIONAL DERIVABLE PROPERTIES
 
        self.getMaxZoom() #get one-indexed maximum zoom level
 
        self.maxZoom = len(self.levels)
 
        #GET THE REQUESTED ZOOMLEVEL
        if not zoomLevel: # none requested, using maximum
            self.zoomLevel = self.maxZoom-1
        else:
            zoomLevel = int(zoomLevel)
            if zoomLevel < self.maxZoom and zoomLevel >= 0:
                self.zoomLevel = zoomLevel
            else:
                self.zoomLevel = self.maxZoom-1
                if self.debug:
                    print ('ERR: the requested zoom level is not available, defaulting to maximum (%d)' % self.zoomLevel )
 
        #GET THE SIZE AT THE RQUESTED ZOOM LEVEL
        self.width = self.maxWidth / 2**(self.maxZoom - self.zoomLevel - 1)
        self.height = self.maxHeight / 2**(self.maxZoom - self.zoomLevel - 1)
 
        #GET THE NUMBER OF TILES AT THE REQUESTED ZOOM LEVEL
        self.maxxTiles = self.levels[-1][0]
        self.maxyTiles = self.levels[-1][1]
 
        self.xTiles = self.levels[self.zoomLevel][0]
        self.yTiles = self.levels[self.zoomLevel][1]
 
 
        if self.debug:
            print( '\tMax zoom level:    %d (working zoom level: %d)' % (self.maxZoom-1, self.zoomLevel)  )
            print( '\tWidth (overall):   %d (at given zoom level: %d)' % (self.maxWidth, self.width)  )
            print( '\tHeight (overall):  %d (at given zoom level: %d)' % (self.maxHeight, self.height ))
            print( '\tTile size:         %d' % self.tileSize )
            print( '\tWidth (in tiles):  %d (at given level: %d)' % (self.maxxTiles, self.xTiles) )
            print( '\tHeight (in tiles): %d (at given level: %d)' % (self.maxyTiles, self.yTiles) )
            print( '\tTotal tiles:       %d (to be retreived: %d)' % (self.maxxTiles * self.maxyTiles, self.xTiles * self.yTiles))
 
 
    def getTileIndex(self, level, x, y):
        """
        Get the zoomify index of a tile in a given level, at given co-ordinates
        This is needed to get the tilegroup.
 
        Keyword arguments:
        level -- the zoomlevel of the tile
        x,y -- the co-ordinates of the tile in that level
 
        Returns -- the zoomify index
        """
 
        index = x + y * int(ceil( floor(self.width/pow(2, self.maxZoom - level - 1)) / self.tileSize ) )
 
        for i in range(1, level+1):
            index += int(ceil( floor(self.width /pow(2, self.maxZoom - i)) / self.tileSize ) ) * \
                     int(ceil( floor(self.height/pow(2, self.maxZoom - i)) / self.tileSize ) )
 
        return index
 
 
    def constructBlankImage(self):
        try:
            self.image = Image.new('RGB', (self.width, self.height), "#000000")
        except MemoryError:
            print ("ERR: Image too large to fit into memory. Exiting")
            sys.exit(2)
        return
 
    def blankTile(self):
        return Image.new('RGB', (self.tileSize, self.tileSize), "#000000")

    def addTiles(self, imageDir):
        '''Fetch each tile in imageDir in a separate thread. When a tile is
        downloaded, paste it at the right position (in the main thread).'''
        print ("Tile download started.")
        waitingTilesQueue = Queue.Queue() #Queue for tiles that have not been downloaded
        downloadedQueue = Queue.Queue() #New tiles will be appended to this queue when they are downloaded

        for col in range(self.xTiles):
            for row in range(self.yTiles):
    	        waitingTilesQueue.put((col, row, imageDir))

        def fetchTile(col, row, imageDir):
            '''Fetches a tile on the server, and append a tuple to the
            tileQueue containing (col, row, tile). 
            Tile is either an Image() object or None if the tile couldn't be
            fetched.'''
            tileIndex = self.getTileIndex(self.zoomLevel, col, row)
            tileGroup = tileIndex // 256

            filepath = getFilePath(self.zoomLevel, col, row, self.ext) #construct the filename (zero indexed level)
            url = imageDir + '/' + 'TileGroup%d'%tileGroup + '/' + filepath

            tile = None #Initially, there is no tile
            fileStr=None
            try:
                fileStr = getUrl(url, debug=self.debug)
            except IOError as error:
                if self.debug:
                    print ("INF: Failed to retreive tile (%s)."%error)
            #construct the image using the data.
            try:
                imageData = cStringIO.StringIO(fileStr) # constructs a StringIO holding the image
                tile = Image.open(imageData) # Converts the file to an image, and append it to the tile Queue
            except: #failure to read the image tile
                tile = self.blankTile()
                if self.debug:
                    print ('\t\tERR: Tile not found or corrupted, skipping.')
            downloadedQueue.put( (col, row, tile) )
            waitingTilesQueue.task_done()

            if not waitingTilesQueue.empty():
                col, row, imageDir = waitingTilesQueue.get()
                fetchTile(col, row, imageDir)

        for i in range(self.nthreads):
            t = threading.Thread(target=fetchTile, args=waitingTilesQueue.get())
            t.setDaemon(True)
            t.start()
            

        print("") #Prints a newline
        i=1
        while waitingTilesQueue.unfinished_tasks != 0 or not waitingTilesQueue.empty():
            #i starts at 0. Humans prefer when numbers start at 1. So we use i+1
            percent = 100*i / (self.xTiles * self.yTiles)
            printLine("INF: Waiting for a tile download to complete... (%2d%%) " % (percent))
            col, row, tile = downloadedQueue.get()
            if self.store:
                tile.save(os.path.join(self.store , filepath), quality=int(self.qual) ) #save the tile
            try:
                self.image.paste(tile, (self.tileSize * col, self.tileSize * row)) #paste into position
            except:
                if self.debug:
                    print("Invalid tile at position (%d, %d)" % (col, row))
                self.image.paste(self.blankTile(), (self.tileSize * col, self.tileSize * row)) #paste into position
            i+=1
        print("\nTile download finished.")

    def getUrls(self, opts):
        '''Returns a list of base URLs for the given Dezoomify object(s)'''
        if not opts.list: #if we are dealing with a single object
            if not opts.base:
                self.imageDirs = [ self.getImageDirectory(opts.url) ]  # locate the base directory of the zoomify tile images
            else:
                self.imageDirs = [ opts.url ]         # it was given directly
 
        else: #if we are dealing with a file with a list of objects
            listFile = open( opts.url, 'r')
            self.imageDirs = [] #empty list of directories
 
            for line in listFile:
                line = line.strip()
                if not opts.base:
                    self.imageDirs.append( self.getImageDirectory(line) )  # locate the base directory of the zoomify tile images
                else:
                    self.imageDirs.append( line ) # it was given directly
 
 
    def setupDirectory(self):
        '''if we will save the tiles, set up the directory to save in'''
        if self.store:
            root, ext = os.path.splitext(self.out)
            if not os.path.exists(root):
                if self.debug:
                    print( 'INF: Creating image storage directory: %s' % root)
                os.mkdir(root)
            self.store = root
        else:
            self.store = None
 
 
 
    def __init__(self, opts):
        self.debug = opts.debug
        self.ext = opts.ext
        self.store = opts.store
        self.qual = opts.qual
        self.store = opts.store
        self.out = opts.out
        self.nthreads = opts.nthreads

        self.setupDirectory()
        self.getUrls(opts)
 
        i = 0
        for imageDir in self.imageDirs:
            self.getProperties(imageDir, opts.zoomLevel)       # inspect the ImageProperties.xml file to get properties, and derive the rest
 
            self.constructBlankImage() # create the blank image to fill with tiles
            self.addTiles(imageDir)            # find, download and paste tiles into place
 
            if opts.list: #add a suffix to the output file names if needed
                root, ext = os.path.splitext(self.out)
                destination = root + '%03d' % i + ext
            else:
                destination = self.out
 
            self.image.save(destination, quality=int(self.qual) ) #save the dezoomified file
 
            if self.debug:
                print( 'INF: Dezoomifed image created and saved to ' + destination )
 
            i += 1
 
if __name__ == "__main__":
    try:
        main()
    finally:
        None

License edit

 
This file is licensed under the Expat License, sometimes known as the MIT License:

Copyright © Inductiveload

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

The Software is provided "as is", without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. In no event shall the authors or copyright holders be liable for any claim, damages or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the Software or the use or other dealings in the Software.