Commons:Batch uploading/Brooklyn Museum/HowTo

How do I process this batch jobEdit

First sorry for my bad english ;) This is a short documentation how I do this job. I hope this will help a bit if you start you own bot. You should be a linux or unix user to understand this. Currently I have to use Xubuntu (but I dislike it), so the following is done with Xubuntu.

I current work in this steps:

  • analyse the website and find the best way to extract images and the metadata
  • write some scripts with a lot of loops, sed and grep commands and then download all I need (images & metadata)
  • (if needed parse the metadata again, check and format it) and create a script to upload all
  • upload tests and upload (with another headless maschine)

First step. I test the api and find out that I can get all information about an object by its itemId. The simplest way to get all itemIds is to parse the search results. I wrote this simple bash script:

#!/bin/bash
outfile=itemIds.txt

# item count / 30 + 1
for i in {0..136} ; do

        index="`expr $i \* 30`"

        echo "Page #${i} ..."
        lynx --source "http://www.brooklynmuseum.org/opencollection/search/?type=object&start_index=${index}&q=africa*&prev_q=&x=25&y=14"|tr '[\n\r\t]' ' '|sed 's/<div /\n<div /g'|grep 'item-info' |grep -v 'item-info-no-image'|grep '/opencollection/objects/[0-9]*/'|cut -d '"' -f 2| cut -d '/' -f 4 >> ${outfile}.tmp

done

cat ${outfile}.tmp | sort | uniq > ${outfile}
rm ${outfile}.tmp

After this I found only 1568 objects with images. Ok, I make a script to download the xml data to each object in a single file (I suggest using sub-folders for each object). You will need a api key you can get here. I little bit tuning the params to get the highest resolution and all other fine information.

#!/bin/bash
apikey="<insert your api key>"
cat itemIds.txt | while read item ; do

        # create folder in 'files'
        mkdir "files/${item}"

        echo "ItemId: ${item} ..."

        # get xml as-is
        lynx -source "http://www.brooklynmuseum.org/opencollection/api/?method=collection.getItem&version=1&api_key=${apikey}&item_type=object&item_id=${item}&image_results_limit=20&include_html_style_block=true&max_image_size=1536" > files/${item}/${item}.xml
done

To analyse the available licences I wrote this script (extractRights.sh) and then pipe it to sort.

#!/bin/bash
find ./files -type f -name '*.xml' | while read file ; do
        cat "${file}" | tr '[\r\n]' ' '|sed '/rightstype/s/\(.*rightstype=\)\(.*\)/\2/' |awk -F\" '{print $2}'
done
bash extractRights.sh | sort | uniq -c

This is the result:

      1 1.0
     80 copyright_artist_or_artists_estate
   1450 creative_commons_by_nc
     37 no_known_copyright_restrictions

Now I know the keywords for the licences I can use (creative_commons_by_nc and no_known_copyright_restrictions) and wrote script to remove all files that are not with this license.

Hint: Currently there is a mistake by the museum. They marked images as CC-BY on the website but the same as CC-BY-NC on the api. We are sure they mean CC-BY in the api too, so 'creative_commons_by_nc' in the api means 'creative_commons_by'.

#!/bin/bash
find ./files -type f -name '*.xml' | while read file ; do

        rightstype="`cat \"${file}\" | tr '[\r\n]' ' '|sed '/rightstype/s/\(.*rightstype=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"

        if [ "$rightstype" != "creative_commons_by_nc" ] && [ "$rightstype" != "no_known_copyright_restrictions" ] ; then
        
                rm "$file"
                rmdir "`dirname $file`"

        fi
done

I do the same with the attribute 'collection' to get only items that are in the Arts of Africa collection.

#!/bin/bash
find ./files -type f -name '*.xml' | while read file ; do

        collection="`cat \"${file}\" | tr '[\r\n]' ' '|sed '/collection/s/\(.*collection=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"

        if [ "$collection" != "Arts of Africa" ] ; then
        
                rm "$file"
                rmdir "`dirname $file`"

        fi
done

Ok, lets count:

find ./files -type f -name '*.xml'  | wc -l

Ok, this are 1392 objects now.

I wrote a bash script that extract all information from the xml files, put them in singles files each and download and rename the images. I know that bash is not the perfect script language for this job, but I like to play around with and its easy to develop. Do not wonder I put any piece of information on a single file (I love to work with single files) but this will make the upload script small and easy to develop.

#!/bin/bash
find ./files -type f -name '*.xml' | while read file ; do

        echo "Process ${file} ..."

        xml="`cat \"${file}\" | tr '[\r\n]' ' '`"

        # extract all information with some grep and sed magic
        # please do not try to understand this while you are not a little bit crazy ;)

        id="`echo \"${xml}\" | sed 's/id=/\nid=/g'| grep "id=" | head -n 1|sed '/id/s/\(.*id=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
        title="`echo \"${xml}\" | sed 's/title=/\ntitle=/g'| grep "title=" | head -n 1|sed '/title/s/\(.*title=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
        uri="`echo \"${xml}\" | sed 's/uri=/\nuri=/g'| grep "uri=" | head -n 1|sed '/uri/s/\(.*uri=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
        accession_number="`echo \"${xml}\" | sed 's/accession_number=/\naccession_number=/g'| grep "accession_number=" | head -n 1|sed '/accession_number/s/\(.*accession_number=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
        object_date="`echo \"${xml}\"| sed 's/object_date=/\nobject_date=/g'| grep "object_date=" | head -n 1|sed '/object_date/s/\(.*object_date=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
        medium="`echo \"${xml}\" | sed 's/medium=/\nmedium=/g'| grep "medium=" | head -n 1|sed '/medium/s/\(.*medium=\)\(.*\)/\2/' |awk -F\\" '{print $2}'|sed 's/</</g'|sed 's/>/>/g'|sed -e 's/<[^>]*>//g'`"
        dimensions="`echo \"${xml}\"| sed 's/dimensions=/\ndimensions=/g'| grep "dimensions=" | head -n 1|sed '/dimensions/s/\(.*dimensions=\)\(.*\)/\2/' |awk -F\\" '{print $2}'|sed 's/</</g'|sed 's/>/>/g'|sed -e 's/<[^>]*>//g'`"
        credit_line="`echo \"${xml}\" | sed 's/credit_line=/\ncredit_line=/g'| grep "credit_line=" | head -n 1|sed '/credit_line/s/\(.*credit_line=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
        classification="`echo \"${xml}\" | sed 's/classification=/\nclassification=/g'| grep "classification=" | head -n 1|sed '/classification/s/\(.*classification=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
        description="`echo \"${xml}\" | sed 's/description=/\ndescription=/g'| grep "description=" | head -n 1|sed '/description/s/\(.*description=\)\(.*\)/\2/' |awk -F\\" '{print $2}'|sed 's/</</g'|sed 's/>/>/g'|sed -e 's/<[^>]*>//g'`"
        location="`echo \"${xml}\" | sed 's/location=/\nlocation=/g'| grep "location=" | head -n 1|sed '/location/s/\(.*location=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
        label="`echo \"${xml}\" | sed 's/label=/\nlabel=/g'| grep "label=" | head -n 1|sed '/label/s/\(.*label=\)\(.*\)/\2/' |awk -F\\" '{print $2}'|sed 's/</</g'|sed 's/>/>/g'|sed -e 's/<[^>]*>//g'`"
        #collection="`echo \"${xml}\" | sed 's/collection=/\ncollection=/g'| grep "collection=" | head -n 1|sed '/collection/s/\(.*collection=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
        #rightstype="`echo \"${xml}\" | sed 's/rightstype=/\nrightstype=/g'| grep "rightstype=" | head -n 1|sed '/rightstype/s/\(.*rightstype=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
        markings="`echo \"${xml}\" | sed 's/markings=/\nmarkings=/g'| grep "markings=" | head -n 1|sed '/markings/s/\(.*markings=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
        dynasty="`echo \"${xml}\" | sed 's/dynasty=/\ndynasty=/g'| grep "dynasty=" | head -n 1|sed '/dynasty/s/\(.*dynasty=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
        signed="`echo \"${xml}\" | sed 's/signed=/\nsigned=/g'| grep "signed=" | head -n 1|sed '/signed/s/\(.*signed=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
        period="`echo \"${xml}\" | sed 's/period=/\nperiod=/g'| grep "period=" | head -n 1|sed '/period/s/\(.*period=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"

        if [ "$id" != "" ] ; then
                echo -n "$id" > $file.id
        fi

        if [ "$title" != "" ] ; then
                echo -n "$title" > $file.title
        fi

        if [ "$uri" != "" ] ; then
                echo -n "$uri" > $file.uri
        fi

        if [ "$accession_number" != "" ] ; then
                echo -n "$accession_number" > $file.accession_number
        fi

        if [ "$object_date" != "" ] ; then
                echo -n "$object_date" > $file.object_date
        fi

        if [ "$medium" != "" ] ; then
                echo -n "$medium" > $file.medium
        fi

        if [ "$dimensions" != "" ] ; then
                echo -n "$dimensions" > $file.dimensions
        fi

        if [ "$credit_line" != "" ] ; then
                echo -n "$credit_line" > $file.credit_line
        fi

        if [ "$classification" != "" ] ; then
                echo -n "$classification" > $file.classification
        fi

        if [ "$description" != "" ] ; then
                echo -n "$description" > $file.description
        fi

        if [ "$label" != "" ] ; then
                echo -n "$label" > $file.label
        fi

        if [ "$location" != "" ] ; then
                echo -n "$location" > $file.location
        fi

        #if [ "$collection" != "" ] ; then
        #       echo -n "$collection" > $file.collection
        #fi

        #if [ "$rightstype" != "" ] ; then
        #       echo -n "$rightstype" > $file.rightstype
        #fi

        # others
        ###################################################

        if [ "$markings" != "" ] ; then
                echo "* Markings: $markings" >> "$file.other"
        fi

        if [ "$signed" != "" ] ; then
                echo "* Signed: $signed" >> "$file.other"
        fi

        if [ "$dynasty" != "" ] ; then
                echo "* Dynasty: $dynasty" >> "$file.other"
        fi

        if [ "$period" != "" ] ; then
                echo "* Period: $period" >> "$file.other"
        fi

        # artists (diffrent values)
        echo "${xml}" | sed 's/<artist /\n<artist /g' | grep '<artist ' | while read artist ; do

                artist_role="`echo \"${artist}\" | sed '/role/s/\(.*role=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
                artist_name="`echo \"${artist}\" | sed '/name/s/\(.*name=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"

                echo "* ${artist_role}: ${artist_name}" >> "$file.other"

        done
        
        # geolocations (diffrent values)
        echo "${xml}" | sed 's/<geolocation /\n<geolocation /g' | grep '<geolocation ' | while read geolocation ; do

                geolocation_name="`echo \"${geolocation}\" | sed '/name/s/\(.*name=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
                geolocation_type="`echo \"${geolocation}\" | sed '/location_type/s/\(.*location_type=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"

                echo "* ${geolocation_type}: ${geolocation_name}" >> $file.other        

        done

        # images
        image_count=0
        echo "${xml}" | sed 's/<image uri=/\n<image uri=/g' | grep '<image uri=' | sed 's/\/size[0-9]\//\/size4\//g' |while read image ; do

                image_link="`echo \"${image}\" | sed 's/uri=/\nuri=/g'| grep "uri=" | head -n 1|sed '/uri/s/\(.*uri=\)\(.*\)/\2/' |awk -F\\" '{print $2}'`"
                image_color="`echo \"${image}\" | sed 's/is_color=/\nis_color=/g'| grep "is_color=" | head -n 1|sed '/is_color/s/\(.*is_color=\)\(.*\)/\2/' |awk -F\\" '{print $2}'|grep 'true'`"
                image_xray="`echo \"${image_link}\" | grep '_xrs_\|_xray_' &> /dev/null && echo \"true\"`"
                image_name="`basename \"${image_link}\"`"
                image_ext="`echo \"${image_name}\" | rev | cut -d '.' -f 1 | rev | tr '[A-Z]' '[a-z]'`"

                image_count=`expr ${image_count} + 1`
                if [ "$image_count" -gt "1" ] ; then
                        upload_name="Brooklyn_Museum_${accession_number}_`basename \"${uri}\"`_(${image_count}).${image_ext}"
                else
                        upload_name="Brooklyn_Museum_${accession_number}_`basename \"${uri}\"`.${image_ext}"
                fi

                echo "> Download ${image_name} ..."
                
                wget "${image_link}" -O "files/${id}/${upload_name}" &> "files/${id}/${upload_name}.log" || echo "ERROR!" >> "files/${id}/${upload_name}.log"

                echo "File:${upload_name}" >> "$file.gallery"

                if [ "${image_link}" != "" ] ; then
                        echo -n "$image_link" > "files/${id}/${upload_name}.link"
                fi

                if [ "${image_name}" != "" ] ; then
                        echo -n "$image_name" > "files/${id}/${upload_name}.name"
                fi

                if [ "${image_color}" != "" ] ; then
                        echo -n "$image_color" > "files/${id}/${upload_name}.color"
                fi

                if [ "${image_xray}" != "" ] ; then
                        echo -n "$image_xray" > "files/${id}/${upload_name}.xray"
                fi

        done

done

This result in a file-listing like this for each xml-file.

$ ls -l
insgesamt 564
-rw-rw-r-- 1 xxx xxx   2238 Okt 15 20:27 2910.xml
-rw-rw-r-- 1 xxx xxx      6 Okt 20 13:05 2910.xml.accession_number
-rw-rw-r-- 1 xxx xxx      9 Okt 20 13:05 2910.xml.classification
-rw-rw-r-- 1 xxx xxx     14 Okt 20 13:05 2910.xml.collection
-rw-rw-r-- 1 xxx xxx     56 Okt 20 13:05 2910.xml.credit_line
-rw-rw-r-- 1 xxx xxx    547 Okt 20 13:05 2910.xml.description
-rw-rw-r-- 1 xxx xxx     52 Okt 20 13:05 2910.xml.dimensions
-rw-rw-r-- 1 xxx xxx     70 Okt 20 13:05 2910.xml.gallery
-rw-rw-r-- 1 xxx xxx      4 Okt 20 13:05 2910.xml.id
-rw-rw-r-- 1 xxx xxx     18 Okt 20 13:05 2910.xml.medium
-rw-rw-r-- 1 xxx xxx     31 Okt 20 13:05 2910.xml.object_date
-rw-rw-r-- 1 xxx xxx     87 Okt 20 13:05 2910.xml.other
-rw-rw-r-- 1 xxx xxx     22 Okt 20 13:05 2910.xml.rightstype
-rw-rw-r-- 1 xxx xxx      5 Okt 20 13:05 2910.xml.title
-rw-rw-r-- 1 xxx xxx     63 Okt 20 13:05 2910.xml.uri
-rw-rw-r-- 1 xxx xxx 161155 Mär 10  2012 Brooklyn_Museum_22.233_Stool_(2).jpg
-rw-rw-r-- 1 xxx xxx     80 Okt 20 13:05 Brooklyn_Museum_22.233_Stool_(2).jpg.link
-rw-rw-r-- 1 xxx xxx    927 Okt 20 13:05 Brooklyn_Museum_22.233_Stool_(2).jpg.log
-rw-rw-r-- 1 xxx xxx     13 Okt 20 13:05 Brooklyn_Museum_22.233_Stool_(2).jpg.name
-rw-rw-r-- 1 xxx xxx 162477 Mär 15  2012 Brooklyn_Museum_22.233_Stool.jpg
-rw-rw-r-- 1 xxx xxx      4 Okt 20 13:05 Brooklyn_Museum_22.233_Stool.jpg.color
-rw-rw-r-- 1 xxx xxx     91 Okt 20 13:05 Brooklyn_Museum_22.233_Stool.jpg.link
-rw-rw-r-- 1 xxx xxx    932 Okt 20 13:05 Brooklyn_Museum_22.233_Stool.jpg.log
-rw-rw-r-- 1 xxx xxx     24 Okt 20 13:05 Brooklyn_Museum_22.233_Stool.jpg.name

Hint: Dependent on the source there can unusable filenames with %XX characters or double dots or double underlines. You should found this before upload and rename all dependent files correct, otherwise the upload fails silent with the upload-script. (you can try to log it and pipe all output to a logfile and analyse this after all, i.E. python pywikipedia/upload.py ... &>> alluploads.log)

Now we can upload the files, using pywikipedia and this upload script:

find ./files/ -name '*.jpg' | while read file ; do

        if ! grep -m 1 "^${file}$" upload.log &> /dev/null ; then

                path="`dirname \"${file}\"`"
                number="`basename \"${path}\"`"
                filename="`basename \"${file}\"`"

                id="`cat \"${path}/${number}.xml.id\" 2> /dev/null`"
                uri="`cat \"${path}/${number}.xml.uri\" 2> /dev/null`"
                accession_number="`cat \"${path}/${number}.xml.accession_number\" 2> /dev/null`"
                medium="`cat \"${path}/${number}.xml.medium\" 2> /dev/null`"
                dimensions="`cat \"${path}/${number}.xml.dimensions\" 2> /dev/null`"
                credit_line="`cat \"${path}/${number}.xml.credit_line\" 2> /dev/null`"

                image_link="`cat \"${file}.link\" 2> /dev/null`"
                image_name="`cat \"${file}.name\" 2> /dev/null`"

                # prepare title
                if test -e "${path}/${number}.xml.title" ; then
                        title="{{en|`cat \"${path}/${number}.xml.title\" 2> /dev/null`}}"
                else
                        title=""
                fi

                # prepare date
                if test -e "${path}/${number}.xml.object_date" ; then
                        if grep "^[0-9]*th century$" "${path}/${number}.xml.object_date" &> /dev/null ; then
                                yy="`cat \"${path}/${number}.xml.object_date\"| sed 's/[a-zA-Z]//g' | sed 's/[ ]*//g'`"
                                object_date="{{other_date|century|${yy}}}"
                        else
                                object_date="{{en|`cat \"${path}/${number}.xml.object_date\" 2> /dev/null`}}"
                        fi
                else
                        object_date=""
                fi

                # prepare description (the line break and the empty line in the environment variable are important)
                description="`cat \"${path}/${number}.xml.description\" 2> /dev/null`"
                label="`cat \"${path}/${number}.xml.label\" 2> /dev/null`"

                if [ "${description}" != "" ] &&  [ "${label}" != "" ] ; then
                        description="{{en|${description}}}

{{en|${label}}}"
                else
                        if [ "${description}" == "" ] &&  [ "${label}" == "" ] ; then
                                description="${title}"
                        else
                                description="{{en|${description}${label}}}"
                        fi
                fi

                # prepare location
                location="`cat \"${path}/${number}.xml.location\" 2> /dev/null`"
                if test -e "${path}/${number}.xml.location" ; then
                        location="{{Brooklyn Museum location|collection=africa}} ${location}"
                else
                        location="{{Brooklyn Museum location|collection=africa}}"
                fi

                # prepare additional notes
                notes=""
                if test -e "${path}/${number}.xml.other" 2> /dev/null ; then 
                        notes="`cat \"${path}/${number}.xml.other\" | sed 's/ place / Place /g' 2> /dev/null`"
                else
                        notes=""
                fi

                # add gallery if more than one image (the line breaks in the environment variables are important)
                image_count="`cat \"${path}/${number}.xml.gallery\" 2> /dev/null | wc -l`"
                if [ "${image_count}" -gt "1" ] ; then
                        gallery="<gallery>
`cat \"${path}/${number}.xml.gallery\" 2> /dev/null`
</gallery>"

                else
                        gallery=""
                fi

                # add categories for b&w or x-ray (the line breaks in the environment variables are important)
                add_categories=""
                if test -e "${file}.xray" ; then
                        add_categories="
[[Category:X-rays of objects]]"
                else
                        if ! test -e "${file}.color" ; then
                                add_categories="
[[Category:Black and white photographs]]"
                        fi
                fi

                # upload...
                echo "Uploading $filename => "
                starttime=$(date +"%s")

                yes N | python pywikipedia/upload.py -simulate -keep -filename:${filename} -noverify ${file} "{{Artwork
 | Artist            = {{unknown}}
 | Title             = ${title}
 | Year              = ${object_date}
 | Description       = ${description}
 | Technique         = 
 | Dimensions        = ${dimensions}
 | Institution       = {{Institution:Brooklyn Museum}}
 | Location          = ${location}
 | Credit_line       = ${credit_line}
 | Inscriptions      = 
 | Notes             = ${notes}
 | Source            = [http://www.brooklynmuseum.org/opencollection/objects/${id} Online Collection] of [[w:Brooklyn Museum|Brooklyn Museum]]; Photo: Brooklyn Museum, [${image_link} ${image_name}]
 | accession number  = [http://www.brooklynmuseum.org/opencollection/objects/${id} ${accession_number}]
 | Permission        = {{WikiAfrica/Brooklyn Museum}}
 | Other_versions    = ${gallery}
}}

[[Category:African art in the Brooklyn Museum]]
[[Category:Import by User:Slick-o-bot/Brooklyn Museum]]${add_categories}" && echo "${file}" >> upload.log  

                # set throttle (means: $throttle uploads per minute)
                throttle=4

                stoptime=$(date +"%s")
                uploadtime=$(($stoptime-$starttime))
                sleep=`expr \( 60 - ${throttle} \* ${uploadtime} \) / \( ${throttle} - 1 \)`
                if [[ ${sleep} -lt 0 ]] ; then sleep=0 ; fi

                echo "-----------------------------------------------------------------"
                echo ">> upload time was ${uploadtime} seconds, sleeping ${sleep} seconds"
                echo "-----------------------------------------------------------------"
                sleep ${sleep}

        fi
done
Last modified on 1 November 2012, at 11:21