Scanning to searchable pdf

by

Sometimes it is more than useful to have an document in a scanned form to have easy access to it. One of the problems you will have to deal with is is that it is not possible to search for your documents by default. Something which can easily be done when you are able to search the data inside the document - which really hard if you have only pixel data in your document.

Systems which can help you here are for example Nepomuk or anything else which let you add tags and comments to files. Downside of this strategy is that it will take a lot of time until you have added enough tags to find a document later.

Another idea is to use an ocr tool to create a new document. I've tried different different ocr systems but you cannot expect to create a 1:1 searchable copy of your scan. So the downside of this one is the possible loss of information.

Compromise between this two solutions is to create a searchable pdf including a copy of the scanned image and the recognised text behind it. As side effect a pdf viewer is able to find the position of a word inside this document - which makes it lot easier to find something in a multi page pdf (for example concated with pdftk). Technology which helps to generate such a document is hOCR. It is an annotated HTML format which can produced with cuneiform or OCRopus and consumed by hocr2pdf of ExactImage to create the final input. The usage of these tools is quite easy but I've created two scripts which helps me a little bit when scanning many images.

img2pdf:
#! /bin/sh
set -e

if [ "$1" = "" ] || [ ! -f "$1" ]; then
	echo "Usage: $0 scan.img"
	exit 1
fi
TMPDIR="`mktemp -t -d img2pdf.XXXXXXXXXX`" && {
	econvert -i "$1" -o "$TMPDIR/scan.tiff"
	optimize2bw -n -i "$TMPDIR/scan.tiff" -o "$TMPDIR/bw.tiff"
	cuneiform -l ger -f hocr -o "$TMPDIR/hocr.html" "$TMPDIR/bw.tiff"
	hocr2pdf -s -i "$TMPDIR/scan.tiff" -o "$1.color.pdf" < "$TMPDIR/hocr.html"
	hocr2pdf -s -i "$TMPDIR/bw.tiff" -o "$1.bw.pdf" < "$TMPDIR/hocr.html"
	rm -rf "$TMPDIR"
	echo "$1.color.pdf and $1.bw.pdf created"
}
scan2pdf:
#! /bin/sh
set -e

NAME="$1"
if [ "$1" = "" ]; then
	NAME=`date '+%Y%m%d_%H_%M_%S_%N'`
fi
TMPDIR="`mktemp -t -d scan2pdf.XXXXXXXXXX`" && {
	scanimage --format=tiff --resolution 300 > "$TMPDIR/scan.tiff"
	img2pdf "$TMPDIR/scan.tiff"
	mv "$TMPDIR/scan.tiff.bw.pdf" "$NAME.bw.pdf"
	mv "$TMPDIR/scan.tiff.color.pdf" "$NAME.color.pdf"
	rm -rf "$TMPDIR"
	echo "$NAME.color.pdf and $NAME.bw.pdf created"
}

P.S.: A debian package for cuneiform is currently in the new queue and I've created a package for exactimageDaniel Baumann uploaded a package for exactimage to debian.