If you have print assets that you want to publish on the web, until now you had two options. The first one: the scanned material is OCR-ed, the text is then extracted into an editable text format (Word or XML), formatting and structure are applied throughout, and then the document is converted to a web-friendly format, like HTML. This is an expensive process even if done offshore. The second option: you can simply publish an OCR-ed scanned PDF file and accept that it will be clumsily rendered, not interactive, somewhat searchable, and in the case of large documents, your browser may crash.
A couple of weeks ago, CanLII deployed a beta version featuring a third approach with respect to digitizing print assets. Since CanLII’s announcement, we (Lexum is a technology supplier for the CanLII site) have received quite a few questions on how it is done and what technology we use.
So here you go.
- The scanned images are OCR-ed with text zones defined (things such as page headers and marginal notes are omitted in this project which deals with content from case law reports).
- In addition to a PDF file, most OCR software solutions produce an XML map containing the text equivalents of all words in the image file and their position on the page. The map gives you the precise coordinates of each word in twips or pixels.
- The output that we (“we” here and below designates software and algorithms, of course) generate and use for publishing is an HTML file composed of two layers – the image and the text. Only one is displayed to the user – if the image is loaded, the text will be hidden.
- As opposed to the OCR-generated XML map, which is organized on a word-by-word basis, our text layer map is organized on a line-by-line basis. We recreate the text map by positioning the lines based on the positions of their first and last words. Why this trouble? It allows for a more fluid text selection. You have certainly noticed that text selection in most OCR-ed PDFs looks like the words have been cut from a newspaper. We wanted to avoid this inelegant result.
- Once our lines are positioned on the map, an extra step is needed for the perfect superposition of each individual character. Each character from the text layer has to be precisely positioned where its corresponding image from the image layer is. You don’t get this at the previous step since even though the lines are positioned, the words and characters within them can be off. To achieve this, we recalculate the font size, word spacing and character spacing in the text layer.
- This is done in two steps. We calculate the right font size needed for the text layer so it matches the image layer. The font size choice gives you an exact superposition on the x-axis but it can throw your rendering off on the y-axis due to some typesetting approaches in the print version. To cope with these, we adjust word spacing and letter spacing. Adjustments of font size and spacing can roll over multiple rounds.
- Further, all pages are vertically aligned because scanned pages do not necessarily have the same left and right margins throughout a book. This effort also allows users to seamlessly select text across multiple pages.
- End-of-line hyphenated words are checked against a dictionary and single words are reassembled for the purposes of search indexing.
- That’s it. When it comes to web page rendering, we use a lazy loading approach. Based on the user location in the file, only the current page, the two previous pages and the two subsequent pages will be loaded from the image layer. However, the text layer is loaded in full to allow you to, say, CTRL+F a word that’s on page 487 out of 600.
Here are some of the benefits that we have achieved with this approach.
- The PDF file is directly rendered within the webpage – there is no need to open or download stuff, or to use external plug-ins or applications.
- Large PDFs are as quick to load as any text or HTML files (see the note about the lazy loading approach above).
- Text selection is fluid within a page and across pages.
- Copy-paste to Word works nicely and it’s faithful to text structure and formatting.
- End-of-line hyphens are ignored and words are indexed as a whole, when applicable.
- Search terms can be highlighted in search results.
- Legal citations can be hyperlinked; all kinds of links can be built (the output file is an HTML, after all…).
- The file can be annotated with available commercial or free web annotation tools.
This method copes with some of the major limitations of scanned PDFs. It does not replace Word or XML conversion but it is highly cost-efficient and offers a new cost-benefit angle.
And, finally, as a forward-looking conclusion, Internet Explorer 10 (1.2% of visitors), 9 (0.7%) and 8 (0.7%) are not supported. If you run one of these browsers, please download something else, if not for the new PDF feature on CanLII, at least for the sake of your browsing pleasure and security.
Enjoy CanLII’s smart PDF!