“Digitization” is certainly a term to conjure with in libraries these days. A variety of reasons has motivated these projects. The physical degradation of irreplaceable collections is a considerable spur, as is the trend toward greater openness and improved access to information. The Library of Congress has been developing significant digital collections, including the American Memory project, Thomas (the legislative archive), and newspaper collections. In Canada, government and university libraries are looking closely at their holdings, with an eye to making rare materials available via the web. The Library and Archives Canada is also building digital collections of literary works, maps and government documents. Academic libraries are opening their thesis collections.
It’s encouraging to see these projects – this willingness to adapt and to put information where the clients are may help, in the long run, to ensure that these resources remain available to future generations, and help to revitalize the profession of librarianship. It may also help to keep our libraries funded!
The Ontario Digitization Initiative has been making rapid progress through ancient legislative documents. I also know of small government libraries which have been scanning other key documents – internal reports, administrative documents and internal policies, statutory consolidations, photos, maps – anything they can get their hands on. To do it properly, the process is time-consuming, and requires a high level of attention to detail. All the more reason to do it right, once, and think about access.
Taking a lesson from Steve Matthews’ last column, perhaps the answer is to scan once, distribute widely. The Internet Archive is able to accept contributions, and suggests that they be made in pdf format. Knowledge Ontario is also willing to accept contributions to the Our Ontario portal. Instructions are available on the site. The Library and Archives Canada has a program to collect born-digital documents in the unimaginatively named Electronic Collection. Their selection guidelines are available at http://www.collectionscanada.gc.ca/collection/003-200-e.html – the sections for government documents and private donations are still in development. I’ve not inquired if the team at CanLII would be willing to accept donations of digital statute consolidations to bolster their historic statute collections. I suspect that a random collection of forty years of Ontario Human Rights Codes would be more trouble than they’re worth, and the ODI will have a firm handle on statutes in a short time.
For those of you who are digitizing collections, a few observations from the projects I’ve done to date:
- It’s going to take longer than you expect. Be prepared to test, tweak, retest.
This is not a job for those who do not pay attention to details. Proofreading is essential, especially if you are digitizing in order to get rid of the paper originals.
- OCR everything
- Review every OCR job to make sure it interpreted the document correctly.
- If you’re housing the collection inside your organization, talk to your IT folk about where you’re going to keep these documents. Find out about limits to what they can do with them, and how you’re going to get them into the repository. Can they help you auto-generate metadata, or are you going to have to catalogue the documents to make them accessible?
- .rtf or .doc format works best with screen-readers and other assistive devices – think about your audience when you choose the document format. Pdf format may not be readable by some devices.
- .tif files are your last-resort choice. They aren’t searchable, you can’t cut and paste from them, and cannot be read by assistive devices.
- DIY may not be the best option for high-volume jobs. There are lots of options out there for out-sourcing, and large volumes of documents can be done for relatively little money.
- If you outsource the scanning, make sure you can recall documents on an as-needed basis – the job always takes longer than you expect, and you may need to access materials while it’s off site. Make a box list so that you make the process more efficient. It’s also a good way to make sure that you’ve gotten everything back.
- Document collections which are consistently formatted are the easiest to convert and to store. Variations in page size and layout will make the process longer and more difficult.
- Annotation is not your friend when scanning, but can be an important source of context/knowledge, so don’t discount it.
A new digitization project emerged as I was finishing this column – we have a small collection of cassette tapes of talks given by various staff and executives of one of my client tribunals. The technology to play them is rapidly disappearing, and it might be interesting to have these recordings available in future years. Fortunately, our IT department has an old Dictaphone which will accept this size of cassette. I set up a laptop, equipped with Audacity and a good microphone, and I’m just letting the two machines talk to each other. The beauty of Audacity is that I can go back and edit the sound file afterward, if needed. To compress the size of the file, we’ll run it through Flash, and I’m hoping we can catalogue the new digital file into the library catalogue in the same way we would an electronic document. If that’s not possible, we can always create an audio library in Sharepoint 2007.
Share your experiences and learning – what is the smartest thing that you did in a digitization project? What was your most profound learning? What collections are you developing, and where are you going to keep them? What is the best public repository that you have contributed to? What platforms are you developing to store internal documents?