Column

There Is No Simple Way to Organize 12,000,000 Titles

When I was working on my masters of library and information studies degree it was (and may still be) fashionable for libraries to discontinue use of formal classification schemes (like Dewey, Library of Congress, or my favourite KF Modified) and switch to a bookstore model of organization by topic. This was described as being an improvement over other options because the majority of people using libraries don’t understand what the numbers mean, and this helps them find books on subjects they are interested in without having to learn the scheme first. Of course there were both those who felt this was an error and reduced rigour in libraries and others who were quite happy to make the change, but there was always a large portion of libraries that didn’t consider it – those with substantial holdings which could not be efficiently organized in this way. The University of Toronto for example is credited with having more than 12 million print volumes, and there is no simple way to organize that number of physical objects in a way that will allow each item to be found easily and to maintain a logical flow of subject matter.

It is a problem of long standing for libraries to arrange books in a way that makes sense to library users and staff, while remaining within the constraints of physical space – people need to be able to find specific items they want, titles on subjects they are interested in, and the physical books need to fit in the building without creating too great a fire hazard. Recently with computer assistance the requirement to manage books in a logical way in physical space has been reduced for a portion of many large library collections as they are equipped with RFID tags and stored by size in boxes which are retrieved as necessary from storage. This method obviates the physical organization element of classification by managing the books using electronic descriptions in a database for the purposes of intellectual arrangement and treating physical location as a storage problem not an intellectual exercise. In an digital system books can be everywhere it makes sense for them to be – books by a particular author, books on contracts or cats, or books published in Toronto on 1975, are all available to be identified and found, while they are stored in the most efficient way for utilization of space in a warehouse navigated by robots.

Using print volumes as a stand in for other forms of information organization is useful for the purposes of discussion as they are easy to visualize, but much information we access in law is not easy to organize in series, as it is either not in a physical format that lends itself to being ordered and viewed sequentially, like caselaw which is generally ordered by date in print reporters, or it is not in physical form at all as most caselaw is published now. In the famous instance of topically arranged caselaw in Canada, the Canadian Abridgment, there is ongoing tension between having cases listed everywhere it makes sense for researchers to look for them and the limits on customers’ willingness to pay for and store additional volumes. Should it cease to be printed, this tension will likely disappear and more content will appear in multiple places, as the electronic system is released from the constraints imposed by the logistics of print.

Digital systems have improved organization and retrieval of information, which is why many types of information, such as library catalogues, are not generally available in print anymore, but there is still a delicate balance among retrieving too many results, ease of use of the system, and retrieving everything relevant. Google solves this problem by making inferences about what a person is looking for and presenting that – this makes Google easy to use and mostly solves the problem of too many results by rarely requiring the majority of searchers to go past the first page to find something suitable. CanLII retrieves everything that fits the parameters for the search as entered, presenting what is evaluated as most relevant at the top, and, in its new incarnation, provides ways to limit the search after it is run.

The Westlaw Next default search runs algorithms that determine how precise to be with a search based on how many results will be retrieved and other considerations; however the system continues to recognize Boolean logic based search commands, which override the default algorithms and retrieve results that explicitly match the query. Westlaw Next also provides the ability to narrow search results after the search is run. This system may give excellent results, but like the Google search results, includes some opacity with regard to what is being left out, which can be concerning to some.

Quicklaw uses a default “OR” in a search entered into its main search box, but recognizes Boolean logic if it is entered, and includes a natural language search box on another part of the page. The default “OR” and natural language searches, which also don’t generally require the presence of all entered terms, have some advantages, as virtually any search that is entered will return something. This helps if the search didn’t include a necessary synonym and avoids the disappointment of not retrieving anything, which can be frustrating for novice searchers. However, it can result in unexpected results.

As electronic systems get older, they accrue more content organically as newly created content is added and inorganically as legacy content is digitized or otherwise acquired. It is a widely expressed desire on the part of researchers that online systems become more comprehensive, and as systems approach this goal the parameters for success become more stringent. However, the success of publishers in acquiring content is paired with the challenges of how to present it in a way that is navigable by researchers who have differing levels of expertise in search and ideally presents important before unimportant results. As a law librarian it is a common thing not only to be asked for something on a subject, but to be asked for something “good,” and this desire for “good” results is implicit in users’ expectations of online systems.

This leads publishers of online legal information to continue to develop algorithms to improve relevancy ranking, and, with increased sophistication, they become closer to human mediated editing in quality. This is often paired with the tools to control results using facets after a search is run to allow users to decide how to slice the result set based on more information than is generally available before a researcher sees the results. These two components are becoming central to legal and more general research tools, as the bodies of information included in online systems gets bigger. Facetted control of search results are transparent in their function and shouldn’t behave in an unexpected way, which is in contrast with algorithms which are generally proprietary and opaque to the user.

As time goes on, it is easy to foresee that the issue of how to best present results in online systems with very large numbers of results will continue to be an issue for information providers, and that ongoing development and ingenuity will be devoted to making better and easier ways to navigate them. This will likely mean that algorithms will become more closely aligned with the editorial process and be “massaged and kneaded by caring craftsmen to deliver a premium product.” This will also likely mean that what results are presented when and why will become more opaque, and for most users this will be an improvement, just as in the case of Google search results putting what you are likely to want at the top.

Comments

  1. Knowing some context of how the retrieved digital information tidbit conceptually exists either in temporal screen display with other related information, or/and the information source, can help users filter through long search results.

    One hopes that even basic skilled Internet users are minimally even interested in the information source. But then there may be questions on authenticity, accuracy, point-in-time and relevance…

    The sad thing is that now even Google has “encrypted” even simple search strings and hence, open Internet publishers, even bloggers can’t even learn about search word patterns of their Google searchers that landed on their web site to help them improve on making their web information more accessible. This is not SEO optimization, which can tend to have monetary associations.

    It is like constructing a house with cunning stairways and windows, but one is not even sure how the visitors got into the house.