Slaw The Harsh Spotlight of Science – Coming Soon to a Vendor Near You

In email conversation with a well-known figure of the US eDiscovery world a few weeks back, I realized that we had both noticed that the eDiscovery world has a dirty little secret: there are some eDiscovery vendors out there, offering both software and services, who promise more than they can deliver.

Let me hasten to add, before I start receiving a torrent of protests, that there are many, many vendors who deliver what they promise, and that often failure to deliver is not a reflection of the vendor’s capabilities (or their promises), but rather issues with communication, changes in project scope, or simply limits of the technology. One terabyte of data cannot be fully processed by next week by anyone that I know of. It’s due to the speed at which computers run, not your vendor’s incompetence.

There have been some attempts to correct this seeming disconnect of some vendors between what they can actually do and what their sales staff say they can do. For example, John Randall ^[1] came out with a review ^[2] in 2006 of some of the leading eDiscovery processing software in use at the time. He was refreshingly honest in his assessment of what the various applications would and wouldn’t do and received letters threatening litigation from some vendors’ lawyers as a result. Astonishing, but true. In a more informal way, perhaps the most common single request on the Yahoo LitSupport ^[3] group is a plea for someone to recommend a vendor in a given location, or for a particular kind of work.

A specific example of how the effectiveness of technology and/or methodology can be overestimated is found in the area of searching litigation documents and data. Over the past few years there has been some significant research done in the field of litigation search technology. Some background might be helpful here, to understand the issues with litigation search technology. As most of you know, once paper documents and data have been collected from the client and processed appropriately, they are loaded into a litigation review tool so the legal team can review them for relevance, privilege, and importance to the issues in the case. A page-by-page review with indexing of the documents was the traditional way to accomplish this, but with the vast amount of data that usually results from eDiscovery, (even after culling based on duplication, date ranges and the like), some kind of search methodology is necessary so as to avoid spending years on review.

Most lawyers are familiar with the basic search technology used in simple litigation review tools such as Summation and Concordance. You type in your keywords, perhaps with some Boolean connectors, hit “search”, and the documents you’re looking for are returned in the search results. Unfortunately, this simple keyword-based method of searching is not adequate.

Enter the harsh spotlight of science.

In 1985, a study was done on the effectiveness of Boolean searching: “An Evaluation of Retrieval Effectiveness for a Full-Text Document Retrieval System,” Communications of the Association for Computing Machinery at 289-99, March 1985. The working paper is available here ^[4], abstract of the final, published paper is here ^[5]. I strongly recommend you read it for yourself. One of the things that made this study interesting was that it was actually done using one of the very earliest technologies used as a litigation support system: IBM’s “STAIRS” (STorage And Information Retrieval System). No doubt it wasn’t the most user-friendly piece of software ever designed, but in the early 1980’s it was revolutionary.

The study was quite surprising: the lawyers using STAIRS believed that they would be able to find as much as 75% of the relevant documents in the collection with keywords. In fact, the number was a paltry and worrisome 20%. Of those documents retrieved with the chosen keywords, 80% were considered relevant. So the recall (number of relevant documents retrieved/total number of relevant documents in collection) was low. Conversely the precision (number of relevant documents/total number of documents retrieved) was comparatively high.

I should point out that the searches were iterative. The lawyers would review the documents retrieved by each search, mark them according to relevance, and then revise the search terms so as to obtain more relevant documents. It wasn’t simply a one-shot search; the searches were revised based on the perceived effectiveness of the original search.

Why were the keyword searches so ineffective? The researchers point out that it has been assumed by those developing or using Boolean keyword-based search systems that it’s an easy matter to accurately predict what keywords will be found in relevant documents, and only the relevant documents. This study demonstrated that this is not the case, and gives some specific examples from the documents used in the study as to the huge variation of language used in even relatively formal business documents when referring to the same thing. (E.g. The “accident” was sometimes referred to as “unfortunate occurrence”, “incident”, “situation”).

Fast forward to 2006. Technology has clearly advanced beyond that found in STAIRS. But keyword searches are still the bread and butter of finding key documents in a litigation matter and haven’t changed that much. With that in mind, the Text Retrieval Conference ^[6] (“TREC”), a joint project by the National Institute of Standards and Technology (NIST) and U.S. Department of Defense, decided to add a “Legal Track ^[7]”. The results of their studies have been very informative. Their experiments indicate that Boolean keyword searching only retrieves 22% (as reported in their 2007 ^[8] paper) of the relevant documents in the dataset. However, taking advantage of the newer search technologies out there, which make extensive use of probabilistic, rule-based and linguistic techniques (for definitions, see the Sedona Conference’s Best Practices ^[9] paper), you can get higher recall rates (but not much higher for the automated techniques). Where newer techniques were used in conjunction with feedback and advice by designated “topic experts” about the topics that were the target of the searches, recall could be increased to as much as 80% – far more in line with typical lawyer expectations.

The keyword search issues are just one example of something that I, and many others in the eDiscovery and litigation support field, are concerned about. There is a real need for basic research to determine the effectiveness of the tools and methodologies we rely on in eDiscovery matters. Studies, like the one that John Randall conducted in 2006, which aimed to give an independent overview of what certain data processing technologies did well, and did not do well, are very necessary. Independently derived metrics that a vendor’s performance, methodology or software can be measured against is also necessary. Open source software where programmers far better than myself can examine what the source code is actually doing, and improve it in the context of an open debate is one as-yet largely unexplored possibility to increase transparency in the industry. Development of generally accepted eDiscovery processes, procedures and standards would help; especially if claims of adherence to such can be independently audited. A “consumer reports” type of user feedback on software and vendors, free of threats of litigation, would also drive standards higher.

These are all necessary because litigators need to be able to defend their eDiscovery processes in court. They are necessary because litigation support staff at both law firms, companies and at vendors, need to know the limits of the technology they are using, and develop documented workarounds and alternate methods to handle what their standard technology cannot handle.

I do not mean to bash vendors of eDiscovery software and services in this column. There are many good vendors of both services and technology out there who do excellent work within the limits of the technology currently available. But I do believe that as lawyers move from a basic to more sophisticated understanding of eDiscovery that there will be a realization that “I don’t know” is not a defensible answer to “how did you handle your data”. And if lawyers are going to have more of an answer than “I don’t know”, eDiscovery technology and processes has to move away from what is something of an unverified black box model towards independently verified, and/or open and transparent models. eDiscovery vendors may need to be willing to provide subject matter experts to defend their methodologies and technologies (in the US, there is some caselaw that subjects selection of search terms to Daubert analysis – see this article ^[10]). Alternatively, they may wish to submit their technology to a full-fledged validation by an independent third party – but many vendors will avoid this due to the high cost.

As eDiscovery continues to develop as a field, I believe that we will see far more pressure on vendors to provide evidence that the tools and methods they are using are valid, just as computer forensics professionals are currently expected to do.

It will be interesting to see how this area develops over the next two or three years.