The Harsh Spotlight of Science – Coming Soon to a Vendor Near You

In email conversation with a well-known figure of the US eDiscovery world a few weeks back, I realized that we had both noticed that the eDiscovery world has a dirty little secret: there are some eDiscovery vendors out there, offering both software and services, who promise more than they can deliver.

Let me hasten to add, before I start receiving a torrent of protests, that there are many, many vendors who deliver what they promise, and that often failure to deliver is not a reflection of the vendor’s capabilities (or their promises), but rather issues with communication, changes in project scope, or simply limits of the technology. One terabyte of data cannot be fully processed by next week by anyone that I know of. It’s due to the speed at which computers run, not your vendor’s incompetence.

There have been some attempts to correct this seeming disconnect of some vendors between what they can actually do and what their sales staff say they can do. For example, John Randall came out with a review in 2006 of some of the leading eDiscovery processing software in use at the time. He was refreshingly honest in his assessment of what the various applications would and wouldn’t do and received letters threatening litigation from some vendors’ lawyers as a result. Astonishing, but true. In a more informal way, perhaps the most common single request on the Yahoo LitSupport group is a plea for someone to recommend a vendor in a given location, or for a particular kind of work.

A specific example of how the effectiveness of technology and/or methodology can be overestimated is found in the area of searching litigation documents and data. Over the past few years there has been some significant research done in the field of litigation search technology. Some background might be helpful here, to understand the issues with litigation search technology. As most of you know, once paper documents and data have been collected from the client and processed appropriately, they are loaded into a litigation review tool so the legal team can review them for relevance, privilege, and importance to the issues in the case. A page-by-page review with indexing of the documents was the traditional way to accomplish this, but with the vast amount of data that usually results from eDiscovery, (even after culling based on duplication, date ranges and the like), some kind of search methodology is necessary so as to avoid spending years on review.

Most lawyers are familiar with the basic search technology used in simple litigation review tools such as Summation and Concordance. You type in your keywords, perhaps with some Boolean connectors, hit “search”, and the documents you’re looking for are returned in the search results. Unfortunately, this simple keyword-based method of searching is not adequate.

Enter the harsh spotlight of science.

In 1985, a study was done on the effectiveness of Boolean searching: “An Evaluation of Retrieval Effectiveness for a Full-Text Document Retrieval System,” Communications of the Association for Computing Machinery at 289-99, March 1985. The working paper is available here, abstract of the final, published paper is here. I strongly recommend you read it for yourself. One of the things that made this study interesting was that it was actually done using one of the very earliest technologies used as a litigation support system: IBM’s “STAIRS” (STorage And Information Retrieval System). No doubt it wasn’t the most user-friendly piece of software ever designed, but in the early 1980’s it was revolutionary.

The study was quite surprising: the lawyers using STAIRS believed that they would be able to find as much as 75% of the relevant documents in the collection with keywords. In fact, the number was a paltry and worrisome 20%. Of those documents retrieved with the chosen keywords, 80% were considered relevant. So the recall (number of relevant documents retrieved/total number of relevant documents in collection) was low. Conversely the precision (number of relevant documents/total number of documents retrieved) was comparatively high.

I should point out that the searches were iterative. The lawyers would review the documents retrieved by each search, mark them according to relevance, and then revise the search terms so as to obtain more relevant documents. It wasn’t simply a one-shot search; the searches were revised based on the perceived effectiveness of the original search.

Why were the keyword searches so ineffective? The researchers point out that it has been assumed by those developing or using Boolean keyword-based search systems that it’s an easy matter to accurately predict what keywords will be found in relevant documents, and only the relevant documents. This study demonstrated that this is not the case, and gives some specific examples from the documents used in the study as to the huge variation of language used in even relatively formal business documents when referring to the same thing. (E.g. The “accident” was sometimes referred to as “unfortunate occurrence”, “incident”, “situation”).

Fast forward to 2006. Technology has clearly advanced beyond that found in STAIRS. But keyword searches are still the bread and butter of finding key documents in a litigation matter and haven’t changed that much. With that in mind, the Text Retrieval Conference (“TREC”), a joint project by the National Institute of Standards and Technology (NIST) and U.S. Department of Defense, decided to add a “Legal Track”. The results of their studies have been very informative. Their experiments indicate that Boolean keyword searching only retrieves 22% (as reported in their 2007 paper) of the relevant documents in the dataset. However, taking advantage of the newer search technologies out there, which make extensive use of probabilistic, rule-based and linguistic techniques (for definitions, see the Sedona Conference’s Best Practices paper), you can get higher recall rates (but not much higher for the automated techniques). Where newer techniques were used in conjunction with feedback and advice by designated “topic experts” about the topics that were the target of the searches, recall could be increased to as much as 80% – far more in line with typical lawyer expectations.

The keyword search issues are just one example of something that I, and many others in the eDiscovery and litigation support field, are concerned about. There is a real need for basic research to determine the effectiveness of the tools and methodologies we rely on in eDiscovery matters. Studies, like the one that John Randall conducted in 2006, which aimed to give an independent overview of what certain data processing technologies did well, and did not do well, are very necessary. Independently derived metrics that a vendor’s performance, methodology or software can be measured against is also necessary. Open source software where programmers far better than myself can examine what the source code is actually doing, and improve it in the context of an open debate is one as-yet largely unexplored possibility to increase transparency in the industry. Development of generally accepted eDiscovery processes, procedures and standards would help; especially if claims of adherence to such can be independently audited. A “consumer reports” type of user feedback on software and vendors, free of threats of litigation, would also drive standards higher.

These are all necessary because litigators need to be able to defend their eDiscovery processes in court. They are necessary because litigation support staff at both law firms, companies and at vendors, need to know the limits of the technology they are using, and develop documented workarounds and alternate methods to handle what their standard technology cannot handle.

I do not mean to bash vendors of eDiscovery software and services in this column. There are many good vendors of both services and technology out there who do excellent work within the limits of the technology currently available. But I do believe that as lawyers move from a basic to more sophisticated understanding of eDiscovery that there will be a realization that “I don’t know” is not a defensible answer to “how did you handle your data”. And if lawyers are going to have more of an answer than “I don’t know”, eDiscovery technology and processes has to move away from what is something of an unverified black box model towards independently verified, and/or open and transparent models. eDiscovery vendors may need to be willing to provide subject matter experts to defend their methodologies and technologies (in the US, there is some caselaw that subjects selection of search terms to Daubert analysis – see this article). Alternatively, they may wish to submit their technology to a full-fledged validation by an independent third party – but many vendors will avoid this due to the high cost.

As eDiscovery continues to develop as a field, I believe that we will see far more pressure on vendors to provide evidence that the tools and methods they are using are valid, just as computer forensics professionals are currently expected to do.

It will be interesting to see how this area develops over the next two or three years.

Retweet information »

Comments

  1. This post shows that the problems I identified in my earlier comment today on Bar Associations and Legal Research are worse than we might have thought.

  2. I wanted to respond to your comment that “no vendor can process a terabyte in less than a week”. It is true that most litigation support vendors are in this situation, however, that is not the case for Nuix customers. As the CEO of Nuix, I am happy to unequivocally state that vendors who use Nuix Enterprise software can easily process 1TB of data – on a single server – with full text and metadata extraction – in less than a day – and faster using multiple servers.

    In our lab we have achieved faster speeds, processing 1TB in 22 hours (on a single Dell duel-quad core box).

    The reason Nuix can achieve such speeds are (1) we don’t use Microsoft MAPI API’s to access PST files etc, but rather read files from their binary code; and (2) we are able to use all the resources of a server (or indeed multiple servers) on datasets.

    Anyone interested in finding more about how more advanced technologies are changing the face of the litigation support industry can read some new white paper and insight resources we have written specifically for the litigation support industry at http://www.nuix.com. If anyone wants a demonstration, trial or references – please don’t hesitate to contact me (eddie.sheehy@nuix.com).

  3. I also wanted to respond to your comment stating that “One terabyte of data cannot be fully processed by next week by anyone that I know of.” I guess you might want a clarify what is meant by “fully processed”.

    At eClaris, we can process and deliver two (2) terabytes of ESI in 24 hours or less using just about 15 to 20% of our computer resources. We will extract all available metadata contained in the ESI, de-dupe the data using both SHA-1 and MD5 HASH values, index the entire data set and provide a platform to search and query the data.

    The issue here is not the lack of computer resources, all well-equipped service providers can attest to that.

    Jacques at eClaris, Inc.

  4. Perhaps I shouldn’t have been so specific in referring to a terabyte of data, although when I wrote “processing” I admit I was thinking of timelines that start with the initial “dear vendor, here’s a bunch of preserved ESI” through to delivery of the fully processed and coded data, which obviously is not the what most would understand by “processing”. So I apologize for the confusion. To retain the meaning of the point I was trying to make, feel free to replace “terabyte” with “petabyte” if you wish.

    My intent with my flawed example was to point out that there are limitations to how much computing power, people and other resources you can throw at any given technological issue, and that buyers of products and services need to have reasonable expectations of what is possible.

    To have reasonable expectations, it’s important to know what the software can and cannot do, what it does badly, and why it doesn’t handle, say, NSF files well. Unfortunately there is a lack of independent verification (and documentation) of what software claims to do.

    Buyers can of course do their own testing, but most buyers of software and services don’t have time (or money) to test every piece of software they’re interested in using. Vendors can, and do, test, but that information is obviously proprietary and doesn’t always make its way down to the purchaser.

    That is why I believe there is a need for independent research and testing to verify the claims made.

    Personally I think we are going to see more scrutiny of vendors’ claims in the future, and that is a good thing for the industry.

  5. Whether a Terabyte with Petabyte, the point is moot. Can you list 5 law firms or organizations capable to reviewing a petabyte of ESI data in a week, a month or six months and claim to have have met their (FRCP) 26(g) requirements? Would processing and reviewing a petabyte of data be a “reasonable inquiry” within a week’s notice or even a month notice?

    I support your call for more scrutiny of vendors’ claims, but I would also call for more scrutiny of the purchaser’s eDiscovery practices altogether. A key lesson learned from recent opinions, sanctions and rulings on eDiscovery cases is that preparation is the key. I think you would be surprised to find out what can be achieved when a competent eDiscovery consultant is brought early on a case.

  6. I would be very surprised if any organization could review a petabyte of data in any reasonable time period at all.

    Conversely, I wouldn’t be at all surprised to find out what a competent eDiscovery consultant can do. Consulting is what I do, after all.

    I have worked with law firms to help them improve their internal processes: improvement is not just about the vendors. A look at my previous articles for Slaw would (I hope) demonstrate that I am as interested in improving how purchasers make their decisions about technology, and their eDiscovery processes in general, as I am about improving vendors of software and services.

  7. Hi Debbie. I absolutely agree that third party verification would be a great thing. Nuix would support John Randall or any organization to undertake such a benchmark study – and would commit to providing appropriate hardware and software for such a test. For too long, many software vendors have been claiming things that just don’t stand up to thorough (and sometimes simple) scrutiny. I believe Nuix is an exception. I am sure there are others. Who would everybody trust to be that independent judge and which software vendors would have their software scrutinised? That would be interesting.

  8. I don’t know that there is one good way to tackle this issue. Academic research may be useful as we’re seeing with the TREC LegalTrack project. Independent verification and/or auditing is another option. A third option is to simply let the market do its work: news about bad products and vendors does get around eventually, but it’s slow and not everyone gets the memo in time to avoid some painful experiences.

    I don’t think that just one person can set up as “judge”. As with most professions, I think we would look to the professional organization (in our case, ALSP) for guidelines we should be adhering to, but given the relative newness of the profession, other options may be appropriate.

    For example, the EDRM project now has a number of relevant projects that may help address some of the issues I’ve raised.
    The EDRM Data Set project aims to put together a 100GB set of standard data that can be used to test software and services.
    There is also the EDRM Model Code of Conduct project for both vendors and consumers.
    Plus the EDRM Metrics project hopes to provide an effective means of tracking the time, money and volumes associated with eDiscovery.

    All of these are relevant, I think, to the issue at hand.

    I would argue that the ALSP is perhaps the most appropriate body to take on this task (whatever that task may actually be), but it is still a young organization and has other more pressing items on its “to-do” list.

    As to which vendors (software and otherwise) would be scrutinised, well that is an interesting question. Should it be a compulsory or voluntary process? Either could work, but a lot will depend on how rigorous the standards are, and on how many vendors participate. If no-one participates in testing for a rigorous standard then it’s of less value than a less rigorous, but more widely adopted standard.