So There’s Been Some Buzz About Legal Data Lately …
It seems that interest in legal data has reached such a level of hype that people have started asking me about it unprompted, which is an interesting development. I had assumed that when I spoke to people about this I was buttonholing them, and that they wanted to be anywhere else and talking about anything else (except of course for Tim Knight, but that’s part of the reason we’re friends). It does make sense that it’s happening now. Legal data is interesting: it describes rules and systems that affect all our lives, it is commercially valuable, and it hasn’t been analyzed as much as other similar datasets like medical information have been. Given this surge of interest I thought I would share a few thoughts on the matter here.
One particular area of interest for research is applying artificial intelligence techniques to case law for various applications, especially predictive analytics. I have written about this before here: “Like Moneyball for Lawyers?” on October 17, 2016, and generally my opinions haven’t changed in the last year and a half. There is not enough data in court decisions to provide good analytics for individual judges in particular areas of law. To adequately assess a prospective professional ball player requires thousands of swings in an activity with relatively simple inputs and results. Most judges will not write more than several hundred decisions in a long and active career with complex inputs and outputs, only some of which are available for analysis, as many court activities don’t leave a readily available written record. It’s not impossible to quantify human interactions like this, but it leaves out important nuance.
Aside from publicity materials and hype induced press coverage, I have not heard positive stories about the application of artificial intelligence in law. In fact what I hear from people trying to apply AI to legal materials is that they experience general frustration. Start-ups are pivoting away from legal analysis to subject areas that have more accessible datasets and less complicated source material, and those that haven’t frequently struggle to answer simple questions. There are many applications for automated analysis of legal documents, but as far as I can tell so far they tend toward extracting particular information such as judges’ names, and, as the field has moved on, this is no longer considered “AI”. Even something as simple as saying what a case is about turns out to take nuance that computer programs struggle with (in fairness on occasion I have struggled with that too).
The application of AI to legal data also suffers from the paired issues of restricted access to raw data and access to the required computing power being generally available. In the first week of studies doing an MBA they teach that for a business to be successful long term there needs to be some kind of competitive advantage, and using third party resources to parse a dataset is readily replicable.
I recently heard Geordie Rose speak, and what he said is that AI is hitting the limit of what can be accomplished with free text analysis, because the programs have no context for what they are analyzing, i.e. it has no frame of reference for what an apple is, only that it associated with “pie” and “tree” strings of text. He believes that the emergence of true artificial intelligence is imminent (and is quite alarming on the subject), but that this will likely require building robots for it to explore the world.
Current AI systems are looking at a series of binary encoded text and trying to find patterns, but they have no conception of which of that text is significant or what any of the words mean. Legal documents are some of the most complex writing in English, and it is unlikely that the nuance of what they mean will be an easy target.
David Runciman recently explained this rather well in the London Review of Books:
Alpha-Zero may have overcome thousands of years of human civilisation in a few days, but those same thousands of years of civilisation have taught us to register in an instant forms of communication that no machine is close to being able to comprehend. Chess is a problem to be solved, but language is not and this kind of open-ended intelligence isn’t either. Nor is language simply a problem-solving mechanism. It is what enables us to model the world around us; it allows us to decide which problems are the ones worth solving. These are forms of intelligence that machines have yet to master. (Diary, 25 January 2018, https://www.lrb.co.uk/v40/n02/david-runciman/diary)
Another area of interest in legal data is to look at statistical elements of the justice system. As an example, the question I’ve always wanted the answer to is how much more likely people accused in criminal cases are to plead guilty based on longer distances between their residences and the court point given the increased difficulty involved in traveling so far—in fact I would be thrilled to know the answer if anyone does the research. The problem is that this isn’t an easy thing to extract from published legal literature. Not all court decisions are published, especially in routine matters in lower levels of court. And this kind of data that would be interesting to social scientists is not generally recorded for analysis. In the cases that are published there is usually something unusual about them which makes them worth writing up. For the traditional practice of law this doesn’t matter, because the outlying cases define the range and that’s what practitioners and courts are looking for. There are several legal research tools that are based on this principle especially for sentencing and personal injury awards.
This data is not suitable to predict actual rewards based on a statistical distribution because the majority of the data points are not included in the set. Most statistical tools assume normal distribution of the data with most of the data points grouped in the middle of the range, and either a random sample or complete set of data points.

“A selection of Normal Distribution Probability Density Functions (PDFs). Both the mean, μ, and variance, σ², are varied. The key is given on the graph.” https://commons.wikimedia.org/wiki/File:Normal_Distribution_PDF.svg.
But court judgements aren’t a random sample. To get one would require manually compiling outcomes from court files. In British Columbia and Quebec this could be assisted by the online court document systems that are available for those provinces, but in other jurisdictions it would likely require physically traveling to a courthouse to access physical files or doing a live collection of data over a period of time. There is room to bring techniques from the social sciences into the legal system, but expect the data collection required to be onerous. For all those intrepid legal researchers, criminologists, and others who are trying to do this, I salute you and wish you well, but I think you should expect it to be difficult. That said it’s a good opportunity to look for insights no one else has had before.
A notable exception to this lack of data is the First Nations Court in British Columbia, which has been collecting statistics on outcomes for their clients to better describe the value of their approach. I wrote about the First Nations Court here, but I’m sure there are better sources if you care to look for them. If there are others, I invite you to add them in the comments below.
Just because it’s going to be difficult doesn’t mean it’s not worth doing. Consider John Snow’s manually compiled map of cholera deaths from 1854:

“Original map made by John Snow in 1854. Cholera cases are highlighted in black”. 1854. https://commons.wikimedia.org/wiki/File:Snow-cholera-map-1.jpg.
He saved millions of lives in his pioneering work on disease transmission by looking at the patterns of distribution. There is great work that can be done in law, but the ease of getting there has been overstated.
I feel like you really hit the nail on the head. I have been working with lawyers to automate court forms and legal processes for 15+ years. I steered away from an AI/ML approach early on because I knew that courts only document the interesting cases and that even those are “lightly” documented in the written record. Instead, I expect lawyers who practice to be subject-matter-experts (SMEs) who can build the decision tree with confidence from long experience instead of trying to guess the decision tree from outputs (case law, statutes, etc).
Today’s AI hype around legal decision making has not disabused me of this notion. You have articulated it beautifully.