Thursday Thinkpiece: Legal Data and Information in Practice

Periodically on Thursdays, we present a significant excerpt, usually from a recently published book or journal article. In every case the proper permissions have been obtained. If you are a publisher who would like to participate in this feature, please let us know via the site’s contact form.

Legal Data and Information in Practice: How Data and the Law Interact

Author: Sarah Sutherland
ISBN: 9780367649883
Publisher: Routledge
Page Count: 170
Publication Date: January 2022
Regular Price: $52 (softcover)

Excerpt: Chapter 6: Issues with using legal data, sections 6.1-6.7

6.1 Introduction

The increased use of data in the legal system has been much anticipated with both positive and negative effects expected. Among other anticipated outcomes, people look forward to the development and adoption of systems that promise much delayed productivity gains, which they hope will reduce the cost of legal services, thereby bringing improved access to justice for more people. At the same time, many are apprehensive that these changes will create new sources of injustice they cannot anticipate, and that there will be losses of employment and income for many current participants in the legal system. Both these hopes and concerns are warranted.

Most of the applications and systems anticipated to transform the legal system are driven by data in one form or another, and creating custom datasets to be used as inputs is frequently prohibitively expensive. This means that many actors in this space are looking to access datasets that already exist. However, there are frequently issues with data not being available in desired formats and with suitable licensing. There are also concerns about the suitability of available data for use in many of the applications that people want to create. These concerns will need to be addressed and understood if the projected changes are to happen in the ways people want.

6.2 Availability

Before considering what data does not exist or is not recorded yet, it is useful to consider the data that exists now and how or if it is available for particular analyses. The availability of primary law in the form of court documents and legislation is a major issue in the pursuit of a legal system that is more responsive to quantitative methods, though to what extent it is an issue varies around the world. While almost all courts and legislative bodies make their documents available to be published for use by individuals doing legal research, many see their current modes of distribution as adequate and may not wish to increase them to allow for further analysis. Recent examples include the State of Georgia’s litigation to stop Carl Malamud and Public.Resource.Org from publishing the Official Code of Georgia Annotated in the United States (Georgia et al. v. Public.Resource.Org, Inc. 2020) and a law introduced in 2019 in France which prohibits the publication of statistical analysis of court decisions (Légifrance 2019).

Beyond the availability of legal documents, there are also often limits on the availability of information on analysis that is done, especially that which is carried out by companies. Researchers in these organizations may not publish their research, and even where they do publish about their enquiries, it is common for them not to include enough detail in their articles to allow others to reproduce their work. To reproduce machine learning techniques in particular requires the code involved in the project, but also details about the dataset, metadata, and detailed instructions. It is the norm in academic communication that scientific articles include enough detail to recreate the work, but this frequently runs into conflict with the commercial interests of the companies funding the research. This is understandable, but it is not conducive to scientific communication (Heaven 2020).

6.3 What is missing

Even where governments have the best intentions of making data available, it is inescapable that data will be missing. As discussed in Section 2.2, the structure of case law in particular means there will always be significant gaps in the data available. Because courts are designed to decide difficult and high conflict issues, court data is missing much of the record of disputes in society as a whole. Access to the documents in court files helps expand the data available beyond what was finally decided by a judge, but they over represent certain kinds of issues, particularly complex matters involving rich people.

Beyond these concerns, by definition no dataset can be complete. Some of the data is known to be missing — such as court cases under publication bans. Other types of missing data are less obvious. One of the most important categories of missing data is not knowing what outcomes would have been in another circumstances. What would the outcome have been:

  • With other lawyers representing the parties?
  • If the parties had different identities?
  • With a different legislative regime?

This missing data makes it difficult to evaluate outcomes’ causes.

It is also difficult to measure many aspects of the world that we want to know about, so instead we develop proxies for what we really want to know (Hand 2020) like billable hours or arrest statistics. They do not measure the actual contribution of a lawyer to a firm’s profitability or crime levels in communities, but they present easily counted metrics that can be used instead. This can lead to gaming the system: lawyers are encouraged to take longer on tasks than they otherwise might, and as time can be written off or clients refuse to pay, hours worked are not directly correlated to revenue. In fact, analysis has found that they can be quite removed from each other (Clio 2019).

Another important issue is whether it is possible to extrapolate beyond available data. Any set of data will by necessity have limits in its coverage. It is impossible to know what happened in the past before data was recorded, and it is impossible to know what will happen in the future. This is particularly important when dealing with complex human decision making, because people can respond to the findings based on existing data. New judges are called to the bench, governments change after elections, and people learn.

These are just some of the reasons why data may be missing. All these considerations mean that it is important to consider what is not in the data at hand and account for it in research methodology and how much confidence to put in the findings.

6.4 Ambiguity

Beyond the relatively simple issues of access and completeness of data, there are other elements that make the use of legal data complicated due to the nature of law. Human societies are complex, and human language is imprecise:

Classical logical models may break down in dealing with legal indeterminacy, a common feature of legal reasoning: even when advocates agree on the facts in issue and the rules for deciding a matter, they can still make reasonable arguments for or against a proposition.

(Ashley 2017, 38).

Though many people who look to legal computing as a way forward are unhappy with this imprecision, in some cases it may in fact be an advantage. For example, many people look to smart contracts to increase efficiencies, but as science fiction writer Ken Liu has observed smart contracts will likely never work well in real life for complex matters. Proponents see the inefficacy in the existing environment, but they do not see how the human interaction of bargaining and negotiation between people leaves room for ambiguity which can help make an agreement possible. Programmers want to take the ambiguity out of the process, because that is not how they think (Doctorow, Liu, and Newitz 2020).

That said, there is more ambiguity in legal documents than there needs to be, and research is being done into how to better isolate the ambiguity in law that is advantageous and that which is not. One of the primary drivers of this at the time of writing is rules as code, which aims to better express what is meant in legal documents. Human languages have syntactic ambiguity in their structure, which has no legal purpose, and it may be advantageous to find ways to remove it. At the same time it may be appropriate to identify topics that are too complex to be legally coded, in which case it may be appropriate to indicate that in these cases a test such as “reasonableness” will be applied by human decision makers. The goal is to identify each of these sources of ambiguity and ensure they are used appropriately (Morris 2021). See Section 3.6 for further discussion of ambiguity in the context of data formats.

6.5 Limitations on language processing

As so much currently contemplated legal data research is being done with existing data in the form of free text documents, it is impossible to avoid the limitations of natural human language as a data source. The room for interpretation and ambiguity in human language is significant, and using written documents as source data requires interpretation when it is used for analysis. This is particularly difficult when using language created by people without subject matter expertise in law, as they may not be able to articulate their issues in a structured way.

Search queries are an example of a way that people interact with legal materials, and these datasets could be significant sources of data into how people understand and interact with the law. It seems that legal information retrieval systems, such as research platforms, should be able to interpret queries in the ways people intend, but so far there is generally no way to accomplish this (Ashley 2017, 316). The disconnect between human beings’ intentions and the literal nature of computers’ programming means that there continue to be limitations on what systems can do:

Algorithms do exactly what they are programmed to do, which sometimes creates a problem for programmers. . . . Right now machines will deliver you exactly what you wish for — and we’re not capable yet of wishing for the right thing. We can certainly use algorithms to supplement our thinking but knowing what’s ahead still requires people who can listen, analyze, and make connections.

(Webb 2016, 91)

While researchers have had some success in categorizing the writing of legal experts, they have had much less with non-experts. Analysis of self represented parties looking to make claims against their lawyers to the American Bar Association found that their descriptions of the facts and issues of their cases were organized in such a way that attempts to predict outcomes was only slightly better than random assignment (Branting et al. 2020). It is possible that automated means will never be sufficient to analyze this kind of writing.

That said, there are also issues with analyzing experts’ writing. For example, it is an important problem in legal artificial intelligence to be able to discern whether judges believe the assertions they make to be true — many times they are simply stating the positions of the parties (Ashley 2017, 369). These kinds of complexities will need to be addressed for the creation of fully functioning systems able to use existing legal content in the ways many anticipate.

As decision making applications become more sophisticated, there are anticipated to be problems because computer programs can apply classical logical deduction to problems, but they cannot support arguing both for and against a proposition, which makes them inadequate for modeling legal arguments (Ashley 2017, 128). Inferences also change once information is added or becomes invalid. When dealing with law in particular, it is important to consider that “Legal claims need not be ‘true’; they only need to satisfy a given proof standard” (Ashley 2017, 129).

6.6 Sampling

Sampling is the process of collecting and selecting what data points will be included in analysis. In most situations, it is not feasible to analyze full datasets, so a sample is selected to stand in for the whole. Statistical analysis requires adequate sampling to be reliable. The sample size and attributes are selected based on considerations like how confident researchers want to be in their results, but one of the most important elements of a sample is that it must be a random set of points from the complete underlying dataset. Legal datasets tend not to have this structure, because the recorded data does not include all the possible points, and the points that are included are not randomly distributed. Instead they are selected by people according to their own preference or according to various rules: “Sometimes the data contain features that, for spurious reasons such as coincidence or biased selection, happen to the associated with the outcomes of cases in a particular collection” (Ashley 2017, 111). Matters are usually selected for inclusion in the written case law in particular based on attributes of the data. The most common criteria used for recording matters is that they are unusual, because legal systems focus on defining ranges of possible outcomes, or that the parties have extensive resources.

Analytics and sample size

One major issue with legal data for machine learning applications in particular is the small sample sizes available for many datasets. While there were approximately 120,000 intellectual property cases in the United States over the ten year period leading up to 2006 (Walker 2019, 121), but over the same period a dataset for the Supreme Court of Canada would include approximately 1000 cases. These numbers quickly become quite small if they are further parsed to reflect only particular areas of law or individual adjudicators, lawyers, or parties. There are few datasets internationally that are as large as the American intellectual property corpus.

It is difficult to run machine learning on such small numbers of documents. Researchers looking at a set of decisions issued by the Singapore Supreme Court found that there were 6,227 decisions issued between 2000 and 2019. However, when they examined how to run machine learning over the data found that some techniques such as using pre-trained language models as opposed to task specific language models were quite effective (Soh, Lim, and Chai 2019).

These kinds of techniques will be necessary if machine learning and other techniques are going to be used in smaller jurisdictions and with topics that have lower case volumes. This is important, because narrowing analytics applications to particular areas of law is one of the ways that it can be effective. It is easier to analyze decision making in a narrow and discrete areas of the law, like bail or refugee hearings, than something broad like commercial disputes or other heterogeneous matters (McGill and Salyzyn 2021).

6.7 Cost

It remains to be seen how effective applications using legal data will be, but there is hope that they will be the source of needed change. However, there are real worries that even if data driven applications are useful, the cost of setting them up, especially the cost of creating adequate data to build them with, will be prohibitively expensive. That said, the legal services market in the United States alone is forecast to be worth $767.1 billion USD in 2021 (Statista Research Department n.d.), which does not include the value for the portion of the legal system provided by others, such as the government and non-profit groups.

“There are numerous ways that artificial intelligence and machine learning could improve the service sector, but that improvement will result in systems that are more costly than current ones. AI advances are fueled by data, and data-gathering done well is expensive.” (Pasquale 2020, 197)

These values imply that significant gains can be made at the system level, and that they would have the potential to offset substantial investment in productivity improvements. However, these amounts are spread over large numbers of organizations and jurisdictions, many of which will not be able to share solutions. To get some idea of how much difference productivity gains can make, consider that at one time computers were so expensive that only large institutions and governments could have them. Now they can be added to household devices like thermostats and pregnancy tests that are cost competitive with similar non-digital products, but these kinds of price changes are considerably less likely in law, especially in the short to medium terms.


Ashley, Kevin D. 2017. Artificial Intelligence and Legal Analytics: New Tools for Law Practice in the Digital Age. Cambridge: Cambridge University Press.

Branting, Karl, Carlos Balhana, Craig Pfeifer, John Aberdeen, and Bradford Brown. 2020. “Judges Are from Mars, Pro Se Litigants Are from Venus: Predicting Decisions
from Lay Text.” In Frontiers in Artificial Intelligence and Applications, edited by Serena Villata, Jakub Harašta, and Petr Krˇemen, 215–218. IOS Press.

Clio. 2019. “2019 Legal Trends Report.” Burnaby, Canada: Clio.

Doctorow, Cory, Ken Liu, and Annalee Newitz. 2020. “Cory Doctorow – Tech in Sci-Fi & ATTACK SURFACE w/ Ken Liu & Annalee Newitz.” Virtual, October 20. https://

Georgia et al. v. Public.Resource.Org, Inc. 2020. Supreme Court of the United States.

Hand, David J. 2020. Dark Data: Why What You Don’t Know Matters. Princeton: Princeton University Press.

Heaven, Will Douglas. 2020. “AI Is Wrestling with a Replication Crisis.” MIT Technology Review (blog). November 12, 2020.

Légifrance. 2019. “Article 33 — LOI N° 2019–222 Du 23 Mars 2019 de Programmation 2018–2022 et de Réforme Pour La Justice (1).” Légifrance. March 24, 2019.

McGill, Jena, and Amy Salyzyn. 2021. “Judging by Numbers: How Will Judicial Analytics Impact the Justice System and Its Stakeholders?” Dalhousie Law Journal 44 (1):

Morris, Jason. 2021. (Principal Research Engineer, Symbolic Artificial Intelligence for the Singapore Management University Centre for Computational Law), in discussion with the author.

Pasquale, Frank. 2020. New Laws of Robotics: Defending Human Expertise in the Age of AI. Belknap Press: An Imprint of Harvard University Press.

Soh, Jerrold, How Khang Lim, and Ian Ernst Chai. 2019. “Legal Area Classification: A Comparative Study of Text Classifiers on Singapore Supreme Court Judgments.” In
Proceedings of the Natural Legal Language Processing Workshop 2019, 67–77. Minneapolis, Minnesota: Association for Computational Linguistics.

Statista Research Department. n.d. “Size of the Global Legal Services Market 2015–2023.” Statista. Accessed March 21, 2021.

Walker, Joshua. 2019. On Legal AI. Washington, DC: Full Court Press.

Webb, Amy. 2016. The Signals Are Talking: Why Today’s Fringe Is Tomorrow’s Mainstream. New York: PublicAffairs.

Start the discussion!

Leave a Reply

(Your email address will not be published or distributed)