Designing Data Projects Using Court Records

AALL Spectrum
Author: Rebecca Fordon, Heidi Frostestad Kuehl, Tom Gaylord, & Adam R. Pah, AALL Spectrum

This submission is part of a column swap with the American Association of Law Libraries (AALL) bimonthly member magazine, AALL Spectrum. Published six times a year, AALL Spectrum is designed to further professional development and education within the legal information industry. Slaw and the AALL Spectrum board have agreed to hand-select several columns each year as part of this exchange. 

Lessons from the SCALES OKN Project, Including its Cost, Data, and Development Challenges

With the increasing availability of electronic court records, many law librarians have seen more requests from their patrons to pull data from those records to answer questions. These questions may be about the courts’ workload, the success rate of a particular type of legal action, or comparing the practices of different judges. Yet the records are not always so easily reduced to data. They may be dispersed across different systems, such that the librarian must

contact multiple courts to obtain records. Even with the records in hand, the metadata (e.g., dates and descriptions of each docket entry) may not be normalized and may be difficult to join with other data sets. Further, researchers must keep ethical considerations in mind, such as those arising from court rules on sealing case records and on personally identifiable information. Luckily, organizing information is what librarians do, and these projects are well within our wheelhouse.

A core part of succeeding with data projects is aligning the technical development with the ultimate end users—whether they be legal scholars, lawyers, or judges—and what questions they want answers to. Once the range of questions that need to be answered is identified, it will define and drive efforts to acquire court record data (i.e., which courts, what timeframe, which types of cases) and what information needs to be extracted from these records and turned into usable data for further analysis.

Building a Data Project

Building a data project such as the Systematic Content Analysis of Litigation Events Open Knowledge Network (SCALES OKN), with its diverse potential user base—one that ranges from legal scholars to journalists—requires significant work at the outset before serious technical work can begin to understand what questions users would

want answers to. We did hours of in-depth interviews with many potential users at the outset to understand both what questions they currently answer using available tools as well as what questions they would like to answer but lack the data or tools to answer. For our own data project, when we generalized across our interviews it helped identify key questions from users, such as “how often does X event happen in Y cases” and “how do cases with Z type of parties end?” that ultimately defined both the breadth of data that needed to be acquired as well as what efforts were necessary to enhance the court records with additional metadata from further computational modelling of the data.

The pool of people having a lawyer’s understanding of civil procedure and litigation practice in court, a software engineer’s ability to build software, and a data scientist’s knowledge of how to develop machine learning models is extremely small, so successfully executing a data project requires building a diverse team. Whether it’s the end user or someone directly on the team, legal expertise is necessary to understand and identify what is transpiring in the court records, creating the core data that the technical team can use to aid software development—whether that is simply programming data extraction from the source documents or training more sophisticated models that can identify and classify events in the case as distinct types of legal actions. In building the technical team, the questions that were initially determined will establish how large the technical team needs to be and what expertise it needs to have. Building software to acquire and extract data from court records, potentially allowing for better search than what is currently available, will require more expertise in software development to ensure that it is robust. As the questions demand a greater understanding of what the content of the court records actually are, not just what is written, the team will need to have data analytics and data science skills to retrain existing state-of-the-art language

models to make predictions about the contents of the court records with help from legal experts on the team. Because of these difficulties, it is often best to work from public APIs and data sources like those that CourtListener makes available to decrease the amount of technical and subject-domain expertise needed to start the project. However, as with most issues related to analyzing federal court data, it’s not a given that the pertinent data will necessarily be publicly available because of the PACER paywall, and care should be exercised to fully understand both what is and isn’t available before advancing with a data project.

Of course, with a project like SCALES OKN, we are starting with a very large data set and one that many people would expect to be pretty uniform throughout, since PACER is meant to be relatively uniform; however, the problem is that we have 94 U.S. District Courts, and each behaves a little bit differently from the others. Compounding that issue is that even within a single district court, the different judges’ chambers also behave differently, so you might get one chamber that thinks every interlocutory order is an opinion that should be freely accessible and others who don’t believe anything other than a final disposition should be labeled an opinion, and the judges themselves usually have the final say in whether an opinion should be “published” in the strict sense of that term.

In addition, many judges (or their clerks), from court to court, will label the same type of filing or disposition

using different identifiers (or using the parties’ identifiers), so cleaning up that data requires being able to create a taxonomic system that can identify which filings with different names are the same thing. For identifying the cause of action, PACER already has an existing taxonomic system, the Nature of Suit (NOS) system. The problem with

that, of course, is that the attorney filing the initial complaint is the one who has control over what the NOS is, so it is a very subjective system and one that doesn’t always capture instances in which a case covers potentially a half dozen or more natures of suit, but which the civil docket filing sheet might only show a single one.

Furthermore, since SCALES OKN itself is not looking at state courts at this time (perhaps it will in the future), this potentially opens a whole different can of worms because, of course, there’s no uniform state docketing system like

there is for the federal courts with PACER. Rather, each state does its own thing and, in many cases, even within the same state the different trial courts might use different docketing systems, which makes retrieving information in any sort of uniform manner across a single state difficult to almost impossible. One can see examples of this simply by searching the docket sources for the major legal databases like Bloomberg, LexisNexis, and Westlaw, and noticing that in many states only a small subset of counties or lower judicial districts/circuits are accounted for because there is no electronic access to some of the smaller courts in rural counties that can even be scraped or downloaded.

Courts themselves must also adhere to certain ethical considerations that can create challenges for coherent access.

Breaking Down the System

  • COURT RULES AND JUDICIARY POLICIES: Every Federal court strives to provide open access to available print records through the Clerk of the Court with its associated fees (if any) that are designated by Court operations and appropriate handbooks or manuals of the Court. Further digital access is provided by private vendors like PACER, Bloomberg Law, CM/ECF, and the equivalent e-filing and retention systems for state court records. Through a court-focused lens, budgetary allocations provide monies to sustain the maintenance and archiving of records. These archiving systems and transfer of comprehensive records vary from court to court, but the researcher will at the very least be able to find a full docket and record trail (in most instances). The Guide to Judiciary Policy governs federal judiciary employee conduct for disclosure of information, including an oath of office and confidentiality when appropriate for court documents and retention of records by Clerks of Court and other employees. The judiciary continues to thoughtfully consider open access to federal court records in conjunction with organized retention strategies and transparency of access to records, and this issue is continually on the radar of federal judiciary committees and in the daily operations of the Court. The Clerk of the Court’s office and the circuit librarian often work together to facilitate access to materials for researchers and buttress the overall research enterprise.
  • ORGANIZATION OF RECORDS AND NARA: Each state or federal court also has organized policies for retention of case records and transfer of those records, as appropriate, for preservation of files and open access to files. At the federal level, District Court and Appellate Court case records are often transferred to NARA (National Archives and Records Administration) after a certain amount of time and court documents are filed electronically (but sometimes records are also transferred back to the lower District Court after an appeal). The researcher should contact the Clerk of Court and consider the provenance of the case files and type of case (civil, criminal, magistrate, or bankruptcy) to determine the scope of electronic or print records and where they might be currently housed. Most federal case files opened prior to 1999 will be in print. NARA has detailed information and online searching of records available on their website. For courts in the Seventh Circuit, for example, the court records would be openly available from NARA in Chicago (unless the court records are sealed or transferred back to the District Courts in Illinois, Indiana, or Wisconsin).
  • SEALING DOCUMENTS FOR PARTIES AND COUNSEL IN CASES: Certain categories of cases, including criminal law or constitutional law cases, might have sealed records or closed hearings that may or may not be unsealed and generally available at some point in time. Typically, high-profile constitutional law cases in various circuits only allow the attorneys of record in the case or parties to the case to view case documents in PACER or CM/ECF. They are then unsealed at some point after the decision or disposition of the case record. If they are not unsealed or made available to the public, then the researcher will likely only see the timeline for the case in PACER or CM/ECF and will not be able to access associated documents for a comprehensive data set. Standard fees and court procedure would apply to those records once they are unsealed and available for public viewing and access per judicial order.
  • PROTECTION OF PERSONALLY IDENTIFIABLE INFORMATION (PII): In addition to sealing the records of the case, the protection of personally identifiable information (or PII) may affect access to court records and other filings. This is especially true for cases involving financial information, such as bankruptcy records, or those of financial crimes or in high-profile habeas cases in the federal courts. The record will not provide access to those personal details to protect the individual(s) in the case (by striking out the information or otherwise barring access to materials per local court customs and procedure). Researchers will notice that the PII for cases will be unavailable in PACER, CM/ECF, and through other traditional avenues for uncovering case files. Local procedures will also vary for accessing PII at the state level, and often state court rules bar filings with PII included.
  • LOCAL RULES MAY DEVIATE FROM FEDERAL OR STATE NORMS: It may be frustrating to researchers to realize that federal and state norms are often not the same for court records and filings and retention of those same records. This is also a challenge for development of a comprehensive national search of court documentation. Many courts still adhere to local customs, fee schedules for requesting documents, and other unique data entry systems for the chronology or timeline of court records in the docket. Typically, researchers must be tenacious in their quest for accurate case timelines and not get frustrated when cobbling together data for large research projects (which tend to be homegrown, like the SCALES project). Innovation is sure to happen, though, and will become easier once case files are digitized in the public domain and searchable. Researchers may build court record data sets like SCALES from the federal courts for their specific research needs and then aspire to make them searchable like other existing open access projects for legal or political science data (such as ICPSR or Harvard’s Case Law project). Comprehensive projects are feasible with identifiable goals and adequate funding or institutional support.
  • COST OF OBTAINING RECORDS NECESSITATES A PROJECT LIKE SCALES: Finally, it is important to acknowledge the sheer cost of obtaining the mammoth data sets for case law-searching projects and innovative digitization efforts like SCALES. Federal dockets are lengthy, and associated documents with the filings are huge. It’s an amazing and laudable initiative with lots of hard work by the partners in this project to free up those case documents for public viewing and create a useful algorithm for searching beyond PACER.

The Upside to Data Projects

As you can see, there are a lot of startup costs, both in terms of acquiring data and in terms of creating a team, which are immediately followed by a plethora of pitfalls once the data is obtained: cleaning it up, merging it, and making sure it’s not revealing information that is not meant for public consumption. There is no one way to do this, but coming up with a plan along these lines before you jump in and try to take on a project of this scope and dimension is absolutely necessary.


Rebecca Fordon
Reference Librarian & Adjunct Professor

Ohio State University Moritz Law Library
Columbus, OH

Heidi Frostestad Kuehl
Director of the David C. Shapiro Memorial Law Library
Associate Professor of Law

Northern Illinois University College of Law
DeKalb, IL

Tom Gaylord
Faculty Services & Scholarly Communications Librarian

Pritzker School of Law
Northwestern University
Chicago, IL

Adam R. Pah
Clinical Assistant Professor

Kellogg School of Management
Northwestern University
Chicago, IL

Start the discussion!

Leave a Reply

(Your email address will not be published or distributed)