As more scholars are looking at doing statistical of case law, I wanted to give some advice on how to do that given the way court decisions are written and published.
The first thing to understand about a dataset of case law is that it is not representative of a sample of all the matters that appear before the courts. Jury verdicts and many oral reasons in various areas of law are never written down, so they are not distributed to CanLII and other publishers. This is particularly common for routine issues in areas law like criminal, family, or small claims. This makes these cases difficult to discover for research. Statistically it means that there is no normal distribution, so standard statistical tools like t-tests are not usable as any sample taken will be skewed to the extremes.
This is less of an issue with courts or tribunals that publish a higher proportion of their decisions: the Supreme Court of Canada or appeal courts are better candidates to get a representative sample, but they are less applicable for the social impacts of the legal system. There will also still be missing decisions for a number of reasons, such as publication bans. To get a true random sample of all the matters that appear before a lower court or tribunal may require a trip to the registry and an understanding of docket numbers’ structure to generate a workable set. Some tribunals are mandated to publish all decisions by statute, which would make things easier.
Neutral citations are applied by the issuing body, not CanLII or another publisher. Some issuing bodies haven’t adopted the neutral citation standard yet. For these decisions CanLII applies a CanLII citation with the following format: <year> CanLII <sequential number>. You can read more about it here.
If written decisions will meet your needs, you can generate a random set using neutral citations a year, the alphabetical code for the court or tribunal, and a random number corresponding to the sequential number at the end.
This method may be limited if there are numbers in the sequence that are missing from the source you’re looking at. This may mean several things: the decision may have been left out of the system due to error, there may be a reason it is not available like a publication ban, or the issuing body may have a policy of not always adding the numbers sequentially. In this case you may have to verify with the issuing body.
The non-random elements in the distribution of the decisions start with the way they are scheduled: hearings are scheduled for many non-random reasons, such as availability of facilities and people, or the time of year. They are then written by human decision makers who take varying lengths of time, and they released for publication based on administrative processes carried out by a mixture of human intervention and automated systems that vary greatly among institutions. Decisions may be released months after a court date, so it is advisable to backdate the start of your analysis.
Decisions are not displayed in a random order on CanLII. Decisions are posted by the month and day of the decision, but that is not closely correlated to when CanLII gets the decisions. There are layers of human intervention — decisions may be written up and sent at a later date than this. When decisions arrive at CanLII the editorial team monitors a mostly automated system. Generally this won’t affect the timeline of when decisions will be published much, but there is human intervention in the selection of the order.
There are stringent protections to make sure decisions are never missed once they reach CanLII, but the fact that a document isn’t in CanLII’s databases doesn’t mean that it doesn’t exist.
So here’s my advice on building a random sample for analysis:
- Pick a date several months ago to start your analysis from, if you are starting today I suggest picking a year before 2018
- Use a random number generator to generate sequential numbers to use. I usually use the =RANDBETWEEN() function in Excel.
- If you want to randomly select from different decision making bodies, apply a number to each body and use the random number generator to select them that way.
If you need to sample matters with no written decisions, I suggest contacting the issuing body to make sure you understand the docket numbering process locally and make a sample based on that. This process will only include matters that are decided in an adjudicative process, if you want to include statistics about individuals who don’t go to court because they give up or make a deal before or during trial, you will have to explore other options that may involve in person observation.
The silver lining to this is that for a truly random sample for a dataset with a normal distribution you don’t need as large a sample as you might think, as the determinant of ideal sample size is not dependent on population size. You can start reading more about that here.
Note: I have written from the perspective that you are using CanLII to do this work. Other sources may be similar, but you should check on the detail with the publisher if it really matters.
I would like to thank Lisa Trabucco for instigating my exploration of the details of this issue and discussing it with me. I would also like to thank Frédéric Pelletier at Lexum for confirming my understanding of the publishing process.