How Informative Is the Text of Securities Complaints?

Adam B. Badawi is Professor of Law at University of California at Berkeley School of Law. This post is based on his recent paper.

Much of the research in law and finance reduces long, complex texts down to a small number of variables. Prominent examples of this practice include the coding of corporate charters as an entrenchment index and characterizing dense securities complaints by using the amount at issue, the statutes alleged to have been violated, and the presence of an SEC investigation. A persistent concern of legal scholars is that this type of reduction loses much of the nuance and detail embedded in legal text. In a recent paper, I use text analysis and machine learning to assess what we might be losing by not taking text seriously enough. Or, to put it another way, I ask what we can learn from a closer analysis of legal documents. The answer, it turns out, is quite a lot.

The body of text that I use in the paper is a corpus of over five thousand private securities class action complaints that collectively contain over 90 million words. There are a couple attractive features of using this source of legal documents. The first is that these complaints are subject to the heightened pleading requirements of the Private Securities Litigation Reform Act (“PSLRA”), which means that many of them go into significant detail about the underlying allegations. This is particularly true for the consolidated complaints that get filed after the selection of lead counsel, a group of documents where each one averages nearly 25,000 words. The second desirable feature is that the vast majority of these cases results in one of only two outcomes. The cases are either dismissed—sometimes voluntarily and sometimes via a motion to dismiss—or they produce a settlement. The binary nature of these outcomes makes it a bit easier to generate predictions through machine learning.

So what do the machine learning models tell us? Most importantly, they provide insight on what is likely to happen in these cases. The highest performing models correctly predict whether a case will settle or get dismissed at rates approaching seventy percent, which is well above the baseline settlement rates. This outperformance is the case both for the first-filed complaints and the consolidated complaints and persists after performing a series of robustness checks.

It is fair to ask whether a roughly seventy percent accuracy rate is a meaningful number. Perhaps the best way to test that would be to ask a group of seasoned securities lawyers to predict whether a given complaint is likely to settle or get dismissed. That approach is, to say the least, cost prohibitive. But previous work shows that the participants in the stock market are able to determine the merit of securities lawsuits within a few days of the complaint being filed.

To compare the performance of market participants and the machine learning models, I construct long-short portfolios of the cases that the models predict are most likely to get dismissed and are most likely to produce settlements. For the first-filed complaints, those portfolios earn an average of about five points of abnormal return in the ten-day window after filing. These results show that the machine learning models perform well relative to those who are trading on the content of securities complaints.

Although machine learning is often maligned, somewhat unfairly, as a black box, simplified versions of the models show the words that are most useful in predicting the outcomes of these securities cases. It is unsurprising, for example, that words associated with challenges to mergers under the securities laws—a notoriously weak strain of these cases—are associated with dismissals. While it might be easy to code for this type of information in cases, other important words are picking up more subtle features of cases. Some of these relate to the substantive allegations. For example, words that allege fraudulent conduct over long periods of time are more likely to produce settlements. There may also be indications of lawyerly quality embedded in text. Excessive reliance on media reports—which some allege is a sign of lower quality lawyering—appears to be associated with the eventual dismissal of cases.

Why does this matter for researchers in law and finance? As I discussed at the outset, there are real concerns about reducing text down to a handful of variables. To get a sense of whether this is a problem, I compare the performance of models that only use text-based variables to those that only use non-text-based variables. Importantly, the text-based models do a better job of predicting outcomes than the non-text-based models. In most cases, combining the two types of variables produces only marginal improvements.

The better performance of the text-based models helps to confirm the fears of some legal scholars—reducing text to a small number of variables leaves important information behind. Analysis that does not incorporate this information runs the risk of omitted variable bias. In the context of securities litigation, that omission may make it seem like an easily coded variable is driving outcomes when in fact it is more subtle information embedded in the text that is doing so. My aim in this paper is not just to raise concerns about leaving this information by the wayside. As text analysis and machine learning become more common in law and finance and as it becomes easier to analyze large corpora researchers should aim to control for the content and variation in the legal texts that they analyze.

You can find the most recent version of the paper here.

Both comments and trackbacks are currently closed.