Predicting Litigation Risk via Machine Learning

James Naughton is Associate Professor of Accounting at the University of Virginia Darden School of Business. This post is based on a recent paper by Mr. Naughton; Gene Moo Lee, Assistant Professor of Accounting and Information Systems at the University of British Columbia Sauder School of Business; Xin Zheng, Assistant Professor of Accounting and Information Systems at the University of British Columbia Sauder School of Business; and Dexin Zhou, Assistant Professor of Economics and Finance at CUNY Baruch College Zicklin School of Business.

Traditionally, empirical models in accounting and finance have focused on parameter estimation—in other words, what is the relation between the dependent and independent variable (i.e., does X cause Y)? However, a number of these studies generate inferences using variables that are estimates of unobservable firm attributes derived from traditional regression models. For example, estimates of securities litigation risk, a rare but important economic event, are typically generated from logistic regression models. Because of how these estimates are used in the literature, their accuracy has important implications for the conclusions drawn in a number of studies. In the case of securities litigation risk, researchers generally use estimates in two ways. First, researchers include litigation risk as a control variable in a regression specification where litigation is a correlated omitted variable. In these models, an inaccurate estimate of litigation risk can bias the coefficients of interest, thus affecting subsequent inferences. The second way in which these estimates are used is to create treated and control observations in a natural experiment. Typically, researchers will identify a regulatory event that affects litigation risk, and examine its impact using a difference-in-difference methodology where the treated firms have high values for litigation risk and the control firms have low values for litigation risk. In these studies, an inaccurate estimate of litigation risk can result in firms being misclassified, and therefore affect subsequent inferences.

In our paper Predicting Litigation Risk via Machine Learning, we suggest that these underlying estimates can be more accurately predicted using machine learning algorithms, and that these more accurate estimates should aid future research. Machine learning revolves around the problem of prediction (i.e., produce predictions of Y from X), which reflects the actual usage of litigation risk in the broader literature. The appeal of machine learning over traditional models is its ability to discover complex structures that are not specified in advance. For example, increasing account receivables could be a predictor of litigation risk only when sales are declining. Additionally, machine learning is especially adept at resolving the classification of rare events when compared to traditional linear models. Machine learning techniques manage to fit complex and very flexible functional forms to the data without simply overfitting; it finds functions that work well out-of-sample. We evaluate a comprehensive set of twelve machine learning techniques and benchmark their performance against the most commonly used logistic regression models in Kim and Skinner (2012) using data from 1996 through 2015 (i.e., the post-PSLRA period). We split our data into three samples for use in three separate testing procedures: training, cross-validation and holdout. Since our third sample is not used for parameter estimation or tuning, it is a true out-of-sample evaluation of models’ predictive performance.

These machine learning models improve the prediction of litigation risk substantially, with hourglass-shaped and convolutional neural networks the most effective. We evaluate the effectiveness of each model using the F-Score, a statistic calculated from the precision and recall of the test. The F-Score is especially appropriate in our setting because securities litigation is a relatively rare event. Precision can be viewed as a measure of quality (i.e., how well the model can discriminate between litigation and non-litigation events), and recall as a measure of quantity (i.e., how many litigation events are identified). Higher precision means that the model returns more relevant results than irrelevant ones, and high recall means that the model returns most of the relevant results. Overall, we find that the highest performing machine learning models shift a number of firm-year observations to a lower likelihood of litigation relative to Kim and Skinner (2012). As a result, the machine learning models exhibit a substantial improvement in precision, and a lesser deterioration in recall. This tradeoff generates the substantial improvement in the F1 score. This improvement in precision also implies that variations on the F-Score that place a higher weight on precision (which is the most salient attribute of litigation risk estimates as used in the empirical literature) over recall will show an even bigger improvement for machine learning approaches.

Overall, our results suggest that the joint consideration of economically-meaningful predictors and machine learning techniques maximize the effectiveness of litigation risk estimates. Our results have implications for practitioners and researchers. To aid future studies, we produce firm-year litigation risk estimates from a convolutional neural network model that uses recursive feature elimination on a pool of 68 possible parameters. These estimates are available upon request.

The complete paper is available for download here.

Both comments and trackbacks are currently closed.