**March, 18 2018**

*Vijay Nair*

I’ve been working as a “quant” in a large bank over the last two years. Before that, I spent 15 years in a research lab and another 25 years as an academic. Much to my surprise, I didn’t find the transition to banking industry to be especially difficult: data and statistics are ubiquitous and play a central role in decision-making in financial institutions. The role of statistics in banking is too wide to cover in this short note, so I’ll limit myself to credit risk, setting aside topics in investment banking, operational risks such as fraud detection, and others.

Credit risk is generally defined as the risk from a borrower failing to meet its obligations to repay a loan according to agreed upon terms. Banks (both retail/consumer and wholesale) derive a significant amount of their revenue from secured and unsecured loans to individuals, businesses, corporations, and even governments. Examples include primary and secondary mortgages, home equity line of credit, auto loans, business loans, student loans, credit cards, and so on. Therefore, it is critical that the banks model and quantify their exposure to risk. This is also a legal requirement for large financial institutions.

Credit risk deals with modeling three terminal events associated with a loan: default, bankruptcy, and prepayment. These are usually treated as competing risks. Since we are dealing with modeling a time-to-event or the probability of an event, the basic techniques come from survival or event history analysis. However, instead of modeling the distribution of a time-to-event, the common practice is to use binary regression techniques to model the conditional probability of defaulting in the future, given that the loan has not defaulted up to the current month. The typical prediction horizon is on the order of three to four years. The use of logistic regression models with both static and time-varying covariates is very prevalent. Covariates include account level variables such as borrower’s credit (FICO) score at origin and over time, the interest rate at the time of origin and over time, ratio of the loan amount to value of the security over time, etc. Macro-economic variables, such as unemployment and inflation, are also key features in the model. In fact, one of the critical uses of these models is to quantify the bank’s credit exposure under “stressed” environments by predicting risk at various levels of severe economic conditions. This is often called “stress testing”.

The models can be at an individual account level or at some appropriately aggregated pool level, such as vintage (origination time). Data at account level for certain types of loans can run into several billion records and hundreds of covariates, with each observation being a time series. So, big-data management and algorithms are important considerations. Currently, modelers often finesse the problem by basing their models on a subset of sampled data. However, there is an increasing trend towards analyzing the entire data set. SAS is probably the most common data analysis platform although one sees increasing use of R, Python libraries and other open-source software. A lot of the data analyst’s effort goes into data management and preprocessing before actual model development. This includes identifying sources of data, transferring and merging databases, data cleaning, dealing with outliers and missing data, etc. So, in addition to programming, some knowledge in database management and data retrieval is needed.

There are some unique features of credit risk analysis and prediction that distinguish it from traditional survival analysis. The figure below characterizes the situation for a particular loan, which originates at time *v* and will terminate (say, default) at some future time. We are currently at time *s* (snapshot or current time) and the goal is to predict risk (probability of default, bankruptcy, or prepayment) at some future time *t* (performance time).

There are at least two time dimensions that come into play in modeling and prediction: __age of the loan__ at snapshot time and __calendar time__ (since the macroeconomic variables depend on calendar time). In addition, there is important information associated with origination time (such as originating interest rates) that must be incorporated into the analysis. Traditional models ignore the effects of calendar time and analyze time to default as a function of only age (or equivalently origination time). These are called *vintage models* in the literature. On the other hand, there are models that ignore the effect of age (or make some assumptions) and predict default probability as a function of horizon only (see figure above). These are called *horizon models*. In reality, neither of these models is adequate and we have to account for the effects of both age and horizon (or equivalently calendar time) using *dual-time models*.

Another feature worth noting is that most of the account-level covariates are known only up to snapshot time and are not available for the prediction model. Similar problems arise in biostatistics and medical applications also. One obvious approach is to model past covariates and use it to predict their values in the future. For some reason, this is not commonly done and various ad hoc techniques, including the use of varying coefficients to capture missing information in the predictors, have become common in the industry.

While traditional statistical techniques and diagnostics are used in developing credit risk models, the primary metric for assessing model performance is “back-testing”: how the models perform on hold-out historical data. But the major challenge here, as with any prediction exercise that uses the past to predict the future, is that the past must be representative of the future. In this instance, economists say that the past must include at least one full economic cycle. But this is certainly not the case here as the economic conditions since the last financial crisis are not typical of what we have seen before the crisis or what we might see in the next few years. This is of course a difficult challenge, one that has to be addressed through subject matter expertise, extensive discussion and systematic elicitation of information, and scrutinizing any proposed modifications to the statistical results.

I’ve not even scratched the surface of how statistical methods are used in the banking industry. Credit scoring or assessing the credit worthiness of loan applicants is by now mature topic. Others that may not be as well known to the statistical community include fraud detection (related to anomaly detection in other areas), anti-money laundering, compliance monitoring, and of course the huge area of investment banking. Machine learning techniques are currently receiving a lot of attention in finance, but that is a topic for another discussion.