Predicting Corporate Default Using Text

Ashok Banerjee & Sanjeev Kumar

Ashok Banerjee, Ph.D., is Professor, Finance and Control, Indian Institute of Management Calcutta (IIM-C). He is also the faculty in-charge of the Financial Research and Trading Lab at IIM-C. His primary research interests are in areas of Financial Time Series, News Analytics and Mergers & Acquisitions.

*Sanjeev Kumar, Consultant (Analytics Practice) TCG-Digital

The rising corporate debt and higher default rates have led to a continuous increase in distressed loans in Indian financial system. The situation worsened when stressed asset ratio rose from 7.6 % in March 2012 to 11.5 % in March 2016 and further to 12% in March 2017. As of June 2016, the total amount of Gross Non-Performing Assets (NPA) for public and private sector banks was around Rs. 6 lakh Crore (almost $10 billion). Alarmed by the deteriorating asset quality, the Reserve Bank of India (RBI) in April 2015 had urged all commercial banks to put in place an early warning system to prevent financial fraud. In March 2016, the Securities and Exchange Board of India (SEBI), the Ministry of Corporate Affairs (MCA) and the Institute of Chartered Accountants of India (ICAI) had emphasised the need for developing an early warning system aimed at zeroing in on companies that have taken funds from public and whose balance sheet parameters show that they may renege on repayment. The problem with this approach –generating early warning signals from financial statements- is it may lack predictive power. This would be particularly true for firms which ‘window dress’ their financial numbers to ‘defer’ release of bad news. Lenders typically concentrate largely on financial parameters at the time of loan origination and subsequently track the behaviour of borrowers through financial statements and other financial data furnished by the borrower. However, the information in the financial statements may not reveal the actual state of affairs of a borrower. Take the following example (Table 1). These three companies defaulted in 2015. Their financial health did not show any sign of trouble/irregularity three years (2012) before the year of default. In fact, leverage (debt-equity) of two companies was much less than one. Operating profit margins were in double-digit for two firms. The Altman’s Z-score[1] was much above the comfort zone for all the three companies in 2012. One might point out that the EMS can predict distress one year ahead and not so early. However, even in the year of default (2015), the EMS was above 2.6 for all three companies.


[1] The Altman Z-Score is used as a tool for analyzing the level of distress a firm might face in next one year. Altman et al (1995) introduced a revised Z-score model for the non-manufacturing and manufacturing companies operating in developing countries using the sample of Mexican Companies. They called the revised model as EMS (Emerging Market Score). The present study uses the EMS. Any firm, which secures an EMS of 1.1(2.6) or below (above), has high (low) risk of default.

Much of the research has so far explored the relationship between financial distress and historical accounting information. However, the quantitative financial information comprises only approximately 20% of all the information contained in annual reports (Beattie et al. 2004). Therefore to obtain a complete picture of financial health of a company, it is necessary that one uses the qualitative information provided in corporate annual reports. There is of late a growing interest among finance and accounting research community in analysing and quantifying the qualitative information present in annual reports. Loughran, McDonald ( 2011 ) analysed the tone of corporate annual reports (sentiment) and observed that sentiments expressed in annual report text data is significantly correlated with profitability, trading volume, and unexpected earnings for listed companies in USA.

Table 1: Financial Health of Three Companies

Realizing the need for greater scrutiny of annual reports, the RBI[2] instructed banks to undertake a detailed study of the Annual Report, and not concentrate merely on financial statements. At present detection of loan frauds takes an unusually long time, which may delay action against any fraudulent entity causing huge losses to financial institutions. So, early detection of any trouble or distress of borrowers would really help in controlling the menace of non-performing assets. The lenders in India should learn the art of extracting information from large text documents and improve their present rating system by supplementing financial parameters with text-based information. This would make the existing rating system more robust.


We have observed, after manually going through hundreds of annual reports of corporates, firms reveal more in the ‘text’ part of the annual report.
Companies, more so the listed ones, become careful while presenting financial statements simply because this section of the annual report is scrutinised most by analysts, investors and lenders. We have developed a proprietary text-based model for estimating default probability of firms and we claim that our model has much better predictive power than Altman’s. Our proposed model is equally effective in case of unlisted firms. Further our text-based model is designed to capture any kind of trouble or uncertainty that a firm faces in addition to default risk.

Words reveal more

Our model is developed using text present in the annual report of a company. We have only used three sections of an annual report- Directors Report (including Management Discussion and Analysis), Audit Report and Notes to Accounts. It is important to note that annual report (except the audit report) is a self-report of a company and hence such a document is bound to have strong bias. Yet we were amazed by the quality of information that one can extract from such a biased text. Let us take the case of Vijay Textile (mentioned in Table 1). The company reported an operating margin of more than 28% in 2012 with a debt-equity ratio of less than 1.5. Even in the year of default, the debt-equity did not cross 2, though the sales growth was negative. However, if one looks at the annual report of the company over past few years prior to the year of default, one would notice that the company had started facing financial hardships at least four years before 2015 (Table 2). It is interesting to note that the Altman EMS improved over the years whereas the text of annual reports clearly showed that the firm was burdened with huge financial hardship so much so that the company had to dispose of some assets way back in 2011. The firm witnessed inventory pile up and lower profitability in 2012 and the situation did not improve thereafter leading to huge pressure on liquidity in 2014. The material information captured in the text of the annual report, in this example, proves that it makes economic sense to analyse the non-financial information as seriously as one does for financial information. We find that directors report provide most of material information and audit report provided least marginal information.


Magnusson et al. (2005) use self-organizing maps to visualize the changes in the writing style of the annual reports of telecommunication companies. They observed that when a company is expected to perform well, the tone of the report remains positive with extensive use of optimistic vocabulary as compared to a less optimistic and more conservative tone when expecting worse financial performance.

Table 2: Excerpts from Annual Report

Methodology Explained

Each piece of annual report text data provides one aspect of reality about a firm’s condition for a particular financial year. But the text data contains a lot of noise or irrelevant information, which makes extracting only useful information, using computational tool, a bit cumbersome. So text data cleaning is a first important task before performing any analysis on it.

For cleaning the dataset, we have used the following steps:

  1. Remove all hypertext data, urls etc.
  2. Remove the selective dash only like un-relalistic is converted to unrealistic, un-certain to uncertain but not profit-loss to profitloss, rather profit loss. We identify the selective prefixes which changes/add stress on the only desired sentiment of words.
  3. Remove all non informative text data like numbers, dates, serial numbers for starting points, comma, dots, anything between () or {} or [].
  4. Remove all phrases which are general accounting literature terms like profit and loss, gain and loss, all words in capital letters.