langue langue

Alternative quantitative techniques to logistic regression

The Scoring center of BNP Paribas Personal Finance uses scoring method to pilot credit granting, debt collection and its commercial loans. In order to do that iu uses essentially the logistic regression method.
The aim of the apprenticeship was to find et apply a new method of scoring alternativ to logistic regression. I studied during this year the « Forrest Augmented Naive Bayes classifier » (FAN) method and compared the results with logistic regression.

BNP Paribas Personal Finance-Centre de Scoring

bnpThe Personal Finance (PF) division is include in the Retail Banking branch of BNP Paribas. BNP PARIBAS Personal Finance is the european leader of consumer and real estate credit.
Born in 2008 from the fusion of Cetelem, consumer credit market, and UCB, real estate european specialist, BNPP-PF accounted at the time for 108€ billions worth of credit, 20 millions clients, is present in 4 dontinents, in over 30 countries and is gathering 26 500 employees.


Based in the Risk Direction, the Scoring Center is charged to furnish statistical tools to facilitate risk control. Scores are issued from the statistical study of observed population behavior, and aim to assign a rating for each file based on his riskiness level. Rules which conducts the calcul of those scores are then include in expert system (informatic tools wich generate the reasoning of an expert in a specified subject) and allow to measure the level of riskiness and to select specific files accordingly.
The Centre de Scoring is also responsible for supervising decision support tool within PF. Decision support tools allow subsidiaires to conduct granting or collection strategies, along with informatic tools. The Scoring Center ensure the performance of those tools, and conducts subsidiaries in their usesn (formations, accompaniment, support...).
Main missions of the Scoring Center are :

  • To construc the scores of Personal Finance
  • To assure a support and formations for the use of decisional tools
  • To maintain expert system for credit ranking decision
  • To realise technology watch activities on statistical methods
  • To realize audit on decisional plans
The Scoring Center builds 40 to 50 scores a year. Today, there is over 200 differents scores in uses at PF.
The R&D division of the Scoring Center gathers knowledge and promote technology watch activities.

Context of the internship

Actual modelisaiton techniques used to distinguish good from bad clients (binary criterion) in credit risk are essentially based on scoring method and the use of logistic regression. This techniques has been used in BNPP PF for some decades.
The objectif is to look for, modelise, and test a new alternative techniques to scoring and to compare the results obtained with logistic regression.
After a bibliographic search of existing models, I choose to study the « Forest Augmented Naive Bayes classifier model», FAN,( wich is an upgrade of the « Naive Bayes classifier » model).
This method drew my attention by its simplicity, the good results observed on the « Naive Bayes classifier » method in many data mining problems and the capacity to modelize a polytomous criterion.
In order to challenge those two methods, a performance comparison has been made. The results of the classification(confusion matrix) as well as the ranking (rating score and ROC curve)has been studied to confront the resultswith logistic regression. The FAN method has moreover been developped on a polytomous criterion, a criterion with over 2 modalities.
Several SAS macros has been made to study this method. I furthermore developped others SAS and C macros for the account of the Scoring Center.

Forest Augmented Naive Bayes classifier

Naive Bayes classifier

The Naive Bayes model is a largely used method in data mining as a classification algorithm. In this model, the probability to belong into one class (good/boad client) is estimated using conditional probabilities.
For that purpose, the probability of blonging in each classes is computed for each variable (age, maximum delay of payment, job tenure) based on a test sample, and is defined in function of the different modalities of each variable (between 18 and 26 years old, between 27 and 35 years old, between 36 and 44 years old...). In practice, all the files belonging to the same modality of a variable (number of clients having between 18 et 36 years old and belonging to the "good client" class)and in the same class are counted and divided by all the files of the class. We therefore obtain the probability of being classified as a good/bad client knowing that we belong to one modalité of one class.

Test Sample
Id Age Bank seniority Job tenure ... Number of postponment in collection Class
1 Between 18 and 26 years old Between 0 and 2 years old Between 0 and 2 years old ... 0 1
2 More than 45 years old More than 10 years old Between 10 and 15 years old ... Between 1 and 2 0
3 Between 27 and 35 years old Between 3 and 6 years old Between 3 and 5 years old ... Between 1 and 2 0
4 Between 27 and 35 years old Between 0 and 2 years old Between 0 and 2 years old ... More than 2 0
5 Between 18 and 26 years old Between 0 and 2 years old Between 0 and 2 years old ... 0 0
6 More than 45 years old More than 10 years old Between 10 and 15 years old ... 0 0
... ... ... ... ... ... ...
16000 Between 18 and 26 years old Between 0 and 2 years old Between 0 and 2 years old ... 0 0
The equation of the bayesian probabilities we are reaching for is :bayes
where A will be the class (good/bad client) and B will be the specifications of the file(age, bank senniority etc...). Thanks to the test sample, we can compute the probability of B knowing A. We just have to calculate the nulber of files which have the modality "Between 18 et 26 years old" for the variable age, and divide by the number of files which have Class=0 (if we choose A=0) or 1 (if we choose A=1). In the same way we will compute the probability od A as the number of files in which the variable "class" is equal to 0 or 1 divided by the total number of files.
Here the probability unique of A (divisor of the bayesian probability)doesn't matter because the client's file remains the same (the probabilities will stay the same regardless of the class). In order to save time calculation we will therefore not calculate this probability.
We can know compute the probablity of belonging to each of the classes in respect to the file of each client. We will then classify the file according to te maximum a posteriori criterion, it means the class in which the file has the highest probability of belonging into.maximum

Forest Augmented Naive Bayes classifier

The Forest Augmented Naive Bayes classifier model calculate the probabilities of belonging to one class knwowing the value of a first variable associated to a second one. Indeed, the fact to realize a second late payment in 20 years of banking seniority isn't the same that realising the same number of delay in only one year. Variables are therefore associated based on the mutual conditional information criterion. Then, we only keep the variables wich associated together have a high enough mutual information to avoid "overfitting" issues (the model becomes over-reliant on the test sample and doesn't perfomr well enough on knew ones). The classification is always based one the maximum a posteriori criterion. We then obtain what is called a forest of bayesian models :

foret
Example of Bayesian Forest

The goal of the study was to compare the performances of the studied model with those of the logistic regression, therefore we needed the equivalent of a rating score for the ranking purpose.
The difficulty was that the two probablities (to belong to the two different classes) cannot be compared. A file with a high probability of being classified as a good client will not inevitably be classified as "good". It all depends on its probability of being classified as "bad". If the last is even higher than the first, the file will be classified as "bad", even thought the probability of being "good" is higher than the probabilty of others files wich have been classifiednas "good". Then, we couldn't take one probability as a refernce to establish a ranking of the files.

Test Sample
Criterion File Probability of class 0 Probability of class 1 Credit Score
1 15121 7.760E-3 5.631E-2 1
1 15122 2.597E-1 1.090E-1 0
The probability of class 1 of the file 15122is higher than the one of the file 15121. However this file isn't classified in class 1, whereas file 15121 is. It is because his probability of belonging into class 0 is even higher.

I have therefore decided de consider the differnce between the two probabilities to rank the files. In the same way, differences more or less importnat occurred between the different files without reflecting a difference of ranking. I have then decided to divide the result of this difference by the probability of being a good client (tests have proved taht the results are the same using one or the other probability or the sum of both) in order to standardize all the results between themselves.
We will know use this formula to compute the Gini coefficient, which is used at BNP to evaluate the efficacity of the model, to compare our results on a binary criterion with those of the logistic regression.
note
FAN score rating

Results

The objectif of the apprenticeship was to test a new alternativ method to logistic regression in the framework of scoring.
The apprenticeship took place in different stages :

  • Methodology researches
  • Redaction of files on the different method seen
  • Choice of one method of modelisation
  • Thorough study of the FAN model
  • Development on SAS program
  • Test of performances compared with logistic regression on differents data bases
  • Study of the method on a polytomous criterion and development of SAS macros for polytomous score modelisation
The methodologique study bring me to also study the FAN model on a poytomous score.
  • Based on the ranking, the performances of the FAN model match those of the logistic regression on a binary criterion. The gain of time is however marginal to replace the logistic regression. The model is however useful to challenge the results of the logistic regression.
  • Based on the classification, the FAN model is globally better than the regression logistic. This results is especially true on a polytomous criterion where the FAN model has the capability to classify files in more than 2 categories. This technique display however the same inconvenients as the logistic regression when a modality of the criterion is overrepresented compared to others.
The results observed on a polytomous criterion are very satisfiyng. Indeed, there wasn't any existing method to realize poltmous score at PF (acually a working group is working on the subject). A methodology has therefore been realized, several macros have been computed, and is now used by the R&D team. Les résultats observés sur critère polytomique sont très satisfaisant. En effet, il n’existait jusqu’à maintenant aucune méthode de réalisation de score polytomique chez PF (actuellement un groupe de travaille explore ce sujet). Toute une méthodologie a été mise en place, accompagnée de plusieurs macros SAS nécessaires à l’élaboration d’un score polytomique, et a par la suite été utilisée et continue d’être utilisée par une autre équipe de recherche sur ce sujet.
Based on the teachings of the FAN model and the HUM method we have completed the following parts of the polytomous modelisation :
  • variables selection
  • variables crossing selection
  • calcul of perfromances
  • risk level selection
I have also been charged to realize SAS and C macros for the account of the scoring center during the year.