The Scoring center of BNP Paribas Personal Finance uses scoring method to pilot credit granting, debt collection and its commercial loans. In order to do that iu uses essentially the logistic regression method.
The aim of the apprenticeship was to find et apply a new method of scoring alternativ to logistic regression. I studied during this year the « Forrest Augmented Naive Bayes classifier » (FAN) method and compared the results with logistic regression.
The Personal Finance (PF) division is include in the Retail Banking branch of BNP Paribas. BNP PARIBAS Personal Finance is the european leader of consumer and real estate credit.
Born in 2008 from the fusion of Cetelem, consumer credit market, and UCB, real estate european specialist, BNPP-PF accounted at the time for 108€ billions worth of credit, 20 millions clients, is present in 4 dontinents, in over 30 countries and is gathering 26 500 employees.
Based in the Risk Direction, the Scoring Center is charged to furnish statistical tools to facilitate risk control. Scores are issued from the statistical study of observed population behavior, and aim to assign a rating for each file based on his riskiness level. Rules which conducts the calcul of those scores are then include in expert system (informatic tools wich generate the reasoning of an expert in a specified subject) and allow to measure the level of riskiness and to select specific files accordingly.
The Centre de Scoring is also responsible for supervising decision support tool within PF. Decision support tools allow subsidiaires to conduct granting or collection strategies, along with informatic tools. The Scoring Center ensure the performance of those tools, and conducts subsidiaries in their usesn (formations, accompaniment, support...).
Main missions of the Scoring Center are :
Actual modelisaiton techniques used to distinguish good from bad clients (binary criterion) in credit risk are essentially based on scoring method and the use of logistic regression. This techniques has been used in BNPP PF for some decades.
The objectif is to look for, modelise, and test a new alternative techniques to scoring and to compare the results obtained with logistic regression.
After a bibliographic search of existing models, I choose to study the « Forest Augmented Naive Bayes classifier model», FAN,( wich is an upgrade of the « Naive Bayes classifier » model).
This method drew my attention by its simplicity, the good results observed on the « Naive Bayes classifier » method in many data mining problems and the capacity to modelize a polytomous criterion.
In order to challenge those two methods, a performance comparison has been made. The results of the classification(confusion matrix) as well as the ranking (rating score and ROC curve)has been studied to confront the resultswith logistic regression. The FAN method has moreover been developped on a polytomous criterion, a criterion with over 2 modalities.
Several SAS macros has been made to study this method. I furthermore developped others SAS and C macros for the account of the Scoring Center.
The Naive Bayes model is a largely used method in data mining as a classification algorithm. In this model, the probability to belong into one class (good/boad client) is estimated using conditional probabilities.
For that purpose, the probability of blonging in each classes is computed for each variable (age, maximum delay of payment, job tenure) based on a test sample, and is defined in function of the different modalities of each variable (between 18 and 26 years old, between 27 and 35 years old, between 36 and 44 years old...). In practice, all the files belonging to the same modality of a variable (number of clients having between 18 et 36 years old and belonging to the "good client" class)and in the same class are counted and divided by all the files of the class. We therefore obtain the probability of being classified as a good/bad client knowing that we belong to one modalité of one class.
|Id||Age||Bank seniority||Job tenure||...||Number of postponment in collection||Class|
|1||Between 18 and 26 years old||Between 0 and 2 years old||Between 0 and 2 years old||...||0||1|
|2||More than 45 years old||More than 10 years old||Between 10 and 15 years old||...||Between 1 and 2||0|
|3||Between 27 and 35 years old||Between 3 and 6 years old||Between 3 and 5 years old||...||Between 1 and 2||0|
|4||Between 27 and 35 years old||Between 0 and 2 years old||Between 0 and 2 years old||...||More than 2||0|
|5||Between 18 and 26 years old||Between 0 and 2 years old||Between 0 and 2 years old||...||0||0|
|6||More than 45 years old||More than 10 years old||Between 10 and 15 years old||...||0||0|
|16000||Between 18 and 26 years old||Between 0 and 2 years old||Between 0 and 2 years old||...||0||0|
The Forest Augmented Naive Bayes classifier model calculate the probabilities of belonging to one class knwowing the value of a first variable associated to a second one. Indeed, the fact to realize a second late payment in 20 years of banking seniority isn't the same that realising the same number of delay in only one year. Variables are therefore associated based on the mutual conditional information criterion. Then, we only keep the variables wich associated together have a high enough mutual information to avoid "overfitting" issues (the model becomes over-reliant on the test sample and doesn't perfomr well enough on knew ones). The classification is always based one the maximum a posteriori criterion. We then obtain what is called a forest of bayesian models :
The goal of the study was to compare the performances of the studied model with those of the logistic regression, therefore we needed the equivalent of a rating score for the ranking purpose.
The difficulty was that the two probablities (to belong to the two different classes) cannot be compared. A file with a high probability of being classified as a good client will not inevitably be classified as "good". It all depends on its probability of being classified as "bad". If the last is even higher than the first, the file will be classified as "bad", even thought the probability of being "good" is higher than the probabilty of others files wich have been classifiednas "good". Then, we couldn't take one probability as a refernce to establish a ranking of the files.
|Criterion||File||Probability of class 0||Probability of class 1||Credit Score|
|1||15122||2.597E-1||1.090E-1||0||The probability of class 1 of the file 15122is higher than the one of the file 15121. However this file isn't classified in class 1, whereas file 15121 is. It is because his probability of belonging into class 0 is even higher.|
The objectif of the apprenticeship was to test a new alternativ method to logistic regression in the framework of scoring.
The apprenticeship took place in different stages :