Current location - Education and Training Encyclopedia - Graduation thesis - Application case of data mining technology in credit card business
Application case of data mining technology in credit card business
Application case of data mining technology in credit card business

Credit card business has the characteristics of huge overdraft amount and small single amount, which makes the application of data mining technology in credit card business inevitable. Foreign credit card issuers have widely used data mining technology to promote the development of credit card business and achieve comprehensive performance management. Since the first credit card was issued in China from 65438 to 0985, the credit card business has developed by leaps and bounds and accumulated a huge amount of data. The importance of data mining in credit card business is becoming more and more obvious.

I. Application of data mining technology in credit card business The application of data mining technology in credit card business mainly includes analytical customer relationship management, risk management and operation management.

1. Analytical CRM

Analytical CRM applications include market segmentation, customer acquisition, cross-selling and customer churn. Credit card analysts collect and process a large amount of data, analyze these data, find their data patterns and characteristics, analyze the characteristics, consumption habits, consumption tendencies and consumption needs of a certain customer group, and then infer the next consumption behavior of the corresponding consumer group, and then take the initiative to market the identified consumer group for specific products. Compared with the traditional large-scale marketing method which does not distinguish the characteristics of consumers, this method greatly saves the marketing cost and improves the marketing effect, thus bringing more profits to the bank. According to the customer purchase probability predicted by the response model, what kind of marketing methods should be used for customers, and more active and humanized marketing methods should be adopted for customers with high response probability, such as telephone marketing and door-to-door marketing. For customers with low reply probability, you can choose low-cost e-mail and letter marketing methods. In addition to acquiring new customers, it is also important to maintain the loyalty of existing high-quality customers, because the cost of retaining an old customer is far lower than the cost of developing a new customer. In customer relationship management, through data mining technology, we can find out the characteristics of lost customers and find out their loss rules, so that we can compensate those cardholders with similar characteristics before they lose, so that high-quality customers can continue to create value for banks.

2. Risk management

Another important application of data mining in credit card business is risk management. Various credit scoring models can be established by using data mining technology in risk management. There are three main models: application credit card scorecard, behavior credit scorecard and collection credit scorecard, which provide credit risk control for credit card business before, during and after.

The application scoring model is specially used for the credit evaluation of new applicants. It is suitable for the credit review stage of credit cards. Through the personal information filled in by the applicant, the customer quality can be effectively and quickly identified and divided, whether the application is approved or not can be decided, and the initial credit limit can be approved for the approved applicant, helping the issuing bank to control the risk from the source. The application of scoring model does not depend on people's subjective judgment or experience, which is conducive to the implementation of unified and standardized credit policies by issuing banks. The behavior scoring model is aimed at the existing cardholders, through monitoring and predicting the cardholders' behaviors, so as to evaluate the credit risk of the cardholders, and according to the results of the model, intelligently decide whether to adjust the customer's credit limit, decide whether to authorize when authorizing, and whether to renew the card when it expires, so as to give an early warning of possible occurrences. The collection scoring model is a supplement to the application scoring model and the behavior scoring model, which is established when the cardholder is overdue or has bad debts. The collection scorecard is used to predict and evaluate the effectiveness of measures taken against bad debts, such as the possibility of customers' reaction to warning letters. In this way, the issuing bank can take corresponding measures to deal with overdue customers of different degrees according to the prediction of the model. When the above three scoring models are established, the data used are mainly demographic data and behavioral data. Demographic data include age, gender, marital status, educational background, characteristics of family members, housing situation, occupation, professional title, income status, etc. Behavior data includes the cardholder's past use frequency, amount, repayment and other performance information. Therefore, the use of data mining technology can enable banks to effectively establish credit risk control systems before, during and after the event.

3. Operation management

Although the application of data mining in the field of credit card operation and management is not the most important, for many foreign card issuing companies, great achievements have been made in improving production efficiency, optimizing processes, predicting capital and service demand, and providing service orders.

Second, the commonly used data mining methods

In the application of the above data mining technology in the credit card field, there are many tools that can be used to develop prediction and description models. Some use statistical methods such as linear regression and logistic regression; Some have non-statistical or mixed methods, such as neural network, genetic algorithm, decision tree, regression tree and so on. Only a few common typical methods are discussed here.

1. Linear regression

Simple linear regression analysis is a statistical technique to quantify the relationship between two continuous variables. These two variables are dependent variables (predictive variables) respectively. Using this method, we can find a line through the data, and the points on the line minimize the variance of the corresponding data points. When modeling marketing, risk and customer relationship management, there are usually several independent variables. Predicting a continuous variable with multiple independent variables is called multivariate linear regression, and the model established by linear regression method is usually robust.

2. Logistic regression

Logistic regression is the most widely used modeling technique, which is similar to linear regression. The main difference between them is that the dependent variable (predicted variable) of logistic regression is not continuous, but discrete or type variable. If you apply for scoring model, you can use logistic regression method to select key variables to determine regression coefficient. Taking the key variables of the applicant as independent variables, x 1, x2, …xm, y=[ 1, the applicant is a bad customer; The applicant is a good customer and a dependent variable. For two kinds of dependent variables, it is generally assumed that the probability of customer deterioration is p (y =1) = eβ 0β1×1… β mxm/1eβ 0β1×1… β mxm.

3. Neural network

Neural network processing is very different from regression processing. It does not follow any probability distribution, but imitates the function of the human brain. It can be thought that it extracts and learns information from every experience. The neural network system consists of a series of nodes similar to human brain neurons, which are connected with each other through the network. If there is data input, they can do the work of determining the data pattern. Neural network consists of an input layer, an intermediate layer (or hidden layer) and an output layer which are connected with each other. The middle layer is composed of multiple nodes, which completes most of the network work. The output layer outputs the execution results of data analysis.

4. Genetic algorithm

Similar to neural network, genetic algorithm does not follow any probability distribution, and it comes from the evolutionary process of "survival of the fittest". It first encodes the possible solution of the problem in some form, and the encoded solution is called chromosome. N chromosomes are randomly selected as the initial population, and then the fitness value of each chromosome is calculated according to the predetermined evaluation function. Chromosomes with better performance have higher fitness values. The chromosomes with higher fitness value are selected for replication, and a group of new chromosomes with better adaptability to the environment are generated by genetic operators to form a new population until they finally converge to an individual with the best adaptability to the environment and obtain the optimal solution of the problem.

5. Decision chart

The goal of decision tree is to classify data into different groups or branches step by step, and establish the strongest partition on the value of dependent variable. Because the classification rules are intuitive, they are easy to understand. Figure 1 is a decision tree of customer response, from which it is easy to identify the group with the highest response rate.

Third, the case analysis

Taking logistic regression method to establish credit card application scoring model as an example, this paper illustrates the application of data mining technology in credit card business. The design of application scoring model can be divided into seven basic steps.

1. Define the criteria for good customers and bad customers.

The standards of good customers and bad customers are defined according to the needs suitable for management. According to foreign experience, establish a risk model to predict customer quality, with at least 1000 samples. In order to avoid risks, considering the initial stage of the credit card market, the main income sources of banks are seller's commission, credit card interest, handling fee income and operating spread of funds. Therefore, the general bank will reduce the overdue rate of customers as its main management goal. For example, define bad customers as customers who are overdue for more than 60 days; Define bad customers as customers who are overdue for more than 60 days; A good customer is defined as a customer who is overdue for no more than 30 days and is not overdue at present.

Generally speaking, in the same sample space, the number of good customers is far greater than the number of bad customers. In order to ensure that the model has a high ability to identify bad customers, the sample number ratio of good customers and bad customers is 1: 1.

2. Determine the sample space

The determination of sample space should consider whether the sample is representative. A customer is a good customer, indicating that the cardholder has performed well in using the card during an observation period; As long as the customer has a "bad" record, it is identified as a bad customer. Therefore, the observation period of good customers is generally longer than that of bad customers. Good customers and bad customers can choose different time periods, that is, different sample spaces. For example, the sample space of good customers is applicants from June 2003 to February 2003, and the sample space of bad customers is applicants from June 2003 to May 2004, which can not only ensure a long performance period for good customers, but also ensure a sufficient number of bad customers. Of course, the quality of sampling customers should be representative.

3. Data source

In the United States, there is a unified credit bureau to score personal credit, which is usually called "FICO score". Banks, credit card companies and financial institutions in the United States can use credit reporting agencies to report personal data when analyzing customers' credit risks. In China, due to the imperfection of the credit information system, the modeling data mainly comes from the application form. With the gradual improvement of China's national credit information system, some data for future modeling can be collected from credit information agencies.

4. Data collation

A large number of sampled data must be sorted out if they really enter the model at last. In data processing, we should pay attention to check the logic of data, distinguish between "missing data" and "0", infer some values according to logic, find abnormal data, and evaluate its authenticity. By calculating the minimum value, maximum value and average value, it can be preliminarily verified whether the sampling data is random and representative.

5. Variable selection

The choice of variables should not only have the correctness of mathematical statistics, but also have the explanatory power of the actual business of credit cards. Logistic regression method is to find the independent variable that can predict the dependent variable as accurately as possible and give it a certain weight. If the number of independent variables is too small, the fitting effect is not good, and the dependent variables can not be predicted well; Too many independent variables will lead to over-fitting, and the effect of predicting dependent variables is not good. Therefore, it is necessary to reduce some independent variables, such as using virtual variables to represent variables that cannot be quantified, and using univariate and decision tree analysis to screen variables. Independent variables that are almost related to dependent variables can be classified into one category, such as the influence of regions on the probability of bad customers. Assuming that the correlation between Guangdong Province and Fujian Province and bad customers is -0.38 1 and -0.380 respectively, the two regions can be classified into one category. In addition, some independent variables can be constructed according to the information in the application form, such as combining the words "marital status" and "raising children" in the application form, and combining them according to experience and common sense.

6. Model structure

With the help of SAS9 software, the variables were screened by stepwise regression method. An algorithm is designed here, which is divided into six steps.

Step 1: Find the multivariate correlation matrix (if it is a dummy variable, > 0.5 is relative correlation; If it is a general variable, > 0.7-0.8 is relatively relevant).

Step 2: rotational principal component analysis (general variables require a relative correlation of > 0.8; Virtual variable demand > 0.6-0.7 is relatively relevant).

Step 3: Find 15 variables and ***30 variables in first principal component and the second principal component respectively.

Step 4: Calculate the good/bad correlation of all 30 variables, find out the variables with high correlation and add them to the variables obtained in step 3.

Step 5: Calculate VIF. If the VIF value is large, check the correlation matrix in step 1, analyze the influence of these two variables on the model respectively, and eliminate the one with less correlation.

Step 6: Cycle step 4 and step 5 until all variables are found. The multivariate correlation matrix is highly correlated, and a single variable contributes a lot to the model.

7. Model validation

When collecting data, all sorted data are divided into modeling samples for establishing models and control samples for model verification. Control samples are used to verify the overall predictability and stability of the model. The model test indicators using the scoring model include K-S value, ROC, AR and other indicators. Although influenced by objective factors such as unclean data, the K-S value of the scoring model in this case has exceeded 0.4, reaching the usable level.

Fourth, the development prospect of data mining in the domestic credit card market

In foreign countries, the credit card business has a high degree of informatization, and a large number of resources are reserved in the database. Various models established by data technology have been successfully implemented in credit card business. At present, domestic credit card issuing banks first use data mining to establish an application scoring model. As the first step in the application of credit card business, many credit card issuing banks have established customized application scoring models by using their own historical data. Generally speaking, the application of data mining in China's credit card business is on the issue of data quality, and it is difficult to establish a business model.

As domestic card-issuing banks have established or started to establish data warehouses, data from different operating sources are stored in a centralized environment and properly cleaned and converted. This provides a good operating platform for data mining and will bring various conveniences and functions to data mining. The personal credit information system of the People's Bank of China has also been launched, forming a nationwide centralized personal credit data. Based on the continuous improvement of internal and external environment, data mining technology will have more and more broad application prospects in credit card business.