Current location - Education and Training Encyclopedia - Graduation thesis - How to calculate python svm in auc paper
How to calculate python svm in auc paper
Draw ROC curve in Python and calculate AUC value

order

Roc curve and AUC are usually used to evaluate binary classifiers. This article will briefly introduce ROC and AUC, and then demonstrate how python draws ROC curves and calculates AUC with examples.

Introduction to AUC

AUC (area under curve) is a very commonly used evaluation index in two kinds of models of machine learning. Compared with F 1-Score, AUC is more tolerant of project imbalance. At present, common machine learning libraries (such as scikit-learn) generally integrate the calculation of this index, but sometimes the model is independent or written by itself. At this time, if you want to evaluate the training model, you have to build an auc calculation module yourself. This paper finds that libsvm-tools has a very easy-to-understand AUC calculation when querying information, so it is selected for future use.

AUC calculation

The calculation of AUC is divided into the following three steps:

1, the preparation of calculation data, if there is only one training set in model training, cross-validation is generally used for calculation. If there is an evaluation set, it can be calculated directly. The data format is generally the forecast score and its target category (note that it is the target category, not the forecast category).

2. According to the threshold, the horizontal (X: false positive rate) and vertical (Y: true positive rate) points are obtained.

3. After connecting the coordinate points into a curve, calculate the area under the curve, which is the value of AUC.

Go straight to python code

#! -*-Code =utf-8 -*-

Import pylab as pl.

Import log from mathematics, exp, sqrt

Evaluate_result= "Your file path"

db = [] #[score,nonclk,clk]

Positive and negative = 0,0

Use open(evaluate_result,' r') as fs:

For lines in fs:

nonclk,clk,score = line.strip()。 Split ('\t')

nonclk = int(nonclk)

clk = int(clk)

Fraction = floating point (fraction)

db.append([score,nonclk,clk])

pos += clk

Negative+= not clocked.

db = sorted(db,key=lambda x:x[0],reverse=True)

# Calculate ROC coordinate points

xy_arr = []

tp,fp = 0。 , 0.

For I(len(db)) in the range:

tp += db[i][2]

fp += db[i][ 1]

xy_arr.append([fp/neg,tp/pos])

# Calculate the area under the curve

auc = 0。

prev_x = 0

For x, y in xy_arr:

If x! = prev_x:

auc += (x - prev_x) * y

prev_x = x

Print "auc is% s" % AUC

x = [_v[0] for _v in xy_arr]

y = [_v[ 1] for _v in xy_arr]

PL.title ("ROC curve of% s (AUC = %.4f)"% ('svm', AUC))

Pl.xlabel ("false positive rate")

Pl.ylabel ("true positive rate")

Pl.plot(x, y)# Plot x and y with pylab.

Pl.show()# Show the drawing on the screen.

The input data set can refer to svm prediction results.

Its format is:

Non-clock \ tclock \ tfract

These include:

1, nonclick: data not clicked can be regarded as negative sample number.

2.clk: the number of clicks, which can be regarded as the number of positive samples.

3. Score: the predicted score can be used as the pre-statistics of this group of positive and negative samples to reduce the calculation of AUC.

The result of the operation is:

If pylab is not installed on this machine, you can directly mark the dependency and drawing parts.

pay attention to

The code published above:

1, only the result of binary classification can be calculated (as for the label of binary classification, you can handle it casually).

2. In the above code, each score has a threshold, but in fact, this is quite inefficient. You can sample samples or calculate horizontal axis coordinates.