Current location - Education and Training Encyclopedia - Graduation thesis - Brief introduction of isolated forest algorithm
Brief introduction of isolated forest algorithm
Isolated forest algorithm is an unsupervised anomaly detection method suitable for continuous data. It was first put forward by Professor Zhou Zhihua of Nanjing University in 2008, and then an improved version was put forward on 20 12. Different from other anomaly detection algorithms, which describe the alienation between samples by quantitative indicators such as distance and density, the isolated forest algorithm detects outliers by isolating sample points. Specifically, the algorithm uses a binary search tree structure called isolation tree to isolate samples. Because the number of outliers is small and alienated from most samples, outliers will be isolated earlier, that is, outliers will be closer to the root node, while normal values will be farther away from the root node. In addition, compared with traditional algorithms such as LOF and K-means, the isolated forest algorithm is more robust to high latitude data.

Firstly, we give the definition of isolation tree and the path length of sample points in isolation tree.

Isolated tree: if it is a node of an isolated tree, there are two situations: the external node has no child nodes, and the internal node has two child nodes and a test. The test in consists of a property and a split point, which belongs to and vice versa.

Path length of the sample point in the isolated tree: the number of edges of the sample point from the root node to the leaf node.

From the following figure, we can intuitively see that relatively abnormal points are separated from the whole only after four cuts, while relatively normal points are separated from the whole only after 1 1 splits. This also embodies the basic idea of isolated forest algorithm. (ps: the picture is from the original paper)

Next, let's introduce the isolated forest algorithm in detail. The algorithm can be roughly divided into two stages. In the first stage, we need to train an isolated tree to form an isolated forest. Then we bring each sample into each isolated tree in the forest, calculate the average height, and then calculate the abnormal score of each sample.

Step 1: For a given data set, randomly select a subset of sample points and put it into the root node.

Step 2: Randomly specify a dimension from the dimensions and randomly generate a cutting point in the current data.

Step 3: This cutting point generates a hyperplane, and divides the current data space into two subspaces: the specified sample points with dimension less than p are put into the left child node, and the sample points with dimension greater than or equal to p are put into the right child node.

Step 4: Perform steps 2 and 3 recursively until all leaf nodes have only one sample point or the isolated tree has reached the specified height.

Step 5: Loop step 1 to step 4 until an isolated tree is generated.

The second stage:

Step 1: For each data point, traverse each isolated tree, calculate the average height of the point in the forest, and normalize the average height of all points. The calculation formula of abnormal value score is as follows:

Among them,

Reference: https://dl.acm.org/citation.cfm? doid = 2 13360.26665436366