I. Introduction
Information technology security is an important issue, and a lot of energy has been invested in the research of intrusion and internal threat detection. He has made many contributions in handling security-related data [1]-[4], detecting botnets [5]-[8], port scanning [9]-[ 12], violent attacks [13]-[ 16] and so on. What all these works have in common is that they all need representative network-based data sets. In addition, the benchmark data set is a good basis for evaluating and comparing the quality of different network intrusion detection systems (NIDS). Given a tagged data set in which each data point is assigned to a normal or attack category, the number of detected attacks or false alarms can be used as an evaluation criterion.
Unfortunately, there are not many representative data sets. Sommer and Paxson[ 17](20 10) think that the lack of representative public data sets is one of the biggest challenges for anomaly-based intrusion detection. Malowidzki et al. (20 15) and Haider et al. (20 17) have similar statements. However, the community is solving this problem because several intrusion detection data sets have been published in the past few years. Among them, Australian Cyber Security Center released UNSW-NB 15[20] data set, Coburg University released CIDS-00 1 [2 1] data set, and New Brunswick University released CICIDS 20 17[22] data set. More data sets will be released in the future. However, the existing data set does not have a comprehensive index, so it is difficult to track the latest development.
This paper summarizes the existing network-based intrusion detection data sets. First, study the underlying data in more detail. Network-based data appears in packet-based or stream-based formats. Stream-based data only contains meta-information about network connections, while packet-based data also contains payloads. Then, the properties of different data sets commonly used in literature are analyzed and grouped to evaluate the quality of network data sets. The main contribution of this survey is a detailed literature review of web-based data sets, and an analysis of which data sets conform to which data set attributes. This paper focuses on the attack scenarios in data sets, emphasizing the relationship between data sets. In addition, in addition to typical data sets, we also briefly introduce traffic generators and data warehouses as further sources of network traffic, and provide some opinions and suggestions. As the main benefit, this survey establishes a set of data set attributes as the basis for comparing available data sets and determining suitable data sets, and gives specific evaluation scenarios. In addition, we have created a website 1, which refers to all the mentioned data sets and data repositories, and we intend to update the website.
The rest of this paper is organized as follows. Related work will be discussed in the next section. The third part analyzes the network data based on packet and stream in detail. The fourth part discusses the typical data set attributes commonly used in literature to evaluate the quality of intrusion detection data sets. Section V summarizes the existing data sets and reviews each data set according to the attributes identified in Section IV. The sixth part briefly introduces other sources of network data. Before concluding this document with a summary, section VII discusses observations and suggestions.
Second, related work
This section reviews the related work of network-based intrusion detection data sets. It should be noted that this paper does not consider host-based intrusion detection data sets, such as ADFA[23]. Readers can find detailed information about host-based intrusion detection data in [24] of Glass-Vanderlan et al.
Malowidzki et al. [18] discussed missing data sets as an important problem in intrusion detection, put forward requirements for good data sets, and listed available data sets. Koch et al. [25] provided another overview of intrusion detection data sets, analyzed 13 data sources, and evaluated them according to 8 data set attributes. Nehinbe[26] provides critical data set evaluation for IDS and Intrusion Prevention System (IPS). The author studies seven data sets from different sources (such as DARPA data set and DEFCON data set), emphasizes their limitations, and puts forward methods to create more realistic data sets. Since many data sets have been published in the past four years, we have continued our work from 20 1 1 to 20 15 [18], [25] and [26], but provided a more up-to-date and detailed overview than our predecessors.
Although many data set papers (such as CIDDS-002[27], ISCX[28] or UGR's'16 [29]) only give a brief overview of some intrusion detection data sets, Sharafaldin et al. provide a more detailed overview [30]. Their main contribution is a new framework for generating intrusion detection data sets. Sharafaldin et al. also analyzed 1 1 available intrusion detection data sets, and evaluated them according to 1 1 data set attributes. Compared with earlier data set papers, our work focuses on providing a neutral overview of existing network-based data sets, rather than providing additional data sets.
Other recent papers also involve web-based data sets, but the main focus is different. Bhuyan et al. conducted a comprehensive investigation on network anomaly detection. The author describes nine existing data sets and analyzes the data sets used by existing anomaly detection methods. Similarly, Nisioti et al. [32] focus on unsupervised intrusion detection methods and briefly refer to the existing 12 network-based data set. Yavanoglu and Aydos[33] analyzed and compared the most commonly used intrusion detection data sets. However, their review only contains seven data sets, including other data sets, such as HTTP CSIC 20 10[34]. In a word, these works often have different research goals, and they are only exposed to network-based data sets, which are slightly different.
Third, data.
Typically, network traffic is captured in a packet-based or stream-based format. Packet-level network traffic is usually captured by mirroring ports on network devices. Packet-based data contains complete payload information. Stream-based data is more concentrated and usually contains only metadata from network connections. Wheelus and others emphasized this difference through an illustrative comparison: "A good example of the difference between capture package inspection and NetFlow is to cross the forest on foot instead of flying over the forest in a hot air balloon" [35]. In this work, the third category (other data) is introduced. The other type has no standard format and varies with each data set.
Packet-based data
Packet-based data is usually captured in pcap format and contains payload. The available metadata depends on the network and transport protocol used. There are many different protocols, the most important of which are TCP, UDP, ICMP and IP. Figure 1 shows different titles. TCP is a reliable transmission protocol, which contains metadata such as serial number, acknowledgement number, TCP flag or checksum value. UDP is a connectionless transport protocol, and its header is smaller than TCP, which only contains four fields, namely source port, destination port, length and checksum. Compared with TCP and UDP, ICMP is a supporting protocol that contains status messages, so it is smaller. Usually, there is an available IP header transport protocol next to the header. The IP header provides information such as source and destination IP addresses, as shown in figure 1.
B. stream-based data
Stream-based network data is a more concise format, which mainly contains meta-information about network connections. Stream-based data aggregates all packets with specific attributes in a time window into one stream, usually without any payload. The default five-tuple definition, that is, source IP address, source port, destination IP address, destination port and transmission protocol [37], is a widely used data attribute matching standard based on stream. Streams can appear in unidirectional or bidirectional formats. One-way format aggregates all packets sharing the above attributes from host A to host B into one stream. All packets from host B to host A are aggregated into another unidirectional stream. On the contrary, bi-directional flow summarizes all packets between host A and host B, regardless of their direction.
Typical stream-based formats are NetFlow[38], IPFIX[37], sFlow[39] and OpenFlow[40]. Table 1 summarizes the typical attributes of flow-based network traffic. Depending on the specific stream format and stream exporter, other attributes can be extracted, such as bytes per second, bytes per packet, TCP flag of the first packet, and even the computational entropy of the payload.
In addition, you can use tools such as nfdump2 or YAF3 to convert package-based data into stream-based data (but not vice versa). If readers are interested in the differences between stream exporters, they can find more details in [4 1] and analyze how different stream exporters affect botnet classification.
C. Other data
This category includes all data sets that are neither purely package-based nor stream-based. An example of this may be stream-based datasets, which have added additional information from packet-based data or host-based log files. KDD Cup 1999[42] data set is a famous representative of this kind. Each data point has network-based attributes, such as the number of source bytes transmitted or the number of TCP flags, but it also has host-based attributes, such as the number of failed logins. Therefore, each dataset in this category has its own property set. Because each data set must be analyzed separately, we do not make any general description of the available properties.
Fourth, data set attributes
In order to compare different intrusion detection data sets and help researchers find data sets suitable for their specific evaluation scenarios, it is necessary to define common attributes as the evaluation basis. Therefore, we study the typical data set attributes used to evaluate intrusion detection data sets in the literature. FAIR[43] defines four principles that academic data should follow, namely, searchability, accessibility, interoperability and reusability. Consistent with this general concept, this work uses more detailed data set attributes to provide a key comparison of network-based intrusion detection data sets. Generally speaking, different data sets emphasize different data set attributes. For example, UGR's 16 data set [29] emphasizes long recording time to capture periodic effects, while ISCX data set [28] emphasizes accurate labeling. Because our goal is to study the more general properties of network-based intrusion detection data sets, we try to unify and summarize the properties used in the literature, rather than using all the properties. For example, some methods evaluate the existence of certain types of attacks, such as DoS (Denial of Service) or browser injection. The existence of some attack types may be the relevant attribute of detection methods to evaluate these specific attack types, but it is meaningless to other methods. Therefore, we use general attribute attacks to describe the existence of malicious network traffic (see Table 3). Section 5 provides more details about different types of attacks in the dataset and discusses other specific properties.
We don't set evaluation scores like Haider et al. [19] or Sharafaldin et al. [30] because we don't want to judge the importance of different data set attributes. We believe that the importance of some attributes depends on specific assessment scenarios and should not be universally judged in the survey. Instead, readers should be able to find data sets that suit their needs. Therefore, we divide the dataset properties discussed below into five categories to support system search. Figure 2 summarizes all data set attributes and their value ranges.
A. General description
The following four attributes reflect general information about the data set, namely, year of creation, availability, existence of normal network traffic and malicious network traffic.
1) Creation year: As network traffic is affected by concept drift, new attack scenarios appear every day, so the age of intrusion detection data sets plays an important role. This attribute describes the year of creation. Compared with the year when the data set was published, the year when the underlying network traffic of the data set was captured was more related to the latest degree of the data set.
2) Availability: The intrusion detection data set should be made public as the basis for comparing different intrusion detection methods. In addition, the quality of the data set can only be checked by a third party if it is made public. Table III contains three different characteristics of this attribute: Yes, O.R. (upon request) andNo.. Upon request, access is granted after sending a message to the author or person in charge.
3) Normal user behavior: This attribute indicates the availability of normal user behavior in the data set, and accepts yes or no values. A value of Yes indicates normal user behavior in the dataset, but does not declare whether there is an attack. Generally speaking, the quality of intrusion detection system mainly depends on its attack detection rate and false alarm rate. In addition, the existence of normal user behavior is essential for evaluating IDS. However, the lack of normal user behavior does not make the dataset unusable, but indicates that it must be merged with other datasets or real-world network traffic. Such a combination step is usually called covering or salinization [44], [45].
4) Attack traffic: IDS data set should contain various attack scenarios. This attribute indicates whether there is malicious network communication in the dataset, and if there is at least one attack in the dataset, the value of this attribute is yes. Table 4 provides additional information about specific attack types.
B. Nature of data
The attributes of this category describe the format of the data set and the existence of meta-information.
1) Metadata: For packet-based and stream-based network traffic, it is difficult for third parties to make content-related explanations. Therefore, the data set should provide additional information about network structure, IP address, attack scenario, etc. And metadata. This property indicates that additional metadata exists.
2) Format: Network intrusion detection data sets appear in different formats. We roughly divide them into three formats (see section 3). (1) Packet-based network traffic (such as pcap) includes network traffic with load. (2) Flow-based network traffic (such as NetFlow) only contains meta-information about network connections. (3) Other types of data sets may contain stream-based traces with additional attributes from packet-based data or even from host-based log files.
3) Anonymity: Due to privacy reasons, intrusion detection data sets are often not made public or can only be provided anonymously. This property indicates whether the data is anonymous and which properties are affected. A value of none in Table III indicates that no anonymization has been performed. A value of yes (IPs) indicates that the IP address has been anonymized or deleted from the dataset. Similarly, a value of yes (payload) indicates that payload information is anonymized or deleted from packet-based network traffic.
C. data file
The attributes in this category describe the dataset in terms of capacity and duration.
1) Count: The attribute Count describes the size of the data set as the number of packages/streams/points contained or the physical size (GB).
2) Duration: The data set should cover the network traffic for a long time, so as to capture the periodic effects (such as day and night or working days and weekends) [29]. The attribute duration provides the recording time of each data set.
D. recording environment
The attributes in this category describe the network environment and conditions under which the dataset was captured.
1) traffic type: describes three possible sources of network traffic: real, analog or synthetic. Reality refers to capturing real network traffic in an effective network environment. Simulation means capturing real network traffic in a test bed or simulated network environment. Composing means that network traffic is generated synthetically (for example, by a traffic generator) and not captured by real (or virtual) network devices.
2) Network type: The network environment of SMEs is essentially different from that of Internet service providers (ISP). Therefore, different environments need different security systems, and the evaluation data set should adapt to the specific environment. This property describes the underlying network environment in which the corresponding data set was created.
3) Complete network: This attribute uses the method of Sharafaldin et al. [30] to indicate whether the data set contains complete network traffic from a network environment with multiple hosts and routers. If the dataset contains only network traffic from a single host (such as honeypots) or only some protocols from network traffic (such as dedicated SSH traffic), set the value to No. ..
E. valuation
The following functions are related to evaluating intrusion detection methods using network-based data sets. More precisely, these attributes represent the availability of a predetermined subset of meanings, the balance of data sets, and the existence of labels.
1) predefined segmentation: sometimes, even if different ids are evaluated on the same data set, it is difficult to compare their quality. In this case, it must be clear whether to use the same subset for training and evaluation. This attribute provides information if the data set has a predetermined subset of meanings for training and evaluation.
2) Equalization: anomaly-based intrusion detection usually adopts machine learning and data mining methods. In the training phase of these methods (for example, decision tree classifier), data sets should be balanced with their class labels. Therefore, the data set should contain the same number of data points from each category (normal and attack). However, the network traffic in the real world is unbalanced, which contains more normal user behaviors than attack traffic. This property indicates whether the dataset is balanced with its class label. Before using the data mining algorithm, we should balance the unbalanced data set through proper preprocessing. He and Garcia[46] gave a good overview of learning from unbalanced data.
3) Marked: Marked data sets are necessary for training supervision methods, evaluation supervision and unsupervised intrusion detection methods. This property indicates whether the dataset is marked. If there are at least two categories of Normal and Attack, please set this property to Yes. The possible values in this attribute are: Yes, and yes with BG. (background is), there is (IDS), indirectly, there is no, there is, and the background is the third kind of background. Packages, streams or data points belonging to the class background can be normal or attack. Yes (IDS) refers to using some intrusion detection system to create labels for data sets. Some labels of the dataset may be wrong, because IDS may not be perfect. Indirect means that the dataset has no explicit labels, but you can create your own labels through other log files.
Verb (abbreviation for verb) data set
In our opinion, when searching enough web-based data sets, the attribute and format of tagged data sets are the most decisive attributes. Intrusion detection methods (supervised or unsupervised) determine whether labels are needed and what kind of data (packets, streams or others) are needed. Therefore, Table 2 provides a classification of all the web-based data sets studied for these two attributes. Table 3 gives a more detailed overview of the network-based intrusion detection dataset about the dataset properties in Section 4. The existence of specific attack scenarios is an important aspect when searching network-based data sets. Therefore, Table III shows the existence of attack traffic, while Table IV provides detailed information about specific attacks in the data set. Papers on data sets describe attacks at different levels of abstraction. For example, Vasudevan et al. described the attack traffic in their data set (SSENET- 20 1 1) as "Nmap, Nessus, Angry IP Scanner, Port Scanner, Metaploit, Retrospective OS, LOIC, etc. It is some attack tools used by participants to launch attacks. " In contrast, Ring et al. specified the number and different types of port scans in their CIDDS-002 data set [27]. Therefore, the abstract level of attack description may be different in Table 4. A detailed description of all types of attacks is beyond the scope of this article. On the contrary, we recommend interested readers to read Anwar et al's open access paper "From Intrusion Detection to Intrusion Response System: Foundation, Requirements and Future Direction". In addition, some data sets are modifications or combinations of other data sets. Figure 3 shows the correlation between several known data sets.
Web-based data sets, arranged alphabetically.
AWID [49].AWID is a public data set 4, which is mainly aimed at 802. 1 1 network. Its creator used a small network environment (1 1 client) and captured WLAN traffic in a packet-based format. Within an hour, 37 million packets were captured. Extract 156 attributes from each packet. Malicious network traffic is generated by executing a specific attack on 802. 1 1 network. AWID is marked as a training subset and a test subset.
Booters[50].Booters are distributed denial of service (DDoS) attacks provided by criminals. Santana and others. Al[50] published a data set, which included the tracking of nine attacks from different initiators, which were carried out against empty routing IP addresses in the network environment. The generated data set is recorded in a packet-based format and contains more than 250GB of network traffic. Individual packages are unmarked, but different Booters attacks are in different files. The data set is public, but the name of the initiator is anonymous for privacy reasons.
Botnet [5]. Botnet dataset is a combination of existing datasets and can be used publicly. The creators of botnets use the superposition method of [44] to combine (part of) ISOT[57], ISX2012 [28] and CTU- 13[3] data sets. The result data set contains various botnets and normal user behaviors. Botnet data set is divided into 5.3 GB training subset and 8.5 GB test subset, both of which are based on packet format.
CIC DoS[5 1].CIC DoS is a set of data of Canadian Network Security Institute, which can be used publicly. The author's intention is to create an intrusion detection data set with application-level DoS attacks. Therefore, the author carried out eight different DoS attacks on the application layer. The generated tracking results are combined with the non-attack traffic of ISCX 20 12[28] data set to generate normal user behavior. The generated data set is in a packet-based format and contains 24-hour network traffic.
CICIDS 2017 [22]. CICIDS 2017 was created in a simulation environment for 5 days, which included network traffic based on data packets and bidirectional streaming formats. For each data stream, the author extracted more than 80 attributes and provided additional metadata about IP addresses and attacks. Normal user behavior is performed through scripts. The data set contains many types of attacks, such as SSH brute force attack, heartbleed, botnet, DoS, DDoS, web and penetration attack. CICIDS 20 17 has been released to the public.
Cid ds-001[21]. cid ds-001data set was captured in the simulated small business environment of 20 17 years, which contains the surrounding network traffic based on one-way flow with detailed technical reports and additional information. The feature of this data set is that it contains an external server that was attacked on the Internet. Unlike honeypots, clients from simulated environments usually use this server. Both normal and malicious user behaviors are performed through publicly available python scripts on GitHub9. These scripts allow continuous generation of new data sets and can be used in other research. CIDDS-00 1 data set is publicly available, including SSH brute force cracking, DoS and port scanning attacks, and some attacks captured from the wild.
Cidds-002 [27].CIDDS-002 is a port scanning data set created based on CIDDS-00 1 script. This data set contains two weeks of network traffic based on one-way traffic and is located in a simulated small business environment. CIDDS-002 contains normal user behavior and various port scanning attacks. The technical report provides additional meta-information about the anonymous external IP address data set. The data set is public.