AbstractDistributed Denial of Service (DDoS) attackbrings revenue loss, productivity loss, reputation damage, theft, etc. to hugebanking and business firms. This leads to the necessity of a good preventionand detection techniques. In this paper, to provide better solution to theseproblems using features analysis. The statistical characteristics or parametersof the incoming packets are Absolute time interval, Absolute session count,Absolute session interval, Absolute page access count, Absolute Bandwidthconsumption, and Absolute Ratio of packet count. The incoming packets areclassified into normal and attack by deploying K-Means, J48, and Naïve Bayesclassifiers algorithms using the normal and attack profile from previouslyavailable datasets. Information gain algorithm is used to decrease the falsepositive and false negative errors and increase efficiency of detection byreducing the parameters. The performance increases with more consistency afterthe application of information gain.

The efficiencies of detection before andafter the application gain are 98%and 99.5% respectively. In this paper, CAIDAdatasets are used for feature selection and classification.IntroductionDoS attack is an intentional attempt bymalicious users to completely disrupt or degrade the availability ofservices/resources to legitimate users .Dos attack are of two types. One issingle source attack.

These are easily countered by several defense mechanismsand therefore the source of   attackswill be simply blocked. Another one is multiple source attack (DDoS) in whichmultiple systems are used to perform the attack.Distributeddenial of service (DDoS) attack, which makes a server suffer in having slowresponses to clients or even refusing their accesses, is one of the majorthreats which will continue in the future. DDoS attack is an attempt to make anonline services unavailable by overwhelming it traffic from multiple sources.They target a wide variety of important resources from banks to news websiteand present a major challenge to making sure people can publish and accessimportant information. Most common DDoSattacks use layered structure, as shown in Figure1, in which the attackers useclient program to connect the handlers. The handlers are compromised systems thatgive commands to the bots or zombie agents to perform a DDoS attack. These botsor zombie agents are compromised by the attackers through handlers.

Theattackers compromise the systems using many mechanisms like using Trojans ormalwares etc. In the attack, the attackers give the command to handlers andthen the handlers command the bots and the bots flood the victim withtremendous amounts of traffics consuming all the resources of the victim.                                                                                                Fig 1: An illustration of DDoS AttackTypes of DDoS AttacksThere are several types of DDoS attacks and there aresome common DDoS attacks (both past and present):   UDP Flood: User datagram protocol is a sessionlessnetworking protocol. This method is referred to a UDP flood. Random ports onthe target machine are flooded with packets that cause it to listen for anapplication on that port and report back with ICMP packet.  SYN Flood: It willsend repeated spoofed requests from a variety of source at the target server.

Best services for writing your paper according to Trustpilot

Premium Partner
From $18.00 per page
4,8 / 5
Writers Experience
Recommended Service
From $13.90 per page
4,6 / 5
Writers Experience
From $20.00 per page
4,5 / 5
Writers Experience
* All Partners were chosen among 50+ writing services by our Customer Satisfaction Team

Thus, the server will respond with ACK packet to complete the TCP connection,but instead of closing the connection, then it is allowed to timeout. Eventually with a strong attack, the host resources willexhausted and the server will go offline.HTTP: In HTTPflood DDoS attack, the attacker exploits seemingly legitimate GET and POSTrequest to attack a web server or application. HTTP floods do not use malformedpackets, spoofing and reflection techniques, and requires less bandwidth thanother attacks to bring down the targeted site or server. The attack is mosteffective when it forces the server or application to allocate the maximumresources possible in response to each single request.Ping floodAttack: This type attack is one of the simplest attack in which the attackerfloods the victim’s computer with ICMP Echo Request (ping) packets. For eachping packets from the attacker, the victim replies with a reply packet.

Thus itconsumes both the outgoing and incoming bandwidths. This attack is moreefficient when the attacker has more bandwidth than the victim. Reflected Attack: In this attack an attacker createsforged packets that will be sent out to as many as computers. When that thecomputer receive the packets they will reply, but the reply will be a spoofedaddress that routes to the target. All computers will communicate at once andthis will cause the site to be bogged down with requests until the serverresources are exhausted.      Peer-to-Peer Attacks: In this type attack the serverprovides an opportunity for attackers.

Instead of using a botnet to siphontraffic towards the target, a peer-to-peer server is exploited to route trafficto the target website. Then, people using the file-sharing hub are instead sentto the target website until the website is overwhelmed and sent offline.   Slowloris: This type ofDDoS attack can be difficult to mitigate, it is a tool that allows an attackerto use fewer resources during an attack. During the attack connection to thetarget machine will be opened with partial requests and allowed to stay and itreaches the maximum time.

It will send the HTTP headers at certain intervals.It never completes them keeping more connections open longer until the targetwebsite does not stay offline.             On the basis of protocol DDoS can be further classified asNetwork/transport level andApplication level DDOS attacks Network/transport level DDoS attack:In network-layer DDoS attacks, attackers send a large number of bogus packets(packets with bogus payload and invalid SYN and ACK number) toward the victimserver and normally attackers use IP spoofing.

In network-layer DDoS attacks,the victim server or IDS can easily distinguish legitimate packets from DDoSpackets. In transport Layer is especially vulnerable forthe Denial of Service (DOS) attack or Distributed Denial of Service (DDOS)attack. Two most popular protocols used in the transport layer are TCP(Transmission Control Protocol) and UDP (User Datagram Protocol).

At this level, mostly TCP, UDP, ICMP, andDNS protocol packets are used to launch the attacks.Application layer DDoS attack:These attacksgenerally consume less bandwidth and are stealthier in nature when compared tovolumetric attacks. However, they can have a similar impact to service as they targetspecific characteristics of well-known applications such as HTTP, DNS, VoIP orSimple Mail Transfer Protocol (SMTP).

These attacks focus on disruptinglegitimate users services by exhausting the resources. An application-levelDDoS attack overloads an application server, such as by making excessive login,database lookup or search requests. Application attacks are harder to detectthan other kinds of DDoS attacks.

Since the connections are alreadyestablished, the requests may appear to be from legitimate users.Request-FloodingAttacks: High rates of seemingly legitimate applicationrequests, such as HTTP GETs, DNS queries and SIP INVITEs), deluge web serversto degrade and disrupt its normal functioning. Asymmetric Attacks:  High-workload? requests that take a heavy tollof server resources such as CPU, memory or disk space. Repeated Single Attacks: An isolated ?high-workload?request being sent across many TCP sessions, a stealthier way to combineasymmetric and request-flooding layer seven DDoS attacks. Application-Exploit Attacks: The attack vectors here arevulnerabilities in applications, for instance, hidden-field manipulation,buffer overflows, scripting vulnerabilities, cross-site scripting, cookiepoisoning, and SQL injection. Related Work Distributed Denial of Service (DDoS) attacks have become a commonthreat to online businesses. With over 50,000 distinct attacks per week, DDoSattacks have become highly visible and costly form of cyber-crime. Theyclassify the detection mechanisms into statistical and heuristics based onbased detection algorithm.

Statistical based detection system (SBDS) determinesnormal traffic/packets data and then generalizes the scope of normal. The traffic/packets that falls out of the scope then it treated as attack (or anomalous).So, to improve the accuracy SBDS needs to learn the traffic with patternconstant as much as the can be active on the network.

Network traffic/packetsinformation is processed with machine learning algorithms. It differentiatesthe attack traffic/packets from normal patterns of established network. All thetraffic is measured by an anomaly score for the specific event and the score ishigher than the threshold, the detection system will give a further action to theseattack traffic/ packets.     Heuristic based detection system (HBDS)employs the logic form statistical analysis of the network traffic on theirthreshold decisions. HBDS requires fine tuning to adapt to network traffic andminimize the false negative and false positive.       DDoS detection with SBDS relies on thefalse negative and false positive errors. The detection can discriminate normaltraffic which is more likely to be an attack. However some botnets, eg Mydoomcan bypass the detection approaches through the victim.

This is becauseapproaches consider the transport layer and or network layer.Theirfore, thebotnets which generates similar legitimate HTTP packets can avoid detectionapproach is their inability to consider legitimate traffic mixed with attackingtraffic. Ahmed et al. usechange point analysis of packet arrival rate of new source IP addresses. Themethod is based on non-parametric CUSUM technique.Salem et al.proposed a solution for early detection of flooding attacks in backbonenetwork.Bhattacharyyaand Kalita has presented a new framework for detecting anomalies by employingleast mean square (LMS) filter and Pearson chi-square divergence on randomaggregation of flows in 2d sketch data structure.

The method can also detect lowrate attack apart from high rate flooding attacks with high detection accuracyand false alarm rate.Tang et al.proposed an efficient online detection scheme for Session Initiation Protocol(SeIiP) flooding attacks that can detect both high rate as well as low rateflooding attacks.Xie and Yu.Creates DDoS detection for monitoring web flash crowd traffic in order toreveal dynamic shifts in normal brust traffic, which might signal onset ofApplication layer DDoS attacks during the flash crowd event.

Zargar et al.analyzed the scope of the DDoS flooding attacks and categorized the attacks andavailable countermeasures based on where and when these methods could prevent,detect, and respond to the DDoS flooding attacks. The exploration of machine learning features considered to train and test the model The need ofmetrics should explore in contrast to packet patterns. The detailed explorationof the constraints observed in existing contemporary models, which are statedin related work, it is obvious to state that, in distributed environment,diversified packet flow is easy to achieve through minimal time frames andsession time.

The arrival rate based on human users, including a proxy serverseems to constitute the non-pattern (random) cases. Hence, to challenge thisconstraint, this manuscript devised a novel set of metrics, which are derivedfrom absolute time interval rather than the session time and packet patterns.Absolute time interval: This denotesthe absolute time taken by the set of sessions initiated at given thresholdtime frame.

This feature considered as significant, as HTTP-flood is cumulativeof multiple sessions and diversified packet flow. The features explored furtherfor defined absolute time interval.  Absolute session Count: This featurerepresents the average number of sessions found in an absolute time intervaldefined.

This feature is considered since the load on any target webserverestimated by the number of sessions in a given time interval.Absolute session Interval: This featurerepresents the average time render each session in an absolute time intervaldefined. This feature is critical as the session time indicates the time spentby a source on the target webserver with an intension of fair use or an attack.Absolute Page access count: Thisfeature represents the average number of requests in an absolute time intervaldefined. This feature also critical one among the considered features, sincethe page access count along with absolute session interval optimizes thedetection of the load on target webserver.Ratio of Packet Count: This featurerepresents the average number of packet of divergent sources those initiate thesessions in an absolute time interval defined. This feature is also one amonggiven features, the ratio of packet count along with session intervals todetect the load on server.

Ratio of Request between Intervals: Thisfeature represents the average request between time intervals sources thoseinitiate the sessions in an absolute time interval. This feature is critical asthe session time ratio by source on webserver.Absolute Bandwidth Consumption: Thisfeature represents the average bandwidth consumed by the requests found inabsolute time interval defined.

This feature also considered as significantsince the estimation of bandwidth consumption is critical in load assessment.Then the record structure is given below:  Absolute time interval Absolute session Count Absolute session Interval Absolute Page access count Ratio of Packet Count Ratio of Request between Intervals Absolute Bandwidth Consumption  The estimation of the absolute time intervaland other features defined are as follows:     The sessions initiated in a given timeframe threshold are grouped, and then from each group, the time spent to completeall the sessions in that group will be considered as the time interval of thecorresponding group. Then the sum of average of these time intervals and rootmean square distance of the respective session groups considered as theabsolute time interval.     The number of sessions rendered in eachabsolute time interval considered as absolute session count of the respectiveabsolute time interval.      The sum of average session completion timeof given absolute time interval and their root mean square distance denoted asabsolute session completion of the corresponding absolute time interval.     Similarly, the average page access timefor given absolute time interval and their root mean square distanceaggregated, which denotes the absolute page access time of the correspondingabsolute time interval.         Further the total number of pagesrendered in a given absolute time interval considered as absolute page accesscount of the corresponding absolute time interval.

       The ratio of eminent sources against thetotal number of divergent sources found in a given corresponding absolute timeinterval will be considered as the eminent source diversity ratio of thecorresponding absolute time interval.The totalbandwidth consumed by the requests found in a given absolute time intervaldenoted as absolute bandwidth consumption of the corresponding absolute timeinterval.        The dataset preparationThissection explores the dataset preprocessing to train the devised model.

The labeledtransactions given for training phase will be partitioned in to Flood andnormal transaction sets TF and TN. Then these partitioned sets are used furtherto extract the features considered for training phase. The absolute timeinterval will be defined for corresponding datasets TF and TN. Further thefeature will be extracted from TF and TN which will be denoted as   and in further discussion. Each record of the respectivesets will represent an absolute time interval and respective values of theother dependent features.

The record structure is as follows:  Absolute time interval Absolute session Count Absolute session Interval Absolute Page access count Ratio of Packet Count Ratio of Request between Intervals Absolute Bandwidth Consumption 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 60 51 58 52 42 58 46 53 46 55 70 67 59 59 68 67 63 70 67 54 70 66 65 63 54 57 60 58 59 56       2.84 2.59 3.04 3.

16 3.12 3.08 3.28 2.76 2.

59 2.84 2.31 1.78 2.88 3 2.72 1.

54 2.23 2.15 2.84 1.58 1.66 2.55 3.

24 1.7 1.74 2.

67 2.84 3.04 2.88 3.12   1020 663 928 780 588 754 552 795 690 605 1190 1474 1121 1239 1428 1206 1197 1470 1407 1026 1470 1452 1105 1197 1134 1026 1020 928 1003 1008         35.71 24.

75 58.97 67.77 4.42 29.84 14.88 76.

31 20.89 38.36 98.45 78.19 73.13 21.

4 35.91 36.3 73.2 50.95 65.

04 43.57 83.28 85.39 34.71 43.57 17.

83 55.91 35.75 58.97 78.89 31.99       Theseattributes will be referred as a setin further draft of the article. The number of attributesin each record will be 7, which is the size of  that can bereferred as.

Further, these record sets,  formed forrespective transaction sets TF and TN are used to train the bio-inspiredstrategy called Jaccard searchOptimalFeature Selection using Jaccard Similarity:The Feature selection attributes that are having thesimilar values in both  are usually notqualified to assess the significance of the requests are prone to flood attackor not. Moreover the values obtained for these attributes are dynamic andvaries for the training dataset given. Hence it is obvious to identify theoptimal attributes for flood and normal transactions. This section explores theoptimal feature selection for given flood and normal transaction dataset, whichis as follows: The similarity between the values given to anattribute in respective to both datasets is estimated by using jaccard index.For eachattribute Extractall values as sets,  observed for attribute in respective record sets.Removeduplicates from. Find the Distance of towards using jaccard index as follows Similarly find the Distance oftowardsusing jaccard index as followsTheabove equations, identifying the ratio of the common elements in both sets(intersection of both sets) and unique elements respective to both sets, whichis said to be similarity score under jaccard index.

This value is subtractedfrom the max similarity score, to identify the distance under jaccard index.Then the attributes with distance and ( distancethreshold) will be considered as optimal attributes of and respectively.These optimal attributes will be referred further as a set and  in respectiveto and.  Absolute session Count Absolute session Interval Absolute Page access count Absolute Bandwidth Consumption Ratio of Packet Count 60 51 58 52 42 58 46 53 46 55 70 67 59 59 68 67 63 70 67 54 70 66 65 63 54 57 60 58 59 56   2.84 2.59 3.04 3.16 3.

12 3.08 3.28 2.

76 2.59 2.84 2.31 1.78 2.

88 3 2.72 1.54 2.23 2.15 2.

84 1.58 1.66 2.55 3.24 1.7 1.74 2.

67 2.84 3.04 2.

88 3.12   1020 663 928 780 588 754 552 795 690 605 1190 1474 1121 1239 1428 1206 1197 1470 1407 1026 1470 1452 1105 1197 1134 1026 1020 928 1003 1008     35.71 24.75 58.97 67.77 4.

42 29.84 14.88 76.31 20.89 38.36 98.45 78.

19 73.13 21.4 35.91 36.

3 73.2 50.95 65.04 43.57 83.

28 85.39 34.71 43.57 17.

83 55.91 35.75 58.97 78.89 31.

99       Cluster the datausing any algorithm:  Absolute session Count Absolute session Interval Absolute Page access count Absolute Bandwidth Consumption Ratio of Packet Count              Classification:        By using any algorithm K-Means, J48 andNaïve Bayes algorithms is used as classifier algorithm. K-Means, J48 and NaiveBayes is a machine learning approach that uses probabilities of all theattributes to make a prediction .There is a strong assumption in an algorithmapproaches.

The assumption is that all the attributes are independent to oneanother. This assumption not makes the request is more accurate but alsofaster. The database is in the excel sheet format (CSV). In this the data fromthe database will be loaded to program and it will be split into training andtest datasets. The data set will be randomly split into train and testdatasets.                                                                                                                                                                                                               Fig 2: Block diagram of ProposedSystem.  Trainingphase