How OVH protects its customers against SYN flood attacks
Each day OVH detects and mitigates over 1500 attacks against its customers’ servers. About one third of these attacks are of the "SYN flood" type. This particular technique, aiming to achieve a denial of service (DoS), uses the characteristics of the TCP protocol. TCP represents over 90% of the traffic received by OVH and is one of the foundations on which the Internet is built. It is the Transmission Control Protocol that is used to transmit e-mails, web pages, etc. The defensive measures against these attacks are not new. However, OVH has adopted a more original approach, fully developed in-house, through the implementation of an FGPA and a Linux kernel patch. This implementation will be the subject of the present article. But beforehand, we will dissect the way a SYN flood attack is carried out.
SYN flood: a well-known attack
This type of attack emerged in the 1990s. Today, it is normal to deal with attacks of this kind reaching 60 million packets per second, which is slightly less than 40 Gbps. Don't be fooled by the fact that 40 Gbps seems modest compared to the 1 Tbps attack we had to counter at OVH last September. In this type of attack, what counts is not the flow rate, but the amount of packets that need to be processed. It's just like in "Lord of the Rings": what counts, during the battle, is not the size of the Orcs, but how many of them you are facing. Each Orc needs to be taken out individually. In a computer attack, the principle is the same: each packet must be individually inspected and filtered.
In the computer history timeline, SYN flood are a fairly old technique and therefore mitigation techniques are well known and mastered. The most famous and most effective of these countermeasures are based on the use of "SYN-cookies". It was first presented in 1996 by Daniel J. Bernstein, an expert in cryptography, best known for his involvement in the development of the new safety standards. As part of our new OVH IP Load Balancing offer, we have implemented this mitigation technique on an FPGA, coupled with a Linux kernel especially suited to take over during the final ACK if it is legitimate. But before detailing this technique, let's have a look at what a TCP connection entails.
Anatomy of a TCP connection
TCP protocol is a stream protocol that works in connected mode. That's pretty much the sum of it.
To be more precise, when stating that TCP is a stream protocol it means that TCP packets will be delivered to the server in the same order they were sent by the sender. Since packets can get jumbled or dropped while navigating the network, the receiver uses a sequence number to put the packets back in order and confirm to the sender the receipt of the packets. And in order to prevent a malicious client from injecting packets into an arbitrary TCP stream, the sender and receiver agree on their initial sequence numbers during the negotiation phase of the connection.
In practice, a TCP connection is established via the exchange of three packets between the client and the server:
The client sends a packet requesting SYNchronization of its sequence number - this is the SYN. The server ACK-cepts the sequence number and thus the connection request and SYNchronizes its own sequence number, this is the SYN- ACK. Finally, the client ACK-cepts the server sequence number, this is the final ACK. The connection is established, the client and server can communicate.
In other words, it is somewhat like a standard phone call:
1. The phone rings, the "server" picks up: this is the SYN
2. The "server" answers "Hello?": This is the SYN-ACK
3. The "client" replies "Hello!": This is the final ACK
But sometimes the phone rings, you pick up, and no one answers. So you say "Hello?" once, twice, three times before abandoning and hanging up. The TCP protocol follows exactly the same logic. Once it has "picked up", that is to say, received the SYN, the protocol stores it in a queue of all connections in the process of being established. The receiver sends a first SYN-ACK. If he does not receive an ACK in reply within the next second, it will retry and wait 2 seconds then 4, etc. until it receives a reply or reaches the predefined maximum number of connection attempts. As soon as the server receives the final ACK, the connection can be accepted by the application. The connection is then taken out of the waiting queue, freeing up a spot for a new connection waiting to be established.
In Linux, the maximum number of connection attempts can be controlled through the net.ipv4.tcp_synack_retries parameter. By default, this parameter is set to 5, meaning 6 attempts at intervals of 1, 2, 4, 8, 16 and 32 seconds, unsuccessful requests thus being aborted after 63 seconds of trying to establish a connection. It is possible to monitor "Half-Open" connections with tools such as netstat. They will be listed under a status of "SYN-RECV". If you see a very large number of requests being listed under this status, it means that you either have an overloaded server or ... you are under attack.
In this example, I manually forged a SYN packet using an address and a source port that do not belong to me, just like an attacker would do. The connection being established remains stuck in half-open, "SYN-RECV" status
Vulnerabilities and attacks
To stick with the phone call analogy, when you pick up the receiver and say "Hello? " sometimes nobody answers, even after several attempts. This can happen when you are called by a prankster or an unscrupulous telemarketing call center. But while you were waiting to get a reply from the caller, your phone line was busy. If you were expecting an important call, you would not have been able to receive it. This is what you call a denial of service.
The same applies to a server. The operating system maintains a server queue for all connections being established. This queue can hold a limited number of connections waiting for the final ACK. The queue is called a "SYN backlog". If the queue is clogged with pending connections, all subsequent incoming connection requests will not be accepted during the waiting time (63 seconds by default on Linux) and service will be denied during this time span. In Linux, the size of the backlog can be controlled with the net.ipv4.tcp_max_syn_backlog parameter. By default, this parameter is set to 256, and is automatically adjusted depending on the total RAM available on the actual machine.
Whatever the size of the queue, it is possible to clog it. All you have to do is to send as many SYN as the queue can hold, for the duration of the retry period. For instance, since a queue may contain 256 pending connections (the Linux default value, without adjustment), it can theoretically be filled up simply by sending slightly more than 4 packets per second (PPS)!
Even worse, since the malicious client does not need to inspect the SYN-ACK response, it can forge an address and a random destination port and thus evade the statistical detection measures.
Add to this that the mere receipt of a packet on a server consumes RAM and CPU resources. Even if the packet cannot be queued, it will have consumed temporary resources and impacted the performance of the machine. If resources are sufficient, the following barrier is the network capacity. This aspect is called the DDoS, the distributed denial of service, but we will not dwell on it as it is outside the scope of this post. Still, the concept is the same. We need to protect our customer at the gate of the network. A lone server does not stand a chance.
There are a number of mitigation strategies and they are usually combined for greater efficiency. For instance, in the first line of defense, the net.ipv4.tcp_synack_retries parameter can be decreased. Since this value is logarithmic, each decrement will divide the maximum waiting time for an ACK by two, but will also increase the risk of losing legitimate connections. Depending on the capacities of the machine, the net.ipv4.tcp_max_syn_backlog can also be increased.
As a second line of defense, the operating system can decide to recycle older connection requests in the queue so as to be able to accept new ones, without having to wait for all SYN-ACK retransmissions to be sent. Linux tries to reserve half of the queue entries for connection "embryos". A "young" connection is one that only has up to 2 retransmissions, meaning 7 seconds retry rather than the standard 63.
Finally, in the third line of defense, you have the most powerful weapon: the SYN-cookies. They are the main topic of this article.
A SYN flood attack has been successful if the connection queue has overflown. You can enlarge the queue, but not indefinitely because resources are limited and you want to keep them for legitimate clients. The SYN-cookie method follows a radically different approach: if the aim of the attack is to saturate the queue, then let's just delete the queue! It's a bit like when you have appendicitis: if the appendix is inflamed, the surgeons will simply remove it.
The principle is simple. When the server receives a SYN, it computes a cryptographic signature based on the source address, the destination address, the source port, the destination port, the client’s initial sequence number, a derivative value of the MSS (Maximum Segment Size), if available, and a secret value that changes periodically. The implementation may vary from one operating system to another, but the idea remains the same: to encode, with a secret mark, a signature of the quadruplet defining the connection and the client’s initial sequence number. The cryptographic signature is then used as the initial sequence number of the server in the SYN-ACK. If the customer is legitimate, it will respond to the SYN-ACK with an ACK number equal to the value +1. The server will then recalculate this value. If the result of the calculation is consistent, then the connection is accepted, just as if it had been registered in the queue. In other words, the network itself acts as the queue.
The SYN-cookie method allows you to dispense with the intermediate queue. It would therefore be tempting to permanently activate this method and deactivate the queue for ongoing connection attempts. While it may look perfect, this technique actually has three drawbacks. The first is intuitive: removing the queue puts you at risk of losing the SYN-ACK retransmission in case of a network failure. This is quite a minor inconvenience. If the SYN-ACK is lost, the client will just retransmit the SYN. A legitimate client will do it anyway, and as far as the malicious client is concerned, don't worry: massive retransmissions are his specialty...
The second drawback is similar to the first one. It can happen that the SYN-ACK has reached its proper destination but the actual final ACK has gotten lost somewhere along the line in the maze of the Internet. In this case, the problem is far more serious. Normally, if the ACK is not received, the queue will re-trigger the sending of the SYN-ACK. The client detects a retransmission and in turn triggers a retransmission of the final ACK. But, in our case, this will never happen. Two things can then happen. Either the client is first to speak, this is the case for HTTP and SSL connections, where the application dialogue is initiated by the client. The server having not received the ACK will not have a connection and will thus trigger the sending of an RST. Or the client waits for the server to send a "banner" before sending its request. This is the case, for instance, with SSH, IMAP and SMTP connections. In this case, you will have to wait for the client to detect a timeout before resetting the connection, if it has been programmed in this way. But that is not always the case. When doing low-level development, it is always risky to prejudge the quality of the implementations of the network applications…
The third and last drawback is that some TCP options are only sent during SYN. And some of these options are particularly important in terms of performance. For instance, the MSS Maximum Segment Size is equivalent to the MTU at TCP level. It avoids the fragmentation of the IP packet. This option is so critical that it is encoded in the signature of the SYN cookie so that it can be recovered during ACK. The SACK or "Selective ACK" is used to indicate exactly what has been received (or not) in order to limit the amount of data that will have to be retransmitted in case of losses. The WScale or "Window scale" allows you to switch the maximum size of the TCP window from 64 KB to 1 GB. This is especially important for connections that have an elevated latency x bandwidth product. The ECN or "Explicit Congestion Notification" allows the server to notify the sender proactively that congestion has occurred on the network, thus allowing the sender to adapt its flow rate before any loss occurs. This is not an actual option but simply a part of the TCP header. During SYN, it indicates that the sender is able to notify congestion. Otherwise, it simply indicates congestion. This parameter is usually not supported but can be important, if necessary.
These options are only sent in the SYN (MSS, WScale, SACK Permitted). They are normally stored in the connections queue and are thus lost. Ideally the client could resend them. Since Linux 2.6.26, these options are encoded in the least significant bits of the timestamp, somewhat in the manner of steganography. The timestamp is a little-known extension of TCP, yet it is widely used by operating systems. It allows them, among other things, to accurately measure the RTO, meaning the time between sending a packet and receiving a response, just like an ICMP ping. This option works with two fields: the timestamp of the issuer and a copy of the received timestamp. In practice, if the server encodes the options in its own timestamp, the client will return them to him in the ACK. The server can then decode and recover them. This remedies one of the major limitations of SYN cookies. And it does not require specific support on the client side.
Because of these limitations, in Linux, the SYN cookie mechanism will only be activated during a queue overflow if the net.ipv4.tcp_syncookies parameter is set to 1.
More anecdotally, the trick of encoding the main options in the timestamp was integrated into Linux even though the SYN cookie was considered potentially obsolete. One reason behind this was the loss of these options. It was also suggested that computers had become powerful enough to handle such attacks without requiring a particular defense mechanism. But several developers stepped up and refuted this suggestion, offering benchmarks as proof. One of these developers is Willy Tarreau, author of HAProxy which the new IP Load Balancing offered by OVH is based on If you want more information, I encourage you to read the following excellent article.
As demonstrated by Willy Tarreau, the problem is not solved without SYN cookie. And even with SYN cookies, the computational load increased from 60% to 70% in the tests he ran. Indeed, although it is no longer possible to get the connection queue to overflow, it is still possible to overload the CPU and render the machine completely unavailable by overloading it with SYN packets, because of the processing time required for each packet.
In Linux, the "SYNPROXY" iptables target has optimized the management of SYN cookies since version 3.12 by managing them much earlier in the packets treatment process within the TCP Stack. This takes the load of the CPU and allows it to better bear the burden. This implementation uses a slightly different strategy. Instead of rebuilding the connection during ACK, it regenerates the SYN packet that should have been received by the TCP Stack and ensures that the sequence number is consistent. This approach is possible because the iptables target is located in the same Linux kernel and therefore has access to the same primitives.
When we designed the infrastructure of the new IP Load Balancing offer, we paid particular attention to the various types of TCP/IP attacks, including attacks like SYN Flood, that unfortunately are all too common.
On a machine equipped with a 24-core Intel(R) Xeon(R) CPU E5-2650L v3 @ 1.80GHz, without HyperThreading, 256GB of RAM and a Mellanox ConnectX-4 100Gbit/s card, synchronized at 40Gbit/s (for better stability), we started to lose connections at 4M PPS (2 Gbps) and the machine became unreachable in SSH from 7M PPS. These tests were performed on the iptables SYNPROXY target.
7M PPS before losing a machine may seem a high value. But it is, in fact, entirely insufficient. We have already suffered and protected our customers from such attacks, spewing several tens of millions of packets per second. While the technique is a good one per se, a single machine cannot withstand such a sustained attack.
Mitigation through the IP Load Balancer, featuring "Armor"
To be able to support all the attacks, even the most violent, each Load Balancing IP zone has its own instance of "Armor". Armor is the Anti-DDoS that has been fully developed in house by OVH. It sits at the core of the new generation of VAC. Most Armor mitigation strategies are software implemented, in a highly-optimized way. But, in order to be able to withstand a load of several hundred Gbps/millions of packets per second, more massive attack mitigation strategies are implemented directly on FPGA. The FPGA or "field-programmable gate array" is a reconfigurable electronic chip. The principle is simple: configure any digital electronic circuit on the chip without having to solder transistors by hand. This allows us to create our own chip, dedicated to mitigating attacks, with a significantly higher performance than if we did it only in software.
An FPGA consists of different basic elements that can be assembled in order to design specialized logic circuits. For instance, in an Intel Arria 10 GT 1150 model, there are about 2 million registers that serve as variables, 1 million of lookup tables to compose the logical operators, 1,500 specialized digital signal processing (DSP) blocks for mathematical calculations, 2,000 RAM blocks totaling 50 Mb as well as input/output elements and a very large amount of "wires" to connect all of these together and build the circuit.
What makes these chips so interesting is that each component runs independently. An FPGA can typically operate at 200 MHz, which is low compared to a processor. But since all elements work in parallel, it is possible, for instance, to have 1 million basic calculations, 2,000 reads, 2,000 writes within the internal memory at every clock tick. This means 2×1014 (200,000 billion) operations per second and 4×1011 (400 billion) writes and reads in the internal memory per second.
For DDoS mitigation, the gain is achieved at several levels. On the one hand, the chip is organized like a very long pipeline wherein each packet received is processed step by step, like on an assembly line in a factory. There are hundreds of steps to analyze the packet, make a decision, delete, modify, or let it through. Each stage of the pipeline operates in parallel with the others. In addition, the chip also provides direct access to network interfaces (40 or 100 Gbps) as well as SRAM QDR-II memory modules that are specialized in low latency random access. These memories are used to manage memory elements requiring access for each packet: search trees, hash tables containing white lists or counters. This is the type of memory that is used to generate the registers and caches of the microprocessors.
The drawback of these enhanced performances is that FPGA development time is, for now, longer than on CPU. This is due to the fact that development is done at very low level and there are very few reusable open-source libraries, even though we are seeing improvements on these two aspects lately. It is the reason why FPGA is used to counter massive attacks that are relatively simple to detect. This means it is possible to withstand extremely high flow rates even with minimum size packets.
In the case of the SYN flood mitigation, the FPGA calculates the cookies in a totally parallel manner, which ensures the treatment of 60 million packets per second by the card used upstream of the IP Load Balancing service. The flow rate is the maximum that a 40G link can receive, so there is no risk of losing packets as long as the attack does not saturate the link.
In the final stages of the pipeline, once the FPGA has validated the cookie sent by the client, it marks the connection as valid in the QDR memory, marks the packet as being valid, and lets it through. It is the first packet of the connection that arrives on an IP Load Balancing server. When this packet arrives, the kernel attempts to validate the cookie via its own secret mark. Since each server has its own secret mark, the validation fails and the server sends back an RST packet to reject the connection. The client must renew its connection attempt in order for it to work. This runs counter to our aim of making mitigation transparent for visitors.
One solution would be to synchronize the secret mark across all machines and the FPGA. However, in doing so, the secret mark would be exposed outside the kernel, which would have the effect of increasing the scope of the attack, when our objective is to reduce it. Moreover, as we have learned from experience, synchronizing machines is always more complicated than it seems at first. The dozens of articles published at each introduction of a "leap second" in timeservers are a testimony to that.
Looking more closely, if we use the same algorithm for generating the cookie and the timestamp, it is possible to rely on the standard Linux implementation to rebuild the connection and its options. The only step that would need adaptation would be the validation of the cookie itself. In this case, when the FPGA validates a connection, it inserts a secret mark in the packet that cannot be forged by a malicious client. The kernel detects and validates this secret mark and passes the task over to the standard implementation.
Mission accomplished! Almost...
The FPGA does not know which physical machine will take over the connection once it is validated. For this reason, it cannot "predict" the "proper" initial sequence number in the SYN cookie. Similarly it cannot predict the "proper" value of the timestamp. Indeed, the timestamp is an integer of 4 bytes derived from the uptime of the machine, meaning it is unique to each machine. Subsequently, when the server sends a packet to the client, for instance with the response to an HTTP request, the client compares the timestamp value with that received during SYN-ACK. If the difference between the two values is higher than two billion - e.g. half of the maximum counter value, the client considers that this is a retransmission of an "old" packet and ignores it. This is the "PAWS" mechanism or "Protection Against Wrapped Sequence" (https://www.ietf.org/rfc/rfc1323.txt). This protection was designed to handle the case of looping sequence number, when the number goes back through 0. Since this packet and the next are ignored, the connection is blocked.
One solution would be to disable the timestamp option. But that would mean sacrificing performance since performance-related options are encoded in precisely this field. Thus, this solution is not feasible.
Another approach, which is currently being deployed, is to establish the offset between the value selected by the FPGA and the current value of the Linux kernel. One might think that this is very intrusive since it involves recording the offset in each TCP socket and applying the difference for each outgoing packet. That is indeed the case. But the good news is that this code already exists in the kernel since the commit. This commit was added in 2013 by a OpenVZ developer within the CRIU project ("Checkpoint and Restore in Userspace"). In other words, a live migration process. And in this context, it is necessary to migrate the state of the TCP connections between machines having a different uptime and therefore a different timestamp. This is the reason why we implemented this patch, the IP Load Balancing service relying on it to rebuild connections in a transparent manner, when receiving the final ACK.