Why AWS NLB stickiness is not always sticky

05/10/2021
Posted in AWS Blog
05/10/2021 Rutger Beyen

Why AWS NLB stickiness is not always sticky

We were recently working on an AWS setup which involved a Network LoadBalancer (NLB) with a TCP listener and a requirement for sticky sessions. As we were seeing some strange behavior which we couldn’t immediately explain and which might be linked to the session stickiness we decided to make a small test setup.

The problem

Unlike an ALB where session stickiness is accomplished with cookies, the NLB uses a built-in 5-tuple hash table in order to maintain stickiness across backend servers. We access the NLB through its DNS name, which actually returns the IPs of the two NLB endpoints in a round-robin fashion with a TTL of 60 seconds.

We were looking for an answer on the following question: if our end-user would resolve the DNS and pick the IP of the first NLB endpoint to start the connection, the session will be routed towards one of the backend servers. But after 60 seconds the client could potentially re-issue the DNS query and start its connection with the other NLB endpoint. How will the stickiness and cross-zone loadbalancing behave? Will our end-user connection be routed to the initial server again, even if that means crossing AZ boundaries?

The situation

We started with a classical setup comprising of an NLB with an endpoint in each AZ and a Targetgroup having one instance as target in each AZ.

Scenario #1

  • Cross-Zone loadbalancing: Disabled
  • TargetGroup Stickiness: Disabled

How does it behave?

  1. Client connects to the IP of the first NLB node: the connection is redirected to the server in AZ 1.
  2. Client connects to the IP of the second NLB node: the connection is redirect to the server in AZ 2.

Since there is only one healthy target per AZ and cross-zone loadbalancing is not enabled, this situation results in ‘AZ-stickiness’: the traffic remains in the AZ in which it arrived. The setup relies on DNS to distribute client connections evenly across both NLB endpoints, but there is nothing to guarantee that a specific user connection is always directed to the same NLB endpoint, let alone to the same backend server.

Scenario #2

  • Cross-Zone loadbalancing: Enabled
  • TargetGroup Stickiness: Disabled

Allowing cross-zone loadbalancing and not requiring any stickness. This should give us complete randomness, shouldn’t it?

And so it does. We’re now completely randomized and hit every backend server, irrespective of our NLB endpoint ‘point of entry’. Works as expected.

Scenario #3

  • Cross-Zone loadbalancing: Disabled
  • TargetGroup Stickiness: Enabled

We’ve enabled stickiness on the targetgroup now, and disabled the cross-zone loadbalancing again. Let’s hope our client connection is now sticky to a specific backend server.

  1. Client connects to the IP of the first NLB node: the connection is redirected to the server in AZ 1.
  2. Client connects to the IP of the second NLB node: the connection is redirect to the server in AZ 2.

Ok wait, we’ve asked our TargetGroup to be sticky, but still our connection is balanced over both backend servers? What’s going on?

The fact that our NLB is not allowing cross-zone loadbalancing seems to prevent the connection from reaching the same backend every time. The connection enters via NLB endpoint 1 but stickiness has decided that the connection should go to server in AZ 2? Stickiness fails, the disabled cross-zone loadbalancing wins…

With only one healthy backend per AZ this behaves the same as not enabling stickiness at all. We’re pretty sure that with more than one backend per AZ the stickiness is maintained…within that AZ only. Interesting!

Scenario #4

  • Cross-Zone loadbalancing: Enabled
  • TargetGroup Stickiness: Enabled

Let’s solve this. We’ve enabled both cross-zone loadbalancing and targetgroup stickiness. We should hit the same backend server every time now.

And so it does. Only now we reach true stickiness and hit the same backend server every time, no matter how hard we try by entering via the loadbalancer node in the other AZ.

The conclusion

If you don’t allow cross-zone loadbalancing, then stickiness is only active within AZ boundaries. As DNS round-robin could direct a client to a different point of entry after the TTL has expired, strict stickiness is not guaranteed.

So if you really need stickiness to a specific backend target, you need to allow cross-zone loadbalancing (and live with the extra cost of inter-AZ traffic). Only now do the different loadbalancer nodes share the hash table of “client-to-target” stickiness.

 

Kinda logic, though…

 

PS: NLB idle timeout for TCP connections is 350 seconds. Once the timeout is reached or the session is terminated, the NLB will forget the stickiness and incoming packets will be considered as a new flow and could be loadbalanced to a new target.

  • SHARE
, , ,

LET'S WORK
TOGETHER

Need a hand? Or a high five?
Feel free to visit our offices and come say hi
… or just drop us a message

We are ready when you are

Cloudar NV – Operations

Veldkant 7
2550 Kontich (Antwerp)
Belgium

info @ cloudar.be

+32 3 450 67 18

Cloudar NV – HQ

Veldkant 33A
2550 Kontich (Antwerp)
Belgium

VAT BE0564 763 890

    This contact form is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

    contact
    • SHARE