Denial of Service on an LDAP Server

on Monday, April 9, 2018

This came about accidently and was caused by three mistakes turning into a larger problem. This is similar to the way flights are forced to land. It’s never one thing that requires a plane to land mid-flight. Statistically, air planes that fail while flying have 7 small things go wrong with them that combined cause a plane to no longer function as a whole. So, when two or three small things go wrong while the plane is in the air, they generally land the flight and fix them.

The three mistakes were:

  • On each login attempt to a website, an LDAP connection was established but never closed.
  • The number of open LDAP connections to the Identity server was limited to around 100.
  • A database lock caused slow page response times after login, causing the users to think they weren’t logged in and trying again. Causing a spike in login attempts.

So, the root issue was that on each login attempt to a website, an LDAP connection was established but never closed. This had gone undetected for years because the connection would timeout after 2 minutes and the number of open connections would stay below any ones radar. Recently, we had found out that something was leaving open connections, but we didn’t see the offending line of code until this issue got to the Denial of Service level. The fix was straight forward: Close the connection properly after each usage. Ya’ know. How it’s supposed to be done.

The number of open LDAP connections to the Identity server was limited to around 100. There was a recent update to the Identity server that provided LDAP services. One change during the update, which was unknown, was that the number of simultaneous open connections was lowered to around 100 at the same time. If connections had been closed properly there wouldn’t had been an issue. But, at this lower level even a small number of open connections easily pushed the server towards its limit.

The big change that aggravated the situation was that the website that users were logging into changed the amount of data loaded after login. Previously all data was lazy loaded as needed, but a recent update changed the data to load after login (I might write more about this another day). The data load revealed that there was some database locking/contention with other websites/services that were also using the same database tables. This contention wasn’t found in the Test environment during load tests as not enough systems we’re involved to replicate the real world production data demands on the database (if we ever figure out how to do that there will definitely be a blog post). This new table lock changed an initial Login page response time from 2~3 seconds into 70+ seconds. The users, after about 10~15 seconds would feel like something had gone wrong with their login attempt and would try again. This continued for hours until enough people were trying over and over again that 100 simultaneous open LDAP connections were used and the LDAP server was effectively having it’s service denied.

Ultimately, a single fix relieved enough pressure on the system to make things work. We changed the offending stored procedures in the database to no longer lock the table (allowing dirty reads for a while). The Login page response times returned to 2~3 seconds immediately, and the number of login attempts fell back to a normal rate.

This wasn’t a permanent fix, but a temporary solution so we could have time to work on the three points above for a stable long term solution.

0 comments:

Post a Comment


Creative Commons License
This site uses Alex Gorbatchev's SyntaxHighlighter, and hosted by herdingcode.com's Jon Galloway.