Denial of Service on an LDAP Server

on Monday, April 9, 2018

This came about accidently and was caused by three mistakes turning into a larger problem. This is similar to the way flights are forced to land. It’s never one thing that requires a plane to land mid-flight. Statistically, air planes that fail while flying have 7 small things go wrong with them that combined cause a plane to no longer function as a whole. So, when two or three small things go wrong while the plane is in the air, they generally land the flight and fix them.

The three mistakes were:

  • On each login attempt to a website, an LDAP connection was established but never closed.
  • The number of open LDAP connections to the Identity server was limited to around 100.
  • A database lock caused slow page response times after login, causing the users to think they weren’t logged in and trying again. Causing a spike in login attempts.

So, the root issue was that on each login attempt to a website, an LDAP connection was established but never closed. This had gone undetected for years because the connection would timeout after 2 minutes and the number of open connections would stay below any ones radar. Recently, we had found out that something was leaving open connections, but we didn’t see the offending line of code until this issue got to the Denial of Service level. The fix was straight forward: Close the connection properly after each usage. Ya’ know. How it’s supposed to be done.

The number of open LDAP connections to the Identity server was limited to around 100. There was a recent update to the Identity server that provided LDAP services. One change during the update, which was unknown, was that the number of simultaneous open connections was lowered to around 100 at the same time. If connections had been closed properly there wouldn’t had been an issue. But, at this lower level even a small number of open connections easily pushed the server towards its limit.

The big change that aggravated the situation was that the website that users were logging into changed the amount of data loaded after login. Previously all data was lazy loaded as needed, but a recent update changed the data to load after login (I might write more about this another day). The data load revealed that there was some database locking/contention with other websites/services that were also using the same database tables. This contention wasn’t found in the Test environment during load tests as not enough systems we’re involved to replicate the real world production data demands on the database (if we ever figure out how to do that there will definitely be a blog post). This new table lock changed an initial Login page response time from 2~3 seconds into 70+ seconds. The users, after about 10~15 seconds would feel like something had gone wrong with their login attempt and would try again. This continued for hours until enough people were trying over and over again that 100 simultaneous open LDAP connections were used and the LDAP server was effectively having it’s service denied.

Ultimately, a single fix relieved enough pressure on the system to make things work. We changed the offending stored procedures in the database to no longer lock the table (allowing dirty reads for a while). The Login page response times returned to 2~3 seconds immediately, and the number of login attempts fell back to a normal rate.

This wasn’t a permanent fix, but a temporary solution so we could have time to work on the three points above for a stable long term solution.

Approve/Revoke API Keys in Apigee Through PS

on Monday, April 2, 2018

The Apigee API Gateway grants access using API Keys (in this example). And, those keys are provisioned for each application that will use an API. This can be a bit confusing, because when you first sign up to use an API, you think that you’re going to get an API Key. But, that’s not the case. It’s your application that gets the key. This grants finer grained control of what applications have access to what, and allows for malicious activity to have a narrower impact. It does make the REST API Management backend a little confusing, as you always need to specify a developers email address when updating an application. The application’s are owned by a developer, so you have to know the developer before updating the application.

There’s a little more confusion because Applications don’t have direct access to APIs. Applications are given approval to use API Products. And, those API Products grant access to different API Proxies. This layer of abstraction is helpful when you want to make a very custom made API Product for an individual customer. And that can happen more often than you might expect.

image

So, in order to approve or revoke an API key on an Application you will actually need:

  • The developers email address
  • The application name
  • The API product name
  • And, either to approve or revoke the status

This script will use the developer email address and application name to look up the status of the application. If no specific API product name is given then the status of Approve or Revoke will be applied to the Application and all associated API Products. If a specific API product name is given, then only that API Products' status will be updated.

$global:ApigeeModule = @{}
$global:ApigeeModule.ApiUrl = "https://api.enterprise.apigee.com/v1/organizations/{org-name}"
$global:ApigeeModule.AuthHeader = @{ Authorization = "Basic {base64-encoded-credentials}" }

<#
.SYNOPSIS
	Makes a call to the Apigee Management API. It sets the approved/revoked status on
    an individual API Product for an Appliation. If not individual API Product is specified
    then the status will change for the Application and all associated API Products.

.PARAMETER Email
	The developer email address that owns the Application

.PARAMETER AppName
    The application to updated

.PARAMETER ApiProductName
    The individual API product to update (optional).

.PARAMETER Status
    Either 'approved' or 'revoked'

.EXAMPLE
	$appInfo = Set-ApigeeDeveloperAppStatus -Email it@company.org -AppName someapp -ApiProductName calendard_api -Status approved
#>
Function Set-ApigeeDeveloperAppStatus {
[CmdletBinding()]
Param (
    [Parameter(Mandatory = $true)]
    [string] $Email,
    [Parameter(Mandatory = $true)]
    [string] $AppName,
    [Parameter(Mandatory = $false)]
    [string] $ApiProductName = [String]::Empty,
    [Parameter(Mandatory = $true)]
    [ValidateSet("approved","revoked")]
    [string] $Status
)

    $action = "approve"
    if($Status -eq "revoked") { $action = "revoke" }


    # check the app exists (maybe use error handling here?)
    $appPath = "/developers/$Email/apps/$AppName"
    $app = Invoke-ApigeeRest -ApiPath $appPath


    # api key/oauth key are stored in the first credentials
    # (all api products are stored within this credential)
    $creds = $app.credentials[0]

    $keysPath = "$appPath/keys"
    $keyPath = "$keysPath/$($creds.consumerKey)"


    # if no api product is selected, approve or deny the entire developer app
    # (the api products are going to be approved/revoked as well)
    if($ApiProductName -eq [String]::Empty) {

        # it's very rare that the App will have it's status changed, but just in case ...
        if($app.status -ne $Status) { 
            $path = $appPath + "?action=$action"
            Write-Verbose "Setting app '$AppName' ($Email) to status '$Status'"
            Invoke-ApigeeRest -ApiPath $path -Method Post -ContentType "application/octet-stream"
        }
    

        # if no api product is selected,
        # then approve or deny the entire set of api products
        foreach($apiProduct in $creds.apiProducts) {
            if($apiProduct.status -ne $Status) {
                $path ="$keyPath/apiproducts/$($apiproduct.apiproduct)?action=$action"
                Write-Verbose "Setting api product '$($apiproduct.apiproduct)' on app '$AppName' ($Email) to status '$Status'"
                Invoke-ApigeeRest -ApiPath $path -Method Post -ContentType "application/octet-stream"
            }
        }

    } else {
    # if an api product name is given, then only update that product

        # check the api product exists
        $apiProduct = $creds.apiProducts |? apiproduct -eq $ApiProductName
        if(-not $apiProduct) {

            Write-Verbose "Could not find api product '$ApiProductName' for app '$AppName' ($Email)"

        } else {
            
            $path = "$keyPath/apiproducts/$($ApiProductName)?action=$action"
            Write-Verbose "Setting api product '$ApiProductName' on app '$AppName' ($Email) to status '$Status'"
            Invoke-ApigeeRest -ApiPath $path -Method Post -ContentType "application/octet-stream"

        }
    }
    

    $app = Invoke-ApigeeRest -ApiPath $appPath
    return $app
}



<#
.SYNOPSIS
	Makes a call to the Apigee Management API. It adds the authorization header and
	uses the root url for our organizations management api endpoint ($global:ApigeeModule.ApiUrl).
	This returns the body of the response as a string.

.PARAMETER ApiPath
	The sub path that will be added onto $global:ApigeeModule.ApiUrl.

.EXAMPLE
	$developers = Invoke-ApigeeMethod -ApiPath "/developers"
#>
Function Invoke-ApigeeRest {
[CmdletBinding()]
Param (
	[Parameter(Mandatory = $true)]
	[string] $ApiPath,
	[ValidateSet("Default","Delete","Get","Head","Merge","Options","Patch","Post","Put","Trace")]
	[string] $Method = "Default",
	[object] $Body = $null,
	[string] $ContentType = "application/json",
    [string] $OutFile = $null
)

	if($ApiPath.StartsWith("/") -eq $false) {
		$ApiPath = "/$ApiPath"
	}
	$Uri = $global:ApigeeModule.ApiUrl + $ApiPath

	if($Body -eq $null) {
		$result = Invoke-RestMethod `
                    -Uri $Uri `
                    -Method $Method `
                    -Headers $global:ApigeeModule.AuthHeader `
                    -ContentType $ContentType `
                    -OutFile $OutFile
	} else {
		$result = Invoke-RestMethod `
                    -Uri $Uri `
                    -Method $Method `
                    -Headers $global:ApigeeModule.AuthHeader `
                    -Body $Body `
                    -ContentType $ContentType `
                    -OutFile $OutFile
	}
	
	return $result
}

IIS Proxy & App Web Performance Optimizations Pt. 4

on Friday, March 16, 2018

Last time we took the new architecture to it’s theoretical limit and pushed more of the load toward the database. This time …

What we changed (Using 2 Load Test Suites)

       
  • Turn on Output Caching on the Proxy Server. Which defaults to caching js, css, and images. Which works really great with really old sites.
  • We also lowered the number of users as the Backend Services ramped up to 100%.
  • Forced Test Agents to Run in 64-bit mode. This resolved the an Out Of Memory exception that we we’re getting when the Test Agents were running into the 2 GB memory caps of their 32-bit processes.
  • Found a problem with the Test Suite that was allowing all tests to complete without hitting the backend service. (This really effected the number of calls that made it to the Impacted Backend Services.)
  • Added a second Test Suite which also used the same database. The load on this suite wasn’t very high; it just added more real world requests.

Test Setup

  • Constant Load Pattern
    • 1000 users
    • 7 Test Agents (64-bit mode)
  • Main Proxy
    • 4 vCPU / 8 vCore
    • 24 GB RAM
    • AppPool Queue Length: 50,000
    • WebFarm Request Timeout: 120 seconds
    • Output Caching (js, css, images)
  • Impacted Web App Server
    • 3 VMs
    • AppPool Queue Length: 50,000
  • Impacted Backend Service Server
    • 8 VMs
  • Classic ASP App
    • CDNs used for 4 JS files and 1 CSS file
      • Custom JS and CSS coming from Impacted Web App
      • Images still coming from Impacted Web App
    • JS is minified
  • VS 2017 Test Suite
    • WebTest Caching Enabled
    • A 2nd Test Suite which Impacts other applications in the environment is also run. (This is done off a different VS 2017 Test Controller)

Test Results

  • Main Proxy
    • CPU: 28% (down 37)
    • Max Concurrent Connections: – (Didn’t Record)      
  • Impacted Web App
      • CPU: 56% (down 10)
    • Impacted Backend Service
        • CPU: 100% (up 50)
      • DB
        • CPU: 30% (down 20)
      • VS 2017 Test Suite
        • Total Tests: 95,000 (down 30,000)
        • Tests/Sec: 869 (down 278)

      This more “real world” test really highlighted that the impacted systems weren’t going to have a huge impact on the database shared by the other systems which will using it at the same time.

      We had successfully moved the load from the Main Proxy onto the the backend services, but not all the way to the database. With some further testing we found that adding CPUs and new VMs to the Impacted Backend Servers had a direct 1:1 relationship with handling more requests. The unfortunate side of that is that we weren’t comfortable with the cost of the CPUs compared to the increased performance.

      The real big surprise was the significant CPU utilization decrease that came from turning On Output Caching on the

      And, with that good news, we called it a day.

      So, the final architecture looks like this …

      image

      What we learned …

      • SSL Encryption/Decryption can put a significant load on your main proxy/load balancer server. The number of requests processed by that server will directly scale into CPU utilization. You can reduce this load by moving static content to CDNs.
      • Even if your main proxy/load balancer does SSL offloading and requests to the backend services aren’t SSL encrypted, the extra socket connections still have an impact on the servers CPU utilization. You can lower this impact on both the main proxy and the Impacted Web App servers by using Output Caching for static content (js, css, images).
        • We didn’t have the need to use bundling and we didn’t have the ability to do spriting; but we would strongly encourage anyone to use those if they are an option.
      • Moving backend service requests to an internal proxy doesn’t significantly lower the number of requests through the main proxy. It’s really images that create the most number of requests to render a web page (especially with an older Classic ASP site).
      • In Visual Studio, double check that your suite of web tests are doing exactly what you think they are doing. Also, go the extra step and check that the HTTP Status Code returned on each request is the code that you expect. If you expect a 302, check that it’s a 302 instead of considering a 200 to be satisfactory.

      IIS Proxy & App Web Performance Optimizations Pt. 3

      on Monday, March 12, 2018

      We left off last time after resolving 3rd party JS and CSS files from https://cdnjs.com/ CDNs. And having raised the Main Proxy servers Application Pool Queue Length from 1,000 to 50,000.

      We are about to add more CPUs to the Main Proxy and see if that improves throughput.

      What we changed (Add CPUs)

      • Double the number of CPUs to 4 vCPU / 8 vCore.
        • So far the number of connections into the proxy directly correlates to the amount of cpu utilization / load. Hopefully, by adding more processing power, we can scale up the number Test Agents and the overall load.

      Test Setup

      • Step Load Pattern
        • 1000 initial users, 200 users every 10 seconds, max 4000 users
        • 4 Test Agents
      • Main Proxy
        • 4 vCPU / 8 vCore
        • 24 GB RAM
        • AppPool Queue Length: 50,000 (default)
        • WebFarm Request Timeout: 30 seconds (default)
      • Impacted Web App Server
        • 2 VMs
      • Impacted Backend Service Server
        • 6 VMs
      • Classic ASP App
        • CDNs used for 4 JS files and 1 CSS file
          • Custom JS and CSS coming from Impacted Web App
          • Images still coming from Impacted Web App
        • JS is minified
      • VS 2017 Test Suite
        • WebTest Caching Enabled

      Test Results

      • Main Proxy
        • CPU: 65% (down 27)
        • Max Concurrent Connections: 15,000 (down 2,500)
      • Impacted Web App
        • CPU: 87%
      • Impacted Backend Service
        • CPU: 75%
      • VS 2017 Test Suite
        • Total Tests: 87,000 (up 22,000)
        • Tests/Sec: 794 (up 200)

      Adding the processing power seemed to help out everything. The extra processors allowed for more requests to be processed in parallel. This allowed for requests to be passed through and completely quicker, lower the number of concurrent requests. With the increased throughput the number of Tests that could be completed, increases the number of Tests/Sec.

      Adding more CPUs to the Proxy helps everything in the system move faster. It parallelizes the requests flowing through it and prevents process contention.

      So, where does the new bottleneck exist?

      Now that the requests are making it to the Impact Web App, the CPU load has transferred to them and their associated Impacted Backend Services. This is a good thing. We’re moving the load further down the stack. Doing that successfully would push the load down to the database (DB); which is currently not under much load at all.

      image

      What we changed (Add more VMs)

      • Added 1 more Impacted Web App Server
      • Added 2 more Impacted Backend Services Servers
        • The goal with these additions to use parallelization to allow for more requests to be processed at once and push the bottleneck towards the database.

      Test Setup

      • Step Load Pattern
        • 1000 initial users, 200 users every 10 seconds, max 4000 users
        • 4 Test Agents
      • Main Proxy
        • 4 vCPU / 8 vCore
        • 24 GB RAM
        • AppPool Queue Length: 50,000 (default)
        • WebFarm Request Timeout: 30 seconds (default)
      • Impacted Web App Server
        • 3 VMs
      • Impacted Backend Service Server
        • 8 VMs
      • Classic ASP App
        • CDNs used for 4 JS files and 1 CSS file
          • Custom JS and CSS coming from Impacted Web App
          • Images still coming from Impacted Web App
        • JS is minified
      • VS 2017 Test Suite
        • WebTest Caching Enabled

      Test Results

      • Main Proxy
        • CPU: 62% (~ the same)
        • Max Concurrent Connections: 14,000 (down 1,000)
      • Impacted Web App
          • CPU: 60%
        • Impacted Backend Service
            • CPU: 65%
          • VS 2017 Test Suite
            • Total Tests: 95,000 (up 8,000)
            • Tests/Sec: 794 (~ the same)

          The extra servers helped get requests through the system faster. So, the overall number of Tests that completed increased. This helped push the load a little further down.

          The Cloud philosophy of handling more load simultaneously through parallelization works. Obvious, right?

          So, in that iteration, there was no bottleneck. And, we are hitting numbers similar to what we expect of the day of the event. But, what we really need to do leave ourselves some head room in case more users show up that we expect. So, let’s add in more Test Agents and see what it can really handle.

          What we changed (More Users Than We Expect)

          • Added more Test Agents in order to overload the system.

          Test Setup

          • Step Load Pattern
            • 2000 initial users, 200 users every 10 seconds, max 4000 users
            • 7 Test Agents
          • Main Proxy
            • 4 vCPU / 8 vCore
            • 24 GB RAM
            • AppPool Queue Length: 50,000
            • WebFarm Request Timeout: 30 seconds (default)
          • Impacted Web App Server
            • 3 VMs
          • Impacted Backend Service Server
            • 8 VMs
          • Classic ASP App
            • CDNs used for 4 JS files and 1 CSS file
              • Custom JS and CSS coming from Impacted Web App
              • Images still coming from Impacted Web App
            • JS is minified
          • VS 2017 Test Suite
            • WebTest Caching Enabled

          Test Results

          • Main Proxy
            • CPU: 65% (~ same)
            • Max Concurrent Connections: 18,000 (up 4,000)
            • Impacted Web App
                • CPU: 63%
              • Impacted Backend Service
                  • CPU: 54%
                • VS 2017 Test Suite
                  • Total Tests: 125,000 (up 30,000)
                  • Tests/Sec: 1147 (up 282)

                So, the “isolated environment” limit is pretty solid but we noticed that at these limits the response time on the requests had slowed down in the beginning of the Test iteration.

                .asp Page Response Times

                image

                The theory is that with 7 Test Agents, all of which started out 2,000 initial users with no caches primed, all made requests for js, css, and images which swamped the Main Proxy and the Impacted Web App servers. Once the caches started being used in the tests, then things started to smooth out and things stabilized.

                From this test we found two error messages started occurring on the proxy. The first error was 502.3 Gateway Timeout and 503 Service Unavailable. Looking at the IIS logs on the Impacted Web App server we could see that many requests (both 200 and 500 return status codes) were resolving with a Win32 Status Code of 64.

                To resolve the Proxy 502.3 and then Impacted Web App Win32 Status Code 64 problems we increased the Web Farm Request Timeout to 120 seconds. This isn’t ideal, but from what you can see in the graphic above, the average response time is consistently quick. So, this will ensure all users will get a response, even though some may have a severely degraded experience. Chances are, their next request will process quickly.

                Happily, the 503 Service Unavailable was not being generated on the Main Proxy server. It was actually being generated on the Impact Web App servers. They still had their Application Pool Queue Length set to the default 1,000 requests. We increased those to 50,000 and that removed that problem.

                Next Time …

                We’ll add another Test Suite to run along side it and look into more Caching.

                IIS Proxy & App Web Performance Optimizations Pt. 2

                on Friday, March 9, 2018

                Continuing from where we left off in IIS Proxy & App Web Performance Optimizations Pt. 1, we’re now ready to run some initial tests and get some performance baselines.

                The goal of each test iteration is to attempt to load the systems to a point that a bottleneck occurs and then find how to relieve that bottleneck.

                Initial Test Setup

                • Step Load Pattern
                  • 100 initial users, 20 users every 10 seconds, max 400 users
                  • 1 Test Agent

                Initial Test Results

                There was no data really worth noting on this run as we found the addition of the second proxy server lowered the overhead on the Main Proxy enough that no systems were a bottleneck at this point. So, we added more Test Agents re-ran the test with:

                Real Baseline Test Setup

                • Step Load Pattern
                  • 1000 initial users, 200 users every 10 seconds, max 4000 users
                  • 3 Test Agents
                • Main Proxy
                  • 2 vCPU / 4 vCore
                  • 24 GB RAM
                  • AppPool Queue Length: 1000 (default)
                  • WebFarm Request Timeout: 30 seconds (default)
                • Impacted Web App Server
                  • 2 VMs
                • Impacted Backend Service Server
                  • 6 VMs
                • Classic ASP App
                  • No CDNs used for JS, CSS, or images
                  • JS is minified
                • VS 2017 Test Suite
                  • WebTest Caching Disabled

                Real Baseline Test Results

                • Main Proxy
                  • CPU: 99%
                  • Max Concurrent Connections: 17,000
                • VS 2017 Test Suite
                  • Total Tests: 37,000
                  • Tests/Sec: 340

                In this test we discovered that around 14,000 connections was the limit of the Main Proxy before we started to receive responses on 503 Service Unavailable. We didn’t yet understand that there was more to it, but we set about trying to lower the number of connections by lowering the number of requests for js, css, and images. Looking through the IIS logs we also saw the majority of requests were for the static content; which made it look like information wasn’t being cached between calls. So, we found a setting in VS 2017’s Web Test that allowed us to enable caching. (We also saw a lot of the SocketExceptions mentioned in the previous post, but we didn’t understand what they meant at that time).

                What we changed (CDNs and Browser Caching)

                • We took all of the 3rd party JS and CSS files that we use and referenced them from https://cdnjs.com/ CDNs. In total, there was 4 js files and 1 css file.
                  • The reason this hadn’t been done before is there wasn’t enough time to test the fallback strategies if the CDN doesn’t serve the js/css, then the browser should request the files from our servers. We implemented these fallbacks this time.
                • We updated the VS 2017 Web Test configuration to enable caching. Whenever a new Test scenario is run, the test agent will not have caching enabled in order to replicate a “new user” experience; each subsequent call in the scenario will use cached js, css, and images. (This cut around 50% of the requests made in the baseline test)
                  • The majority of the requests into the Main Proxy were image requests. But, the way the application was written we couldn’t risk a) moving the images to a CDN or b) spriting the images. (It is a Classic ASP app, so it doesn’t have all the bells and whistles that newer frameworks have)

                Test Setup

                • Step Load Pattern
                  • 1000 initial users, 200 users every 10 seconds, max 4000 users
                  • 3 Test Agents
                • Main Proxy
                  • 2 vCPU / 4 vCore
                  • 24 GB RAM
                  • AppPool Queue Length: 1000 (default)
                  • WebFarm Request Timeout: 30 seconds (default)
                • Impacted Web App Server
                  • 2 VMs
                • Impacted Backend Service Server
                  • 6 VMs
                • Classic ASP App
                  • CDNs used for 4 JS files and 1 CSS file
                    • Custom JS and CSS coming from Impacted Web App
                    • Images still coming from Impacted Web App
                  • JS is minified
                • VS 2017 Test Suite
                  • WebTest Caching Enabled

                Test Results

                • Main Proxy
                  • CPU: 82% (down 17)
                  • Max Concurrent Connections: 10,400 (down 6,600)
                • VS 2017 Test Suite
                  • Total Tests: 69,000 (up 32,000)
                  • Tests/Sec: 631 (up 289, but with 21% failure rate)

                Offloading the common third party js and css files really lowered the number of requests into the Main Proxy server (38% lower). And, with that overhead removed, the CPU utilization came down from a pegged 99% to 82%.

                Because caching was also enabled, the test suite was able to churn through the follow-up page requests much quicker. That increase in rate nearly doubled the number of Tests/Sec completed.

                Move 3rd party static content to CDNs when possible (https://cdnjs.com/ is a great service.) When doing so, try to implement failed loads and fallbacks for those resources.

                But, we still had high CPU utilization on the Main Proxy. And, we had a pretty high failure rate with lots of 503 Service Unavailable and some 502.3 Gateway Timeouts. We determined the cause of the 503s was that the Application Pools Queue length was being hit. We considered this to be the new bottleneck.

                What we changed (CDNs and Browser Caching)

                • We set the application pool queue length from 1,000 to 50,000. This would allow us to queue up more requests and lower the 503 Service Unavailable error rate.
                • We also had enough head room in the CPU to add another Test Agent.

                Test Setup

                • Step Load Pattern
                  • 1000 initial users, 200 users every 10 seconds, max 4000 users
                  • 4 Test Agents
                • Main Proxy
                  • 2 vCPU / 4 vCore
                  • 24 GB RAM
                  • AppPool Queue Length: 50,000
                  • WebFarm Request Timeout: 30 seconds (default)
                • Impacted Web App Server
                  • 2 VMs
                • Impacted Backend Service Server
                  • 6 VMs
                • Classic ASP App
                  • CDNs used for 4 JS files and 1 CSS file
                    • Custom JS and CSS coming from Impacted Web App
                    • Images still coming from Impacted Web App
                  • JS is minified
                • VS 2017 Test Suite
                  • WebTest Caching Enabled

                Test Results

                • Main Proxy
                  • CPU: 92% (up 10)
                  • Max Concurrent Connections: 17,500 (up 7,100)
                • VS 2017 Test Suite
                  • Total Tests: 65,000 (down 4,000)
                  • Tests/Sec: 594 (down 37, but with only 3% failure rate)

                This helped fix the failure rate issue. Without all the 503s forcing the Tests to end early, it took slightly longer to complete each test and that caused the number of Tests/Sec to fall a bit. This also meant we had more requests queued up, bringing the number of concurrent connections back up.

                For heavily trafficked sites, set your Application Pool Queue Length well above the default 1,000 requests. This is only needed if you don’t have a Network Load Balancer in front of your proxy.

                At this point we were very curious what would happen if we added more processors to the Main Proxy. We were also curious what the average response time was from the Classic .asp pages. (NOTE: all the js, css, and image response times are higher than the page result time.)

                .asp Page Response Times on Proxy

                image

                Next Time …

                We’ll add more CPUs to the proxy and see if we can’t push the bottleneck further down the line.

                IIS Proxy & App Web Performance Optimizations Pt. 1

                on Monday, March 5, 2018

                We’re ramping up towards a day where our web farm fields around 40 times the normal load. It’s not much load compared to truly popular websites, but it’s a lot more than what we normally deal with. It’s somewhere around the order of ~50,000 people trying to use the system in an hour. And, the majority of the users hit the system in the first 15 minutes of the hour.

                So, of course, we tried to simulate more than the expected load in our test environment and see what sort of changes we can make to ensure stability and responsiveness.

                A quick note: This won’t be very applicable to Azure/Cloud based infrastructure. A lot of this will be done for you on the Cloud.

                Web Farm Architecture

                These systems run in a private Data Center. So, the servers and software don’t have a lot of the very cool features that the cloud offers.

                The servers are all Win 2012 R2, IIS 8.5 with ARR 3.0, URL Rewrite 7.2, and Web Farm Framework 1.1.

                Normally, the layout of the systems is similar to this diagram. This gives a general idea that there is a front-end proxy, a number of applications, backend services, and a database which are all involved in this yearly event. And, that a single Web App is significantly hit and it’s main supporting Backend Service is also significantly hit. The Backend Service is also shared by the other Web Apps involved in the event; but they are not the main clients during that hour.

                image

                Testing Setup

                For testing we are using Visual Studio 2017 with a Test Controller and several Agents. It’s a very simple web test suite with a single scenario. This is the main use case during that hour. A user logs in to check their status, and then may take a few actions on other web applications.

                Starting Test Load

                • Step Pattern
                • 100 users, 10 user step every 10 seconds, max 400 users
                • 1 Agent

                We eventually get to this Test Load

                • Step Pattern
                • 1000 users, 200 user step every 10 seconds, max 2500 users
                • 7 agents

                We found that over 2500 concurrent users would result in a SocketException on the Agent machines. Our belief is that each agent attempts to run the max user load defined by the test. And, that the Agent Process will run out (Sockets?) to spawn new users to make calls. This results in SocketExceptions. To alleviate the issue, we added more Agents to the Controller and lowered the maximum number of concurrent users.

                SocketExceptions on VS 2017 Test Agents can be prevented by lowering the maximum number of concurrent users. (You can then add in more Agents to the Test Controller in order to get the numbers back up.)

                Initial Architecture Change

                We’ve been through this load for many years so we already have some standard approaches that we take every year to help with the load:

                • Add more Impacted Backend Service servers
                • Add more CPU/Memory to the Impacted Web App

                This year we went further by

                • Adding another proxy server to ensure Backend Service Calls from the Impacted Web App don’t route through the Main Proxy to the Impacted Backend Services. This helps reduce the number of connections through the Main Proxy.
                • Adding 6 more Impacted Backend Service servers. These are the servers that always take the worst hit. These servers don’t need sticky sessions, so they can easily spread the load between them.
                • Adding a second Impacted Web App server. This server usually doesn’t have the same level of high CPU load that the Proxy and Impacted Backend Services do. These servers do require sticky sessions, so there are potential issues with the load not being balanced.

                If you don’t have to worry about sticky session, adding more processing servers can always help distribute a load. That’s why Cloud based services with “Sliders” are fantastic!

                image

                Next Time …

                In the next section we’ll look at the initial testing results and the lessons learned on each testing iteration.

                Apigee Response CORS Headers using Javascript

                on Monday, February 26, 2018

                Apigee provides a quick “Add CORS Headers” to responses when creating a new API Proxy. It’s straight forward and will get you started to add CORS headers to the replies from your first API endpoints. The problem with that is that CORS headers are used in “preflight” and aren’t that useful after the call has successfully completed. Apigee OPTIONS Response for Preflight/CORS can help you set up preflight responses.

                But, it’s still useful to add in CORS headers to your responses in order to ensure that your endpoints are communicating their security requirements. To do this you can use javascript to inspect the responses and add in missing CORS headers. This sample javascript will:

                • Ensure Access-Control-Allow-Origin is defined. Sets the default value to ‘*’.
                • Ensure Access-Control-Allow-Headers is defined. Sets the default value to ‘origin, x-requested-with, accept, my-api-key, my-api-version, authorization, content-type’.
                  • my-api-key and my-api-version are custom headers specific to the Apigee endpoints this script is used with. If the Resource Service doesn’t return these headers, then they will be added in.
                • Ensure Access-Control-Max-Age is defined. Sets the default value to ‘3628800’ seconds (42 days … I have no idea why that was chosen.)
                • Ensure Access-Control-Allow-Methods is defined. Sets the default value to ‘GET, PUT, POST, DELETE’. This should really be set by the Resource Service, so use it only if you feel comfortable.

                This should be created as a Shared Flow and applied to Proxy Endpoint's Postflow.

                //  Access-Control-Allow-Origin
                var accessControlAllowOrigin = context.getVariable("response.header.Access-Control-Allow-Origin.values").toString();
                if(accessControlAllowOrigin.startsWith('[')) { accessControlAllowOrigin = accessControlAllowOrigin.substring(1, accessControlAllowOrigin.length() - 1); }
                if(accessControlAllowOrigin.endsWith('[')) { accessControlAllowOrigin = accessControlAllowOrigin.substring(0, accessControlAllowOrigin.length() - 1); }
                if(accessControlAllowOrigin.length() === 0) {
                    accessControlAllowOrigin = "*";
                }
                context.setVariable("response.header.Access-Control-Allow-Origin", accessControlAllowOrigin);
                
                //  Access-Control-Allow-Headers
                var accessControlAllowHeaders = context.getVariable("response.header.Access-Control-Allow-Headers.values").toString();
                if(accessControlAllowHeaders.startsWith('[')) { accessControlAllowHeaders = accessControlAllowHeaders.substring(1, accessControlAllowHeaders.length() - 1); }
                if(accessControlAllowHeaders.endsWith('[')) { accessControlAllowHeaders = accessControlAllowHeaders.substring(0, accessControlAllowHeaders.length() - 1); }
                if(accessControlAllowHeaders.length() === 0) {
                    accessControlAllowHeaders = "origin, x-requested-with, accept, my-api-key, my-api-version, authorization, content-type";
                }
                if(accessControlAllowHeaders.indexOf("my-api-key") === -1) {
                    accessControlAllowHeaders += ", my-api-key";
                }
                if(accessControlAllowHeaders.indexOf("my-api-version") === -1) {
                    accessControlAllowHeaders += ", my-api-version";
                }
                context.setVariable("response.header.Access-Control-Allow-Headers", accessControlAllowHeaders);
                
                //  Access-Control-Max-Age
                var accessControlMaxAge = context.getVariable("response.header.Access-Control-Max-Age.values").toString();
                if(accessControlMaxAge.startsWith('[')) { accessControlMaxAge = accessControlMaxAge.substring(1, accessControlMaxAge.length() - 1); }
                if(accessControlMaxAge.endsWith('[')) { accessControlMaxAge = accessControlMaxAge.substring(0, accessControlMaxAge.length() - 1); }
                if(accessControlMaxAge.length() === 0) {
                    accessControlMaxAge = "3628800";
                }
                context.setVariable("response.header.Access-Control-Max-Age", accessControlMaxAge);
                
                //  Access-Control-Allow-Methods
                var accessControlAllowMethods = context.getVariable("response.header.Access-Control-Allow-Methods.values").toString();
                if(accessControlAllowMethods.startsWith('[')) { accessControlAllowMethods = accessControlAllowMethods.substring(1, accessControlAllowMethods.length() - 1); }
                if(accessControlAllowMethods.endsWith('[')) { accessControlAllowMethods = accessControlAllowMethods.substring(0, accessControlAllowMethods.length() - 1); }
                if(accessControlAllowMethods.length() === 0) {
                    accessControlAllowMethods = "GET, PUT, POST, DELETE";
                }
                context.setVariable("response.header.Access-Control-Allow-Methods", accessControlAllowMethods);
                


                Creative Commons License
                This site uses Alex Gorbatchev's SyntaxHighlighter, and hosted by herdingcode.com's Jon Galloway.