IIS Proxy & App Web Performance Optimizations Pt. 2

on Friday, March 9, 2018

Continuing from where we left off in IIS Proxy & App Web Performance Optimizations Pt. 1, we’re now ready to run some initial tests and get some performance baselines.

The goal of each test iteration is to attempt to load the systems to a point that a bottleneck occurs and then find how to relieve that bottleneck.

Initial Test Setup

  • Step Load Pattern
    • 100 initial users, 20 users every 10 seconds, max 400 users
    • 1 Test Agent

Initial Test Results

There was no data really worth noting on this run as we found the addition of the second proxy server lowered the overhead on the Main Proxy enough that no systems were a bottleneck at this point. So, we added more Test Agents re-ran the test with:

Real Baseline Test Setup

  • Step Load Pattern
    • 1000 initial users, 200 users every 10 seconds, max 4000 users
    • 3 Test Agents
  • Main Proxy
    • 2 vCPU / 4 vCore
    • 24 GB RAM
    • AppPool Queue Length: 1000 (default)
    • WebFarm Request Timeout: 30 seconds (default)
  • Impacted Web App Server
    • 2 VMs
  • Impacted Backend Service Server
    • 6 VMs
  • Classic ASP App
    • No CDNs used for JS, CSS, or images
    • JS is minified
  • VS 2017 Test Suite
    • WebTest Caching Disabled

Real Baseline Test Results

  • Main Proxy
    • CPU: 99%
    • Max Concurrent Connections: 17,000
  • VS 2017 Test Suite
    • Total Tests: 37,000
    • Tests/Sec: 340

In this test we discovered that around 14,000 connections was the limit of the Main Proxy before we started to receive responses on 503 Service Unavailable. We didn’t yet understand that there was more to it, but we set about trying to lower the number of connections by lowering the number of requests for js, css, and images. Looking through the IIS logs we also saw the majority of requests were for the static content; which made it look like information wasn’t being cached between calls. So, we found a setting in VS 2017’s Web Test that allowed us to enable caching. (We also saw a lot of the SocketExceptions mentioned in the previous post, but we didn’t understand what they meant at that time).

What we changed (CDNs and Browser Caching)

  • We took all of the 3rd party JS and CSS files that we use and referenced them from https://cdnjs.com/ CDNs. In total, there was 4 js files and 1 css file.
    • The reason this hadn’t been done before is there wasn’t enough time to test the fallback strategies if the CDN doesn’t serve the js/css, then the browser should request the files from our servers. We implemented these fallbacks this time.
  • We updated the VS 2017 Web Test configuration to enable caching. Whenever a new Test scenario is run, the test agent will not have caching enabled in order to replicate a “new user” experience; each subsequent call in the scenario will use cached js, css, and images. (This cut around 50% of the requests made in the baseline test)
    • The majority of the requests into the Main Proxy were image requests. But, the way the application was written we couldn’t risk a) moving the images to a CDN or b) spriting the images. (It is a Classic ASP app, so it doesn’t have all the bells and whistles that newer frameworks have)

Test Setup

  • Step Load Pattern
    • 1000 initial users, 200 users every 10 seconds, max 4000 users
    • 3 Test Agents
  • Main Proxy
    • 2 vCPU / 4 vCore
    • 24 GB RAM
    • AppPool Queue Length: 1000 (default)
    • WebFarm Request Timeout: 30 seconds (default)
  • Impacted Web App Server
    • 2 VMs
  • Impacted Backend Service Server
    • 6 VMs
  • Classic ASP App
    • CDNs used for 4 JS files and 1 CSS file
      • Custom JS and CSS coming from Impacted Web App
      • Images still coming from Impacted Web App
    • JS is minified
  • VS 2017 Test Suite
    • WebTest Caching Enabled

Test Results

  • Main Proxy
    • CPU: 82% (down 17)
    • Max Concurrent Connections: 10,400 (down 6,600)
  • VS 2017 Test Suite
    • Total Tests: 69,000 (up 32,000)
    • Tests/Sec: 631 (up 289, but with 21% failure rate)

Offloading the common third party js and css files really lowered the number of requests into the Main Proxy server (38% lower). And, with that overhead removed, the CPU utilization came down from a pegged 99% to 82%.

Because caching was also enabled, the test suite was able to churn through the follow-up page requests much quicker. That increase in rate nearly doubled the number of Tests/Sec completed.

Move 3rd party static content to CDNs when possible (https://cdnjs.com/ is a great service.) When doing so, try to implement failed loads and fallbacks for those resources.

But, we still had high CPU utilization on the Main Proxy. And, we had a pretty high failure rate with lots of 503 Service Unavailable and some 502.3 Gateway Timeouts. We determined the cause of the 503s was that the Application Pools Queue length was being hit. We considered this to be the new bottleneck.

What we changed (CDNs and Browser Caching)

  • We set the application pool queue length from 1,000 to 50,000. This would allow us to queue up more requests and lower the 503 Service Unavailable error rate.
  • We also had enough head room in the CPU to add another Test Agent.

Test Setup

  • Step Load Pattern
    • 1000 initial users, 200 users every 10 seconds, max 4000 users
    • 4 Test Agents
  • Main Proxy
    • 2 vCPU / 4 vCore
    • 24 GB RAM
    • AppPool Queue Length: 50,000
    • WebFarm Request Timeout: 30 seconds (default)
  • Impacted Web App Server
    • 2 VMs
  • Impacted Backend Service Server
    • 6 VMs
  • Classic ASP App
    • CDNs used for 4 JS files and 1 CSS file
      • Custom JS and CSS coming from Impacted Web App
      • Images still coming from Impacted Web App
    • JS is minified
  • VS 2017 Test Suite
    • WebTest Caching Enabled

Test Results

  • Main Proxy
    • CPU: 92% (up 10)
    • Max Concurrent Connections: 17,500 (up 7,100)
  • VS 2017 Test Suite
    • Total Tests: 65,000 (down 4,000)
    • Tests/Sec: 594 (down 37, but with only 3% failure rate)

This helped fix the failure rate issue. Without all the 503s forcing the Tests to end early, it took slightly longer to complete each test and that caused the number of Tests/Sec to fall a bit. This also meant we had more requests queued up, bringing the number of concurrent connections back up.

For heavily trafficked sites, set your Application Pool Queue Length well above the default 1,000 requests. This is only needed if you don’t have a Network Load Balancer in front of your proxy.

At this point we were very curious what would happen if we added more processors to the Main Proxy. We were also curious what the average response time was from the Classic .asp pages. (NOTE: all the js, css, and image response times are higher than the page result time.)

.asp Page Response Times on Proxy

image

Next Time …

We’ll add more CPUs to the proxy and see if we can’t push the bottleneck further down the line.

0 comments:

Post a Comment


Creative Commons License
This site uses Alex Gorbatchev's SyntaxHighlighter, and hosted by herdingcode.com's Jon Galloway.