IIS Proxy & App Web Performance Optimizations Pt. 3

on Monday, March 12, 2018

We left off last time after resolving 3rd party JS and CSS files from https://cdnjs.com/ CDNs. And having raised the Main Proxy servers Application Pool Queue Length from 1,000 to 50,000.

We are about to add more CPUs to the Main Proxy and see if that improves throughput.

What we changed (Add CPUs)

  • Double the number of CPUs to 4 vCPU / 8 vCore.
    • So far the number of connections into the proxy directly correlates to the amount of cpu utilization / load. Hopefully, by adding more processing power, we can scale up the number Test Agents and the overall load.

Test Setup

  • Step Load Pattern
    • 1000 initial users, 200 users every 10 seconds, max 4000 users
    • 4 Test Agents
  • Main Proxy
    • 4 vCPU / 8 vCore
    • 24 GB RAM
    • AppPool Queue Length: 50,000 (default)
    • WebFarm Request Timeout: 30 seconds (default)
  • Impacted Web App Server
    • 2 VMs
  • Impacted Backend Service Server
    • 6 VMs
  • Classic ASP App
    • CDNs used for 4 JS files and 1 CSS file
      • Custom JS and CSS coming from Impacted Web App
      • Images still coming from Impacted Web App
    • JS is minified
  • VS 2017 Test Suite
    • WebTest Caching Enabled

Test Results

  • Main Proxy
    • CPU: 65% (down 27)
    • Max Concurrent Connections: 15,000 (down 2,500)
  • Impacted Web App
    • CPU: 87%
  • Impacted Backend Service
    • CPU: 75%
  • VS 2017 Test Suite
    • Total Tests: 87,000 (up 22,000)
    • Tests/Sec: 794 (up 200)

Adding the processing power seemed to help out everything. The extra processors allowed for more requests to be processed in parallel. This allowed for requests to be passed through and completely quicker, lower the number of concurrent requests. With the increased throughput the number of Tests that could be completed, increases the number of Tests/Sec.

Adding more CPUs to the Proxy helps everything in the system move faster. It parallelizes the requests flowing through it and prevents process contention.

So, where does the new bottleneck exist?

Now that the requests are making it to the Impact Web App, the CPU load has transferred to them and their associated Impacted Backend Services. This is a good thing. We’re moving the load further down the stack. Doing that successfully would push the load down to the database (DB); which is currently not under much load at all.

image

What we changed (Add more VMs)

  • Added 1 more Impacted Web App Server
  • Added 2 more Impacted Backend Services Servers
    • The goal with these additions to use parallelization to allow for more requests to be processed at once and push the bottleneck towards the database.

Test Setup

  • Step Load Pattern
    • 1000 initial users, 200 users every 10 seconds, max 4000 users
    • 4 Test Agents
  • Main Proxy
    • 4 vCPU / 8 vCore
    • 24 GB RAM
    • AppPool Queue Length: 50,000 (default)
    • WebFarm Request Timeout: 30 seconds (default)
  • Impacted Web App Server
    • 3 VMs
  • Impacted Backend Service Server
    • 8 VMs
  • Classic ASP App
    • CDNs used for 4 JS files and 1 CSS file
      • Custom JS and CSS coming from Impacted Web App
      • Images still coming from Impacted Web App
    • JS is minified
  • VS 2017 Test Suite
    • WebTest Caching Enabled

Test Results

  • Main Proxy
    • CPU: 62% (~ the same)
    • Max Concurrent Connections: 14,000 (down 1,000)
  • Impacted Web App
      • CPU: 60%
    • Impacted Backend Service
        • CPU: 65%
      • VS 2017 Test Suite
        • Total Tests: 95,000 (up 8,000)
        • Tests/Sec: 794 (~ the same)

      The extra servers helped get requests through the system faster. So, the overall number of Tests that completed increased. This helped push the load a little further down.

      The Cloud philosophy of handling more load simultaneously through parallelization works. Obvious, right?

      So, in that iteration, there was no bottleneck. And, we are hitting numbers similar to what we expect of the day of the event. But, what we really need to do leave ourselves some head room in case more users show up that we expect. So, let’s add in more Test Agents and see what it can really handle.

      What we changed (More Users Than We Expect)

      • Added more Test Agents in order to overload the system.

      Test Setup

      • Step Load Pattern
        • 2000 initial users, 200 users every 10 seconds, max 4000 users
        • 7 Test Agents
      • Main Proxy
        • 4 vCPU / 8 vCore
        • 24 GB RAM
        • AppPool Queue Length: 50,000
        • WebFarm Request Timeout: 30 seconds (default)
      • Impacted Web App Server
        • 3 VMs
      • Impacted Backend Service Server
        • 8 VMs
      • Classic ASP App
        • CDNs used for 4 JS files and 1 CSS file
          • Custom JS and CSS coming from Impacted Web App
          • Images still coming from Impacted Web App
        • JS is minified
      • VS 2017 Test Suite
        • WebTest Caching Enabled

      Test Results

      • Main Proxy
        • CPU: 65% (~ same)
        • Max Concurrent Connections: 18,000 (up 4,000)
        • Impacted Web App
            • CPU: 63%
          • Impacted Backend Service
              • CPU: 54%
            • VS 2017 Test Suite
              • Total Tests: 125,000 (up 30,000)
              • Tests/Sec: 1147 (up 282)

            So, the “isolated environment” limit is pretty solid but we noticed that at these limits the response time on the requests had slowed down in the beginning of the Test iteration.

            .asp Page Response Times

            image

            The theory is that with 7 Test Agents, all of which started out 2,000 initial users with no caches primed, all made requests for js, css, and images which swamped the Main Proxy and the Impacted Web App servers. Once the caches started being used in the tests, then things started to smooth out and things stabilized.

            From this test we found two error messages started occurring on the proxy. The first error was 502.3 Gateway Timeout and 503 Service Unavailable. Looking at the IIS logs on the Impacted Web App server we could see that many requests (both 200 and 500 return status codes) were resolving with a Win32 Status Code of 64.

            To resolve the Proxy 502.3 and then Impacted Web App Win32 Status Code 64 problems we increased the Web Farm Request Timeout to 120 seconds. This isn’t ideal, but from what you can see in the graphic above, the average response time is consistently quick. So, this will ensure all users will get a response, even though some may have a severely degraded experience. Chances are, their next request will process quickly.

            Happily, the 503 Service Unavailable was not being generated on the Main Proxy server. It was actually being generated on the Impact Web App servers. They still had their Application Pool Queue Length set to the default 1,000 requests. We increased those to 50,000 and that removed that problem.

            Next Time …

            We’ll add another Test Suite to run along side it and look into more Caching.

            0 comments:

            Post a Comment


            Creative Commons License
            This site uses Alex Gorbatchev's SyntaxHighlighter, and hosted by herdingcode.com's Jon Galloway.