IIS Proxy & App Web Performance Optimizations Pt. 4

on Friday, March 16, 2018

Last time we took the new architecture to it’s theoretical limit and pushed more of the load toward the database. This time …

What we changed (Using 2 Load Test Suites)

       
  • Turn on Output Caching on the Proxy Server. Which defaults to caching js, css, and images. Which works really great with really old sites.
  • We also lowered the number of users as the Backend Services ramped up to 100%.
  • Forced Test Agents to Run in 64-bit mode. This resolved the an Out Of Memory exception that we we’re getting when the Test Agents were running into the 2 GB memory caps of their 32-bit processes.
  • Found a problem with the Test Suite that was allowing all tests to complete without hitting the backend service. (This really effected the number of calls that made it to the Impacted Backend Services.)
  • Added a second Test Suite which also used the same database. The load on this suite wasn’t very high; it just added more real world requests.

Test Setup

  • Constant Load Pattern
    • 1000 users
    • 7 Test Agents (64-bit mode)
  • Main Proxy
    • 4 vCPU / 8 vCore
    • 24 GB RAM
    • AppPool Queue Length: 50,000
    • WebFarm Request Timeout: 120 seconds
    • Output Caching (js, css, images)
  • Impacted Web App Server
    • 3 VMs
    • AppPool Queue Length: 50,000
  • Impacted Backend Service Server
    • 8 VMs
  • Classic ASP App
    • CDNs used for 4 JS files and 1 CSS file
      • Custom JS and CSS coming from Impacted Web App
      • Images still coming from Impacted Web App
    • JS is minified
  • VS 2017 Test Suite
    • WebTest Caching Enabled
    • A 2nd Test Suite which Impacts other applications in the environment is also run. (This is done off a different VS 2017 Test Controller)

Test Results

  • Main Proxy
    • CPU: 28% (down 37)
    • Max Concurrent Connections: – (Didn’t Record)      
  • Impacted Web App
      • CPU: 56% (down 10)
    • Impacted Backend Service
        • CPU: 100% (up 50)
      • DB
        • CPU: 30% (down 20)
      • VS 2017 Test Suite
        • Total Tests: 95,000 (down 30,000)
        • Tests/Sec: 869 (down 278)

      This more “real world” test really highlighted that the impacted systems weren’t going to have a huge impact on the database shared by the other systems which will using it at the same time.

      We had successfully moved the load from the Main Proxy onto the the backend services, but not all the way to the database. With some further testing we found that adding CPUs and new VMs to the Impacted Backend Servers had a direct 1:1 relationship with handling more requests. The unfortunate side of that is that we weren’t comfortable with the cost of the CPUs compared to the increased performance.

      The real big surprise was the significant CPU utilization decrease that came from turning On Output Caching on the

      And, with that good news, we called it a day.

      So, the final architecture looks like this …

      image

      What we learned …

      • SSL Encryption/Decryption can put a significant load on your main proxy/load balancer server. The number of requests processed by that server will directly scale into CPU utilization. You can reduce this load by moving static content to CDNs.
      • Even if your main proxy/load balancer does SSL offloading and requests to the backend services aren’t SSL encrypted, the extra socket connections still have an impact on the servers CPU utilization. You can lower this impact on both the main proxy and the Impacted Web App servers by using Output Caching for static content (js, css, images).
        • We didn’t have the need to use bundling and we didn’t have the ability to do spriting; but we would strongly encourage anyone to use those if they are an option.
      • Moving backend service requests to an internal proxy doesn’t significantly lower the number of requests through the main proxy. It’s really images that create the most number of requests to render a web page (especially with an older Classic ASP site).
      • In Visual Studio, double check that your suite of web tests are doing exactly what you think they are doing. Also, go the extra step and check that the HTTP Status Code returned on each request is the code that you expect. If you expect a 302, check that it’s a 302 instead of considering a 200 to be satisfactory.

      IIS Proxy & App Web Performance Optimizations Pt. 3

      on Monday, March 12, 2018

      We left off last time after resolving 3rd party JS and CSS files from https://cdnjs.com/ CDNs. And having raised the Main Proxy servers Application Pool Queue Length from 1,000 to 50,000.

      We are about to add more CPUs to the Main Proxy and see if that improves throughput.

      What we changed (Add CPUs)

      • Double the number of CPUs to 4 vCPU / 8 vCore.
        • So far the number of connections into the proxy directly correlates to the amount of cpu utilization / load. Hopefully, by adding more processing power, we can scale up the number Test Agents and the overall load.

      Test Setup

      • Step Load Pattern
        • 1000 initial users, 200 users every 10 seconds, max 4000 users
        • 4 Test Agents
      • Main Proxy
        • 4 vCPU / 8 vCore
        • 24 GB RAM
        • AppPool Queue Length: 50,000 (default)
        • WebFarm Request Timeout: 30 seconds (default)
      • Impacted Web App Server
        • 2 VMs
      • Impacted Backend Service Server
        • 6 VMs
      • Classic ASP App
        • CDNs used for 4 JS files and 1 CSS file
          • Custom JS and CSS coming from Impacted Web App
          • Images still coming from Impacted Web App
        • JS is minified
      • VS 2017 Test Suite
        • WebTest Caching Enabled

      Test Results

      • Main Proxy
        • CPU: 65% (down 27)
        • Max Concurrent Connections: 15,000 (down 2,500)
      • Impacted Web App
        • CPU: 87%
      • Impacted Backend Service
        • CPU: 75%
      • VS 2017 Test Suite
        • Total Tests: 87,000 (up 22,000)
        • Tests/Sec: 794 (up 200)

      Adding the processing power seemed to help out everything. The extra processors allowed for more requests to be processed in parallel. This allowed for requests to be passed through and completely quicker, lower the number of concurrent requests. With the increased throughput the number of Tests that could be completed, increases the number of Tests/Sec.

      Adding more CPUs to the Proxy helps everything in the system move faster. It parallelizes the requests flowing through it and prevents process contention.

      So, where does the new bottleneck exist?

      Now that the requests are making it to the Impact Web App, the CPU load has transferred to them and their associated Impacted Backend Services. This is a good thing. We’re moving the load further down the stack. Doing that successfully would push the load down to the database (DB); which is currently not under much load at all.

      image

      What we changed (Add more VMs)

      • Added 1 more Impacted Web App Server
      • Added 2 more Impacted Backend Services Servers
        • The goal with these additions to use parallelization to allow for more requests to be processed at once and push the bottleneck towards the database.

      Test Setup

      • Step Load Pattern
        • 1000 initial users, 200 users every 10 seconds, max 4000 users
        • 4 Test Agents
      • Main Proxy
        • 4 vCPU / 8 vCore
        • 24 GB RAM
        • AppPool Queue Length: 50,000 (default)
        • WebFarm Request Timeout: 30 seconds (default)
      • Impacted Web App Server
        • 3 VMs
      • Impacted Backend Service Server
        • 8 VMs
      • Classic ASP App
        • CDNs used for 4 JS files and 1 CSS file
          • Custom JS and CSS coming from Impacted Web App
          • Images still coming from Impacted Web App
        • JS is minified
      • VS 2017 Test Suite
        • WebTest Caching Enabled

      Test Results

      • Main Proxy
        • CPU: 62% (~ the same)
        • Max Concurrent Connections: 14,000 (down 1,000)
      • Impacted Web App
          • CPU: 60%
        • Impacted Backend Service
            • CPU: 65%
          • VS 2017 Test Suite
            • Total Tests: 95,000 (up 8,000)
            • Tests/Sec: 794 (~ the same)

          The extra servers helped get requests through the system faster. So, the overall number of Tests that completed increased. This helped push the load a little further down.

          The Cloud philosophy of handling more load simultaneously through parallelization works. Obvious, right?

          So, in that iteration, there was no bottleneck. And, we are hitting numbers similar to what we expect of the day of the event. But, what we really need to do leave ourselves some head room in case more users show up that we expect. So, let’s add in more Test Agents and see what it can really handle.

          What we changed (More Users Than We Expect)

          • Added more Test Agents in order to overload the system.

          Test Setup

          • Step Load Pattern
            • 2000 initial users, 200 users every 10 seconds, max 4000 users
            • 7 Test Agents
          • Main Proxy
            • 4 vCPU / 8 vCore
            • 24 GB RAM
            • AppPool Queue Length: 50,000
            • WebFarm Request Timeout: 30 seconds (default)
          • Impacted Web App Server
            • 3 VMs
          • Impacted Backend Service Server
            • 8 VMs
          • Classic ASP App
            • CDNs used for 4 JS files and 1 CSS file
              • Custom JS and CSS coming from Impacted Web App
              • Images still coming from Impacted Web App
            • JS is minified
          • VS 2017 Test Suite
            • WebTest Caching Enabled

          Test Results

          • Main Proxy
            • CPU: 65% (~ same)
            • Max Concurrent Connections: 18,000 (up 4,000)
            • Impacted Web App
                • CPU: 63%
              • Impacted Backend Service
                  • CPU: 54%
                • VS 2017 Test Suite
                  • Total Tests: 125,000 (up 30,000)
                  • Tests/Sec: 1147 (up 282)

                So, the “isolated environment” limit is pretty solid but we noticed that at these limits the response time on the requests had slowed down in the beginning of the Test iteration.

                .asp Page Response Times

                image

                The theory is that with 7 Test Agents, all of which started out 2,000 initial users with no caches primed, all made requests for js, css, and images which swamped the Main Proxy and the Impacted Web App servers. Once the caches started being used in the tests, then things started to smooth out and things stabilized.

                From this test we found two error messages started occurring on the proxy. The first error was 502.3 Gateway Timeout and 503 Service Unavailable. Looking at the IIS logs on the Impacted Web App server we could see that many requests (both 200 and 500 return status codes) were resolving with a Win32 Status Code of 64.

                To resolve the Proxy 502.3 and then Impacted Web App Win32 Status Code 64 problems we increased the Web Farm Request Timeout to 120 seconds. This isn’t ideal, but from what you can see in the graphic above, the average response time is consistently quick. So, this will ensure all users will get a response, even though some may have a severely degraded experience. Chances are, their next request will process quickly.

                Happily, the 503 Service Unavailable was not being generated on the Main Proxy server. It was actually being generated on the Impact Web App servers. They still had their Application Pool Queue Length set to the default 1,000 requests. We increased those to 50,000 and that removed that problem.

                Next Time …

                We’ll add another Test Suite to run along side it and look into more Caching.

                IIS Proxy & App Web Performance Optimizations Pt. 2

                on Friday, March 9, 2018

                Continuing from where we left off in IIS Proxy & App Web Performance Optimizations Pt. 1, we’re now ready to run some initial tests and get some performance baselines.

                The goal of each test iteration is to attempt to load the systems to a point that a bottleneck occurs and then find how to relieve that bottleneck.

                Initial Test Setup

                • Step Load Pattern
                  • 100 initial users, 20 users every 10 seconds, max 400 users
                  • 1 Test Agent

                Initial Test Results

                There was no data really worth noting on this run as we found the addition of the second proxy server lowered the overhead on the Main Proxy enough that no systems were a bottleneck at this point. So, we added more Test Agents re-ran the test with:

                Real Baseline Test Setup

                • Step Load Pattern
                  • 1000 initial users, 200 users every 10 seconds, max 4000 users
                  • 3 Test Agents
                • Main Proxy
                  • 2 vCPU / 4 vCore
                  • 24 GB RAM
                  • AppPool Queue Length: 1000 (default)
                  • WebFarm Request Timeout: 30 seconds (default)
                • Impacted Web App Server
                  • 2 VMs
                • Impacted Backend Service Server
                  • 6 VMs
                • Classic ASP App
                  • No CDNs used for JS, CSS, or images
                  • JS is minified
                • VS 2017 Test Suite
                  • WebTest Caching Disabled

                Real Baseline Test Results

                • Main Proxy
                  • CPU: 99%
                  • Max Concurrent Connections: 17,000
                • VS 2017 Test Suite
                  • Total Tests: 37,000
                  • Tests/Sec: 340

                In this test we discovered that around 14,000 connections was the limit of the Main Proxy before we started to receive responses on 503 Service Unavailable. We didn’t yet understand that there was more to it, but we set about trying to lower the number of connections by lowering the number of requests for js, css, and images. Looking through the IIS logs we also saw the majority of requests were for the static content; which made it look like information wasn’t being cached between calls. So, we found a setting in VS 2017’s Web Test that allowed us to enable caching. (We also saw a lot of the SocketExceptions mentioned in the previous post, but we didn’t understand what they meant at that time).

                What we changed (CDNs and Browser Caching)

                • We took all of the 3rd party JS and CSS files that we use and referenced them from https://cdnjs.com/ CDNs. In total, there was 4 js files and 1 css file.
                  • The reason this hadn’t been done before is there wasn’t enough time to test the fallback strategies if the CDN doesn’t serve the js/css, then the browser should request the files from our servers. We implemented these fallbacks this time.
                • We updated the VS 2017 Web Test configuration to enable caching. Whenever a new Test scenario is run, the test agent will not have caching enabled in order to replicate a “new user” experience; each subsequent call in the scenario will use cached js, css, and images. (This cut around 50% of the requests made in the baseline test)
                  • The majority of the requests into the Main Proxy were image requests. But, the way the application was written we couldn’t risk a) moving the images to a CDN or b) spriting the images. (It is a Classic ASP app, so it doesn’t have all the bells and whistles that newer frameworks have)

                Test Setup

                • Step Load Pattern
                  • 1000 initial users, 200 users every 10 seconds, max 4000 users
                  • 3 Test Agents
                • Main Proxy
                  • 2 vCPU / 4 vCore
                  • 24 GB RAM
                  • AppPool Queue Length: 1000 (default)
                  • WebFarm Request Timeout: 30 seconds (default)
                • Impacted Web App Server
                  • 2 VMs
                • Impacted Backend Service Server
                  • 6 VMs
                • Classic ASP App
                  • CDNs used for 4 JS files and 1 CSS file
                    • Custom JS and CSS coming from Impacted Web App
                    • Images still coming from Impacted Web App
                  • JS is minified
                • VS 2017 Test Suite
                  • WebTest Caching Enabled

                Test Results

                • Main Proxy
                  • CPU: 82% (down 17)
                  • Max Concurrent Connections: 10,400 (down 6,600)
                • VS 2017 Test Suite
                  • Total Tests: 69,000 (up 32,000)
                  • Tests/Sec: 631 (up 289, but with 21% failure rate)

                Offloading the common third party js and css files really lowered the number of requests into the Main Proxy server (38% lower). And, with that overhead removed, the CPU utilization came down from a pegged 99% to 82%.

                Because caching was also enabled, the test suite was able to churn through the follow-up page requests much quicker. That increase in rate nearly doubled the number of Tests/Sec completed.

                Move 3rd party static content to CDNs when possible (https://cdnjs.com/ is a great service.) When doing so, try to implement failed loads and fallbacks for those resources.

                But, we still had high CPU utilization on the Main Proxy. And, we had a pretty high failure rate with lots of 503 Service Unavailable and some 502.3 Gateway Timeouts. We determined the cause of the 503s was that the Application Pools Queue length was being hit. We considered this to be the new bottleneck.

                What we changed (CDNs and Browser Caching)

                • We set the application pool queue length from 1,000 to 50,000. This would allow us to queue up more requests and lower the 503 Service Unavailable error rate.
                • We also had enough head room in the CPU to add another Test Agent.

                Test Setup

                • Step Load Pattern
                  • 1000 initial users, 200 users every 10 seconds, max 4000 users
                  • 4 Test Agents
                • Main Proxy
                  • 2 vCPU / 4 vCore
                  • 24 GB RAM
                  • AppPool Queue Length: 50,000
                  • WebFarm Request Timeout: 30 seconds (default)
                • Impacted Web App Server
                  • 2 VMs
                • Impacted Backend Service Server
                  • 6 VMs
                • Classic ASP App
                  • CDNs used for 4 JS files and 1 CSS file
                    • Custom JS and CSS coming from Impacted Web App
                    • Images still coming from Impacted Web App
                  • JS is minified
                • VS 2017 Test Suite
                  • WebTest Caching Enabled

                Test Results

                • Main Proxy
                  • CPU: 92% (up 10)
                  • Max Concurrent Connections: 17,500 (up 7,100)
                • VS 2017 Test Suite
                  • Total Tests: 65,000 (down 4,000)
                  • Tests/Sec: 594 (down 37, but with only 3% failure rate)

                This helped fix the failure rate issue. Without all the 503s forcing the Tests to end early, it took slightly longer to complete each test and that caused the number of Tests/Sec to fall a bit. This also meant we had more requests queued up, bringing the number of concurrent connections back up.

                For heavily trafficked sites, set your Application Pool Queue Length well above the default 1,000 requests. This is only needed if you don’t have a Network Load Balancer in front of your proxy.

                At this point we were very curious what would happen if we added more processors to the Main Proxy. We were also curious what the average response time was from the Classic .asp pages. (NOTE: all the js, css, and image response times are higher than the page result time.)

                .asp Page Response Times on Proxy

                image

                Next Time …

                We’ll add more CPUs to the proxy and see if we can’t push the bottleneck further down the line.

                IIS Proxy & App Web Performance Optimizations Pt. 1

                on Monday, March 5, 2018

                We’re ramping up towards a day where our web farm fields around 40 times the normal load. It’s not much load compared to truly popular websites, but it’s a lot more than what we normally deal with. It’s somewhere around the order of ~50,000 people trying to use the system in an hour. And, the majority of the users hit the system in the first 15 minutes of the hour.

                So, of course, we tried to simulate more than the expected load in our test environment and see what sort of changes we can make to ensure stability and responsiveness.

                A quick note: This won’t be very applicable to Azure/Cloud based infrastructure. A lot of this will be done for you on the Cloud.

                Web Farm Architecture

                These systems run in a private Data Center. So, the servers and software don’t have a lot of the very cool features that the cloud offers.

                The servers are all Win 2012 R2, IIS 8.5 with ARR 3.0, URL Rewrite 7.2, and Web Farm Framework 1.1.

                Normally, the layout of the systems is similar to this diagram. This gives a general idea that there is a front-end proxy, a number of applications, backend services, and a database which are all involved in this yearly event. And, that a single Web App is significantly hit and it’s main supporting Backend Service is also significantly hit. The Backend Service is also shared by the other Web Apps involved in the event; but they are not the main clients during that hour.

                image

                Testing Setup

                For testing we are using Visual Studio 2017 with a Test Controller and several Agents. It’s a very simple web test suite with a single scenario. This is the main use case during that hour. A user logs in to check their status, and then may take a few actions on other web applications.

                Starting Test Load

                • Step Pattern
                • 100 users, 10 user step every 10 seconds, max 400 users
                • 1 Agent

                We eventually get to this Test Load

                • Step Pattern
                • 1000 users, 200 user step every 10 seconds, max 2500 users
                • 7 agents

                We found that over 2500 concurrent users would result in a SocketException on the Agent machines. Our belief is that each agent attempts to run the max user load defined by the test. And, that the Agent Process will run out (Sockets?) to spawn new users to make calls. This results in SocketExceptions. To alleviate the issue, we added more Agents to the Controller and lowered the maximum number of concurrent users.

                SocketExceptions on VS 2017 Test Agents can be prevented by lowering the maximum number of concurrent users. (You can then add in more Agents to the Test Controller in order to get the numbers back up.)

                Initial Architecture Change

                We’ve been through this load for many years so we already have some standard approaches that we take every year to help with the load:

                • Add more Impacted Backend Service servers
                • Add more CPU/Memory to the Impacted Web App

                This year we went further by

                • Adding another proxy server to ensure Backend Service Calls from the Impacted Web App don’t route through the Main Proxy to the Impacted Backend Services. This helps reduce the number of connections through the Main Proxy.
                • Adding 6 more Impacted Backend Service servers. These are the servers that always take the worst hit. These servers don’t need sticky sessions, so they can easily spread the load between them.
                • Adding a second Impacted Web App server. This server usually doesn’t have the same level of high CPU load that the Proxy and Impacted Backend Services do. These servers do require sticky sessions, so there are potential issues with the load not being balanced.

                If you don’t have to worry about sticky session, adding more processing servers can always help distribute a load. That’s why Cloud based services with “Sliders” are fantastic!

                image

                Next Time …

                In the next section we’ll look at the initial testing results and the lessons learned on each testing iteration.


                Creative Commons License
                This site uses Alex Gorbatchev's SyntaxHighlighter, and hosted by herdingcode.com's Jon Galloway.