I don’t install powershell scripts as Windows Tasks every day (any probably need to find a way for another system to manage that responsibility), so it’s easy to forget how to do them. Here’s a quick template to install a Windows Task on a remote machine:
Labels: powershell, server
Microsoft is a very large company and there is never going to be a single statement which encapsulates the exact direction that every division in the company is moving. The leaders of the company try to point in a wise direction and they hope like hell that the organization see their wisdom and starts to move toward that direction.
One of those big directional statements was Microsoft throwing its support behind Docker. MS runs a very large cloud provider in Azure and looking over the statistics they have (and I would assume internal feedback) they most likely are seeing a large shift towards the use of docker within their systems. It would only be reasonable to think that they should provide a platform that will support the systems their customers are using.
But, what are the challenges that they face with Docker …
- Docker Containers are not Domain Joined
Microsoft’s security methodology for many years has been based around Kerberos/AD and domain credentials. In order to provide least privilege access for your applications you create an AD account for that application and then you setup permissions for that account based upon it’s needs. Those credentials are authenticated against Domain Controllers and then a Kerberos token is passed everywhere to authn/authz the account on all services within the domain (SQL Server, Disk Access, LDAP).
So, if the docker instance isn’t domain joined, running an application under a domain account becomes a difficulty. How do they deal with this?
- Docker Hosts and gMSA accounts
Group Managed Service Accounts (gMSA) is a concept that was introduced into Active Directory prior to Docker. The idea behind the accounts are that they would be more locked down/secure than a normal AD user account. These accounts would be registered within AD to only be usable on a particular set of machines within the domain, and the accounts would need to pre-register themselves on those machines before they could be used.
Microsoft architects looked at this and thought, if we already have these accounts registered on the Docker Host machine and the Docker container can interact with the Host machine, maybe we can find a way to slide the authenticated Kerberos credentials into the Docker Container for use?
Which they did. But, their are a number of ‘gotchas’ along the way to make gMSA accounts work with Containers: - Container hostname must match the gMSA name for Windows Server 2016 and Windows 10, versions 1709 and 1803
- You can't use gMSAs with Hyper-V isolated containers on Windows 10 versions 1703, 1709, and 1803
- Container initialization will hang or fail when you try to use a gMSA with a Hyper-V isolated container on Windows 10 and Windows Server versions 1703, 1709, and 1803.
-
Using a gMSA with more than one container simultaneously leads to intermittent failures on Windows Server 2016 and Windows 10, version 1709 and 1803.
So, with the all the issues listed above here’s the Use Case:
Can you use Microsoft SQL Server within a Docker Container?
I think the answer is “I guess … but it feels like the SQL Server team (or the MS Docker Team) is focusing on supporting SQL Server in Linux Containers more than on Windows Containers.”
- Microsoft SQL Server Images for Windows Are Not Being Updated
The last version of MSSQL on Windows Servers was built in Feb. 2018 against Windows 10 1709 / Windows Server 2017-GA / Windows Server Core 2017. So, there hasn’t been an update for Windows 10 1803/1809 or Windows Server 2019.
So, why is that? I don’t know the answer, but maybe it’s because their isn’t a lot of usage of SQL Server in Containers due to licensing costs. Or that setting up a SQL Server instance to run under a gMSA account doesn’t necessarily mean it’s going to be able to authenticate Kerberos tokens/SSPI from clients (I never got to a place where I could test this). Or, maybe Azure usage statistics show that people aren’t using MSSQL in Windows Containers.
Either way, I’m not sure the MSSQL Team is really sold on investing their time into that platform. Only they know.
- Microsoft SQL Server Images for Linux are Working Great!
However, they are keeping up to date on the SQL Server for Linux images. Using Linux simplifies things as it breaks out of the constraints of using Kerberos/SSPI for authentication and will only need to support the SQL Login authentication model.
Potentially, that’s a good enough reason on it’s own to make supporting a container easier for the MSSQL team. But, I wonder if they have statistics from Azure that show the market is strongly preferring this configuration when using containers?
So, Is SQL Server looking to Dockerize on Windows?
It just doesn’t feel like it.
IIS Proxy & App Web Performance Optimizations Pt. 4
Posted by Steven Maglio on Friday, March 16, 2018Last time we took the new architecture to it’s theoretical limit and pushed more of the load toward the database. This time …
What we changed (Using 2 Load Test Suites)
- Turn on Output Caching on the Proxy Server. Which defaults to caching js, css, and images. Which works really great with really old sites.
- We also lowered the number of users as the Backend Services ramped up to 100%.
- Forced Test Agents to Run in 64-bit mode. This resolved the an Out Of Memory exception that we we’re getting when the Test Agents were running into the 2 GB memory caps of their 32-bit processes.
- Found a problem with the Test Suite that was allowing all tests to complete without hitting the backend service. (This really effected the number of calls that made it to the Impacted Backend Services.)
- Added a second Test Suite which also used the same database. The load on this suite wasn’t very high; it just added more real world requests.
Test Setup
- Constant Load Pattern
- 1000 users
- 7 Test Agents (64-bit mode)
- Main Proxy
- 4 vCPU / 8 vCore
- 24 GB RAM
- AppPool Queue Length: 50,000
- WebFarm Request Timeout: 120 seconds
- Output Caching (js, css, images)
- Impacted Web App Server
- 3 VMs
- AppPool Queue Length: 50,000
- Impacted Backend Service Server
- 8 VMs
- Classic ASP App
- CDNs used for 4 JS files and 1 CSS file
- Custom JS and CSS coming from Impacted Web App
- Images still coming from Impacted Web App
- JS is minified
- VS 2017 Test Suite
- WebTest Caching Enabled
- A 2nd Test Suite which Impacts other applications in the environment is also run. (This is done off a different VS 2017 Test Controller)
Test Results
- Main Proxy
- CPU: 28% (down 37)
- Max Concurrent Connections: – (Didn’t Record)
- Impacted Web App
- CPU: 56% (down 10)
- Impacted Backend Service
- CPU: 100% (up 50)
- DB
- CPU: 30% (down 20)
- VS 2017 Test Suite
- Total Tests: 95,000 (down 30,000)
- Tests/Sec: 869 (down 278)
This more “real world” test really highlighted that the impacted systems weren’t going to have a huge impact on the database shared by the other systems which will using it at the same time.
We had successfully moved the load from the Main Proxy onto the the backend services, but not all the way to the database. With some further testing we found that adding CPUs and new VMs to the Impacted Backend Servers had a direct 1:1 relationship with handling more requests. The unfortunate side of that is that we weren’t comfortable with the cost of the CPUs compared to the increased performance.
The real big surprise was the significant CPU utilization decrease that came from turning On Output Caching on the
And, with that good news, we called it a day.
So, the final architecture looks like this …
What we learned …
- SSL Encryption/Decryption can put a significant load on your main proxy/load balancer server. The number of requests processed by that server will directly scale into CPU utilization. You can reduce this load by moving static content to CDNs.
- Even if your main proxy/load balancer does SSL offloading and requests to the backend services aren’t SSL encrypted, the extra socket connections still have an impact on the servers CPU utilization. You can lower this impact on both the main proxy and the Impacted Web App servers by using Output Caching for static content (js, css, images).
- We didn’t have the need to use bundling and we didn’t have the ability to do spriting; but we would strongly encourage anyone to use those if they are an option.
- Moving backend service requests to an internal proxy doesn’t significantly lower the number of requests through the main proxy. It’s really images that create the most number of requests to render a web page (especially with an older Classic ASP site).
- In Visual Studio, double check that your suite of web tests are doing exactly what you think they are doing. Also, go the extra step and check that the HTTP Status Code returned on each request is the code that you expect. If you expect a 302, check that it’s a 302 instead of considering a 200 to be satisfactory.
Labels: iis, server, visual studio
IIS Proxy & App Web Performance Optimizations Pt. 3
Posted by Steven Maglio on Monday, March 12, 2018We left off last time after resolving 3rd party JS and CSS files from https://cdnjs.com/ CDNs. And having raised the Main Proxy servers Application Pool Queue Length from 1,000 to 50,000.
We are about to add more CPUs to the Main Proxy and see if that improves throughput.
What we changed (Add CPUs)
- Double the number of CPUs to 4 vCPU / 8 vCore.
- So far the number of connections into the proxy directly correlates to the amount of cpu utilization / load. Hopefully, by adding more processing power, we can scale up the number Test Agents and the overall load.
Test Setup
- Step Load Pattern
- 1000 initial users, 200 users every 10 seconds, max 4000 users
- 4 Test Agents
- Main Proxy
- 4 vCPU / 8 vCore
- 24 GB RAM
- AppPool Queue Length: 50,000 (default)
- WebFarm Request Timeout: 30 seconds (default)
- Impacted Web App Server
- 2 VMs
- Impacted Backend Service Server
- 6 VMs
- Classic ASP App
- CDNs used for 4 JS files and 1 CSS file
- Custom JS and CSS coming from Impacted Web App
- Images still coming from Impacted Web App
- JS is minified
- VS 2017 Test Suite
- WebTest Caching Enabled
Test Results
- Main Proxy
- CPU: 65% (down 27)
- Max Concurrent Connections: 15,000 (down 2,500)
- Impacted Web App
- CPU: 87%
- Impacted Backend Service
- CPU: 75%
- VS 2017 Test Suite
- Total Tests: 87,000 (up 22,000)
- Tests/Sec: 794 (up 200)
Adding the processing power seemed to help out everything. The extra processors allowed for more requests to be processed in parallel. This allowed for requests to be passed through and completely quicker, lower the number of concurrent requests. With the increased throughput the number of Tests that could be completed, increases the number of Tests/Sec.
Adding more CPUs to the Proxy helps everything in the system move faster. It parallelizes the requests flowing through it and prevents process contention.
So, where does the new bottleneck exist?
Now that the requests are making it to the Impact Web App, the CPU load has transferred to them and their associated Impacted Backend Services. This is a good thing. We’re moving the load further down the stack. Doing that successfully would push the load down to the database (DB); which is currently not under much load at all.
What we changed (Add more VMs)
- Added 1 more Impacted Web App Server
- Added 2 more Impacted Backend Services Servers
- The goal with these additions to use parallelization to allow for more requests to be processed at once and push the bottleneck towards the database.
Test Setup
- Step Load Pattern
- 1000 initial users, 200 users every 10 seconds, max 4000 users
- 4 Test Agents
- Main Proxy
- 4 vCPU / 8 vCore
- 24 GB RAM
- AppPool Queue Length: 50,000 (default)
- WebFarm Request Timeout: 30 seconds (default)
- Impacted Web App Server
- 3 VMs
- Impacted Backend Service Server
- 8 VMs
- Classic ASP App
- CDNs used for 4 JS files and 1 CSS file
- Custom JS and CSS coming from Impacted Web App
- Images still coming from Impacted Web App
- JS is minified
- VS 2017 Test Suite
- WebTest Caching Enabled
Test Results
- Main Proxy
- CPU: 62% (~ the same)
- Max Concurrent Connections: 14,000 (down 1,000)
- Impacted Web App
- CPU: 60%
- Impacted Backend Service
- CPU: 65%
- VS 2017 Test Suite
- Total Tests: 95,000 (up 8,000)
- Tests/Sec: 794 (~ the same)
The extra servers helped get requests through the system faster. So, the overall number of Tests that completed increased. This helped push the load a little further down.
The Cloud philosophy of handling more load simultaneously through parallelization works. Obvious, right?
So, in that iteration, there was no bottleneck. And, we are hitting numbers similar to what we expect of the day of the event. But, what we really need to do leave ourselves some head room in case more users show up that we expect. So, let’s add in more Test Agents and see what it can really handle.
What we changed (More Users Than We Expect)
- Added more Test Agents in order to overload the system.
Test Setup
- Step Load Pattern
- 2000 initial users, 200 users every 10 seconds, max 4000 users
- 7 Test Agents
- Main Proxy
- 4 vCPU / 8 vCore
- 24 GB RAM
- AppPool Queue Length: 50,000
- WebFarm Request Timeout: 30 seconds (default)
- Impacted Web App Server
- 3 VMs
- Impacted Backend Service Server
- 8 VMs
- Classic ASP App
- CDNs used for 4 JS files and 1 CSS file
- Custom JS and CSS coming from Impacted Web App
- Images still coming from Impacted Web App
- JS is minified
- VS 2017 Test Suite
- WebTest Caching Enabled
Test Results
- Main Proxy
- CPU: 65% (~ same)
- Max Concurrent Connections: 18,000 (up 4,000)
- Impacted Web App
- CPU: 63%
- Impacted Backend Service
- CPU: 54%
- VS 2017 Test Suite
- Total Tests: 125,000 (up 30,000)
- Tests/Sec: 1147 (up 282)
So, the “isolated environment” limit is pretty solid but we noticed that at these limits the response time on the requests had slowed down in the beginning of the Test iteration.
.asp Page Response Times
The theory is that with 7 Test Agents, all of which started out 2,000 initial users with no caches primed, all made requests for js, css, and images which swamped the Main Proxy and the Impacted Web App servers. Once the caches started being used in the tests, then things started to smooth out and things stabilized.
From this test we found two error messages started occurring on the proxy. The first error was 502.3 Gateway Timeout and 503 Service Unavailable. Looking at the IIS logs on the Impacted Web App server we could see that many requests (both 200 and 500 return status codes) were resolving with a Win32 Status Code of 64.
To resolve the Proxy 502.3 and then Impacted Web App Win32 Status Code 64 problems we increased the Web Farm Request Timeout to 120 seconds. This isn’t ideal, but from what you can see in the graphic above, the average response time is consistently quick. So, this will ensure all users will get a response, even though some may have a severely degraded experience. Chances are, their next request will process quickly.
Happily, the 503 Service Unavailable was not being generated on the Main Proxy server. It was actually being generated on the Impact Web App servers. They still had their Application Pool Queue Length set to the default 1,000 requests. We increased those to 50,000 and that removed that problem.
Next Time …
We’ll add another Test Suite to run along side it and look into more Caching.
Labels: iis, server, visual studio
IIS Proxy & App Web Performance Optimizations Pt. 2
Posted by Steven Maglio on Friday, March 9, 2018Continuing from where we left off in IIS Proxy & App Web Performance Optimizations Pt. 1, we’re now ready to run some initial tests and get some performance baselines.
The goal of each test iteration is to attempt to load the systems to a point that a bottleneck occurs and then find how to relieve that bottleneck.
Initial Test Setup
- Step Load Pattern
- 100 initial users, 20 users every 10 seconds, max 400 users
- 1 Test Agent
Initial Test Results
There was no data really worth noting on this run as we found the addition of the second proxy server lowered the overhead on the Main Proxy enough that no systems were a bottleneck at this point. So, we added more Test Agents re-ran the test with:
Real Baseline Test Setup
- Step Load Pattern
- 1000 initial users, 200 users every 10 seconds, max 4000 users
- 3 Test Agents
- Main Proxy
- 2 vCPU / 4 vCore
- 24 GB RAM
- AppPool Queue Length: 1000 (default)
- WebFarm Request Timeout: 30 seconds (default)
- Impacted Web App Server
- 2 VMs
- Impacted Backend Service Server
- 6 VMs
- Classic ASP App
- No CDNs used for JS, CSS, or images
- JS is minified
- VS 2017 Test Suite
- WebTest Caching Disabled
Real Baseline Test Results
- Main Proxy
- CPU: 99%
- Max Concurrent Connections: 17,000
- VS 2017 Test Suite
- Total Tests: 37,000
- Tests/Sec: 340
In this test we discovered that around 14,000 connections was the limit of the Main Proxy before we started to receive responses on 503 Service Unavailable. We didn’t yet understand that there was more to it, but we set about trying to lower the number of connections by lowering the number of requests for js, css, and images. Looking through the IIS logs we also saw the majority of requests were for the static content; which made it look like information wasn’t being cached between calls. So, we found a setting in VS 2017’s Web Test that allowed us to enable caching. (We also saw a lot of the SocketExceptions mentioned in the previous post, but we didn’t understand what they meant at that time).
What we changed (CDNs and Browser Caching)
- We took all of the 3rd party JS and CSS files that we use and referenced them from https://cdnjs.com/ CDNs. In total, there was 4 js files and 1 css file.
- The reason this hadn’t been done before is there wasn’t enough time to test the fallback strategies if the CDN doesn’t serve the js/css, then the browser should request the files from our servers. We implemented these fallbacks this time.
- We updated the VS 2017 Web Test configuration to enable caching. Whenever a new Test scenario is run, the test agent will not have caching enabled in order to replicate a “new user” experience; each subsequent call in the scenario will use cached js, css, and images. (This cut around 50% of the requests made in the baseline test)
- The majority of the requests into the Main Proxy were image requests. But, the way the application was written we couldn’t risk a) moving the images to a CDN or b) spriting the images. (It is a Classic ASP app, so it doesn’t have all the bells and whistles that newer frameworks have)
Test Setup
- Step Load Pattern
- 1000 initial users, 200 users every 10 seconds, max 4000 users
- 3 Test Agents
- Main Proxy
- 2 vCPU / 4 vCore
- 24 GB RAM
- AppPool Queue Length: 1000 (default)
- WebFarm Request Timeout: 30 seconds (default)
- Impacted Web App Server
- 2 VMs
- Impacted Backend Service Server
- 6 VMs
- Classic ASP App
- CDNs used for 4 JS files and 1 CSS file
- Custom JS and CSS coming from Impacted Web App
- Images still coming from Impacted Web App
- JS is minified
- VS 2017 Test Suite
- WebTest Caching Enabled
Test Results
- Main Proxy
- CPU: 82% (down 17)
- Max Concurrent Connections: 10,400 (down 6,600)
- VS 2017 Test Suite
- Total Tests: 69,000 (up 32,000)
- Tests/Sec: 631 (up 289, but with 21% failure rate)
Offloading the common third party js and css files really lowered the number of requests into the Main Proxy server (38% lower). And, with that overhead removed, the CPU utilization came down from a pegged 99% to 82%.
Because caching was also enabled, the test suite was able to churn through the follow-up page requests much quicker. That increase in rate nearly doubled the number of Tests/Sec completed.
Move 3rd party static content to CDNs when possible (https://cdnjs.com/ is a great service.) When doing so, try to implement failed loads and fallbacks for those resources.
But, we still had high CPU utilization on the Main Proxy. And, we had a pretty high failure rate with lots of 503 Service Unavailable and some 502.3 Gateway Timeouts. We determined the cause of the 503s was that the Application Pools Queue length was being hit. We considered this to be the new bottleneck.
What we changed (CDNs and Browser Caching)
- We set the application pool queue length from 1,000 to 50,000. This would allow us to queue up more requests and lower the 503 Service Unavailable error rate.
- We also had enough head room in the CPU to add another Test Agent.
Test Setup
- Step Load Pattern
- 1000 initial users, 200 users every 10 seconds, max 4000 users
- 4 Test Agents
- Main Proxy
- 2 vCPU / 4 vCore
- 24 GB RAM
- AppPool Queue Length: 50,000
- WebFarm Request Timeout: 30 seconds (default)
- Impacted Web App Server
- 2 VMs
- Impacted Backend Service Server
- 6 VMs
- Classic ASP App
- CDNs used for 4 JS files and 1 CSS file
- Custom JS and CSS coming from Impacted Web App
- Images still coming from Impacted Web App
- JS is minified
- VS 2017 Test Suite
- WebTest Caching Enabled
Test Results
- Main Proxy
- CPU: 92% (up 10)
- Max Concurrent Connections: 17,500 (up 7,100)
- VS 2017 Test Suite
- Total Tests: 65,000 (down 4,000)
- Tests/Sec: 594 (down 37, but with only 3% failure rate)
This helped fix the failure rate issue. Without all the 503s forcing the Tests to end early, it took slightly longer to complete each test and that caused the number of Tests/Sec to fall a bit. This also meant we had more requests queued up, bringing the number of concurrent connections back up.
For heavily trafficked sites, set your Application Pool Queue Length well above the default 1,000 requests. This is only needed if you don’t have a Network Load Balancer in front of your proxy.
At this point we were very curious what would happen if we added more processors to the Main Proxy. We were also curious what the average response time was from the Classic .asp pages. (NOTE: all the js, css, and image response times are higher than the page result time.)
.asp Page Response Times on Proxy
Next Time …
We’ll add more CPUs to the proxy and see if we can’t push the bottleneck further down the line.
Labels: iis, server, visual studio
IIS Proxy & App Web Performance Optimizations Pt. 1
Posted by Steven Maglio on Monday, March 5, 2018We’re ramping up towards a day where our web farm fields around 40 times the normal load. It’s not much load compared to truly popular websites, but it’s a lot more than what we normally deal with. It’s somewhere around the order of ~50,000 people trying to use the system in an hour. And, the majority of the users hit the system in the first 15 minutes of the hour.
So, of course, we tried to simulate more than the expected load in our test environment and see what sort of changes we can make to ensure stability and responsiveness.
A quick note: This won’t be very applicable to Azure/Cloud based infrastructure. A lot of this will be done for you on the Cloud.
Web Farm Architecture
These systems run in a private Data Center. So, the servers and software don’t have a lot of the very cool features that the cloud offers.
The servers are all Win 2012 R2, IIS 8.5 with ARR 3.0, URL Rewrite 7.2, and Web Farm Framework 1.1.
Normally, the layout of the systems is similar to this diagram. This gives a general idea that there is a front-end proxy, a number of applications, backend services, and a database which are all involved in this yearly event. And, that a single Web App is significantly hit and it’s main supporting Backend Service is also significantly hit. The Backend Service is also shared by the other Web Apps involved in the event; but they are not the main clients during that hour.
Testing Setup
For testing we are using Visual Studio 2017 with a Test Controller and several Agents. It’s a very simple web test suite with a single scenario. This is the main use case during that hour. A user logs in to check their status, and then may take a few actions on other web applications.
Starting Test Load
- Step Pattern
- 100 users, 10 user step every 10 seconds, max 400 users
- 1 Agent
We eventually get to this Test Load
- Step Pattern
- 1000 users, 200 user step every 10 seconds, max 2500 users
- 7 agents
We found that over 2500 concurrent users would result in a SocketException on the Agent machines. Our belief is that each agent attempts to run the max user load defined by the test. And, that the Agent Process will run out (Sockets?) to spawn new users to make calls. This results in SocketExceptions. To alleviate the issue, we added more Agents to the Controller and lowered the maximum number of concurrent users.
SocketExceptions on VS 2017 Test Agents can be prevented by lowering the maximum number of concurrent users. (You can then add in more Agents to the Test Controller in order to get the numbers back up.)
Initial Architecture Change
We’ve been through this load for many years so we already have some standard approaches that we take every year to help with the load:
- Add more Impacted Backend Service servers
- Add more CPU/Memory to the Impacted Web App
This year we went further by
- Adding another proxy server to ensure Backend Service Calls from the Impacted Web App don’t route through the Main Proxy to the Impacted Backend Services. This helps reduce the number of connections through the Main Proxy.
- Adding 6 more Impacted Backend Service servers. These are the servers that always take the worst hit. These servers don’t need sticky sessions, so they can easily spread the load between them.
- Adding a second Impacted Web App server. This server usually doesn’t have the same level of high CPU load that the Proxy and Impacted Backend Services do. These servers do require sticky sessions, so there are potential issues with the load not being balanced.
If you don’t have to worry about sticky session, adding more processing servers can always help distribute a load. That’s why Cloud based services with “Sliders” are fantastic!
Next Time …
In the next section we’ll look at the initial testing results and the lessons learned on each testing iteration.
Labels: iis, server, visual studio
Back in 2015, we started using Win2012 R2 servers and within a day of Production usage we started seeing Out of Memory errors on the servers. Looking at the Task Manager, we could easily see that a massive amount of Kernel Memory was being used. But why?
Using some forums posts, SysInternals, and I think a Scott Hanselman blog entry we were able to use PoolMon.exe to see that the system using all the Kernel Memory was Wnf. We had no idea what it was and went down some rabbit holes before finding this forum post.
Microsoft Support would later tell us the problem had something to with a design change to Remote Registry and how it deals with going idle, and another design change in Windows Server 2012 R2 about how it choose which services to make idle. Anyways, the fix was easy to implement (just a real pain to find):
If you want the service to not stop when Idle, you can set this registry key:
key : HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\RemoteRegistry
name : DisableIdleStop
REG_DWORD, data : 1
Here’s what it looks like when the leak is happening:
Labels: server