Don’t update IIS’ applicationHost.config too fast

on Monday, June 3, 2019

IIS’ applicationHost.config file is the persistent backing store for a servers IIS configuration. And, a running IIS instance will monitor that file for any changes on disk. Any changes will trigger a reload of the configuration file and an update to IIS’ configuration, including application pool updates. This is a really nice feature which allows for engineering teams to perform updates to IIS hosts using file operations; which makes it much more flexible to alternative configuration management solutions (hand written scripts, chef, puppet, etc).

Unfortunately, there is a risk involved with updating applicationHost.config outside of the standard appcmd.exe or powershell modules (webadministration and iisadministration). Because the file is read in from disk after each update, a series of rapid updates can cause a pseudo race condition. Even though the file system should prevent reads from occurring when a write is occurring, there seems to be a reproducible problem that IIS may only read in a partial XML configuration file (applicationHost.config) instead of the full file as intended. It’s almost as if updating the file either prevents the reading to finish, or it starts reading in the changes half way through. This only happens sometimes, but if your IIS server is busy enough and you perform enough writes to the applicationHost.config file you can get this error to occur:

The worker process for application pool ‘xxxxxxxxxxxxxx` encountered an error ‘Configuration file is not well-formed XML’ trying to read configuration data from ‘\\?\C:\inetpub\temp\apppools\xxxxxxxxxxxxxx\xxxxxxxxxxxxx.config’, line number ‘3’. The data field contains the error code.


An odd thing to note is that the error message has an unusual value for the .config file name. It uses the application pool name instead of the normal ‘web.config’ (ie. ‘\\?\C:\inetpub\apppools\apppoolname1\apppoolname1.config’).

This is pretty easy to make happen with a for loop in a script. For example:

To prevent the reading problem from happening, there are a number of ways:

  • Use appcmd.exe as Microsoft would suggest
  • Use the Powershell modules that Microsoft provides along with the Stop/Start-*ComitDelay functions
  • Or, put the script to sleep for a few seconds to let IIS process the previous update. This is the most flexible as you can perform updates using a remote network share; where the others requires an active RPC/WinRM session on IIS server. (Example below)

IIS Healthchecks Spam on Win Server 2016 & 2019

on Monday, May 27, 2019

I’ve written before about how IIS healthchecks can spam application servers because IIS web farms are not directly linked to application pools. IIS is designed that any web application (application pool) can route a request to any web farm defined on the system. This means that all the application pools on a system read in all web farm configurations, and then believe that they are supposed to monitor the healthchecks on the web farms. Because of that, the more application pools on a server the more healthchecks that occur.

The issue was introduced in the first version of Web Farm Framework 1.1, but continues within the latest release of Application Request Routing 3.0 (Latest Release 1/25/2018, Web PI Version 3.0.1988).

I first discovered the problem when working on a Windows 2008 server, and was able to reproduce it on a Windows 2012R2 after that. I then filed a Microsoft Support ticket (#116110914917188) in which the product team did try to create a hotfix. But, the fix didn’t work and the ticket was closed stating that the issue would be fixed in a future release of IIS.

I’ve recently re-tested the feature in Windows Server 2016 and 2019. Both of which have the same problematic behavior. There’s a good chance that Application Request Routing is the subsystem which implements web farms, so it might not be an IIS specific issue. It might be that Application Request Routing needs to be updated.

Lowering Database Impact of IIS Healthcheck Spam

on Monday, May 6, 2019

I’ve mentioned before that IIS 8.5 has an issue with spamming healthchecks because it’s web farms aren’t directly associated with application pools. Instead, every running application pool believes it needs to monitor the health of all the web farms, which causes redundant healthcheck calls.

I’m also a proponent of using Healthchecks to test connectivity to databases, external web APIs, and core functionality: Healthchecks Should Not Be Pings.

Because I prefer that healthchecks check that the database is available, and that domain account can log into the database, the healthcheck spam can have a very minimal effect on database performance and connection thresholds. It’s never really been a problem that has been seen, but it’s theoretically something to be concerned about. Soo … here are two techniques that can lower the impact that the healthcheck spam can have on databases:

Cache the Healthcheck Response for a Little While

There are a number of output caching systems built into ASP.NET which can easily provide this functionality:

But! Nothing was every made for ASP.NET WebApi 2. It was really surprising that nothing was ever built in for this. Luckily, there is a great community package the provides the functionality while following the standards that Microsoft was using:

Use Database.Exists (instead of a real call) to help prevent locking

To test if your application is going to be able to communicate with a database successfully, the easiest (and maybe best) way is to use a real call. Just use a function that is a part of the application. This not only tests connectivity, but it also checks if permissions are setup correctly. However, this can drive some concern around table locking if the storm of healthcheck calls aren’t prevented (by the caching mentioned above). Another way to prevent table locking is to use Entity Frameworks Database.Exists method.

IIS Healthcheck & Ninject Memory Leak

on Monday, April 22, 2019

This post is definitely not related to the new Healthcheck system introduced in ASP.NET Core.

I’ve written before about a problem with IIS where the Web Farm healthcheck subsystem creates “healthcheck spamming”. The memory leak described in this post is exacerbated by the healthcheck spamming problem and can create application pools with excessive memory use.

The Ninject Memory leak problem is not a problem with Ninject itself, but a problem with the implementation that I had prescribed within our architecture. I just want to be clear that Ninject doesn’t have a memory leak problem, but that I created an implementation which used Ninject improperly and resulted in a memory leak.

Originally, in our Top Level applications (ie. Company.Dealership.Web) the NinjectWebCommon loading system would configure a series of XxxHealthcheckClient and XxxHealthcheckProxy objects to be loaded In Transient Scope. Which is normally fine, as long as you use the created object within a using block; which would ensure the object is disposed. However, the way I was writing the code did not use a using block.

This meant that when a request came into the Company.Dealership.Web.Healtchcheck Controller an instance of Company.Car.HealtcheckClient and Company.Car.HealthcheckProxy would be created and loaded into memory. The request would then progress through Company.Dealership.Web and result in the Proxy making a call to a second website, Company.Car.Web.Healtcheck(Controller). The problem was that once all the calls had completed, the Client and Proxy objects within the Company.Dealership.Web website would not be disposed (the memory leak).

For a low utilization website/webapi this could go unnoticed as IIS’ built-in application pool recycle time of 29 hours could clean up a small memory leak before anyone notices. But, when you compound this memory leak issue with IIS’ healthcheck spamming the issue can become apparent very quickly. In my testing, a website with a single healthcheck client/proxy pair could consume about 100 MB of memory every hour when there are ~200 application pools on the proxy server. (200 appPools x 1 healthcheck every 30 seconds per appPool = 24,000 healthchecks per hour).

The guidance from Ninject’s Web development side is to change your configurations to no longer use Transient Scope when handling web requests. Instead, configurations should scope instances to In Singleton Scope or In Request Scope. I did some testing and In Singleton Scope consistently proved to remove the memory leak issue every time, which In Request Scope didn’t. I tested In Request Scope a few times and one of the times, the memory leak issue reoccurred. Unfortunately, I could not determine why it was leaking and it truly made no sense to me why it happened (it was most likely a misconfigured build). But, either should work.

When using Ninject within a website, any classes which extend Entity Framework’s DBContext should always be configured In Request Scope.

Here is some code which can detect if a class is currently configured to use In Transient Scope (or is not configured at all) and will reconfigure (rebind) the class to In Singleton Scope:

Let’s Encrypt, IIS Central Cert Store and Powershell

on Monday, February 18, 2019

Let’s Encrypt is a pretty popular tool with a mission to generate free SSL certificates in order to create a more secure internet. The goal is to ensure that the price of SSL certificates does not stand in the way of using them. Unfortunately, when you don’t charge for a product you really have to cut down on the amount of money you spend on customer service.

Their website is a model for limited user interaction. They provide documentation, help guides, and then they point you away from their site and towards the sites of many supporting tool providers which implement their SSL generation platform. But, you will be hard pressed to find a “Contact Us” or “User Support Forum” area on letsencrypt.org. To summarize their site: Here’s how it works, here’s the client providers, read the client providers documentation please.

I don’t fully understand the ACME protocol, but to me it reads like a strict Process and API for validating requests and provisioning signed certificates. Normally there might be a handy website that will guide you through this process with step-by-step instructions but, because there are so many different types of computer systems and programming languages that can implement the ACME protocol, they leave those guides up to the implementers of the ACME clients for each of those systems.

My preference is Powershell, and I found the Posh-ACME guide gave me a good start, but didn’t help me through the final steps of installing the certificate for use with IIS. In this case, an IIS Centralized Certificate Store. So, hopefully this can help others with a start to finish script showing the end users process; instead of hunting down individual steps from different sites.

AWS ALB Price Planning w/ IIS : Add IisLogWebAppId

on Monday, October 29, 2018

This post continues the series from AWS ALB Price Planning w/ IIS : Grouped Sites.

This doesn’t really help figure out much in the larger picture. But, I wanted to separate out statistics about the Web API applications from the normal web applications. Web API applications are strong candidates for rewrites as Serverless ASP.NET Core 2.0 Applications on Lambda Functions. Changing these applications to Lambda Functions won’t reduce the cost of the ALB as they will still use host names that will be serviced by the ALB. But, this will help figure out the tiny tiny costs that the Lambda Functions will charge each month.

This is just an intermediary step to add WebAppId’s to all of the requests.

Background Info

Instead of adding a WebAppId column onto the IisLog, I’m going to create a new table which will link the IisLog table entries to the ProxyWebApp table entries. The reason for this is that the IisLog table has 181,507,680 records and takes up 400 GB of space on disk. Adding a new column, even a single integer column, could be a very dangerous operation because I don’t know how much data the system might want to rearrange on disk.

Plan of Action and Execution

Instead, I’m going to

  1. Add a WebAppId int Identity column onto table dbo.ProxyWebApp. The identity column won’t be part of the Primary Key, but it’s also a super tiny table.
  2. Create a new table called dbo.IisLogWebAppId which takes the Primary Key of table dbo.IisLogWebAppId and combines it with WebAppId.
  3. Create a script to populate dbo.IisLogWebAppId.
  4. Create a stored procedure to add new entries nightly.

The scripts are below, but I think it’s worthwhile to note that the script to populate dbo.IisLogWebAppId took 4h57m to create on 181,507,680 records which was 15 GBs of disk space.

AWS ALB Price Planning w/ IIS : Grouped Sites

on Monday, October 22, 2018

This post continues the series from AWS ALB Price Planning w/ IIS : Rule Evaluations Pt. 2.

Having all the data in IIS and generating out all the hourly LCU totals helps define what the monthly charges could be. But, my expectations are that I will need to split the 71 DNS host names over multiple ALBs in order to reduce the total cost of the LCUs. My biggest fear is the Rule Evaluation Dimension. The more the host names on a single ALB, the more likely a request will go past the free 10 rule evaluations.

To do this, I need to build a script/program that will generate out possible DNS host name groupings and then evaluate the LCUs based upon those groupings.

In the last post I had already written a simple script to group sites based upon the number sub-applications (ie. rules) they contain. That script didn’t take the next step, which is to evaluate the combined LCU aggregates and recalculates the fourth dimension (the Rule Evaluation LCU).

But, before that …

The Full Database Model

So, instead of having to comb through the previous posts and cobble together the database. I think the database schema is generic enough that it’s fine to share all of it. So …

New Additions:

  • ALBGrouping

    This is the grouping table that the script/program will populate.
  • vw_ALB*

    These use the ALBGrouping table recalculate the LCUs.
  • usp_Aggregate_ALBLCUComparison_For_DateRange

    This combines all the aggregates (similar to vw_ProxySiteLCUComparison). But, the way the aggregation works, you can’t filter the result by Date. So, I needed a way to pass a start and end date to filter the results.

ALB Grouping Results

Wow! It’s actually cheaper to go with a single ALB rather than even two ALBs. It way cheaper than creating individual ALBs for sites with more than 10 sub-applications.

I wish I had more confidence in these numbers. But, there’s a really good chance I’m not calculating the original LCU statistics correctly. But, I think they should be in the ball park.

image

I have it display statistics on it’s internal trials before settling on a final value. And, from the internal trials, it looks like the least expensive options is actually using a single ALB ($45.71)!

Next Up, AWS ALB Price Planning w/ IIS : Add IisLogWebAppId.

AWS ALB Price Planning w/ IIS : Rule Evaluations Pt. 2

on Monday, October 15, 2018

This post continues the series from AWS ALB Price Planning w/ IIS : Rule Evaluations.

In the last post, I looked at the basics of pricing a single site on an ALB server. This time I’m going to dig in a little further into how to group multiple sites onto multiple ALB servers. This would be to allow a transition from a single IIS proxy server to multiple ALB instances.

Background

  • The IIS proxy server hosts 73 websites with 233 web applications.
  • Any site with 8 or more web applications within it will be given it’s own ALB server. The LCU cost of having over 10 rule evaluations on a single ALB are so dominant that it’s best to cut the number of rules you have to less than 10.
  • Of the 73 websites, only 6 sites have 8 or more web applications within them. Leaving 67 other websites containing 103 web applications.

Findings from last time

I looked at grouping by number or rules and by average request counts.

If you have figured out how to get all the import jobs, tables, and stored procedures setup from the last couple posts then you are amazing! I definitely left out a number of scripts for database objects and some of the scripts have morphed throughout the series. But, if you were able to get everything setup, here’s a nice little view t0 help look at the expenses of rule evaluation LCUs.

Just like in the last post, there is a section at the bottom to get a more accurate grouping.

Simple Grouping

To do the simple groupings, I’m first going to generate some site statistics, usp_Regenerate_ProxySiteRequestStats. They really aren’t useful, but they can give you something to work with.

You can combine those stats with the WebAppCounts and use them as input into a PowerShell function. This PowerShell function attempts to:

  • Separate Single Host ALBs from Shared Host ALBs
    • $singleSiteRuleLimit sets the minimum number of sub-applications a site can have before it is required to be on it’s on ALB
  • Group Host names into ALBs when possible
    • It creates a number of shared ALBs (“bags”) which it can place sites into.
    • It uses a bit of an elevator algorithm to try and evenly distribute the sites into ALBs.
  • Enforce Rule Limits
    • Unfortunately, elevator algorithms aren’t great at finding a good match every time. So, if adding a site to a bag would bring the total number of evaluation rules over $sharedSiteRuleLimit, then it tries to fit the site into the next bag (and so on).
  • Give Options for how the sites will be prioritized for sorting
    • Depending on how the sites are sorted before going into the elevator algorithm you can get different results. So, $sortByOptions lets you choose a few ways to sort them and to see the results of each options side by side.

The results look something like this:

image

So, sort by WebAppCount (ie. # of sub-applications) got it down to 19 ALBs. That’s 6 single ALBs and 13 shared ALBs.

Conclusion:

The cost of 19 ALBs without LCU charges is $307.80 per month ($0.0225 ALB per hour * 24 hours * 30 days * 19 ALBs). Our current IIS proxy server, which can run on a t2.2xlarge EC2 image, would cost $201.92 per month on a prepaid standard 1-year term.

The Sorting PowerShell Script

How to get more accurate groupings:

  • Instead of generating hourly request statistics based upon Date, Time, and SiteName; the hourly request statistics need to be based upon Date, Time, SiteName, and AppPath. To do this you would need to:
    • Assign a WebAppId to the dbo.ProxyWebApps table
    • Write a SQL query that would use the dbo.ProxyWebApps data to evaluate all requests in dbo.IisLogs an assign the WebAppId to every request
    • Regenerate all hourly statistics over ALL dimensions using Date, Time, SiteName and AppPath.
  • Determine a new algorithm for ALB groupings that would attempt to make the number of rules in each group 10. But, the algorithm should leave extra space for around 1000~1500 (1.0~1.5 LCU) worth of requests per ALB. The applications with the lowest number of requests should be added to the ALBs at this point.
    • You need to ensure that all applications with the same SiteName have to be grouped together.
    • The base price per hour for an ALB is around 2.8 LCU. So, if you can keep this dimension below 2.8 LCU, it’s cheaper to get charged for the rule evaluations than to create a new ALB.

Next Up, AWS ALB Price Planning W/ IIS : Grouped Sites.

    AWS ALB Price Planning w/ IIS : Rule Evaluations

    on Monday, October 8, 2018

    This post continues the series from AWS ALB Price Planning w/ IIS : Bandwidth. Here are couple things to note about the query to grab Rule Evaluations:

    • This can either be the most straightforward or most difficult dimension to calculate. For me, it was the most difficult.
    • The IIS logs I’m working have 73 distinct sites (sometimes referred to as DNS host names or IP addresses). But there are 233 web applications spread across those 73 distinct sites. An ALB is bound to an IP address, so all of the sub-applications under a single site will all become rules within that ALB. At least, this is the way I’m planning on setting things up. I want every application/rule under a site to be pointed at a separate target server list.
    • An important piece of background information is that the first 10 rule evaluations on a request are free. So, if you have less than 10 rules to evaluate on an ALB, you will never get charged for this dimension.
    • Another important piece of information: Rules are evaluated in order until a match is found. So, you can put heavily used sub-applications at the top of the rules list to ensure they don’t go over the 10 free rule evaluation per request limit.
      • However, you also need to be aware of evaluating rules in the order of “best match”. For example, you should place “/webservices/cars/windows” before “/webservices/cars”, because the opposite ordering would send all requests to /webservices/cars.
    • The point being, you can tweak the ordering of the rules to ensure the least used sub-application is the only one which goes over the 10 free rule evaluations limit.

    With all that background information, the number of rule evaluations are obviously going to be difficult to calculate. And, that’s why I fudged the numbers a lot. If you want some ideas on how to make more accurate predictions please see the notes at the bottom.

    Here were some assumptions I made up front:

    • If the site has over 8 sub-applications, that site should have it’s own ALB. It should not share that ALB with another site. (Because the first 10 rule evaluations are free.)
    • All sites with less than 8 sub-applications should be grouped onto shared ALBs.
    • For simplicity the groupings will be based on the number of rule evaluations. The number of requests for each sub-applications will not be used to influence the groupings.

    Findings

    Here were my biggest take aways from this:

    • When an ALB is configured with more than the 10 free rule evaluations allowed, the rule evaluation LCUs can become the most dominant trait. But, that only occurs if number of requests are very high and the ordering of the rules is very unfavorable.
    • The most influential metric on the LCU cost of a site is the number of requests it receives. You really need a high traffic site to push the LCU cost.
    • As described in the “How to get more accurate numbers” section below. The hourly base price of an ALB is $0.0225 per hour. The hourly LCU price is $0.008. So, as long as you don’t spend over 2.8 LCU per hour; it’s cheaper to bundle multiple sites onto a single ALB rather than make a new one.

    To demonstrate this, here was the second most heavily “requested” site. That site has 22 sub-applications. I used some gorilla math and came up with a statement of “on average there will be 6 rule evaluations per request” ((22 sub-applications / 2) – (10 free requests / 2)). Looking at August 1st 2018 by itself, the Rule Evaluations LCU was always lower the amount of Bandwidth used.

    image

    How to Gather the Data

    Since I wanted every application under a site to need a rule; I first needed to get the number of web applications on the IIS server. I do not have that script attached. You should be able to write something using the WebAdministration or IISAdministration powershell modules. I threw those values into a very simple table:

    Once you get your data into dbo.ProxyWebApps, you can populate dbo.ProxyWebAppCounts easily with:

    Now, we need to calculate the number of requests per application for each hour.

    And, finally, generate the LCUs for rule evaluations and compare it with the LCU values from the previous dimensions:

    How to get more accurate numbers:

    • Instead of generating hourly request statistics based upon Date, Time, and SiteName; the hourly request statistics need to be based upon Date, Time, SiteName, and AppPath. To do this you would need to:
      • Assign a WebAppId to the dbo.ProxyWebApps table
      • Write a SQL query that would use the dbo.ProxyWebApps data to evaluate all requests in dbo.IisLogs an assign the WebAppId to every request
      • Regenerate all hourly statistics over ALL dimensions using Date, Time, SiteName and AppPath.
    • Determine a new algorithm for ALB groupings that would attempt to make the number of rules in each group 10. But, the algorithm should leave extra space for around 1000~1500 (1.0~1.5 LCU) worth of requests per ALB. The applications with the lowest number of requests should be added to the ALBs at this point.
      • You need to ensure that all applications with the same SiteName have to be grouped together.
      • The base price per hour for an ALB is around 2.8 LCU. So, if you can keep this dimension below 2.8 LCU, it’s cheaper to get charged for the rule evaluations than to create a new ALB.

    Next Up, AWS ALB Price Planning w/ IIS : Rule Evaluations Pt. 2.

      AWS ALB Price Planning w/ IIS : Bandwidth

      on Monday, October 1, 2018

      This post continues the series about from AWS ALB Price Planning w/ IIS : Active Connections. Here are couple things to note about the query to grab Bandwidth:

      • This is one thing that IIS logs can accurately evaluate. You can get the number of bytes sent and received through an IIS/ARR proxy server by turning on the cs-bytes and sc-bytes W3C logging values. (see screen shot below)
      • AWS does that pricing based on average usage per hour. So, the sql will aggregate the data into hour increments in order to return results.

      image

      Bandwidth SQL Script

      Graphing the output from the script shows:

      • Mbps per Hour (for a month)
        • The jump in the average number of new connections in the beginning of the month corresponded to a return of students to campus. During the beginning of the month, school was not in session and then students returned.
        • The dip at the end of the month has to do with a mistake I made loading some data. There is one of data that I forgot to import the IIS logs, but I don’t really want to go back and correct the data. It will disappear from the database in about a month.
      • Mbps per Hour LCUs
        • This is a critical number. We put 215+ websites through the proxy server. The two AWS ALB dimensions that will have the biggest impact on the price (the number of LCUs) will be the Bandwidth usage and the Rule Evaluations.
        • I’m very surprised that the average LCUs per hour for a month is around 2.3 LCUs, which is very low.

      imageimage

      Next Up, AWS ALB Price Planning w/ IIS : Rule Evaluations.

      AWS ALB Price Planning w/ IIS : Active Connections

      on Monday, September 24, 2018

      This post continues the series about from AWS ALB Price Planning w/ IIS : New Connections. Here are a couple things to note about the query to grab Active Connections:

      • This query is largely based on the same query used in AWS ALB Price Planning w/ IIS : New Connections [link needed].
        • It’s slightly modified by getting the number of connections per minute rather than per second. But, all the same problems that were outlined in the last post are still true.
      • AWS does their pricing based on average usage per hour. So, the sql will aggregate the data into hour increments in order to return results.

      Active Connections SQL Script

        Graphing the output from the script shows:

        • # of Active Connections per Minute by Hour (for a month)
          • The jump in the average number of new connections in the beginning of the month corresponds to a return of students to campus. During the beginning of the month, school was not in session and then the students returned.
          • The dip at the end of the month has to do with a mistake I made loading some data. There is one day of IIS logs that I forgot to import, but I don’t really want to go back and correct the data. It will disappear from the database in about a month.
        • # of Active Connections per Second by Hour Frequency
          • This doesn’t help visualize it as well as I would have hoped. But, it does demonstrate that usually the number of active connections per minutes will be less than 3000; so it will be less than 1 LCU (1 LCU = 3000 active connections per minute).

        imageimageimage

        Next Up, AWS ALB Price Planning w/ IIS : Bandwidth.

        AWS ALB Price Planning w/ IIS : New Connections

        on Monday, September 17, 2018

        AWS ALB Pricing is not straight forward but that’s because they are trying to save their customers money while appropriately covering their costs. The way they have broken up the calculation for pricing indicates that they understand there are multiple different reasons to use an ALB, and they’re only gonna charge you for the feature (ie. dimension) that’s most important for you. That feature comes with a resource cost and they to charge you appropriately for the resource that’s associated with that feature.

        Today, I’m going to (somewhat) figure out how to calculate one of those dimensions using IIS logs from an on-premises IIS/ARR proxy server. This will help me figure out what the projected costs will be to replace the on-premise proxy server with an AWS ALB. I will need to calculate out all the different dimensions, but today I’m just focusing on New Connections.

        I’m gonna use the database that was created in Putting IIS Logs into a Database will Eat Disk Space. The IisLog table has 9 indexes on it, so we can get some pretty quick results even when the where clauses are ill conceived. Here are a couple things to note about the query to grab New Connections:

        • As AWS notes, most connections have multiple requests flowing through them before they’re closed. And, IIS logs the requests, not the connections. So, you have to fudge the numbers a bit to get the number of new connections. I’m going to assume that each unique IP address per second is a “new connection”.
          • There are all sorts of things wrong with this assumption:
            • Browsers often use multiple connections to pull down webpage resources in parallel. Chrome uses up to 6 at once.
            • I have no idea how long browsers actually hold open connections.
            • Some of the sites use the websocket protocol (wss://) and others use long polling, so there are definitely connections being held open for a long time which aren’t being accounted for.
          • And I’m probably going to reuse this poorly defined “fudging” for the number of Active Connections per Minute (future post / [link needed]).
        • When our internal web app infrastructure reaches out for data using our internal web services, those connections are generally one request per connection. So, for all of the requests that are going to “services”, it will be assumed each request is a new connection.
        • AWS does their pricing based on average usage per hour. So, the sql will aggregate the data into hour increments in order to return results.
        • Sidenote: Because the AWS pricing is calculated per hour, I can’t roll these numbers up into a single “monthly” value. I will need to calculate out all the dimensions for each hour before having a price calculation for that hour. And, one hour is the largest unit of time that I can “average”. After that, I have to sum the results to find out a “monthly” cost.

        New Connections SQL Script

        Graphing the output from the script shows:

        • # of New Connections per Second by Hour (for a month)
          • The jump in the average number of new connections in the beginning of the month corresponds to a return of students to campus. During the beginning of the month, school was not in session and then the students returned.
          • The dip at the end of the month has to do with a mistake I made loading some data. There is one day of IIS logs that I forgot to import, but I don’t really want to go back and correct the data. It will disappear from the database in about a month.
        • # of New Connections per Second by Hour Frequency
          • This just helps to visualize where the systems averages are at. It helps show that most hours will be less than 30 connections per second; which is less than 2 LCU. (1 LCU = 25 new connections per second)

        imageimageimage

        Next Up, AWS ALB Price Planning w/ IIS : Active Connections.

        Putting IIS Logs into a Database will Eat Disk Space

        on Monday, September 3, 2018

        The first thing is, Don’t Do This.

        I needed to put our main proxy servers IIS logs into a database in order to calculate out the total bytes sent and total bytes received over time. The reason for the analytics was to estimate the expected cost of running a similar setup in AWS with an Application Load Balancer.

        To load the data, I copied a powershell script from this blog (Not so many…), which was a modification of the original script from the same blog. The script is below. It is meant to be run as scheduled task and to log the details of each run using PowerShellLogging. The script is currently setup to only import a single day of data, but it can be altered to load many days without much effort.

        But I want to focus on the size of the database and the time it takes to load.

        IIS Log Information

        90 Sites
        38 days of logs
        58.8 GB of IIS Logs

        Database Configuration Information

        My Personal Work Machine (not an isolated server)
        MS SQL Server 2016 SP 1
        1TB SSD Drive
        Limited to 5GB Memory
        Core i7-6700
        Windows 10 1803

        First Attempt – An Unstructured Database

        This was “not thinking ahead” in a nutshell. I completely ignored the fact that I needed to query the data afterwards and simply loaded all of into a table which contained no Primary Key or Indexes.

        The good news was it loaded “relatively” quickly.

        Stats

        • 151 Million Records
        • 161 GB of Disk Space (that’s a 273% increase)
        • 7h 30m Running Time

        The data was useless as I couldn’t look up anything without a full table scan. I realized this problem before running my first query, so I have no data on how long that would have taken; but I figure it would have been a long time.

        First Attempt Part 2 – Adding Indexes (Bad Idea)

        Foolishly, I thought I could add the indexes to the table. So, I turned on Simple Logging and tried to add a Primary Key.

        Within 1h 30m the database had grown to over 700 GB and a lot of error messages started popping up. I had to forcefully stop MSSQL Server and delete the .mdf/.ldf files by hand.

        So, that was a really bad idea.

        Second Attempt – Table with Indexes

        This time I created a table with 9 indexes (1 PK, 8 IDX) before loading the data. Script below.

        With the additional indexes and a primary key having a different sort order than the way the data was being loaded, it took significantly longer to load.

        Stats

        • 151 Million Records
        • 362 GB of Disk Space (that’s a 615% increase)
          • 77 GB Data
          • 288 GB Indexes
        • 25h Running Time

        I was really surprised to see the indexes taking up that much space. It was a lot of indexes, but I wanted to be covered for a lot of different querying scenarios.

        Daily Imports

        Stats

        • 3.8 Million Records
        • 6 GB of Disk Space
        • 30m Running Time

        Initial Thoughts …

        Don’t Do This.

        There are a lot of great log organizing companies out there: Splunk, New Relic, DataDog, etc. I have no idea how much they cost, but the amount of space and the amount of time it takes to organize this data for querying absolutely justifies the need for their existence.

        Use PowerShell to Process Dump an IIS w3wp Process

        on Monday, August 27, 2018

        Sometimes processes go wild and you would like to collect information on them before killing or restarting the process. And the collection process is generally:

        • Your custom made logging
        • Open source logging: Elmah, log4Net, etc
        • Built in logging on the platform (like AppInsights)
        • Event Viewer Logs
        • Log aggregators Splunk, New Relic, etc
        • and, almost always last on the list, a Process Dump

        Process dumps are old enough that they are very well documented, but obscure enough that very few people know how or when to use them. I certainly don’t! But, when you’re really confused about why an issue is occurring a process dump may be the only way to really figure out what was going on inside of a system.

        Unfortunately, they are so rarely used that it’s often difficult to re-learn how to get a process dump when an actual problem is occurring. Windows tried to make things easier by adding Create dump file as an option in the Task Manager.

        image

        But, logging onto a server to debug a problem is becoming a less frequent occurrence. With Cloud systems the first debugging technique is to just delete the VM/Container/App Service and create a new instance. And, On-Premise web farms are often interacted with through scripting commands.

        So here’s another one: New-WebProcDump

        This command will take in a ServerName and Url and attempt to take a process dump and put it in a shared location. It does require a number pre-requisites to work:

        • The Powershell command must be in a folder with a subfolder named Resources that contains procdump.exe.
        • Your web servers are using IIS and ASP.NET Full Framework
        • The computer running the command has a D drive
          • The D drive has a Temp folder (D:\Temp)
        • Remote computers (ie. Web Servers) have a C:\IT\Temp folder.
        • You have PowerShell Remoting (ie winrm quickconfig –force) turned on for all the computers in your domain/network.
        • The application pools on the Web Server must have names that match up with the url of the site. For example https://unittest.some.company.com should have an application pool of unittest.some.company.com. A second example would be https://unittest.some.company.com/subsitea/ should have an application pool of unittest.some.company.com_subsitea.
        • Probably a bunch more that I’m forgetting.

        So, here are the scripts that make it work:

        • WebAdmin.New-WebProcDump.ps1

          Takes a procdump of the w3wp process associated with a given url (either locally or remote). Transfers the process dump to a communal shared location for retrieval.
        • WebAdmin.Test-WebAppExists.ps1

          Check if the an application pool exists on a remote server.
        • WebAdmin.Test-IsLocalComputerName.ps1

          Tests if the command will need to run locally or remotely.
        • WebAdmin.ConvertTo-UrlBasedAppPoolName.ps1

          The name kind of covers it. For example https://unittest.some.company.com should have an application pool of unittest.some.company.com. A second example would be https://unittest.some.company.com/subsitea/ should have an application pool of unittest.some.company.com_subsitea.


        IIS Proxy & App Web Performance Optimizations Pt. 4

        on Friday, March 16, 2018

        Last time we took the new architecture to it’s theoretical limit and pushed more of the load toward the database. This time …

        What we changed (Using 2 Load Test Suites)

             
        • Turn on Output Caching on the Proxy Server. Which defaults to caching js, css, and images. Which works really great with really old sites.
        • We also lowered the number of users as the Backend Services ramped up to 100%.
        • Forced Test Agents to Run in 64-bit mode. This resolved the an Out Of Memory exception that we we’re getting when the Test Agents were running into the 2 GB memory caps of their 32-bit processes.
        • Found a problem with the Test Suite that was allowing all tests to complete without hitting the backend service. (This really effected the number of calls that made it to the Impacted Backend Services.)
        • Added a second Test Suite which also used the same database. The load on this suite wasn’t very high; it just added more real world requests.

        Test Setup

        • Constant Load Pattern
          • 1000 users
          • 7 Test Agents (64-bit mode)
        • Main Proxy
          • 4 vCPU / 8 vCore
          • 24 GB RAM
          • AppPool Queue Length: 50,000
          • WebFarm Request Timeout: 120 seconds
          • Output Caching (js, css, images)
        • Impacted Web App Server
          • 3 VMs
          • AppPool Queue Length: 50,000
        • Impacted Backend Service Server
          • 8 VMs
        • Classic ASP App
          • CDNs used for 4 JS files and 1 CSS file
            • Custom JS and CSS coming from Impacted Web App
            • Images still coming from Impacted Web App
          • JS is minified
        • VS 2017 Test Suite
          • WebTest Caching Enabled
          • A 2nd Test Suite which Impacts other applications in the environment is also run. (This is done off a different VS 2017 Test Controller)

        Test Results

        • Main Proxy
          • CPU: 28% (down 37)
          • Max Concurrent Connections: – (Didn’t Record)      
        • Impacted Web App
            • CPU: 56% (down 10)
          • Impacted Backend Service
              • CPU: 100% (up 50)
            • DB
              • CPU: 30% (down 20)
            • VS 2017 Test Suite
              • Total Tests: 95,000 (down 30,000)
              • Tests/Sec: 869 (down 278)

            This more “real world” test really highlighted that the impacted systems weren’t going to have a huge impact on the database shared by the other systems which will using it at the same time.

            We had successfully moved the load from the Main Proxy onto the the backend services, but not all the way to the database. With some further testing we found that adding CPUs and new VMs to the Impacted Backend Servers had a direct 1:1 relationship with handling more requests. The unfortunate side of that is that we weren’t comfortable with the cost of the CPUs compared to the increased performance.

            The real big surprise was the significant CPU utilization decrease that came from turning On Output Caching on the

            And, with that good news, we called it a day.

            So, the final architecture looks like this …

            image

            What we learned …

            • SSL Encryption/Decryption can put a significant load on your main proxy/load balancer server. The number of requests processed by that server will directly scale into CPU utilization. You can reduce this load by moving static content to CDNs.
            • Even if your main proxy/load balancer does SSL offloading and requests to the backend services aren’t SSL encrypted, the extra socket connections still have an impact on the servers CPU utilization. You can lower this impact on both the main proxy and the Impacted Web App servers by using Output Caching for static content (js, css, images).
              • We didn’t have the need to use bundling and we didn’t have the ability to do spriting; but we would strongly encourage anyone to use those if they are an option.
            • Moving backend service requests to an internal proxy doesn’t significantly lower the number of requests through the main proxy. It’s really images that create the most number of requests to render a web page (especially with an older Classic ASP site).
            • In Visual Studio, double check that your suite of web tests are doing exactly what you think they are doing. Also, go the extra step and check that the HTTP Status Code returned on each request is the code that you expect. If you expect a 302, check that it’s a 302 instead of considering a 200 to be satisfactory.

            IIS Proxy & App Web Performance Optimizations Pt. 3

            on Monday, March 12, 2018

            We left off last time after resolving 3rd party JS and CSS files from https://cdnjs.com/ CDNs. And having raised the Main Proxy servers Application Pool Queue Length from 1,000 to 50,000.

            We are about to add more CPUs to the Main Proxy and see if that improves throughput.

            What we changed (Add CPUs)

            • Double the number of CPUs to 4 vCPU / 8 vCore.
              • So far the number of connections into the proxy directly correlates to the amount of cpu utilization / load. Hopefully, by adding more processing power, we can scale up the number Test Agents and the overall load.

            Test Setup

            • Step Load Pattern
              • 1000 initial users, 200 users every 10 seconds, max 4000 users
              • 4 Test Agents
            • Main Proxy
              • 4 vCPU / 8 vCore
              • 24 GB RAM
              • AppPool Queue Length: 50,000 (default)
              • WebFarm Request Timeout: 30 seconds (default)
            • Impacted Web App Server
              • 2 VMs
            • Impacted Backend Service Server
              • 6 VMs
            • Classic ASP App
              • CDNs used for 4 JS files and 1 CSS file
                • Custom JS and CSS coming from Impacted Web App
                • Images still coming from Impacted Web App
              • JS is minified
            • VS 2017 Test Suite
              • WebTest Caching Enabled

            Test Results

            • Main Proxy
              • CPU: 65% (down 27)
              • Max Concurrent Connections: 15,000 (down 2,500)
            • Impacted Web App
              • CPU: 87%
            • Impacted Backend Service
              • CPU: 75%
            • VS 2017 Test Suite
              • Total Tests: 87,000 (up 22,000)
              • Tests/Sec: 794 (up 200)

            Adding the processing power seemed to help out everything. The extra processors allowed for more requests to be processed in parallel. This allowed for requests to be passed through and completely quicker, lower the number of concurrent requests. With the increased throughput the number of Tests that could be completed, increases the number of Tests/Sec.

            Adding more CPUs to the Proxy helps everything in the system move faster. It parallelizes the requests flowing through it and prevents process contention.

            So, where does the new bottleneck exist?

            Now that the requests are making it to the Impact Web App, the CPU load has transferred to them and their associated Impacted Backend Services. This is a good thing. We’re moving the load further down the stack. Doing that successfully would push the load down to the database (DB); which is currently not under much load at all.

            image

            What we changed (Add more VMs)

            • Added 1 more Impacted Web App Server
            • Added 2 more Impacted Backend Services Servers
              • The goal with these additions to use parallelization to allow for more requests to be processed at once and push the bottleneck towards the database.

            Test Setup

            • Step Load Pattern
              • 1000 initial users, 200 users every 10 seconds, max 4000 users
              • 4 Test Agents
            • Main Proxy
              • 4 vCPU / 8 vCore
              • 24 GB RAM
              • AppPool Queue Length: 50,000 (default)
              • WebFarm Request Timeout: 30 seconds (default)
            • Impacted Web App Server
              • 3 VMs
            • Impacted Backend Service Server
              • 8 VMs
            • Classic ASP App
              • CDNs used for 4 JS files and 1 CSS file
                • Custom JS and CSS coming from Impacted Web App
                • Images still coming from Impacted Web App
              • JS is minified
            • VS 2017 Test Suite
              • WebTest Caching Enabled

            Test Results

            • Main Proxy
              • CPU: 62% (~ the same)
              • Max Concurrent Connections: 14,000 (down 1,000)
            • Impacted Web App
                • CPU: 60%
              • Impacted Backend Service
                  • CPU: 65%
                • VS 2017 Test Suite
                  • Total Tests: 95,000 (up 8,000)
                  • Tests/Sec: 794 (~ the same)

                The extra servers helped get requests through the system faster. So, the overall number of Tests that completed increased. This helped push the load a little further down.

                The Cloud philosophy of handling more load simultaneously through parallelization works. Obvious, right?

                So, in that iteration, there was no bottleneck. And, we are hitting numbers similar to what we expect of the day of the event. But, what we really need to do leave ourselves some head room in case more users show up that we expect. So, let’s add in more Test Agents and see what it can really handle.

                What we changed (More Users Than We Expect)

                • Added more Test Agents in order to overload the system.

                Test Setup

                • Step Load Pattern
                  • 2000 initial users, 200 users every 10 seconds, max 4000 users
                  • 7 Test Agents
                • Main Proxy
                  • 4 vCPU / 8 vCore
                  • 24 GB RAM
                  • AppPool Queue Length: 50,000
                  • WebFarm Request Timeout: 30 seconds (default)
                • Impacted Web App Server
                  • 3 VMs
                • Impacted Backend Service Server
                  • 8 VMs
                • Classic ASP App
                  • CDNs used for 4 JS files and 1 CSS file
                    • Custom JS and CSS coming from Impacted Web App
                    • Images still coming from Impacted Web App
                  • JS is minified
                • VS 2017 Test Suite
                  • WebTest Caching Enabled

                Test Results

                • Main Proxy
                  • CPU: 65% (~ same)
                  • Max Concurrent Connections: 18,000 (up 4,000)
                  • Impacted Web App
                      • CPU: 63%
                    • Impacted Backend Service
                        • CPU: 54%
                      • VS 2017 Test Suite
                        • Total Tests: 125,000 (up 30,000)
                        • Tests/Sec: 1147 (up 282)

                      So, the “isolated environment” limit is pretty solid but we noticed that at these limits the response time on the requests had slowed down in the beginning of the Test iteration.

                      .asp Page Response Times

                      image

                      The theory is that with 7 Test Agents, all of which started out 2,000 initial users with no caches primed, all made requests for js, css, and images which swamped the Main Proxy and the Impacted Web App servers. Once the caches started being used in the tests, then things started to smooth out and things stabilized.

                      From this test we found two error messages started occurring on the proxy. The first error was 502.3 Gateway Timeout and 503 Service Unavailable. Looking at the IIS logs on the Impacted Web App server we could see that many requests (both 200 and 500 return status codes) were resolving with a Win32 Status Code of 64.

                      To resolve the Proxy 502.3 and then Impacted Web App Win32 Status Code 64 problems we increased the Web Farm Request Timeout to 120 seconds. This isn’t ideal, but from what you can see in the graphic above, the average response time is consistently quick. So, this will ensure all users will get a response, even though some may have a severely degraded experience. Chances are, their next request will process quickly.

                      Happily, the 503 Service Unavailable was not being generated on the Main Proxy server. It was actually being generated on the Impact Web App servers. They still had their Application Pool Queue Length set to the default 1,000 requests. We increased those to 50,000 and that removed that problem.

                      Next Time …

                      We’ll add another Test Suite to run along side it and look into more Caching.

                      IIS Proxy & App Web Performance Optimizations Pt. 2

                      on Friday, March 9, 2018

                      Continuing from where we left off in IIS Proxy & App Web Performance Optimizations Pt. 1, we’re now ready to run some initial tests and get some performance baselines.

                      The goal of each test iteration is to attempt to load the systems to a point that a bottleneck occurs and then find how to relieve that bottleneck.

                      Initial Test Setup

                      • Step Load Pattern
                        • 100 initial users, 20 users every 10 seconds, max 400 users
                        • 1 Test Agent

                      Initial Test Results

                      There was no data really worth noting on this run as we found the addition of the second proxy server lowered the overhead on the Main Proxy enough that no systems were a bottleneck at this point. So, we added more Test Agents re-ran the test with:

                      Real Baseline Test Setup

                      • Step Load Pattern
                        • 1000 initial users, 200 users every 10 seconds, max 4000 users
                        • 3 Test Agents
                      • Main Proxy
                        • 2 vCPU / 4 vCore
                        • 24 GB RAM
                        • AppPool Queue Length: 1000 (default)
                        • WebFarm Request Timeout: 30 seconds (default)
                      • Impacted Web App Server
                        • 2 VMs
                      • Impacted Backend Service Server
                        • 6 VMs
                      • Classic ASP App
                        • No CDNs used for JS, CSS, or images
                        • JS is minified
                      • VS 2017 Test Suite
                        • WebTest Caching Disabled

                      Real Baseline Test Results

                      • Main Proxy
                        • CPU: 99%
                        • Max Concurrent Connections: 17,000
                      • VS 2017 Test Suite
                        • Total Tests: 37,000
                        • Tests/Sec: 340

                      In this test we discovered that around 14,000 connections was the limit of the Main Proxy before we started to receive responses on 503 Service Unavailable. We didn’t yet understand that there was more to it, but we set about trying to lower the number of connections by lowering the number of requests for js, css, and images. Looking through the IIS logs we also saw the majority of requests were for the static content; which made it look like information wasn’t being cached between calls. So, we found a setting in VS 2017’s Web Test that allowed us to enable caching. (We also saw a lot of the SocketExceptions mentioned in the previous post, but we didn’t understand what they meant at that time).

                      What we changed (CDNs and Browser Caching)

                      • We took all of the 3rd party JS and CSS files that we use and referenced them from https://cdnjs.com/ CDNs. In total, there was 4 js files and 1 css file.
                        • The reason this hadn’t been done before is there wasn’t enough time to test the fallback strategies if the CDN doesn’t serve the js/css, then the browser should request the files from our servers. We implemented these fallbacks this time.
                      • We updated the VS 2017 Web Test configuration to enable caching. Whenever a new Test scenario is run, the test agent will not have caching enabled in order to replicate a “new user” experience; each subsequent call in the scenario will use cached js, css, and images. (This cut around 50% of the requests made in the baseline test)
                        • The majority of the requests into the Main Proxy were image requests. But, the way the application was written we couldn’t risk a) moving the images to a CDN or b) spriting the images. (It is a Classic ASP app, so it doesn’t have all the bells and whistles that newer frameworks have)

                      Test Setup

                      • Step Load Pattern
                        • 1000 initial users, 200 users every 10 seconds, max 4000 users
                        • 3 Test Agents
                      • Main Proxy
                        • 2 vCPU / 4 vCore
                        • 24 GB RAM
                        • AppPool Queue Length: 1000 (default)
                        • WebFarm Request Timeout: 30 seconds (default)
                      • Impacted Web App Server
                        • 2 VMs
                      • Impacted Backend Service Server
                        • 6 VMs
                      • Classic ASP App
                        • CDNs used for 4 JS files and 1 CSS file
                          • Custom JS and CSS coming from Impacted Web App
                          • Images still coming from Impacted Web App
                        • JS is minified
                      • VS 2017 Test Suite
                        • WebTest Caching Enabled

                      Test Results

                      • Main Proxy
                        • CPU: 82% (down 17)
                        • Max Concurrent Connections: 10,400 (down 6,600)
                      • VS 2017 Test Suite
                        • Total Tests: 69,000 (up 32,000)
                        • Tests/Sec: 631 (up 289, but with 21% failure rate)

                      Offloading the common third party js and css files really lowered the number of requests into the Main Proxy server (38% lower). And, with that overhead removed, the CPU utilization came down from a pegged 99% to 82%.

                      Because caching was also enabled, the test suite was able to churn through the follow-up page requests much quicker. That increase in rate nearly doubled the number of Tests/Sec completed.

                      Move 3rd party static content to CDNs when possible (https://cdnjs.com/ is a great service.) When doing so, try to implement failed loads and fallbacks for those resources.

                      But, we still had high CPU utilization on the Main Proxy. And, we had a pretty high failure rate with lots of 503 Service Unavailable and some 502.3 Gateway Timeouts. We determined the cause of the 503s was that the Application Pools Queue length was being hit. We considered this to be the new bottleneck.

                      What we changed (CDNs and Browser Caching)

                      • We set the application pool queue length from 1,000 to 50,000. This would allow us to queue up more requests and lower the 503 Service Unavailable error rate.
                      • We also had enough head room in the CPU to add another Test Agent.

                      Test Setup

                      • Step Load Pattern
                        • 1000 initial users, 200 users every 10 seconds, max 4000 users
                        • 4 Test Agents
                      • Main Proxy
                        • 2 vCPU / 4 vCore
                        • 24 GB RAM
                        • AppPool Queue Length: 50,000
                        • WebFarm Request Timeout: 30 seconds (default)
                      • Impacted Web App Server
                        • 2 VMs
                      • Impacted Backend Service Server
                        • 6 VMs
                      • Classic ASP App
                        • CDNs used for 4 JS files and 1 CSS file
                          • Custom JS and CSS coming from Impacted Web App
                          • Images still coming from Impacted Web App
                        • JS is minified
                      • VS 2017 Test Suite
                        • WebTest Caching Enabled

                      Test Results

                      • Main Proxy
                        • CPU: 92% (up 10)
                        • Max Concurrent Connections: 17,500 (up 7,100)
                      • VS 2017 Test Suite
                        • Total Tests: 65,000 (down 4,000)
                        • Tests/Sec: 594 (down 37, but with only 3% failure rate)

                      This helped fix the failure rate issue. Without all the 503s forcing the Tests to end early, it took slightly longer to complete each test and that caused the number of Tests/Sec to fall a bit. This also meant we had more requests queued up, bringing the number of concurrent connections back up.

                      For heavily trafficked sites, set your Application Pool Queue Length well above the default 1,000 requests. This is only needed if you don’t have a Network Load Balancer in front of your proxy.

                      At this point we were very curious what would happen if we added more processors to the Main Proxy. We were also curious what the average response time was from the Classic .asp pages. (NOTE: all the js, css, and image response times are higher than the page result time.)

                      .asp Page Response Times on Proxy

                      image

                      Next Time …

                      We’ll add more CPUs to the proxy and see if we can’t push the bottleneck further down the line.

                      IIS Proxy & App Web Performance Optimizations Pt. 1

                      on Monday, March 5, 2018

                      We’re ramping up towards a day where our web farm fields around 40 times the normal load. It’s not much load compared to truly popular websites, but it’s a lot more than what we normally deal with. It’s somewhere around the order of ~50,000 people trying to use the system in an hour. And, the majority of the users hit the system in the first 15 minutes of the hour.

                      So, of course, we tried to simulate more than the expected load in our test environment and see what sort of changes we can make to ensure stability and responsiveness.

                      A quick note: This won’t be very applicable to Azure/Cloud based infrastructure. A lot of this will be done for you on the Cloud.

                      Web Farm Architecture

                      These systems run in a private Data Center. So, the servers and software don’t have a lot of the very cool features that the cloud offers.

                      The servers are all Win 2012 R2, IIS 8.5 with ARR 3.0, URL Rewrite 7.2, and Web Farm Framework 1.1.

                      Normally, the layout of the systems is similar to this diagram. This gives a general idea that there is a front-end proxy, a number of applications, backend services, and a database which are all involved in this yearly event. And, that a single Web App is significantly hit and it’s main supporting Backend Service is also significantly hit. The Backend Service is also shared by the other Web Apps involved in the event; but they are not the main clients during that hour.

                      image

                      Testing Setup

                      For testing we are using Visual Studio 2017 with a Test Controller and several Agents. It’s a very simple web test suite with a single scenario. This is the main use case during that hour. A user logs in to check their status, and then may take a few actions on other web applications.

                      Starting Test Load

                      • Step Pattern
                      • 100 users, 10 user step every 10 seconds, max 400 users
                      • 1 Agent

                      We eventually get to this Test Load

                      • Step Pattern
                      • 1000 users, 200 user step every 10 seconds, max 2500 users
                      • 7 agents

                      We found that over 2500 concurrent users would result in a SocketException on the Agent machines. Our belief is that each agent attempts to run the max user load defined by the test. And, that the Agent Process will run out (Sockets?) to spawn new users to make calls. This results in SocketExceptions. To alleviate the issue, we added more Agents to the Controller and lowered the maximum number of concurrent users.

                      SocketExceptions on VS 2017 Test Agents can be prevented by lowering the maximum number of concurrent users. (You can then add in more Agents to the Test Controller in order to get the numbers back up.)

                      Initial Architecture Change

                      We’ve been through this load for many years so we already have some standard approaches that we take every year to help with the load:

                      • Add more Impacted Backend Service servers
                      • Add more CPU/Memory to the Impacted Web App

                      This year we went further by

                      • Adding another proxy server to ensure Backend Service Calls from the Impacted Web App don’t route through the Main Proxy to the Impacted Backend Services. This helps reduce the number of connections through the Main Proxy.
                      • Adding 6 more Impacted Backend Service servers. These are the servers that always take the worst hit. These servers don’t need sticky sessions, so they can easily spread the load between them.
                      • Adding a second Impacted Web App server. This server usually doesn’t have the same level of high CPU load that the Proxy and Impacted Backend Services do. These servers do require sticky sessions, so there are potential issues with the load not being balanced.

                      If you don’t have to worry about sticky session, adding more processing servers can always help distribute a load. That’s why Cloud based services with “Sliders” are fantastic!

                      image

                      Next Time …

                      In the next section we’ll look at the initial testing results and the lessons learned on each testing iteration.

                      Self-Signed Certificates for Win10

                      on Friday, November 24, 2017

                      Browsers have implemented all sorts of great new security measures to ensure that certificates are pretty valid. So, using a self-signed certificate today is more difficult than it used to be. Also, IIS for Win8/10 gained access for using a Central Certificate Store. So, here’s some scripts that:

                      • Create a Self-Signed Cert
                        • Creates a self-signed cert with a DNS Name (browsers don’t like it when the Subject Alternative Name doesn’t list the DNS Name).
                        • Creates a Shared SSL folder on disk and adds permissions for IIS’s Central Certificate Store account will read the certs with.
                        • Exports the cert to the Shared SSL folder as a .pfx.
                        • Reimports the certs to the machines Trusted Root Authority (needed for browsers to verify the cert is trusted)
                        • Adds the 443/SSL binding to the site (if it exists) in IIS
                      • Re-Add Cert to Trusted Root Authority
                        • Before Win10, Microsoft implemented a background task which will periodically check the certs installed in your Machine Trusted Root Authority which are self-signed and removes them. So, this script re-installs them.
                        • It will look through the shared SSL folder created in the previous script and add any certs back to the local Machine Trusted Root Authority that are missing.
                      • Re-Add Cert to Trusted Root Authority Scheduled Task
                        • Schedules the script to run hourly
                      ### Create-SelfSignedCert.ps1
                      
                      $name = "site.name.com" # only need to edit this
                      
                      
                      # get the shared ssl password for dev - this will be applied to the cert
                      $pfxPassword = "your pfx password"
                      
                      # you can only create a self-signed cert in the \My store
                      $certLoc = "Cert:\LocalMachine\My"
                      $cert = New-SelfSignedCertificate `
                                  -FriendlyName $name `
                                  -KeyAlgorithm RSA `
                                  -KeyLength 4096 `
                                  -CertStoreLocation $certLoc `
                                  -DnsName $name
                      
                      # ensure the path the directory for the central certificate store is setup with permissions
                      # NOTE: This assumes that IIS is already setup with Central Cert Store, where
                      #       1) The user account is "Domain\AccountName"
                      #       2) The $pfxPassword Certificate Private Key Password
                      $sharedPath = "D:\AllContent\SharedSSL\Local"
                      if((Test-Path $sharedPath) -eq $false) {
                          mkdir $sharedPath
                      
                          $acl = Get-Acl $sharedPath
                          $objUser = New-Object System.Security.Principal.NTAccount("Domain\AccountName") 
                      	$rule = New-Object System.Security.AccessControl.FileSystemAccessRule($objUser, "ReadAndExecute,ListDirectory", "ContainerInherit, ObjectInherit", "None", "Allow")
                      	$acl.AddAccessRule($rule)
                      	Set-Acl $sharedPath $acl
                      }
                      
                      
                      # export from the \My store to the Central Cert Store on disk
                      $thumbprint = $cert.Thumbprint
                      $certPath = "$certLoc\$thumbprint"
                      $pfxPath = "$sharedPath\$name.pfx"
                      if(Test-Path $pfxPath) { del $pfxPath }
                      Export-PfxCertificate `
                          -Cert $certPath `
                          -FilePath $pfxPath `
                          -Password $pfxPassword
                      
                      
                      # reimport the cert into the Trusted Root Authorities
                      $authRootLoc = "Cert:\LocalMachine\AuthRoot"
                      Import-PfxCertificate `
                          -FilePath $pfxPath `
                          -CertStoreLocation $authRootLoc `
                          -Password $pfxPassword `
                          -Exportable
                      
                      
                      # delete it from the \My store
                      del $certPath # removes from cert:\localmachine\my
                      
                      
                      # if the website doesn't have the https binding, add it
                      Import-Module WebAdministration
                      
                      if(Test-Path "IIS:\Sites\$name") {
                          $httpsBindings = Get-WebBinding -Name $name -Protocol "https"
                          $found = $httpsBindings |? { $_.bindingInformation -eq "*:443:$name" -and $_.sslFlags -eq 3 }
                          if($found -eq $null) {
                              New-WebBinding -Name $name -Protocol "https" -Port 443 -IPAddress "*" -HostHeader $name -SslFlags 3
                          }
                      }
                      ### Add-SslCertsToAuthRoot.ps1
                      
                      $Error.Clear()
                      
                      Import-Module PowerShellLogging
                      $name = "Add-SslCertsToAuthRoot"
                      $start = [DateTime]::Now
                      $startFormatted = $start.ToString("yyyyMMddHHmmss")
                      $logdir = "E:\Logs\Scripts\IIS\$name"
                      $logpath = "$logdir\$name-log-$startFormatted.txt"
                      $log = Enable-LogFile $logpath
                      
                      try {
                      
                          #### FUNCTIONS - START ####
                          Function Get-X509Certificate {
                      	Param (
                              [Parameter(Mandatory=$True)]
                      		[ValidateScript({Test-Path $_})]
                      		[String]$PfxFile,
                      		[Parameter(Mandatory=$True)]
                      		[string]$PfxPassword=$null
                      	)
                      
                      	    # Create new, empty X509 Certificate (v2) object
                      	    $X509Certificate = New-Object System.Security.Cryptography.X509Certificates.X509Certificate2
                      
                      	    # Call class import method using password
                              try {
                      			$X509Certificate.Import($PfxFile,$PfxPassword,"PersistKeySet")
                      			Write-Verbose "Successfully accessed Pfx certificate $PfxFile."
                      		} catch {
                      			Write-Warning "Error processing $PfxFile. Please check the Pfx certificate password."
                      			Return $false
                      		}
                      	
                              Return $X509Certificate
                          }
                      
                          # http://www.orcsweb.com/blog/james/powershell-ing-on-windows-server-how-to-import-certificates-using-powershell/
                          Function Import-PfxCertificate {
                          Param(
                      	    [Parameter(Mandatory = $true)]
                      	    [String]$CertPath,
                      	    [ValidateSet("CurrentUser","LocalMachine")]
                      	    [String]$CertRootStore = "LocalMachine",
                      	    [String]$CertStore = "My",
                      	    $PfxPass = $null
                          )
                              Process {
                      	        $pfx = new-object System.Security.Cryptography.X509Certificates.X509Certificate2
                      	        if ($pfxPass -eq $null) {$pfxPass = read-host "Enter the pfx password" -assecurestring}
                      	        $pfx.import($certPath,$pfxPass,"Exportable,PersistKeySet")
                       
                      	        $store = new-object System.Security.Cryptography.X509Certificates.X509Store($certStore,$certRootStore)
                      
                      	        $serverName = [System.Net.Dns]::GetHostName();
                      	        Write-Warning ("Adding certificate " + $pfx.FriendlyName + " to $CertRootStore/$CertStore on $serverName. Thumbprint = " + $pfx.Thumbprint)
                      	        $store.open("MaxAllowed")
                      	        $store.add($pfx)
                      	        $store.close()
                      	        Write-Host ("Added certificate " + $pfx.FriendlyName + " to $CertRootStore/$CertStore on $serverName. Thumbprint = " + $pfx.Thumbprint)
                              }
                          }
                          #### FUNCTIONS - END ####
                      
                      
                          #### SCRIPT - START ####
                          $sharedPath = "D:\AllContent\SharedSSL\Local"
                          $authRootLoc = "Cert:\LocalMachine\AuthRoot"
                          
                          $pfxPassword = "your password" # need to set this
                      
                          $pfxs = dir $sharedPath -file -Filter *.pfx
                          foreach($pfx in $pfxs) {    
                              $cert = Get-X509Certificate -PfxFile $pfx.FullName -PfxPassword $pfxSecret.Password
                              $certPath = "$authRootLoc\$($cert.Thumbprint)"
                              if((Test-Path $certPath) -eq $false) {
                                  $null = Import-PfxCertificate -FilePath $pfx.FullName -CertStoreLocation $authRootLoc -Password $pfxPassword -Exportable
                                  Write-Host "$($cert.Subject) ($($cert.Thumbprint)) Added"
                              } else {
                                  Write-Host "$($cert.Subject) ($($cert.Thumbprint)) Already Exists"
                              }
                          }
                          #### SCRIPT - END ####
                      
                      } finally {
                          foreach($er in $Error) { $er }
                      
                          Disable-LogFile $log
                      }
                      ### Install-Add-SslCertsToAuthRoot.ps1
                      
                      $yourUsername = "your username" # needs local admin rights on your machine (you probably have it)
                      $yourPassword = "your password"
                      
                      $name = "Add-SslCertsToAuthRoot"
                      $filename = "$name.ps1"
                      $fp = "D:\AllContent\Scripts\IIS\$filename"
                      $taskName = $name
                      $fp = "powershell $fp"
                      
                      $found = . schtasks.exe /query /tn "$taskName" 2>null
                      if($found -ne $null) {
                          . schtasks.exe /delete /tn "$taskName" /f
                          $found = $null
                      }
                      if($found -eq $null) {
                          . schtasks.exe /create /ru $yourUsername /rp $yourPassword /tn "$taskName" /sc daily /st "01:00" /tr "$fp"
                          . schtasks.exe /run /tn "$taskName"
                      }

                      Get-FullDomainAccount

                      on Friday, April 1, 2016

                      In a previous post I forgot to include the PowerShell code for Get-FullDomainAccount. Sorry about that.

                      Here it is:

                      $env:USERDOMAIN = "<your domain>"
                      <#
                      .SYNOPSIS
                      	Ensures that the given domain account also has the domain prefix. For example,
                      	if the -DomainAccount is "IUSR_AbcXyz" the "<your domain>\IUSR_AbcXyz" would most likely
                      	be returned. The domain is pulled from the current users domain, $env:USERDOMAIN.
                      
                      	If -Environment is provided, this will also run the -DomainAccount through
                      	Get-EnvironmentDomainAccount to replace any environment specific information.
                      
                      .LINK
                      	Get-EnvironmentDomainAccount
                      		Used to apply environment specific value to the domain account
                      
                      .EXAMPLE
                          $result = Get-FullDomainAccount -DomainAccount "IUSR_AbcXyz"
                          $result -eq "<your domain>\IUSR_AbcXyz"
                      #>
                      Function Get-FullDomainAccount {
                      [CmdletBinding()]
                      Param (
                      	[Parameter(Mandatory=$true)]
                      	[string] $DomainAccount,
                      	[string ]$Environment = ""
                      )
                      	$accountName = $DomainAccount;
                      
                      	if($Environment -ne "") {
                              $accountName = Get-EnvironmentDomainAccount -Environment $Environment -DomainAccount $DomainAccount;
                      	}
                      
                          if($accountName -match "ApplicationPoolIdentity") {
                              $accountName = "IIS AppPool\$accountName"
                          }
                      
                          if($accountName -match "LocalSystem") {
                              $accountName = "$($env:COMPUTERNAME)\$accountName"
                          }
                      
                      	if($accountName -notmatch "\\") {
                      		$accountName = $env:USERDOMAIN + "\" + $accountName;
                      	}
                      	return $accountName;
                      }


                      Creative Commons License
                      This site uses Alex Gorbatchev's SyntaxHighlighter, and hosted by herdingcode.com's Jon Galloway.