Steven Maglio's Homework: How to Crash Exchange Using IIS Healthchecks

So, I had a bad week. I crashed a multiple server, redundant, highly available Exchange Server setup using the IIS Healthchecks of a single website in Dev and Test (not even Prod).

How did I do this? Well …

Start with a website that is only in Dev & Test; and hasn’t moved to Prod.
- All of the database objects are only in Dev & Test.
Do a database refresh from Prod and overlay Dev & Test.
- The database refresh takes 2 hours; but the next 17 hours is a period where the Dev & Test environments don’t have the database objects available to them, because those objects weren’t a part of the refresh.
So, now you have 19 hours of a single website being unable to properly make a database call.
Why wasn’t anyone notified? Well, that’s all on me. It was the Dev & Test version of the website, and I was ignoring those error messages (those many, many error messages).
Those error messages were from ELMAH. If you use ASP.NET and don’t know ELMAH; then please learn about it, it’s amazing!
- In this case, I was using ELMAH with WebAPI, so I was using the Elmah.Contrib.WebAPI package. I’m not singling them out as a problem, I just want to spread the word that WebAPI applications need to use this package to get error reporting.
Finally, you have the IIS WebFarm Healthcheck system.
- The IIS WebFarm healthcheck system is meant to help a WebFarm route requests to healthy application servers behind a proxy. If a single server is having a problem, then requests are no longer routed to it and only the healthy servers are sent requests to process. It’s a really good idea.
- Unfortunately, … (You know what? … I’ll get back to this below)
- Our proxy servers have around 215 web app pools.
- The way IIS healthchecks are implemented, every one of those web app pools will run the healthchecks on every web farm. So, this one single application gets 215 healthchecks every 30 seconds (the default healthcheck interval).
- That’s 2 healthchecks per minute, by 215 application pools …
- Or 430 healthchecks per minute … per server
- Times 3 servers (1 Dev & 2 Test Application Servers) … 1290 healthchecks per minute
- Times 60 per hour, times 19 hours … 1,470,600 healthchecks in 19 hours.
Every one of the 1,470,600 healthchecks produced an error, and ELMAH diligently reported every one of those errors. (First email type)
Now for Exchange
- Even if we didn’t have a multi-server, redundant, highly available Exchange server, 1.5 million emails would have probably crashed it.
- But, things got crazier because we have a multiple server, redundant, highly available setup.
- So, the error emails went to a single recipient, me.
- And, eventually my Inbox filled up (6 GBs limit on my Inbox), which started to produce response emails saying “This Inbox is Full”. (Second email type)
- Well … those response emails went back to the sender … which was a fake email address I used for the website (it’s never supposed to be responded to).
- Unfortunately, that fake email address has an the domain as my account (@place.com); which sent all the responses back to the same Exchange server.
- Those “Inbox is Full” error messages then triggered Exchange to send back messages that said “This email address doesn’t exist”. (Third email type)
- I’m not exactly sure about how this happened, but there was a number of retry attempts on the [First Email Type] which again re-triggered the Second and Third email type. I call the retrys the (Fourth email type).
- Once all of the error messages get factored into the equation, the 1.5 million healthcheck emails generated out 4.5 million healthcheck and smtp error emails.
- Way before we hit the 4.5 million mark, our Exchange server filled up …
  - It’s database
  - The disk on the actual Exchange servers

So, I don’t really understand Exchange too well. I’m trying to understand this diagram a little better. One thing that continues to puzzle me is the why the Exchange server sent out error emails to “itself”. (My email address is my.name@place.com and the ELMAH emails were from some.website@place.com … so the error emails were sent to @place.com, which that Exchange server owns). Or does it …

So, from the diagram, consultation, and my limited understanding … our configuration is this:
- We have a front end email firewall that owns the MX record (DNS routing address) for @place.com.
  - The front end email firewall is supposed to handle external email DDOS attacks and ridiculous spam emails.
- We have an internal Client Access Server / Hub Transport Server which takes in the ELMAH emails from our applications and routes them into the Exchange Servers.
- We have 2 Exchange servers with 2 Databases behind them, which our email inboxes are split across.
- So, the flow might be (again, I don’t have this pinned down)
  - The application sent the error email to the Client Access Server
  - The Client Access Server queued the error email and determined which Exchange server to process it with (let’s say Exchange1)
  - Exchange1 found that the mailbox was full and using SMTP protocols it needed to send an “Inbox is full error message”. Exchange1 looked up the MX record of where to send and found that it needed to send it to the Email Firewall. It sent it ..
  - The Email Firewall then found that some.website@place.com wasn’t an actual address and maybe sent it to Exchange2 for processing?
  - Exchange2 found it was a fake address and sent back a “This address doesn’t exist email”, which went back to the Email Firewall.
  - The Email Firewall forwarded the email or dropped it?
  - And, somewhere in all this mess, the emails that couldn’t be delivered to my real address my.name@place.com because my “Inbox was full” got put into a retry queue … in case my inbox cleared up. And, this helped generate more “Inbox is full” and “This address doesn’t exist” emails.
Sidenote: I said above “One thing that continues to puzzle me is the why the Exchange server sent out error emails to “itself”. ”
- I kinda get it. Exchange does an MX lookup for @place.com and finds the Email Firewall as the IP address, which isn’t itself. But …
- Shouldn’t Exchange know that it owns @place.com? Why does it need to send the error email?

So … this biggest problem in this whole equation is me. I knew that IIS had this healthcheck problem before hand. And, I had even created a support ticket with Microsoft to get it fixed (which they say has been escalated to the Product Group … but nothing has happened for months).

I knew of the problem, I implemented ELMAH, and I completely forgot that the database refresh would wipe out the db objects which the applications would need.

Of course, we/I’ve now gone about implementing fixes, but I want to dig into this IIS Healthcheck issue a little more. Here’s how it works.

IIS has a feature called ARR (Application Request Routing)
- It’s used all the time in Azure. You may have setup a Web App, which requires an “App Service”. The App Service is actually a proxy server that sits in front of your Web App. The proxy server uses ARR to route the requests to your Web App. But, in Azure they literally create a single proxy server for your single web application server. If you want to scale up and “move the slider”, more application servers are created behind the proxy. BUT, in Azure, the number of Web Apps that can sit behind a App Service/Proxy Service is very limited (less than 5). <rant>No where in the IIS documentation do they tell you to limit yourself to 5 applications; and the “/Build conference” videos from the IIS team make you believe that IIS is meant to handle hundreds of websites. </rant>
We use ARR to route requests for all our custom made websites (~215) to the application servers behind our proxy.
ARR uses webfarms to determine where to route requests. The purpose of the webfarms is have multiple backend Application Servers; which handle load balancing.
The webfarms have a Healthcheck feature, which allows the web farms to check if the application servers behind the proxy are Healthy. If one of the application servers isn’t healthy then it’s taken out of the pool until it’s healthy again.
- I really like this feature and it makes a lot of sense.
The BIG PROBLEM with this setup is that the WEBFARMS AREN’T DIRECTLY LINKED TO APPLICATION POOLS.
- So, every application pool that runs on the frontend proxy server, loads the entire list of webfarms into memory.
- If any of those webfarms happens to have a healthcheck url, then that application pool will consider itself the responsible party to check that healthcheck url.
- So, if a healthcheck url has a healthcheck interval of 30 seconds …
- And a proxy server has 215 application pools on it; then that is 215 healthchecks every 30 seconds.

I think the design of the Healthcheck feature is great. But, the IMPLEMENTATION is flawed. HEALTHCHECKS ARE NOT DESIGNED THE WAY THEY ARE IMPLEMENTED.

Of course I’ve worked on other ways to prevent this problem in the future. But, IIS NEEDS TO FIX THE WAY HEALTHCHECKS ARE IMPLEMENTED.

I get bothered when people complain without a solution, so here’s the solution I propose:

Create a new xmlnode in the <webfarm> section of applicationHost.config which directly links webfarms to application pools.
Example (sorry, I’m having a lot of problem getting code snippets to work in this version of my LiveWriter)

<webfarm enabled="true" name="wf_johndoe.place.com_lab">
  <applicationpool name="johndoe.place.com_lab" />
  <server enabled="true" address="wa100.place.com" />
  <applicationrequestrouting>
    <protocol reverserewritehostinresponseheaders="false" timeout="00:00:30">
      <cache enabled="false" querystringhandling="Accept" />
    </protocol>
    <affinity cookiename="ARRAffinity_johndoe.place.com_lab" usecookie="true"/>
    <loadbalancing algorithm="WeightedRoundRobin" />
  </applicationrequestrouting>
</webfarm>

Steven Maglio's Homework

.NET, OSS & a little more

How to Crash Exchange Using IIS Healthchecks

0 comments:

Post a Comment

About me

Categories

Contact

Older Posts