Alter PathSuffix in Apigee with/out Load Balancer

on Friday, February 16, 2018

Apigee’s API Gateway is by many measures a proxy server with some really nice bells and whistles attached. But, it’s still a proxy server at it’s core. Which means it should be able to transform an incoming request before it’s sent to the backend/target resource servers. It can do that, but it’s not as easy as you might hope.

With an API Gateway, a common trasnformation would be to remove a version number from a url before sending the request to the backend server. This scenario crops up when the developer of the resource API didn’t design their system with version numbers in mind. The scenario looks like this:

image

So, in this scenario, the Body API Proxy has a BasePath of /body (proxy.basepath). And the PathSuffix would be /v1/wheels?drive=4WD (proxy.pathsuffix). The developer of the resource service didn’t have version built into the url path, and is expecting a url without it.

Without a Load Balancer Configuration

To make this transformation, we are going to need to artificial create the target endpoints url during the flow process. Seeing that the API Gateway is a proxy server, you would think that you would just need overwrite the request or proxy variables, but most of those are actually read only. Here’s what you’ll need to do:

  1. You’ll use the request.uri and proxy.basepath to figure out the full path suffix.
  2. If the path contains a version number (/v1/) then you will …
  3. Set target.copy.pathsuffix to false. (At the moment, you have to use a Javascript Callout. There is a bug with using an AssignMessage Policy).
    1. This must occur in the Target Endpoint flows (most likely the PreFlow). You can’t do this in the Proxy Endpoint, because the target variables haven’t been created yet. So, they aren’t “in scope”.
  4. You’ll then remove the version number (/v1/) to get the “new” path suffix.
  5. And, finally, you will set the target.url to constructed path. (target.url is one of the few read/write variables.)

Target Endpoint with No Load Balancer

image

With a Load Balancer Configuration

Apigee uses a Load Balancer variable in the Target Endpoint configuration to allow for Resource Server DNS hostnames to be dynamic between the environments. Unfortunately, when this is used, the target.url variable is no longer used. And, you need to set target.copy.queryparams to false as well.

To this, you’ll follow the same steps above, but this time you’ll …

  1. And, finally, you will set the target.url to constructed path. (target.url is one of the few read/write variables.)
  2. Set target.copy.queryparams to false.
  3. Set the {newpathsuffix} variable, which will be configured on the Target Endpoint’s Path.

Target Endpoint with Load Balancer

image

Javascript Callout for Target Endpoint PreFlow: (note that variable {newpathsuffix} isn’t needed when no Load Balancer is involved. It’s being used to make both implementations look similar.)

//  parses the original request to remove the version piece ("/v1", etc)
var basepath = context.getVariable("proxy.basepath")
print("basepath: " + basepath);
var uri = context.getVariable("request.uri");
print("uri: " + uri);
var pathsuffix = uri.substring(basepath.length)
var regex = /(.*)\/v[0-9]+\/(.*)/
var found = regex.exec(pathsuffix)
print("found: " + found)
if(found !== null) {
    //  prevents the request to the backend server from using the original "request.pathSuffix"
    //  this is very important!
    //  the original "request.path" will overwrite whatever we do here if this isn't set
    context.setVariable("target.copy.pathsuffix", false)
    
    // remove the "/v1" part
    var newPathSuffix = found[1]
    if(newPathSuffix.length > 0) { newPathSuffix += "/" }
    newPathSuffix += found[2]
    
    print("newPathSuffix: " + newPathSuffix)
    context.setVariable("newpathsuffix", newPathSuffix)
    
    var targetUrl = context.getVariable("target.url")
    print("target url: " + targetUrl)
    if(targetUrl !== null) {
        
        var pathSuffixRegex = /(.*){ucsbpathsuffix}(.*)/
        var pathSuffixFound = pathSuffixRegex.exec(targetUrl)
        print("pathSuffixFound: " + pathSuffixFound)
        
        if(pathSuffixFound !== null) {
            
            var newUrl = pathSuffixFound[1] + newPathSuffix + pathSuffixFound[2]
            print("new url (replace): " + newUrl)
            context.setVariable("target.url", newUrl);
            
        } else {
            var newUrl = targetUrl + newPathSuffix
            print("new url (append): " + newUrl)
            context.setVariable("target.url", newUrl);
            
        }
    } else {
        // using load balancer
        context.setVariable("target.copy.queryparams", "false") // needed on load balancer
        // the load balancer can use the variable substitution on the  innerText
    }
} else {
    print("newpathsuffix: [empty string]")
    context.setVariable("newpathsuffix", "")
}

target.copy.pathsuffix and target.copy.queryparams

So, these are the key variables that make overwriting the target path possible. The creation of these variables probably has good reasoning behind it, but from an outside perspective they seem really odd. Apigee’s internal system allows you to do a variety of alterations and checks through the Proxy and Target Endpoint flows. These flows can alter most things within the system at the time they execute within the pipeline. BUT, the proxy.pathsuffix and proxy.queryparams are (a) readonly and (b) will overwrite any changes you make to the target.url value. They just ignore everything that happened in the pipeline and override it. This behavior seems to conflict with the way the “flow” system was designed.

Apigee OPTIONS Response for Preflight/CORS

on Monday, February 12, 2018

Apigee comes with the ability to add CORS headers to responses right out of the box. This really isn't that useful though. And, it instills a false sense that it’s actually providing valuable CORS information so the developer doesn't have to think about it.

image

CORS is really implemented into browsers to prevent requests from going to unauthorized endpoints. To do this, many browsers (like Chrome) use a “Preflight” request to pull back a couple of headers which let the browser know that a web service does allow requests from other “origins” (or, DNS names). Essentially CORS headers state:

These websites can use this web service. And, this web service allows these methods (GET, POST, etc) to be called from that website for the resource in question. (With web services, a lot of the time, “These websites” is actually “All websites.”)

Back to Apigee’s initial setup: The problem with adding CORS headers on all responses is that the Preflight request isn’t going to match one of the normal endpoints on an API. So, the response will most likely be a 404 Not Found. And, browsers will consider that an error, and they won’t allow the real request to go through.

This is a big deal for https://editor.swagger.io/. The basic Swagger UI Tester uses fetch, which does the Preflight request/check. The swagger editor and tester are used all over the place and most browsers will try to do a Preflight check which will result in this error message (the image is from Chromes developer tools):

image

In Chrome’s Network tab it will look like this:

image

Take note that the Preflight request is asking the server not only if it’s okay if http://editor.swagger.io is calling, but it wants to know if the ‘ucsb-api-version’ header is acceptable. This means the preflight request doesn’t actually send across any security information. It’s asking if it’s okay to send across security information.

So, if you want to have an Apigee web service that is compatible with the standard Swagger UI editor and tester you need to watch for the OPTIONS preflight request and return an acceptable response. Luckily, this can be done by taking the original CORS headers response and turning it into a PreFlow response. Start out by creating a Shared Flow that looks for OPTIONS requests:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<SharedFlow name="default">
    <Step>
        <Name>OPTIONS-CORS-Headers-Response</Name>
        <Condition>request.verb = "OPTIONS"</Condition>
    </Step>
</SharedFlow>

Then add a RaiseFault Policy that will return all the CORS headers and successful status code:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<RaiseFault async="false" continueOnError="false" enabled="true" name="OPTIONS-CORS-Headers-Response">
    <DisplayName>OPTIONS CORS Headers Response</DisplayName>
    <Properties/>
    <FaultResponse>
        <Set>
            <Headers>
                <Header name="Access-Control-Allow-Origin">*</Header>
                <Header name="Access-Control-Allow-Headers">origin, x-requested-with, accept, ucsb-api-key, ucsb-api-version, authorization</Header>
                <Header name="Access-Control-Max-Age">3628800</Header>
                <Header name="Access-Control-Allow-Methods">GET, PUT, POST, DELETE</Header>
            </Headers>
            <Payload contentType="text/plain"/>
            <StatusCode>200</StatusCode>
            <ReasonPhrase>OK</ReasonPhrase>
        </Set>
    </FaultResponse>
    <IgnoreUnresolvedVariables>true</IgnoreUnresolvedVariables>
</RaiseFault>

Now, all you need to do is add the Shared Flow as the very first Step in your API Proxies’ Preflow Proxy Endpoint steps. The Shared Flow step must come before the API Key verification step because the OPTIONS request will not contain security authorization information. It should look something like this:

image

Once this is all setup, preflight requests from https://editor.swagger.io/ will pass without error. And now you should get a successful response:

image

But, this isn’t a perfect solution. There are still faults with it because your API Proxy is now blindly stating that it will take requests from almost anywhere and for a variety of different METHOD types.

The best possible solution would be to allow for OPTIONS Preflight requests to be detected by looking for the OPTIONS method and checking if the 3 required headers exist. If all of those conditions are met, then flow the request down to the resource server and let it determine what the exact CORS response it can serve. But, that’s all configuration for another day.

Apigee’s security is NOT Default Deny

on Friday, February 9, 2018

Apigee makes a great API Gateway product. But, one thing that’s been surprising is that the security system is not Deny Access by Default. It’s a very forward thinking design, but it surprised many of us who assumed the common security practice of Default Deny was the starting point.

For an application to have access to an API, the application must first be approved to use an API Product. The piece that’s surprising is that if an API Product has no API Proxies or Resource Path restrictions applied to it, then it gives full access to all API Proxies.

Don’t do this. Always attach at least one API Proxy to your API Products.

image

Once you have an API Product setup with an API Proxy, you have restricted access to just that API Proxy’s endpoint. Which is a good step forward. But, it gives you access to the entire endpoint, with no filtering on the “known paths”.

Beware of not applying resource path restrictions. Without resource restrictions, everything going through your API Proxy’s endpoint is passed through.

image

For example, if an API Proxy has a base path of:

https://{org}-{env}.apigee.net/firstapiproxy

And, that API Proxy has “known paths” (eg. flows) of

  • GET /cars
  • GET /trucks
  • GET /vans

Because there are no resource path restrictions, these urls will also work:

To apply the Resource Path restrictions use the API Product interface:

image

Or, for even stricter security, create a DefaultNotFound Flow within your API Proxy. Like the Send404NotFoundResponse used in the oauth2/proxy example:

image

Apigee OAuth Tester in Powershell

on Monday, February 5, 2018

New Apigee instances/organizations come with a built in OAuth 2.0 server. Their default security mechanism is an API Key, but they fully support OAuth 2.0 right out of the box.

A new instance will come with an active OAuth 2.0 endpoint deployed to your Dev, Test, and Prod instances.

The default OAuth 2.0 endpoint is very similar to this proxy example. But, the tutorial on Apigee’s website is to send the grant_type as a form parameter. So, a quick swap can change the grant_type lookup:

image_thumb[3]

Once that’s changed over, you’ll need to request an access token from the endpoint. To do this go into one of your applications and get the client_id and client_secret:

image_thumb[7]

And now we can throw this info into a powershell script to get back our bearer token:

$apigeeHost = "{organization}-{environment}.apigee.net"
$clientId = "{your client id}"
$clientSecret = "{your client secret}"

$authUrl = "https://$apigeeHost/oauth/client_credential/accesstoken"
$authHeaders = @{
    "Content-Type" = "application/x-www-form-urlencoded"
}
$authBody = "grant_type=client_credentials" + `
            "&client_id=$clientId" + `
            "&client_secret=$clientSecret"

$authResponse = Invoke-WebRequest -Method POST -Headers $headers -Body $body -Uri $loginUrl

if($response.StatusCode -ne 200) {
    throw ("Authorization Failure`r`n" + $response)
}

$authInfo = ConvertFrom-Json $response.Content

$authInfo

image_thumb[9]

Before making a call to a resource, make sure to setup the resource API Proxy with an OAuth Verification:

image_thumb[15]

image_thumb[17]

You actually only need the <Operation>VerifyAccessToken</Operation>, but it doesn’t hurt to leave the rest.

Now that we have a bearer token, we can use it as an authorization header to make a call to our resource:

# use your resource url here
$resourceUrl = "https://$apigeeHost/sa/quartercalendar/oauth/v1/quarters?quarter=20154"
$resourceHeaders = @{
    Authorization = "Bearer $($authInfo.access_token)"
}
$resourceResponse = Invoke-WebRequest -Method GET -Uri $resourceUrl -Headers $resourceHeaders
ConvertFrom-Json $resourceResponse.Content

image_thumb[13]

Self-Signed Certificates for Win10

on Friday, November 24, 2017

Browsers have implemented all sorts of great new security measures to ensure that certificates are pretty valid. So, using a self-signed certificate today is more difficult than it used to be. Also, IIS for Win8/10 gained access for using a Central Certificate Store. So, here’s some scripts that:

  • Create a Self-Signed Cert
    • Creates a self-signed cert with a DNS Name (browsers don’t like it when the Subject Alternative Name doesn’t list the DNS Name).
    • Creates a Shared SSL folder on disk and adds permissions for IIS’s Central Certificate Store account will read the certs with.
    • Exports the cert to the Shared SSL folder as a .pfx.
    • Reimports the certs to the machines Trusted Root Authority (needed for browsers to verify the cert is trusted)
    • Adds the 443/SSL binding to the site (if it exists) in IIS
  • Re-Add Cert to Trusted Root Authority
    • Before Win10, Microsoft implemented a background task which will periodically check the certs installed in your Machine Trusted Root Authority which are self-signed and removes them. So, this script re-installs them.
    • It will look through the shared SSL folder created in the previous script and add any certs back to the local Machine Trusted Root Authority that are missing.
  • Re-Add Cert to Trusted Root Authority Scheduled Task
    • Schedules the script to run hourly
### Create-SelfSignedCert.ps1

$name = "site.name.com" # only need to edit this


# get the shared ssl password for dev - this will be applied to the cert
$pfxPassword = "your pfx password"

# you can only create a self-signed cert in the \My store
$certLoc = "Cert:\LocalMachine\My"
$cert = New-SelfSignedCertificate `
            -FriendlyName $name `
            -KeyAlgorithm RSA `
            -KeyLength 4096 `
            -CertStoreLocation $certLoc `
            -DnsName $name

# ensure the path the directory for the central certificate store is setup with permissions
# NOTE: This assumes that IIS is already setup with Central Cert Store, where
#       1) The user account is "Domain\AccountName"
#       2) The $pfxPassword Certificate Private Key Password
$sharedPath = "D:\AllContent\SharedSSL\Local"
if((Test-Path $sharedPath) -eq $false) {
    mkdir $sharedPath

    $acl = Get-Acl $sharedPath
    $objUser = New-Object System.Security.Principal.NTAccount("Domain\AccountName") 
	$rule = New-Object System.Security.AccessControl.FileSystemAccessRule($objUser, "ReadAndExecute,ListDirectory", "ContainerInherit, ObjectInherit", "None", "Allow")
	$acl.AddAccessRule($rule)
	Set-Acl $sharedPath $acl
}


# export from the \My store to the Central Cert Store on disk
$thumbprint = $cert.Thumbprint
$certPath = "$certLoc\$thumbprint"
$pfxPath = "$sharedPath\$name.pfx"
if(Test-Path $pfxPath) { del $pfxPath }
Export-PfxCertificate `
    -Cert $certPath `
    -FilePath $pfxPath `
    -Password $pfxPassword


# reimport the cert into the Trusted Root Authorities
$authRootLoc = "Cert:\LocalMachine\AuthRoot"
Import-PfxCertificate `
    -FilePath $pfxPath `
    -CertStoreLocation $authRootLoc `
    -Password $pfxPassword `
    -Exportable


# delete it from the \My store
del $certPath # removes from cert:\localmachine\my


# if the website doesn't have the https binding, add it
Import-Module WebAdministration

if(Test-Path "IIS:\Sites\$name") {
    $httpsBindings = Get-WebBinding -Name $name -Protocol "https"
    $found = $httpsBindings |? { $_.bindingInformation -eq "*:443:$name" -and $_.sslFlags -eq 3 }
    if($found -eq $null) {
        New-WebBinding -Name $name -Protocol "https" -Port 443 -IPAddress "*" -HostHeader $name -SslFlags 3
    }
}
### Add-SslCertsToAuthRoot.ps1

$Error.Clear()

Import-Module PowerShellLogging
$name = "Add-SslCertsToAuthRoot"
$start = [DateTime]::Now
$startFormatted = $start.ToString("yyyyMMddHHmmss")
$logdir = "E:\Logs\Scripts\IIS\$name"
$logpath = "$logdir\$name-log-$startFormatted.txt"
$log = Enable-LogFile $logpath

try {

    #### FUNCTIONS - START ####
    Function Get-X509Certificate {
	Param (
        [Parameter(Mandatory=$True)]
		[ValidateScript({Test-Path $_})]
		[String]$PfxFile,
		[Parameter(Mandatory=$True)]
		[string]$PfxPassword=$null
	)

	    # Create new, empty X509 Certificate (v2) object
	    $X509Certificate = New-Object System.Security.Cryptography.X509Certificates.X509Certificate2

	    # Call class import method using password
        try {
			$X509Certificate.Import($PfxFile,$PfxPassword,"PersistKeySet")
			Write-Verbose "Successfully accessed Pfx certificate $PfxFile."
		} catch {
			Write-Warning "Error processing $PfxFile. Please check the Pfx certificate password."
			Return $false
		}
	
        Return $X509Certificate
    }

    # http://www.orcsweb.com/blog/james/powershell-ing-on-windows-server-how-to-import-certificates-using-powershell/
    Function Import-PfxCertificate {
    Param(
	    [Parameter(Mandatory = $true)]
	    [String]$CertPath,
	    [ValidateSet("CurrentUser","LocalMachine")]
	    [String]$CertRootStore = "LocalMachine",
	    [String]$CertStore = "My",
	    $PfxPass = $null
    )
        Process {
	        $pfx = new-object System.Security.Cryptography.X509Certificates.X509Certificate2
	        if ($pfxPass -eq $null) {$pfxPass = read-host "Enter the pfx password" -assecurestring}
	        $pfx.import($certPath,$pfxPass,"Exportable,PersistKeySet")
 
	        $store = new-object System.Security.Cryptography.X509Certificates.X509Store($certStore,$certRootStore)

	        $serverName = [System.Net.Dns]::GetHostName();
	        Write-Warning ("Adding certificate " + $pfx.FriendlyName + " to $CertRootStore/$CertStore on $serverName. Thumbprint = " + $pfx.Thumbprint)
	        $store.open("MaxAllowed")
	        $store.add($pfx)
	        $store.close()
	        Write-Host ("Added certificate " + $pfx.FriendlyName + " to $CertRootStore/$CertStore on $serverName. Thumbprint = " + $pfx.Thumbprint)
        }
    }
    #### FUNCTIONS - END ####


    #### SCRIPT - START ####
    $sharedPath = "D:\AllContent\SharedSSL\Local"
    $authRootLoc = "Cert:\LocalMachine\AuthRoot"
    
    $pfxPassword = "your password" # need to set this

    $pfxs = dir $sharedPath -file -Filter *.pfx
    foreach($pfx in $pfxs) {    
        $cert = Get-X509Certificate -PfxFile $pfx.FullName -PfxPassword $pfxSecret.Password
        $certPath = "$authRootLoc\$($cert.Thumbprint)"
        if((Test-Path $certPath) -eq $false) {
            $null = Import-PfxCertificate -FilePath $pfx.FullName -CertStoreLocation $authRootLoc -Password $pfxPassword -Exportable
            Write-Host "$($cert.Subject) ($($cert.Thumbprint)) Added"
        } else {
            Write-Host "$($cert.Subject) ($($cert.Thumbprint)) Already Exists"
        }
    }
    #### SCRIPT - END ####

} finally {
    foreach($er in $Error) { $er }

    Disable-LogFile $log
}
### Install-Add-SslCertsToAuthRoot.ps1

$yourUsername = "your username" # needs local admin rights on your machine (you probably have it)
$yourPassword = "your password"

$name = "Add-SslCertsToAuthRoot"
$filename = "$name.ps1"
$fp = "D:\AllContent\Scripts\IIS\$filename"
$taskName = $name
$fp = "powershell $fp"

$found = . schtasks.exe /query /tn "$taskName" 2>null
if($found -ne $null) {
    . schtasks.exe /delete /tn "$taskName" /f
    $found = $null
}
if($found -eq $null) {
    . schtasks.exe /create /ru $yourUsername /rp $yourPassword /tn "$taskName" /sc daily /st "01:00" /tr "$fp"
    . schtasks.exe /run /tn "$taskName"
}

Wnf Kernel Memory Leak

on Friday, November 17, 2017

Back in 2015, we started using Win2012 R2 servers and within a day of Production usage we started seeing Out of Memory errors on the servers. Looking at the Task Manager, we could easily see that a massive amount of Kernel Memory was being used. But why?

Using some forums posts, SysInternals, and I think a Scott Hanselman blog entry we were able to use PoolMon.exe to see that the system using all the Kernel Memory was Wnf. We had no idea what it was and went down some rabbit holes before finding this forum post.

Microsoft Support would later tell us the problem had something to with a design change to Remote Registry and how it deals with going idle, and another design change in Windows Server 2012 R2 about how it choose which services to make idle. Anyways, the fix was easy to implement (just a real pain to find):

If you want the service to not stop when Idle, you can set this registry key:
key : HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\RemoteRegistry
name : DisableIdleStop
REG_DWORD, data : 1

Here’s what it looks like when the leak is happening:

image

How to Crash Exchange Using IIS Healthchecks

on Saturday, September 23, 2017

So, I had a bad week. I crashed a multiple server, redundant, highly available Exchange Server setup using the IIS Healthchecks of a single website in Dev and Test (not even Prod).

How did I do this? Well …

  • Start with a website that is only in Dev & Test; and hasn’t moved to Prod.
    • All of the database objects are only in Dev & Test.
  • Do a database refresh from Prod and overlay Dev & Test.
    • The database refresh takes 2 hours; but the next 17 hours is a period where the Dev & Test environments don’t have the database objects available to them, because those objects weren’t a part of the refresh.
  • So, now you have 19 hours of a single website being unable to properly make a database call.
  • Why wasn’t anyone notified? Well, that’s all on me. It was the Dev & Test version of the website, and I was ignoring those error messages (those many, many error messages).
  • Those error messages were from ELMAH. If you use ASP.NET and don’t know ELMAH; then please learn about it, it’s amazing!
    • In this case, I was using ELMAH with WebAPI, so I was using the Elmah.Contrib.WebAPI package. I’m not singling them out as a problem, I just want to spread the word that WebAPI applications need to use this package to get error reporting.
  • Finally, you have the IIS WebFarm Healthcheck system.
    • The IIS WebFarm healthcheck system is meant to help a WebFarm route requests to healthy application servers behind a proxy. If a single server is having a problem, then requests are no longer routed to it and only the healthy servers are sent requests to process. It’s a really good idea.
    • Unfortunately, … (You know what? … I’ll get back to this below)
    • Our proxy servers have around 215 web app pools.
    • The way IIS healthchecks are implemented, every one of those web app pools will run the healthchecks on every web farm. So, this one single application gets 215 healthchecks every 30 seconds (the default healthcheck interval).
    • That’s 2 healthchecks per minute, by 215 application pools …
    • Or 430 healthchecks per minute … per server
    • Times 3 servers (1 Dev & 2 Test Application Servers) … 1290 healthchecks per minute
    • Times 60 per hour, times 19 hours … 1,470,600 healthchecks in 19 hours.
  • Every one of the 1,470,600 healthchecks produced an error, and ELMAH diligently reported every one of those errors. (First email type)
  • Now for Exchange
    • Even if we didn’t have a multi-server, redundant, highly available Exchange server, 1.5 million emails would have probably crashed it.
    • But, things got crazier because we have a multiple server, redundant, highly available setup.
    • So, the error emails went to a single recipient, me.
    • And, eventually my Inbox filled up (6 GBs limit on my Inbox), which started to produce response emails saying “This Inbox is Full”. (Second email type)
    • Well … those response emails went back to the sender … which was a fake email address I used for the website (it’s never supposed to be responded to).
    • Unfortunately, that fake email address has an the domain as my account (@place.com); which sent all the responses back to the same Exchange server.
    • Those “Inbox is Full” error messages then triggered Exchange to send back messages that said “This email address doesn’t exist”. (Third email type)
    • I’m not exactly sure about how this happened, but there was a number of retry attempts on the [First Email Type] which again re-triggered the Second and Third email type. I call the retrys the (Fourth email type).
    • Once all of the error messages get factored into the equation, the 1.5 million healthcheck emails generated out 4.5 million healthcheck and smtp error emails.
    • Way before we hit the 4.5 million mark, our Exchange server filled up …
      • It’s database
      • The disk on the actual Exchange servers

So, I don’t really understand Exchange too well. I’m trying to understand this diagram a little better. One thing that continues to puzzle me is the why the Exchange server sent out error emails to “itself”. (My email address is my.name@place.com and the ELMAH emails were from some.website@place.com … so the error emails were sent to @place.com, which that Exchange server owns). Or does it …

  • So, from the diagram, consultation, and my limited understanding … our configuration is this:
    • We have a front end email firewall that owns the MX record (DNS routing address) for @place.com.
      • The front end email firewall is supposed to handle external email DDOS attacks and ridiculous spam emails.
    • We have an internal Client Access Server / Hub Transport Server which takes in the ELMAH emails from our applications and routes them into the Exchange Servers.
    • We have 2 Exchange servers with 2 Databases behind them, which our email inboxes are split across.
    • So, the flow might be (again, I don’t have this pinned down)
      • The application sent the error email to the Client Access Server
      • The Client Access Server queued the error email and determined which Exchange server to process it with (let’s say Exchange1)
      • Exchange1 found that the mailbox was full and using SMTP protocols it needed to send an “Inbox is full error message”. Exchange1 looked up the MX record of where to send and found that it needed to send it to the Email Firewall. It sent it ..
      • The Email Firewall then found that some.website@place.com wasn’t an actual address and maybe sent it to Exchange2 for processing?
      • Exchange2 found it was a fake address and sent back a “This address doesn’t exist email”, which went back to the Email Firewall.
      • The Email Firewall forwarded the email or dropped it?
      • And, somewhere in all this mess, the emails that couldn’t be delivered to my real address my.name@place.com because my “Inbox was full” got put into a retry queue … in case my inbox cleared up. And, this helped generate more “Inbox is full” and “This address doesn’t exist” emails.
  • Sidenote: I said above “One thing that continues to puzzle me is the why the Exchange server sent out error emails to “itself”. ”
    • I kinda get it. Exchange does an MX lookup for @place.com and finds the Email Firewall as the IP address, which isn’t itself. But …
    • Shouldn’t Exchange know that it owns @place.com? Why does it need to send the error email?

So … this biggest problem in this whole equation is me. I knew that IIS had this healthcheck problem before hand. And, I had even created a support ticket with Microsoft to get it fixed (which they say has been escalated to the Product Group … but nothing has happened for months).

I knew of the problem, I implemented ELMAH, and I completely forgot that the database refresh would wipe out the db objects which the applications would need.

Of course, we/I’ve now gone about implementing fixes, but I want to dig into this IIS Healthcheck issue a little more. Here’s how it works.

  • IIS has a feature called ARR (Application Request Routing)
    • It’s used all the time in Azure. You may have setup a Web App, which requires an “App Service”. The App Service is actually a proxy server that sits in front of your Web App. The proxy server uses ARR to route the requests to your Web App. But, in Azure they literally create a single proxy server for your single web application server. If you want to scale up and “move the slider”, more application servers are created behind the proxy. BUT, in Azure, the number of Web Apps that can sit behind a App Service/Proxy Service is very limited (less than 5). <rant>No where in the IIS documentation do they tell you to limit yourself to 5 applications; and the “/Build conference” videos from the IIS team make you believe that IIS is meant to handle hundreds of websites. </rant>
  • We use ARR to route requests for all our custom made websites (~215) to the application servers behind our proxy.
  • ARR uses webfarms to determine where to route requests. The purpose of the webfarms is have multiple backend Application Servers; which handle load balancing.
  • The webfarms have a Healthcheck feature, which allows the web farms to check if the application servers behind the proxy are Healthy. If one of the application servers isn’t healthy then it’s taken out of the pool until it’s healthy again.
    • I really like this feature and it makes a lot of sense.
  • The BIG PROBLEM with this setup is that the WEBFARMS AREN’T DIRECTLY LINKED TO APPLICATION POOLS.
    • So, every application pool that runs on the frontend proxy server, loads the entire list of webfarms into memory.
    • If any of those webfarms happens to have a healthcheck url, then that application pool will consider itself the responsible party to check that healthcheck url.
    • So, if a healthcheck url has a healthcheck interval of 30 seconds …
    • And a proxy server has 215 application pools on it; then that is 215 healthchecks every 30 seconds.

I think the design of the Healthcheck feature is great. But, the IMPLEMENTATION is flawed. HEALTHCHECKS ARE NOT DESIGNED THE WAY THEY ARE IMPLEMENTED.

Of course I’ve worked on other ways to prevent this problem in the future. But, IIS NEEDS TO FIX THE WAY HEALTHCHECKS ARE IMPLEMENTED.

I get bothered when people complain without a solution, so here’s the solution I propose:

  • Create a new xmlnode in the <webfarm> section of applicationHost.config which directly links webfarms to application pools.
  • Example (sorry, I’m having a lot of problem getting code snippets to work in this version of my LiveWriter)
<webfarm enabled="true" name="wf_johndoe.place.com_lab">
  <applicationpool name="johndoe.place.com_lab" />
  <server enabled="true" address="wa100.place.com" />
  <applicationrequestrouting>
    <protocol reverserewritehostinresponseheaders="false" timeout="00:00:30">
      <cache enabled="false" querystringhandling="Accept" />
    </protocol>
    <affinity cookiename="ARRAffinity_johndoe.place.com_lab" usecookie="true"/>
    <loadbalancing algorithm="WeightedRoundRobin" />
  </applicationrequestrouting>
</webfarm>


Creative Commons License
This site uses Alex Gorbatchev's SyntaxHighlighter, and hosted by herdingcode.com's Jon Galloway.