Submitting a Bug/Pull Request for Asp.Net Core

on Monday, November 18, 2019

It’s great that the Asp.Net Core team has made the project open source. It makes finding bugs, resolving them and submit updates/pull requests tremendously more satisfying than when you would call Microsoft support, report your bug and hope the product team considered your problem high enough priority to justify their time.

Last week I put together a post on a NullReferenceException within the Microsoft.AspNetCore.Http.ItemsCollection object. I finished the post by filing a bug with the AspNetCore team (#16938). The next step I needed to do with download the source code for Asp.Net Core and create a fix for it.

So, the standard forking of the repository and creating a new branch to create the fix in was easy enough, but now came the interesting part. Submitting the Pull Request and receiving their feedback. The first thing that was pretty amazing was that I submitted the fix around 10 AM on Saturday morning (Pull Request #16947). By noon on a Saturday some one from the Asp.Net Core team had reviewed the pull request and already found improvements to the code. The best part of the review is that they are very talented programmers and found multiple ways to improve the unit test submitted and even found other errors in the original file that was being updated. They didn’t just review the code, they looked for ways to improve the overall product.

The fix I found for the NullReferenceException was to use a null-conditional operator to ensure that the exception didn’t occur. But, what they did was search the entire class for every place this might occur and suggested where else a null-conditional operator could be applied to prevent future issues. They are detailed.

The parts of my pull request they had the most feedback on were the unit tests, and the suggestions were are useful to simplify the code and get at the core of what was being tested. When running the unit tests on my local machine, I could tell that they really focused in how to make the unit tests as fast and efficient as possible. The dotnet test runner could run the 169 unit tests in the Microsoft.AspNetCore.Http library in under 4 seconds. For comparison, I’ve mostly been working with Powershell unit tests for a while, and loading up the Powershell runtime and the Pester module, before even running the tests, usually takes a good ~5 seconds.

Overall it was a pretty easy process and they accepted the update for inclusion in the Asp.Net Core 5.0 preview1 release. Now, for getting the fix merged into Asp.Net Core 3.X, that was a little more difficult (#17068).

NullReferenceException in Http.ItemsCollection

on Monday, November 11, 2019

The other day a coworker and I were adding the ElmahCore/ElmahCore library to an AspNetCore web project and we ran into a strange NullReferenceException when using it. The problem didn’t occur when the library was used in an AspNetCore 2.2 project, and this was the first time trying it on an AspNetCore 3.0 project. So, we wanted to believe it was something to do with 3.0, in which case there would probably be other people that were running into the issue and Microsoft have a fix ready for it very soon. But, we needed to move forward on the project, so we wanted to find a work around in the meantime.

Personally, I found a bit of humor with this Exception, because at first glance it looked like it was occurring within an Exception Management system (Elmah). I mean the one thing that an Exception Management system definitely doesn’t want to do is throw Exceptions. However, being that I was already troubleshooting an Exception problem that seemed kind of funny, I leaned into the ridiculousness and decided to debug the Exception Management system with another Exception Management system, Stackify Prefix ( … it’s more of an APM, but it’s close enough).

The truly laugh out loud moment came when I hooked up Stackify Prefix, and it fixed the Elmah exception. Elmah started working perfectly normally once Stackify Prefix was setup in the application. So … one Exception Management system fixed another Exception Management system. (Or, at least, that’s how it felt.)

Of course, I needed to figure out what Stackify did to fix Elmah, so needed to pull Stackify back out of the application and really get into Elmah. Which meant grabbing a clone of ElmahCore to work with.

Even before having the source code available, I had the function name and stack trace of the Exception. So once I had the source code available, zeroing in on the code was pretty straight forward. But, what I found was unexpected.

As best I could tell, the problematic code wasn’t in Elmah, but was occurring from an underlying change inside of Microsoft.AspNetCore.Http.ItemsDictionary. It seemed like an internal field, _items, was null within the ItemsDictionary object. And, when the enumerator for the collection was used, a NullReferenceException was being generated.

(More detailed information at ElmahCore Issue #51)

It seemed like the workaround for the issue was to populate the request.Items collection with a value, in order to ensure the internal _items collection was not null. And to double check this, I loaded back up Stackify Prefix and checked what the value of the collection was when Stackify was involved. Sure enough, Stackify had populated the dictionary with a corellation id; and that’s why it fixed the issue.

For ElmahCore, I implemented a really bad code fix and created a pull request for them. It is a really bad fix because the solution triggers the exception, catches it, and then swallows it. Which makes the Elmah code continue to execute successfully. But, any .NET Profiler that ties into the really low level .NET Profiler APIs (these are the APIs which power APM solutions like Stackify Retrace, AppDynamics, etc) will record that exception and report as if its an exception occurring within your application.

At this point, the next steps are how to inform Microsoft of the problem and get a bug fix/pull request going for them. I’ve opened the bug (AspNetCore Issue #16938), but I will need to get the code compiling on my machine before submitting a pull request.

Performance Gains by Caching Google JWT kids

on Monday, November 4, 2019

To start with, a “kid” is a key id within a JSON Web Key Set (JWKS). Within the OpenID Connect protocol (which is kind of like an OAuth2 extension) Authentication Services can ensure the data integrity of their JWT tokens by signing them. And they sign the tokens with a private certificate. The token signature can then be verified using a public certificate; this is somewhat similar to SSL certificates for websites over https. With https, the public certificate is immediately given to the browser as soon as you navigate to the website. But, for JWT tokens your application will have to go “look up” the certificate. And that’s where OpenID Connect comes in.

OpenID Connect’s goal wasn’t to standardize JWT tokens or the certificate signing, it was a secondary feature for them. However, the secondary feature was pretty well done and OAuth2 enthusiasts adopted that part of the protocol while leaving the rest of it alone. OpenID Connect specified that their should be a “well known” endpoint where any system could go look up common configuration information about an Authentication Service, for example:

https://accounts.google.com/.well-known/openid-configuration

One of the standard values is ‘jwks_uri`, which is the link to the JSON Web Key Set. In this case:

https://www.googleapis.com/oauth2/v3/certs

In the example above, the entire certificate is in the `n` value. And the `kid` is the key to lookup which certificate to use. So, that’s what kids are; they’re the ids of signing certificates & algorithms.

So, where does the performance gain come in?

The performance gain for these publicly available certificates is that they can be cached on your application servers. If your application is going to use Google OAuth for authentication, and use JWT tokens to pass the user information around, then you can verify the token signatures using cached certificates. This puts all the authentication overhead on your application server and not in a synchronous callback to an Authentication Service.

But, there is a small performance penalty in the first call to retrieve the JWKS.

What’s the first call performance penalty look like?

Not much, about 500 ms. But, here’s what it looks like with an actual example.

First call that includes the JWT Token:

  • It reaches out to https://accounts.google.com/.well-known/openid-configuration which has configuration information
  • The configuration information indicates where to the get the “kids”: https://www.googleapis.com/oauth2/v3/certs
  • It downloads the JWKS and caches them
  • Then it performs validation against the JWT token (my token was expired in all of the screenshots, this is why there are “bugs” indicated)
  • Processing time: 582 ms
  • Processing time overhead for JWT: about 500 ms (in the list on the left side, the request just before it was the same request with no JWT token, it took about 99 ms)



(info was larger than one screenshot could capture)

Second call with JWT:

  • The caching worked as expected and the calls to google didn’t occur.
  • Processing Time: 102 ms
  • So, the 500 ms overhead of the google calls don’t happen when the caching is working.


(info was larger than one screenshot could capture)

Test Setup:

  • The first call is the application load time. This also included an Entity Framework first load penalty when it called a database to verify if I have permissions to view the requested record.
    • Processing Time: 4851 ms
    • This call did not include the JWT.
  • The second call was to baseline the call without a JWT.
    • Processing Time: 96 ms
  • The third call was to verify the baseline without a JWT.
    • Processing Time: 99 ms

Wrap Up …

So, it isn’t much of a performance gain. But, its enough to make caching the value and keeping all the authentication logic on your application server worth while.

(Ohh … and Stackify Prefix is pretty awesome! If you haven’t tried it, you should: https://stackify.com/prefix-download/)

Translate a BDD User Story to a Pester/Selenium Test

on Monday, October 28, 2019

This is a very simple example of taking a User Story that’s written in a BDD style given-when-then pattern and converting it into a Selenium based UI Test. To do this, I’m going to use Pester and a Selenium Powershell module (mentioned previously in Selenium & Powershell) to write the executable test.

So, let’s start out with a very common scenario, logging into a website. It’s not the greatest example as it’s generally assumed that development teams will need to ensure this functionality works and might not make it into requirements written by end users or driven by business value statements. But, here’s the simplified use case anyways:

Scenario: User logs in with valid credentials

Given that I am not logged in,

When I enter my username and password and click the Log In button,

Then I will be logged in and I will see the Search page.

So, let’s translate this to some example code:

What’s useful in a deployment summary?

on Monday, October 21, 2019

There’s a larger question that the title question stems from: What information is useful when troubleshooting problematic code in production? If a deployment goes out and things aren’t working as expected, what information would be useful for the development team to track down the problem?

Some things that would be great, would be:

  • Complete observability of the functions that were having the problem with information about inputs and the bad outputs.
  • Knowledge of what exactly was changed in the deployment compared to a previously known good deployment.
  • Confidence that the last update to the deployment could be the only reason that things aren’t functioning correctly.

All three of those things would be fantastic to have, but I’m only going to focus on the middle one. And, I’m only going to focus on the image above.

How do we gain quick and insightful knowledge into what changed in a deployment compared to a previously known ‘good’ deployment?

I guess a good starting place is to write down what are the elements that are important to know about a deployment for troubleshooting (by the development team)?

  • You’ll need to lookup the deployment history information, so you’ll need a unique identifier that can be used to look up the info. (I’m always hopeful that this identifier is readily known or very easy to figure/find out. That’s not always the case, but it’s something to shoot for.)
  • When the deployment occurred, date and time?

    This would be useful information to known if the problem definitely started after the deployment went out.
  • Links to all of the work items that were part of the deployment?

    Sometimes you can guess which work item is most likely associated with an issue by the use of a key term or reference in it’s description. This can help narrow down where in logs or source control you may need to look.

    If they are described in a way that is easily understood by the development team (and with luck, members outside the development team) that would be great.
  • Links to the build that went into the production deployment? Or, all previous builds since the last production deployment?

    Knowing the dates and the details of the previous builds can help track the issue back to code commits or similar behavior in testing environments.

    Of course, if you can get to full CI/CD (one commit per build/deployment), then tracking down which work item / commit had the problem becomes a whole lot easier.
  • Links to the source control commit history diffs?

    If you want to answer the question “What exactly changed?” A commit diff can answer that question effectively.
  • Links directly to sql change diffs?

    In the same vein as source control commit history diffs, how about diffs to what changed in the databases.
  • Statistics on prior build testing results? If a previous build didn’t make it to production, why didn’t it make it there? Were their failing unit tests? How about failing integration tests or healthchecks?

    Would quick statistics on the number of tests run on each build (and passed) help pinpoint a production issue? How about code coverage percentages? Hmmm … I don’t know if those statistics would lead to more effective troubleshooting.

Another thing that seems obvious from the picture, but might not always be obvious is “linkability”. A deployment summary can give you an at-a-glance view into the deployment, but when you need to find out more information about a particular aspect, having links to drill down is incredibly useful.

But, there has to be more. What other elements are good to have in a deployment summary?

A FakeHttpMessageResponse Handler

on Monday, October 14, 2019

The Microsoft ASP.NET Team has done a great deal of work over the years to make their libraries more friendly to unit testing. Some of the best parts of their efforts were their adoption of open source testing frameworks like xUnit and Moq.

An interesting use case when testing is the need to mock data from a data source. Generally this means mocking the result from a database call, which the EF Core Team has made considerably easier with the addition of UseInMemoryDatabase.

But, sometimes your data store is going to be a Web API. And in those cases, you might use an HttpClient to retrieve your data from the service. So, how would test that? The way that Microsoft intended for you to handle this is to create an HttpMessageHandler (or DelegatingHandler) and add it into the base line handlers when creating your HttpClient through the HttpClientFactory. Unfortunately, that isn’t always as straight forward as it would seem.

Luckily, Sebastian Gingter has this great way of using Moq to simplify the process with a FakeHttpMessageHandler. It really cleans up the clutter in the code and makes the usage readable.

This example has a small twist on his implementation as it doesn’t pass the mocked HttpMessageHandler in through the constructor. Instead is exposes the HttpClient’s internal handler through a public property so it can be replaced after the instantiation of the HttpClient object. This is definitely not something Microsoft wants you to do.

But, here we go anyway:

  • AlphaProxyTests.cs

    This is example Test code which shows the FakeHttpMessageHandler in use.
  • FakeHttpMessageHandler.cs

    This is the slightly modified version of Sebastian Gingter’s FakeHttpMessageHandler class.
  • HttpClientWrapper.cs

    This is a wrapper class for HttpClient which exposes the internal HttpMessageHandler as a public property. This is not recommended by Microsoft.
  • HttpMessageInvokeExtensions.cs

    This uses reflection to access internal fields in order to directly manipulate the internal HttpMessageHandler that HttpClient will call. Because this accesses internal fields from HttpClient it is unsupported by Microsoft and it will break from time to time. It has already broken once, and there is a good chance it may not work with all ASP.NET Core implementations.

Analysis of an Antivirus False Positive

on Monday, October 7, 2019

Here’s the background scenario:

A team member zipped up a website from a production server and copied it to their local machine. This particular production server did not have an antivirus scanner on it at the time, and when the zip file reached the team members machine this warning message appeared from the machines antivirus software:

image
The team member did the right thing and deleted the files from disk and contacted the security team to investigate.

Soo … what are the right steps to perform from here? I’m not sure, but here are some steps that could be taken.

  • Despite the constant warnings of security trainers, don’t assume that a virus has now been distributed onto the server and now onto the team members workstation. This is a classic fear based training technique used to prevent the spread of an attack, but it may not always be the right thing to do.
  • Instead, look at the evidence presented soo far and try to see what more information can be learned from it.

Okay, so what can be determined from the information so far:

  • A server could possibly have a backdoor virus called PHP/Remoteshell.B on it.
    • If the backdoor was executed, then potentially a different/higher privileged malware could be on the server.
  • A team member machine could possibly have a different/higher privileged malware on it.
  • The website was running Sitefinity
    • Sitefinity is an ASP.NET product, not a PHP product
  • The virus signature was found in a file called Error.*.log
    • Sitefinity uses Error.*.log files to record messages about errors it detects or encounters in order for administrators to review them later.
    • Therefore the Error.*.log files would be error messages and not executable code.

And, what assumptions could we make from the information so far:

  • Given that it’s an error log, viewing the log file would probably be benign and could possibly give information about how an attack started and if the attack stopped.
  • Given that it’s an ASP.NET web application and not a PHP application, the attack did not succeed. So, there is very little to no risk is assuming the server is not infected with a backdoor or virus.
  • However, if it was infected with a virus, and that virus was capable of proliferating itself through RDP, then the team member had already brought the virus into their/your network.
  • In this highly unlikely scenario, if the team member had already brought the virus onto your network then it wouldn’t be terrible to connect with an isolated VM instance from that network which could be used to look at the files on the server.

At this point, the risk of proliferating a virus onto a disposable VM in order to view the actual log files on the server seemed like a good risk vs reward scenario.

So, using a throw away virtual machine, you could connect up to the server and take a look at the Error.*.log files, which might show this:

image

There are a number of lines that look very similar to ones above, and they were all done over short 30 second period and then stopped. The requested urls varied with syntaxs from PHP, .NET, to regular old SQL injection.

So, what’s the new information that we can learn from this:

  • This looks like an automated attack signature from a bot that was penetration testing.

This server is fortune enough to get a routine penetration testing from out automated penetration testing system around every two weeks. If any of the attacks are found to be successful then a report is generated and sent to the Security and Development teams to implement fixes.

So, this could be a nice moment to call the analysis done, but it did highlight a few missing things. It resulted in the antivirus software being hooked up to the server and a scan being performed, along with a good amount of information sharing and more awareness of the automated penetration testing system.

There’s probably a lot more that could be done, and a lot more safety precautions that could have been taken during the analysis. At least, there’s always tomorrow to improve.


Creative Commons License
This site uses Alex Gorbatchev's SyntaxHighlighter, and hosted by herdingcode.com's Jon Galloway.