Sunday, 17 March 2013

Going Cloudy Part 5 - Scalability without breaking the bank

As detailed previously, the application consists of three major components:
  • The web site, which predictably is used by everyone.
  • The asmx web service which is used by about 80% of users but performs fairly low-resource actions.
  • The REST service is hit only by external systems importing/retrieving data on a schedule, so it doesn’t need a lot of resources either.

 Going with the default cloud approach of separating everything would be prohibitively expensive. Currently a 2 instance cloud service is about £18 a month. Add on the monitoring project and this means 4 cloud services, giving me an £80 quid bill every month on servers whose utilisation would be pretty poor.
The smart thing to do is merge them all on to a single cloud service. It’s important that merging multiple sites/services on a single cloud service is only done where appropriate. Putting lots of services on the same instance when doing so will result in a shortage of resources defeats the point of moving to Cloud hosting. In this instance, they are all low-resource services which for the foreseeable future will happily sit on the same instance without problem. When traffic increases to the point where either of the services needs to be separated out to an instance of its own, this will be easy to do when required. The details of how to merge all of the services in to a single instance are detailed in my previous post.

 During initial deployment, I was getting some really appalling performance on all services, which I initially put down to the traffic manager. Using Remote Desktop to log on to one of the instances revealed that there was a process called VSPerf.exe running which was sapping all of the server’s CPU and a great deal of its RAM. There was one of these for each site initially (so 4) and then another 1 for each site was added whenever I did an upgrade deployment. This resulted in one instance where I had 16 VSPerfs running, it took me nearly 15 minutes to remote on to the server to kill them!
I could only find some anecdotal references to this problem on the internet which revealed no solutions other than to reboot after every deployment. Eventually I worked out that I had inadvertently turned on profiling at some point. I’d accidentally turned this on and didn’t need it, so I turned it off and the problem was solved. I’m sure the majority of users of profiling do not have this issue, but if you turn on profiling and then your performance falls through the floor, take a look at Task Manager and see if you are one of the unlucky few.

Friday, 8 March 2013

New Relic for Azure

Ever since my company moved our main management system to Azure, I've been slowly working on the ability to monitor the application better in order to find the bottle necks and improve the user experience. To date, I've added:

  1. Timing on every request so I can query for the slowest pages.
  2. Azure Diagnostics
  3. StackExchange MiniProfiler to give me some insight in to every request, this has proved extremely useful in tracking down the exact cause of slow pages.

More recently, I've been planning to add Glimpse as well, as I like the insight it gives you in to the MVC ViewEngine and routing systems.

All these tools are great but each one adds an extra element of complexity to the system that is unavoidable, or at least it was until now.

Yesterday, I discovered New Relic. Their .net agent promises to hook in to your application with no code changes and it actually achieves it!
Shamelessly lifted from the New Relic blog, this is how:
Code run by the CLR is considered ‘managed’ code, i.e., the CLR provides a managed environment in which memory object garbage collection and other services are ‘managed’ by the CLR. The Profiler API provides a mechanism for a profiler, such as the New Relic .NET agent, to inject code into whatever managed-code functions it desires. These injected bytes are in the form of MSIL, the .NET assembly language.

Personally, I think this is damn impressive, and as I already mentioned it allows the agent to hook in with no code changes. The agent can be installed on any server, but what impresses me is the simplicity with which this can be added to Azure Cloud Services.

The New Relic blog gives step by step instructions on how to add the agent using Nuget here, however this isn't all that is required. I deployed once and there was no data being reported, logging on to the server revealed that the agent had not been installed.
 I found that while the nuget package added a newrelic.cmd file to install the agent, it didn't add an entry to the cloud service's servicedefinition.csdef, so the script was never getting fired. After a few attempts, I found that the following entry works.
<Task commandLine="newrelic.cmd" executionContext="elevated" taskType="foreground" />
My initial attempt utilised a taskType of background, which meant that the task was processed asynchronously and everything was already initialised by the time the installation had completed - the agent had missed the boat to get it's hooks in.
I contacted New Relic support prior to working out the solution and they suggested that this would help them with a problem they were having with the nuget package (presumably, it was supposed to add it to the servicedefinition.csdef). The knowledge that I have possibly aided in making use of NR even smoother for other users is great, it feels good to contribute.

If you sign up through Azure, you get the Standard account free with a pro trial, all the details are on the aforementioned blog post, link below for those too lazy to scroll back up.

Note: I am not affiliated with New Relic in any way other than being a (free) customer and they have not paid me for this post (why would they, no one will read it!), I just think this is a really great tool.

Going Cloudy Part 4 - FTPServer and REST Service configuration

The FTP Server is a Windows Server 2012 Extra Small VM, running IIS for FTP and our in-house import/export agent.
The FTP server is used to exchange data between the application and external services. The reason for FTP is that the main external service we deal with only has that capability. We hope that external service can eventually switch to using JMS which can then be bridged to Service Bus, but for now this is all we have.
For the purposes of resilience and quick recovery after a failure, all files including the FTP folders and the executables for the agent are kept on a separate data disk and all of the configuration needed to get from brand new VM to all systems go is scripted with Powershell.
If the server ever dies, it’s just a case of firing up a new VM, adding the data disk, and running the Powershell script. I also intend to image the OS to give me the reimage option as well. The Powershell script is nothing special, but I intend to publish it for the sake of completeness.

You’ll notice from the diagram that the rest service is being directly accessed from, bypassing the load balancer. You will also notice that only the EU instance is being used, the US instances are sitting there doing nothing.
There is a good reason for this.
The REST service can theoretically be accessed by any external system that is capable (as long as we grant it permission of course). Some of the imports carried out by the REST service are fairly destructive. If traffic was going through the load balancer, there is the possibility that the REST service in the US could be carrying out an import at the same time as the REST service in the EU, having some very scary consequences.
Now, in an ideal world, the REST service and the underlying data model would’ve been built to deal with this possibility. But it wasn’t, so we have to deal with that. Eventually, I intend for the REST service to simply be a receiver of files, which then puts the received data in to a queue to be picked up by a worker process, or worker processes, which WILL be designed to deal with multiple instances working at the same time without screwing everything up.
Having the entirety traffic go straight to the EU instance means that only one REST service will be taking request at any one time. It also means the US ones are doing nothing, but they use little to no resources if they are not being sent any work to do so I don’t think this is a major problem.
The only way around this would be to have a separate service definition set up for the US and other regions, which it seems to me is unnecessary duplication and extra work.
An alternative set up that I am considering is to have a second traffic manager configured for failover which is there just for the REST service. This would allow failover to the US instances if the EU ones ever became unavailable.