This project is read-only.

Using StoryQ for performance critical testing

May 3, 2011 at 7:43 AM

I should probably start by saying that I am a HUGE fan of StoryQ. It's syntactically superior to the other .Net BDD frameworks out there, and using it is encouraging some of our more reluctant developers to adopt it and to start writing much cleaner tests, readable, maintainable test code. It's all good, and my beating the drum has finally started to pay off... but I've run into a snag, and from a commercial perspective it's a rather huge one.

I've been fine-tuning our build processes.  We build about 10 different products with all of their interdependencies, and we have a staged release process where the first step begins with an integration build (which is continuous for our day to day development), which is presently taking us about 20 minutes for a single API... and about 40 minutes all up. This API is heavily tested, and while all manner of optimization has occurred in both the implementation code, and the build process itself, I have yet to deal with the tests. I've taken a number of measurements, and have learned that the tests take up about 65% of the build time, and this will only grow as we add more modules to the system. With about 95% test coverage, we've got a lot invested in the testing process so changes to the way we test will have a big impact on us.

So it's a nice little background story, but here's the rub. When I compare tests written in (to all intents and purposes) an identical manner, I find that the StoryQ based tests are running in the order of somewhere near 25-30% slower than if I were to do things another way. A few rough post-profiling calculations on the back of a napkin (taking into account the JIT overheads etc) show me that I could effectively reduce a 20 minute build to nearly 8 minutes if I were to give up my dreams of taking over the world one StoryQ test at a time.  The numbers I am quoting here are really rough, and have used some seriously contrived test cases that don't quite mirror what is happening on our integration system, but they do give a reasonably good indication of the differences between similar test cases with and without StoryQ

Now, I'm not saying that StoryQ is the only thing that we need to fix.  Yes, we've been dealing with how our tests are written, removing code bottlenecks, and also throwing faster hardware at the problem while working on parallel build processing and all manner of smoke-and-mirror technology too.  I really don't want to be abandoning StoryQ as it has been brilliant at helping me to clean up the collective acts of some of our developers, and has seriously aided us in streamlining other processes, improved our reporting and so on. However people might disagree with the resources we are throwing at this process, we have very good commercial reasons for doing so, and removing bottlenecks from our release process has become quite a critical business requirement, so I'm stuck with trying to "fix" this.

So I guess the real question to put forward is, why is it that StoryQ is having such a big impact on the execution time of our test code? Is there some mysteriously optimal way of writing tests using StoryQ that we're missing, or is there an inherent problem in the design of StoryQ itself that results in comparatively poor execution time? If the latter, what can we do about it, and are there any plans to look into the performance of StoryQ? Or, is this a case of trading off performance for readability/maintainability?

Cheers!

:-)

May 3, 2011 at 8:30 AM

I'd say that the reason storyQ is slower than normal unit testing is probably down to its heavy use of reflection, which is how it displays method names for you. Another possibility is that it's the report output, which is pretty XML-intensive, are you using .Execute() or .ExecuteWithReport()?

Unfortunately I don't think this can be optimised any further, as we've already boiled everything down to Delegate.Name and Delegate.Invoke(), but can you please let me know:

  1. How many storyqified unit tests are you running?
  2. How many steps are being executed (roughly)?
  3. Are you using ExecuteWithReport()?
  4. What unit testing framework you're using, and what execution environment is invoking it (TFS? TeamCity? Nunit.exe?)

Cheers - Rob

 

P.S. as an aside, we have about 10x as many plain old unit tests as story tests, finding that that's roughly the ratio between unit tests and true specifications on my latest storyqified project

May 3, 2011 at 12:07 PM

Like Rob, most of my tests run in the unit test area, less in integration tests and then less in system tests. I've documented this elsewhere as the layering the test according to the test automation pyramid.

  • Is this the case for you? What numerical percentage and runtime percentage of unit tests?
  • What is your take on JB Rainsberger idea that integration tests are a scam? Are you trying to do exhaustive testing through storyq?
  • Freeman and Pryce use these definitions below, how do they relate to your test strategy?
  1. Acceptance: Does the whole system work?
  2. Integration: Does our code work against code we can't change?
  3. Unit: Do our objects do the right thing, are they convenient to work with?

In short, speeding up tests may mask an underlying test strategy problem. I have found this usually to be the case: when the codebase increases in size and complexity my test strategy sometimes doesn't scale and must be refactored. Could this be the case?

May 5, 2011 at 3:36 AM
Edited May 5, 2011 at 4:05 AM

In answer to Rob:

At the moment, about 30-40% of my tests are with StoryQ.  I use the ExecuteWithReport(), but the measurements I've taken show me under most normal conditions that this doesn't tend to impact too heavily on the overall execution time. I tend to write my tests with only a handful of steps, usually with a single Given/When/Then set, and sometimes only Given/Then.  Sometimes I Add a few extra steps to add a little clarity, or when code reuse seems to be important... but this isn't too often. I try to stick to the rule that "Each test should test only one thing!", and in a twisted way I try to do this with the test setup also. It's running using the nunit-console, from a Python script, and with RhinoMocks thrown in for good measure.

To answer Todd:

We don't write integration tests. Our architecture is such that we don't end up with the same sort of legacy overheads that many places do, although this may change at some stage, so we may need to apply a limited number of tests at some future time to highlight those things that would prove to be too costly to try to improve.  We do however execute thousands of unit tests as an integral part of an automated process on the build server where developer's code is integrated into the mainline.  When we talk about integration, this is what we mean. Test failures break the build and the rule of thumb is that the build should NEVER break, and if it does it's probably because someone did not implement their code test-first against a fresh source snap-shot, or even worse, that they didn't run the tests prior to check-in.  We use this to highlight procedural flaws, and to encourage developers to adhere to the procedures that we know work for us. Test layering is less of an issue, as our products are designed to be both layered and modular, so the testing reflects this naturally.

As to whether the test strategy scales... in terms of doing what it should, and providing the data and safety net we desire, yes it does. In terms of overall speed and using developer time efficiently, that is the question that I am trying to answer.  The key may simply be to throw better hardware at it and use a brute force approach to getting things done quickly.  The problem is that upgrading a build system is relatively cheap, but upgrading everyone's system to keep pace is not, and we have to deal with the fact that developers will have systems operating at different speeds to each other and in comparison to the main build system. Also, simply throwing hardware at the problem doesn't allow you to learn to do things more efficiently, and doesn't teach you to improve your skills as a developer.  I'm looking for that break-even point that will allow me to say how much effort we should into improving our code. Using a good profiler has been a good start to identify the critical performance hot-spots in our code, and unfortunately everything is pointing to the tests themselves, particularly in the use of StoryQ, although there is a HUGE JIT hit which I am still trying to get a handle on, but I think the two are somehow related.  So I'm at that classic trade-off scenario, where I must choose between beautiful test code, or efficient test code, or ignoring both cases and supercharging hardware.  My gut tells me that I need to look at all 3 over time, but it's a question of choosing sensible options and scheduling them appropriately.

One thing that I have realized that I think is work a mention, is that it's my understanding that RhinoMocks  uses reflection quite heavily also (or the Castle framework does... or both) and yet it seems to not impact on the tests as heavily.  I wrote a few very simple tests using Rhino, and found that for these particular tests, I could execute them about 3800 times per second, but when I added StoryQ in into the mix, I found I could only execute about 100-120 tests per second. I realise that there are a few additional function calls in there as a result, but not enough to reduce performance so sharply, which is why I ended up posting this discussion. It's what has me questioning whether it was something I hadn't taken into account, or if there was a problem in StoryQ itself... and while all this is academically interesting, it leaves me at that point where I may or may not need to make a few hard decisions. Whatever I decide to do will be right for my business, but I'd rather do so with my eyes open!  :-)

Thanks for your thoughts!

 

S.

May 5, 2011 at 8:39 AM

Thanks MrBDD

I guess i still just need to know what you're using to run your tests (TFS? TeamCity? Nunit.exe?), so that I can profile StoryQ.

Thanks for your comments about rhino too, makes me think that there probably is room for improvement..

May 5, 2011 at 10:07 AM

Hey Rob!

That would be nunit for the tests, although I run everything via the console app so I can script what to call and when.

I hope you guys don't think I'm being too harsh on your efforts with StoryQ.  It's a great product, and given some of the measurements I've been taking lately, most of the problems with trying to optimize a product like this seem to be that you very quickly reach performance limits due to the way that reflection works. And my comment about Rhino was probably a little misleading, because it does impact test execution time quite significantly by itself... and it's possible some of my test data wasn't very clean.  Some additional testing today showed me that the two products were on par with each other in terms of the load they place on tests. That's the problem with applying a scientific approach to learning about stuff... you have to do the same stuff over and over in lots of different ways, and the data changes as the conditions change! :-)

I've noticed that I've tended to farm out a lot of the methods I get StoryQ to use to something I like to call a test harness class.  I'm toying with the idea of trying to inline some of the methods, but I can't see how I'd do that without diving into your code... something I'd prefer to avoid as I prefer to leave 3rd party products as-is so that I don't have to deal with porting customizations every time something changes.

FYI the profile tool that I've been using is the ANTS Performance Profiler. Sorry for the product plug... but it's been a good find and I've been getting a lot of mileage out of it.

Cheers!

May 5, 2011 at 10:17 AM
Edited May 5, 2011 at 10:19 AM

I've had harsher ;)

Can you have a look at my new PerformanceTest (just checked in in http://storyq.codeplex.com/SourceControl/changeset/changes/d354b64226b4) - i got the execution time down from 6 seconds to 34 msec by buffering calls to Console.Out.Writeline (that change is also part of the same commit).

If that's representative of your problem, then can you try getting the latest code and building StoryQ.dll and trying that?

It was in fact ants that pointed out the high cost of all the calls to WriteLine for me, although my favorite is DotTrace ;)

In my defence, this performance problem is related to whatever the current test runner decides to replace Console.Out with :D

 


 

You can in fact call methods on other objects from a StoryQ test method, all that matters is the method name, but one argument against inlining them is the ability to put formatting attributes on methods and their parameters

Cheers - Rob

May 6, 2011 at 1:33 AM

Now that's what I call an optimization!

It's amazing how easy it is to forget the impact that messing with strings can have, and how important it can be to use the "right" tool for the job.  I have a test that was showing me I could only get about 120 executions per second, but with the change you've made it jumped to just under 1500!!

That shifts the balance significantly... but of course it now creates much more work for me as I'll need to look for problems elsewhere if StoryQ is no longer to blame! :-P

 

Thanks for that change, and for reminding me that I'll need to look for similar optimizations in our products! 

:-)