Antifragile Development

I just ordered the book Antifragile by Nicholas Nassim Taleb.  As I was thinking about the thesis of the book, I thought to myself – that sounds a lot like what our team is trying to create, and be. For those of you not familiar with the book, its full title is “Antifragile: Things That Gain from Disorder”.  The core theme is that Antifragile is more than just Resilient – Resilient merely survives; Antifragile feeds on chaos and disorder, and gets better because of that.

Software development is about creatively solving problems, and managing complexity; the team continuously encounters “shocks” with ongoing development and operations.  Production system goes down; bugs get introduced when fragile code gets modified; framework bugs surface long after initially running problem-free; developer with critical knowledge of the system leaves the company.

The team, the code-base, and operational infrastructure often starts out fragile. The small team is stressed; developers trade off speed to market with “doing the right thing”; operations takes on the risk of outage while business tries to figure out how to scale.  Creating a Resilient organization and infrastructure is a great first step but that is not enough.  The team needs to feed on all the things that happen – things going unexpectedly bad, as well as things going unexpectedly well – and use the disorder to get better all the time.  Retrospectives with corrective action help with continuous improvement. Introducing the right amount of stress – whether externally, or internally – makes us better. Antifragile is our goal.

Agile Testing with uTest

Optimizing Hibernate performance through lazy, fetch and batch-size settings

In this post, I am going to mostly focus on a particular example with our usage of Hibernate and how performance was improved.

We have a User object (who doesn’t?), which maps to many Badge objects like so:

<class name="com.utest.User">
   ... 
   <bag name="badges" table="Badges" lazy="false"></bag>
</class>

Notice that lazy is set to false, because we need to make sure the User object is fully initialized before passing it out of the session context.

Loading Users

Now we want to load some users, say 67 of them, and we use HQL like this:

from User where userId in (:userIds)

unfortunately, this results in 68 queries (N+1 selects problem): 1 main query to load all the Users and then 67 queries to load badges for one user at a time.

One way we could try to optimize this is to add fetch=”join” to the mapping:

<bag name="badges" table="Badges" lazy="false" fetch="join">

we are hoping that Hibernate will now run a single query, with an outer join to the Badges table, but in practice it does not.  The reason is that Hibernate ignores the fetch attribute from mappings when an HQL query is executed.

In order for the join to happen, we must either explicitly specify it in the HQL:

from User u left join u.Badges b where u.userId in (:userIds)

or, we must use the Hibernate Criteria API to load the users:

session.createCriteria(User.class)
       .add(Restrictions.in("userId", userIds))
       .setResultTransformer(Criteria.DISTINCT_ROOT_ENTITY)
       .list()

Quick note: If you are using a 3rd party wrapper around hibernate, for example – we are using Hibernate Generic DAO, you might want to check whether it uses HQL or Criteria under the hood.  Ours uses HQL, which is what led to the investigation that led to this blog post.

But which fetch strategy is faster?

Sorry, no easy answers here.  It depends on the other parts of the User object mapping (how complex the single query will be), how many badges you expect per user, and even your network latency.

Here are the timings from my own machine using our actual code and loading 67 users:

1) with fetch=”select” (the default), 67+1 queries, time taken is: 287ms
2) with fetch=”join” and using distinct criteria, 1 query, time taken is: 664ms
3) with fetch=”select” but adding batch-size=”100″, 3+1 queries time taken is: 170ms

It turns out that forcing the join fetch isn’t actually the faster option, in fact it is way slower than doing 67 extra queries in our case.

The best option here is #3.  By adding batch-size=”100″ we are telling Hibernate to load the Badges separately, but 100 at a time.  (Aside: For reasons still unclear to me, Hibernate issues 3 selects to batch load the Badges.  But if batch-size was set to a perfect 67, it would only issue 1)

In this case, doing the much simpler optimization of allowing Hibernate to batch load the badges reduces the total query time by 40%. And we don’t have to change any HQL code.

Conclusions

I think the biggest conclusions here are to
1) pay close attention to the sql queries that Hibernate actually runs against the database and
2) try out different mappings and time them

To help with #1, you should enable Hibernate logging of sql while doing database development by either:
a) Set show_sql setting to “true” in hibernate configuration or
b) Set logging level for “org.hibernate.SQL” to debug to see generated SQL, and perhaps also set “org.hibernate.type” level to trace to see all of the parameters being used in the queries.

Tags: , ,

Continuous Deployment at uTest

It’s been a busy few weeks for the Platform Team at uTest! Earlier this week, we launched our brand new, next-generation uTest platform to all new customers. We’ve also been making improvements to our development and deployment process – we are happy to announce that we’ve started our practice of Continuous Deployment!

For those of you not familiar with Continuous Deployment, the idea is that we deploy code to production very frequently (as in, up to multiple times a day). Code is delivered to customers quickly, and we get to learn and iterate very rapidly. And if we do it right, quality should improve as well.  Purists might say Continuous Deployment must deploy on every commit, and every new code roll-out should go via this route – but we’ve decided to take a more pragmatic approach: we do it in small batches instead of on every commit, and we also decided to keep the weekly deployment schedule separately for the riskier, bigger features.

So what’s behind our new Continuous Deployment process?  Others have written about it , but here’s our take:

1) Automated testing: The backbone of Continuous Deployment is automated testing. It gives the team the confidence to release new code without breakage, without a high cost of release: manual regression testing is too expensive and slow for releasing small changes frequently. We use a combination of unit tests (funcUnit, JUnit), web service tests (SOAP UI), UI tests (Selenium) to ensure that the application does not break.

2) Automated deployment scripts: we have buttons that trigger merges, test runs, and deployments to various internal servers for testing, as well as to production. We reduce the cost of deployment and make this a reliable, repeatable process.

3) Monitoring system: We use Nagios as our monitoring and alert mechanism. If something breaks as a result of our deployment, we will know right away. We monitor a lot of subcomponents of our system individually (probably a different post some day for details). We don’t use Nagios to monitor business events today, but we are planning to add them to our dashboards to ensure that not only are the servers and applications running, but that our application is functioning correctly to deliver business results to the company.

4) Team mentality: This is the most important of all!  No matter how great you make the infrastructure and process, Continuous Deployment will not succeed if the team has not fully bought in.  Conversely, if the team has the drive to make it work, you can make it work, even without 100% test coverage, or a sophisticated real time dashboard for production health monitoring. The key thing is that we all strive to get code deployed quickly (because that’s how we deliver value – work in progress is not zero, but negative value!), and that we are always looking to improve on how we do things.

So, what are the results of adopting Continuous Deployment?  It’s too early to tell – we still need to prove that this process actually provides us with the benefit of quicker feature time to market and rapid iteration (more than our current once-a-week deployment). And to be know that our infrastructure and process can deliver quality software.  But we’ve already felt a strong impact of the change: the whole team is excited that they can introduce their work to the world without waiting.  Because if you built it and no one came and used it, did you really build it?

 

PS -

I’ve heard some questions regarding our process so let me address them here:
1) So how does this fit in with uTest’s In-the-Wild testing service?
Good question! We are continuing to test bigger riskier features using our own service. We think that combining a strong automated testing infrastructure with manual in-the-wild testing is the most effective way to deliver value to customers fast and reliably.

And in fact we had a classic case study this last weekend in testing our new HTML app, when our tester found a problem with our use of the Rich Text Editor – it was 100% reproducible by the tester, but we couldn’t figure it out.  Our developer worked with the tester to figure out that the “disappearing text editor” problem was caused by the tester’s language setting. In-the-wild testing would have been the only way to catch this before releasing the code.

2) Why did we keep the weekly release cycle?
For one, we wanted to test some of the bigger, riskier features in-the-wild as described above. In addition, we do not have Test Automation in place for our core Flex application, which shares the backend infrastructure with our “next-gen” HTML app. So when we touch the back-end code, we risk breaking the legacy app, and currently we do regression testing manually (via our Testing Community).

Reinforcing Our Culture – via “Report Cards”

As I’ve mentioned in my first post in which I described our tech and process infrastructure, culture is a big focal point of my activities. You might say that “culture comes with people” – and that’s absolutely true – people are the basis for team culture, and you can’t enforce a culture top-down. But you can help shape it, and in a team that is undergoing changes (via new blood and new processes), it is very important to keep reinforcing the culture that we want to build and grow.

We’ve made some changes and put together a series of activities to help promote behaviors, for example:

  • Open Office Layout (only partially so far – our Cambridge and Seattle offices are completely open; Southborough still has cubes, but the new layout encourages more collaboration)
  • Weekly Engineering Brown Bag lunches
  • Science Fairs
  • Retrospectives (sprint retrospectives, and also deep post-mortems for production issues)
  • Developer Conference Reimbursements
  • and this Engineering Blog

I just went through a round of employee reviews (yes, as a result of the flattening of the organization, I have everyone on the Platform Team report to me directly), and I used a format of a “report card”, much like what my children bring home from school a couple of times a year.  It comprises of a list of criteria, each with numeric score. I also have free text comments for each theme. I used a scale from 1-4.  The scores aren’t scientific, and I’m not trying to use the score to compare people.  I don’t believe that developers with a “GPA” of 3.6 are any more valuable than one who has a 3.5. I also don’t tie compensation to the numerical score on the report card (although, if I did some analysis, I’ll probably find strong correlation). Instead, it is a way to organize discussion points around behaviors and skills that I value in my organization.

The list looks like this:

Performance

  • “GSD”
  • Accomplish work that is important
  • Attention to quality
 How much value are you delivering to uTest and our customers? How much code are you deploying? Is your work consistently of high quality?

Judgment

  • Wise decisions despite ambiguity
  • Separate tactical steps vs later improvement
How attuned to the context around you? How conscious are you about your decision making process? Are you able to make good decisions in ambiguous situations?

Communication

  • Listen well
  • Concise at writing and speaking
  • Treat people with respect
  • Work towards solving problems (understand problems in context)
  • Effective collaboration
  • Assertive when appropriate
Do you always understand the whys and the hows before taking on a task? Do you update others to minimize surprises, and to make their work more productive? Do you solicit feedback early to validate or improve on your ideas?

Curiosity

  • Learn new things (tech, customers, industry)
  • Suggest and try new ideas

Passion

  • Care intensely about Testing (and uTest success)
  • Celebrate wins and accomplishments
  • Constant improvement
  • Pride in what you produce
  • Inspire others with your drive to do better
Passion and enthusiasm are contagious. Are you excited about work? Do you make your colleagues excited about work? Are you always working to get better at what you do?
Key:
Score = 4: Consistently overachieve
3 = Usually achieve
2 = Satisfactory
1 = Unsatisfactory, need work

 

You might have noticed that this list looks similar to the now famous Netflix Culture deck posted a while ago (if you are working to build good culture, it’s a must read) – I did borrow content heavily from it.

How much discussion do you have about culture in your performance reviews? Do you think a “report card” is an effective way to organize the discussion and reinforce the culture?

 

 

The Life of Bugs

Bugs are born through the touch of our fingers, but are found using all of our senses. Some bugs are visible only to the trained eye. Some bugs loudly cry out to us through our tools. Other bugs make our software feel wrong even before we can put our fingers on why that is. Patterns of code containing bugs smell, so we learn to detect their scents. Then there are the bugs which leave a bad taste in our mouths, which we resolve never to taste again.

Bugs can be found in a myriad of ways. Some bugs that are found are fixed. Others bugs that are found are not fixed. Most bugs are never found. We only remember some of the bugs we did find. We often forget those which were prevented.

Of the bugs we remember, we often also remember how and when we found those bugs. Sometimes we remember which of us “found a bug.” We grow to love the tools and processes which successfully “prevent bugs.” We always remember when a customer “discovers a bug.” We should all forget “who wrote the bug.”

Here are some of the ways and times that bugs are found or discovered (in order of preference):

  1. During high-level design
  2. During detailed design
  3. While writing code in our favorite editors
  4. At compile time
  5. At link time
  6. By unit tests
  7. By integration tests
  8. By static analysis tools
  9. During exploratory testing before code review
  10. During code review
  11. By pre-submit tests
  12. By a continuous build system just after submitting a change
  13. By a continuous build system down the road
  14. By integration tests run before a release candidate build is cut
  15. By acceptance tests run on a release candidate build in a staging environment
  16. By exploratory testing of a release candidate build in a staging environment
  17. By probes or other continuous monitoring tools in a staging environment
  18. By probes or other continuous monitoring tools once a build is pushed to production
  19. By exploratory testing of a new build in production
  20. Through inspection of system logs and error messages
  21. By a member of the engineering team using the product in production
  22. By an employee of the company using the product in production
  23. By a partner or other friend of the company using the product in production
  24. By a customer
  25. By a potential customer
  26. By a competitor
  27. By a member of a news organization
  28. By a hacker

We have full control over a number of the engineering tools and processes which can find our bugs. However, our tools are never perfect. Our processes are never perfectly designed or executed. We often forget that perfection is not our goal. That would take too much time.

Our time is limited. We use our time “pragmatically.” We use the majority of our time to create great software. Some of that time is spent writing bugs. We use some of our time to create, learn and apply new tests and tools which help us to find the bugs we write. We use some of our time to create, learn and execute processes which help us find each others’ bugs. When we have a surplus of time, we share and watch viral videos from the web.

When bugs are “discovered” by someone from outside of our engineering team, we take time to learn from each others’ “mistakes.” We reflect on how our existing tools and processes “failed us.” We may decide that new tools and processes are needed. We are reactive, but our goal is genuine: we want to “prevent” this kind of bug from ever being discovered again.

We resolve to continue introducing bugs into our systems going forward; just less of the kinds that were discovered before.

Tags: , , , , , , , , ,

Does reusing a jQueryUI Dialog make a difference?

For my first blog post (ever) I am going to talk about a small performance test we did to determine whether reusing a pre-created jQueryUI dialog component is really any better than creating a new instance each time we want to display a modal dialog.  For a dynamic, single page, JavaScript powered application such as the one we are building now, these sort of things do matter (more on our stack in future posts).

The setup (before each test run):

Pre-create a jQueryUI dialog with id “dialog1″, and autoOpen option set to false, so that it starts out hidden.

$('<div id="dialog1" />').dialog({
     ...
     autoOpen: false
});

1) Reuse the pre-created dialog, but passing in new html

$('#dialog1').html('<div><div style="font-size: 20pt;">Some html goes here</div></div>');
$('#dialog1').dialog('open');

2) Reuse the pre-created jQueryUI dialog, passing in new html, but changing most of the dialog options, such as its title, buttons, height and width

This will be the most common scenario for us since we would be reusing one dialog component for any sort of modal interaction our application might need throughout a user session.

$('#dialog1').html('<div><div style="font-size: 20pt;">Some html goes here</div></div>');
$('#dialog1').dialog('options', {
     ... see link below for full details, basically changed many options ...
});
$('#dialog1').dialog('open');

3) Create a completely new jQueryUI dialog

$('<div><div><div style="font-size: 20pt;">Some html goes here</div></div></div>').dialog({
    ...
    autoOpen: true
});

The results:

You can see the full code and results on the jsPerf.com test case page

Performance Results

As it turns out, it is ~9000 times faster to reuse a jQueryUI dialog that was previously created rather than creating one each time.  It doesn’t matter much if the options and contents are completely different from the previous use.  This was an even bigger difference than we anticipated, so now we know (and you know too) to reuse jQueryUI dialogs.

P.S. Thanks to uTest and Fumi for providing a space for our engineers to share these sorts of observations!

Tags: , , ,

Inaugural Post

It’s been 4 months since I formally took over running the Platform Team, and I figured it’s time we started publishing an engineering blog to share what we do, and what we think about.

Our vision of this blog is to talk about our technology, obstacles we encounter, discoveries, and also about the culture and engineering practices that allow us to build great, innovative software.  Our team is a work in progress, just like our software is, continuously improving and moving towards a vision.

For this inaugural post, I’d like to describe to you at a high level where we stand in terms of our product architecture, and the engineering practice and process around it.  Frankly, there’d been quite a bit of “infrastructure debt” that we had to pay down before I felt comfortable having you all see where we are; but we’re finally at that stage. Yes, you will still find things we have or do that are not perfect. Over the next several postings, I’d like to paint the vision of where our team is striving to get to, as well as provide details around our ongoing transition – with the hope that it might help others like us going through a similar transition.

Product Architecture At a Glance 

Our current product platform consists of a pretty standard J2EE-based web application stack. We use MySQL, Hibernate and Spring running in JBOSS, with Apache in front, all hosted at Amazon EC2. We store media including all attachments – images, videos, and other files – in S3.  We use Cloudfront as the CDN for some files.

Our user interface used to be 100% Flex (yes, we’ve made the unfortunate choice back in 2010. We are paying for it dearly by the high cost of release due to lack of test automation); we’ve been working on an HTML-based user experience for our next major release.  uTest Express customers have already been enjoying the intuitive, snappy user experience running on our HTML-based framework since September.

We provide a small subset of our product functionality via an API. Partners and Customers who have asked for our API in the past have probably been disappointed in the lack of documentation and the difficulty to use the API.  We are in the process off revamping them, with clear documentation accompanying the API. If you are interested in using the API, please let us know (devsandbox@utest.com), and we can send you an early access API doc and point you to our integration sandbox server to explore integration.

Our Development Process and Engineering Practices At a Glance

This is where we had to do a lot of catch up! To give you some context around our future blog posts detailing our transition, I will illustrate our process and practices “before and after”.

The Team:

The Platform Team used to be 2 separate teams: the engineering group, led by VP of Engineering, and the products group, led by VP of Product Management. The engineering group primarily came from a background accustomed to the waterfall development model. Four months ago, we flattened the organization and combined the groups into one Platform Team, encouraging closer collaboration between the developers and the stakeholders.

Development process:

The team followed a mostly waterfall software development process, where for each release, the engineering group required a full set of specs and requirements from the product team, which were “handed off” to the dev team, which, after development was done, was “handed off” to the QA team. We didn’t (and still don’t) have an internal QA team – manual testing of course was (and still is) done by our testing community. The team’s release cycle was anywhere between one release every one to two months.

We have since taken up a much more agile model. We’ve spent a few release cycles trying Scrum, but it didn’t work very well. (look out for a future blog entry on this topic!)  We are now using the Kanban dev process, and release once a week.  The weekly release cadence required a lot of process and infrastructure changes.

Testing:

We do not have an internal QA team doing manual testing (we rely on our tester community!). We also have not had a dedicated QA automation engineer, and the team had not invested developer time in creating a test automation framework. Our use of Flex did not help either – the lack of automated testing suite resulted in costly releases, and high likelihood of regression bugs sneaking into each release. The QA practice also did not include much test-case-driven testing, which contributed to poor testing coverage for each release. Today, our manual testing for each release has much higher coverage due to the rich set of test cases developed by our community.

We are determined to tackle the quality issue by having the entire team focused on quality.  Since three months ago, the platform team has been working on creating a framework for automated testing. We now have a Software Engineer in Test working full-time so we can catch up with automated testing. We now run more developer-created tests, and have a framework in place to get automation scripts created by our community testers run in our continuous integration environment.  We also have a much more formal code review process so that we can catch bugs earlier.  We stopped our practice of having a handful of Product and Project Managers test each deployment on our production server resulting in late Wednesday nights every time we had deployment.

Build and Deployment Process:

Over the summer, I discovered we had two developers who had been working for us for over two months not knowing we had an integration server – of course they didn’t know how to trigger builds, or how to propagate changes to the database schema if they made any changes.  Clearly something had to be done!  Actually, with the waterfall development process and non-frequent release, our lack of continuous integration, automated builds, and automated deployments was not as critical. Of course not having these now-common systems in place meant that each release was very risky in terms of schedule and quality.

Manual build and deployment steps meant that there was a lot of waiting and also inefficiencies in the deployment cycle.  For example, when we were ready for an external test cycle, we 1) waited for the developer to check in his last bit of code, after which 2) email was manually sent to the sysops person who would go through manual steps to migrate the database, build the code, make configuration changes and copy files, and cycle the application server; and when he’s done, 3) email was sent out to our product/release manager who will log in to our platform to activate the test cycle.  This delay was not necessary!  We’ve since automated and streamlined the process so that there is no waiting needed between these steps; anyone can now build and deploy to our staging server for testing.

Over the last couple of months, the team made a lot of progress in terms of automating the process:

  • continuous build system which is triggered on each check-in; test failures or build failures send out notifications to the team, so that issues can be fixed quickly
  • an automated UI testing framework that gets run on the integration server
  • automated deployment to our staging server (anyone can press the button)
  • automated deployment to our production server (currently restricted to Sysops), including recording of down time (if any)

Looking Forward

Experienced startup VPEs would probably read this post and shake his head in disbelief (“no automated testing!? how can you expect to ship quality code!”).  Our platform software evolved from a humble HTML form-based app that was outsourced to a development company in Argentina, to a full blown app; our engineering process and practices are also evolving from an immature, slow, and unreliable one to one that is agile, responsive, and one that results in higher quality software. Please follow this blog if you would like to ride along for the journey!