It’s been 4 months since I formally took over running the Platform Team, and I figured it’s time we started publishing an engineering blog to share what we do, and what we think about.
Our vision of this blog is to talk about our technology, obstacles we encounter, discoveries, and also about the culture and engineering practices that allow us to build great, innovative software. Our team is a work in progress, just like our software is, continuously improving and moving towards a vision.
For this inaugural post, I’d like to describe to you at a high level where we stand in terms of our product architecture, and the engineering practice and process around it. Frankly, there’d been quite a bit of “infrastructure debt” that we had to pay down before I felt comfortable having you all see where we are; but we’re finally at that stage. Yes, you will still find things we have or do that are not perfect. Over the next several postings, I’d like to paint the vision of where our team is striving to get to, as well as provide details around our ongoing transition – with the hope that it might help others like us going through a similar transition.
Product Architecture At a Glance
Our current product platform consists of a pretty standard J2EE-based web application stack. We use MySQL, Hibernate and Spring running in JBOSS, with Apache in front, all hosted at Amazon EC2. We store media including all attachments – images, videos, and other files – in S3. We use Cloudfront as the CDN for some files.
Our user interface used to be 100% Flex (yes, we’ve made the unfortunate choice back in 2010. We are paying for it dearly by the high cost of release due to lack of test automation); we’ve been working on an HTML-based user experience for our next major release. uTest Express customers have already been enjoying the intuitive, snappy user experience running on our HTML-based framework since September.
We provide a small subset of our product functionality via an API. Partners and Customers who have asked for our API in the past have probably been disappointed in the lack of documentation and the difficulty to use the API. We are in the process off revamping them, with clear documentation accompanying the API. If you are interested in using the API, please let us know (firstname.lastname@example.org), and we can send you an early access API doc and point you to our integration sandbox server to explore integration.
Our Development Process and Engineering Practices At a Glance
This is where we had to do a lot of catch up! To give you some context around our future blog posts detailing our transition, I will illustrate our process and practices “before and after”.
The Platform Team used to be 2 separate teams: the engineering group, led by VP of Engineering, and the products group, led by VP of Product Management. The engineering group primarily came from a background accustomed to the waterfall development model. Four months ago, we flattened the organization and combined the groups into one Platform Team, encouraging closer collaboration between the developers and the stakeholders.
The team followed a mostly waterfall software development process, where for each release, the engineering group required a full set of specs and requirements from the product team, which were “handed off” to the dev team, which, after development was done, was “handed off” to the QA team. We didn’t (and still don’t) have an internal QA team – manual testing of course was (and still is) done by our testing community. The team’s release cycle was anywhere between one release every one to two months.
We have since taken up a much more agile model. We’ve spent a few release cycles trying Scrum, but it didn’t work very well. (look out for a future blog entry on this topic!) We are now using the Kanban dev process, and release once a week. The weekly release cadence required a lot of process and infrastructure changes.
We do not have an internal QA team doing manual testing (we rely on our tester community!). We also have not had a dedicated QA automation engineer, and the team had not invested developer time in creating a test automation framework. Our use of Flex did not help either – the lack of automated testing suite resulted in costly releases, and high likelihood of regression bugs sneaking into each release. The QA practice also did not include much test-case-driven testing, which contributed to poor testing coverage for each release. Today, our manual testing for each release has much higher coverage due to the rich set of test cases developed by our community.
We are determined to tackle the quality issue by having the entire team focused on quality. Since three months ago, the platform team has been working on creating a framework for automated testing. We now have a Software Engineer in Test working full-time so we can catch up with automated testing. We now run more developer-created tests, and have a framework in place to get automation scripts created by our community testers run in our continuous integration environment. We also have a much more formal code review process so that we can catch bugs earlier. We stopped our practice of having a handful of Product and Project Managers test each deployment on our production server resulting in late Wednesday nights every time we had deployment.
Build and Deployment Process:
Over the summer, I discovered we had two developers who had been working for us for over two months not knowing we had an integration server – of course they didn’t know how to trigger builds, or how to propagate changes to the database schema if they made any changes. Clearly something had to be done! Actually, with the waterfall development process and non-frequent release, our lack of continuous integration, automated builds, and automated deployments was not as critical. Of course not having these now-common systems in place meant that each release was very risky in terms of schedule and quality.
Manual build and deployment steps meant that there was a lot of waiting and also inefficiencies in the deployment cycle. For example, when we were ready for an external test cycle, we 1) waited for the developer to check in his last bit of code, after which 2) email was manually sent to the sysops person who would go through manual steps to migrate the database, build the code, make configuration changes and copy files, and cycle the application server; and when he’s done, 3) email was sent out to our product/release manager who will log in to our platform to activate the test cycle. This delay was not necessary! We’ve since automated and streamlined the process so that there is no waiting needed between these steps; anyone can now build and deploy to our staging server for testing.
Over the last couple of months, the team made a lot of progress in terms of automating the process:
- continuous build system which is triggered on each check-in; test failures or build failures send out notifications to the team, so that issues can be fixed quickly
- an automated UI testing framework that gets run on the integration server
- automated deployment to our staging server (anyone can press the button)
- automated deployment to our production server (currently restricted to Sysops), including recording of down time (if any)
Experienced startup VPEs would probably read this post and shake his head in disbelief (“no automated testing!? how can you expect to ship quality code!”). Our platform software evolved from a humble HTML form-based app that was outsourced to a development company in Argentina, to a full blown app; our engineering process and practices are also evolving from an immature, slow, and unreliable one to one that is agile, responsive, and one that results in higher quality software. Please follow this blog if you would like to ride along for the journey!