Supernova Award Category
Tech Optimization and Modernization
The Problem
Databases are really difficult to test — especially when they are designed to be distributed, real-time, and scalable. At MemSQL, we have millions of unique queries to test, highly variable runtimes, and some tests take hours of 100 percent CPU usage. We have over 10,000 unique tests, not to mention the number of test transformations, which may multiply that number by one hundred again.
Our tests also require gigabytes of RAM and multiple cores. That’s the kind of scale you have to think about for testing a platform like MemSQL. We also ran into new testing challenges as our product can take advantage of Intel’s unique AVX technology and vectorization. These modern architectures bring awesome advantages, but they also add to testing scenarios. We started with off-the-shelf, but once you see all the things we’re doing, you can’t just throw it onto a common test platform. Commercial testing solutions are not designed for distributed applications.
The Solution
We started our first database test cluster about five years ago, and named it Psyduck after the Pokémon character.
Initially we had a mixture of home grown, bare metal boxes in the office. Additionally, we had a bare-bones Amazon EC2 presence — for bursty job requirements. That was our initial scaling strategy, manually managed VMs to take on additional load.
From there we looked at operationalizing the whole stack. First we invested in operationalizing VMs on bare metal. We took 25 machines, cleared them, and built an OpenStack cluster. We scaled that up to about 60 machines on-premises. This allowed us to eliminate EC2, but as we scaled that cluster we experienced a lot of pain and complexity. So we took a portion of the cluster and ran Eucalyptus instead. That ended up being interesting, but not very mature compared to OpenStack.
Then we tested what Psyduck with containers. Docker gave us the ability to spin up in seconds (rather than VMs in minutes).
The results
We’re very happy with the outcome of this journey. The container abstraction is solid for performance, isolation, ephemerality — everything that matters in software unit testing. We don’t care about persistence: the test runners come up as containers, do a bunch of work, write out to a distributed file system, and disappear. This abstraction allows us to not worry about what happens to containers, only that they complete their job and then go away.
In the end, Psyduck helped us alleviate most of the headaches around testing software at our scale. Running containers on bare metal, we can afford to put our own products through the most demanding testing regime in our industry.
Metrics
Today, we successfully run a testing cluster composed of hundreds of machines, which executes millions of tests every day. Psyduck drives engineering productivity and allows us to iterate on our database software in a fast and safe way. Psyduck runs several months of computing time in one day.
hundreds of machines
millions of tests
Thousands of engineer interactions per day - each of which can save an engineer up to a hour of work
Allows engineers to do work they individually can’t do
Enables them to work in parallel with themselves
One engineer can work on multiple projects, which can each be tested in parallel by Psyduck while the engineer works on something else
We have been able to build a award winning database in a competitive market with a very small engineering team because of Psyduck
The Technology
Psyduck - this includes MemSQL IP
Amazon EC2
Docker
Kubernetes
Diamanti
Disruptive Factor
The biggest challenges was trying to fit a complex problem into third party solutions - we went throughmany iterations. All of these failed and forced us to write more of Psyduck.
At the end of the day, Psyduck enables our engineering team to be more productive and build a more complex solution with a smaller team. Psyduck is critical to the engineering team, and is vital for building MemSQL technology for external customers.
Unlike other testing frameworks, a single Psyduck test run can be composed of thousands of individual container executions that happen over a large distributed system. Each of these containers may require large amounts of dedicated computing resources and Psyduck is designed to handle this case. One additional complexity of our tests is that most of them are integration tests that require spinning up. Other frameworks tend to be more static and focused on unit-testing, which can be more easily parallelized.
Unit test verifies that a small piece of functionality is independently correct. The unit tests only look at that specific thing in absence of other parts of the system.
Integration Test operates on the system or product as a whole. These are hard to parallelize and setup since each one requires a large amount of configuration and dedicated resources. Psyduck focuses on providing a solution that is optimized to run Integration Tests in parallel at scale - as well as unit tests.
Shining Moment
Growing Psyduck to a size when it became a vital team member. When we moved out of the office closet into dedicated space at a colocation facility in SF.
