In the past few years, alternative data management systems have attracted a fair bit of attention. New systems appear on an almost weekly basis, promising benefits of all kinds – easy graph traversal, lightning-fast writes, extreme scalability, RESTful APIs, availability in the face of natural disasters, and ponies.
The feature lists are attractive, but features aren’t all that needs to be considered before betting your project on new technologies. This short article is an attempt to list a few questions that one must answer before taking the NoSQL plunge.
Do you really need it?
People have built many great applications over the years using traditional technologies. That’s how they became “traditional” in the first place. They are very good at what they do, they have been deployed at thousands of businesses, and bugs have been ironed out for years. The pool of trained engineers familiar with the systems and ready to go from day one is much wider.
The first question you have to answer is “why not MySQL / Postgres / SQLite /Commercial package of choice?” And you better have a good answer. Good answers include, but are not limited to, extreme data volumes, extreme number of transactions-per-second, and data models that are just not a good fit for a traditional RDBMS (although perhaps that just means you need to reconsider the data model). “Amazon / Google / Yahoo / Facebook /Twitter does it” is also not a good answer. You may want to take a look at the Yahoo Cloud Serving Benchmark (1), which attempts to measure performance of different systems for a variety of workloads. The code for the benchmark is open-sourced and modular, so you can tune the input parameters to your expected volumes, and compare the results with what your current system gives you.
Expertise for hire?
The great thing about the various NoSQL solutions is that by and large they have an energetic and excited community behind them, and they are fun to work on. But make no mistake – most of these systems are fairly young. They will have bugs. They will have corner cases. They will lack some features you want, or supporting libraries in your favorite language. Tracing down problems will often involve reading the code. The upside is that without years of cruft, of poor quality work weighing the code down, they are likely to change things around faster. Jumping into the NoSQL pool means that you are likely to be tracking JIRA tickets on apache.org in the near future.
Risk profile and mitigation strategy
If you are going NoSQL, you should be doing so only because you either don’t have a choice, or because you expect to achieve a significant gain. New technology is a risk; it might not work out. How do you plan to test that the system you built will stay up in production? What’s your plan for backing the data up? How will you recover when someone accidentally issues the NoSQL equivalent of “rm -rf /”? Whatever your biggest parameter is, test to within an order of magnitude greater of that – more reads, more writes, more terabytes, more exploding network cards. Think again about hiring! You are awesome, but you will have a lot of stuff to do. You will want to be able to find someone who understands enough Erlang to debug CouchDB replication while you are sunning yourself in Cyprus (by the way, did you know that CouchDB doesn’t support continuous replication over server restarts? Now, did you really?)
About tools, libraries, migration
There is a number of tools that exist for SQL solutions that are hard to find analogues for in the NoSQL world. Clients, language bindings, admin UIs, ETL tools, Emacs modes, Excel plugins, you name it. Actually, don’t name it. You probably don’t need most of it. But of the stuff you do need – what exists? What are you comfortable going without, or creating yourself?
Starting from a clean slate is generally pretty straightforward. You might have legacy systems, however. How do you do bulk imports? One strategy is to run the two systems in parallel for a while, taking writes, simulating reads, making sure everything is kosher in NoSQL land. Is there an ORM that can help you abstract the (vastly different) storage layer? Think also about the reverse – getting data out of NoSQL and into standard reporting tools. Many NoSQL systems are optimized for specific types of access, and doing things like exporting data or performing analytic queries can be difficult. For example, at Twitter, we store the social graph in FlockDB2, which serves our online needs very well. However, we found that in order to analyze patterns in this data, we had to set up replication from FlockDB into Hadoop and HBase, and perform the analysis offline.
How to operationalize this thing
By far the weakest spot I have consistently discovered with NoSQL systems is operationalization. What is the deployment story? Are there tools to examine the health of the system? What about figuring out what went wrong – and remember, something always goes wrong – can I run ad-hoc queries on that thing? Many NoSQL databases are built for querying based on a key. What if you want to scan a large part of the data, looking for some condition? Will this take your website down to a crawl? How do you set up monitoring for the system? When an alert fires, what exactly are you supposed to do? The operations engineers in your company probably have a lot of experience running things like Linux boxes and MySQL instances. They know a great deal about bizarre replication quirks, device driver bugs, and TCP throughput collapse. They don’t have the same experience with your distributed key-value hash table and its gossip protocol. They know what it looks like when a MySQL node goes insane.What does an insane Cassandra node look like? Does it matter if you always do consistent reads? How do you check for replication lag? HBase region servers can survive master failure – what does that mean, exactly? Can this wait till morning?
The answers to these questions are, of course, different for different projects. Chances are, these answers will be a lot more complete by the time you read this than they are at the time I write this. The NoSQL “movement” has produced very interesting, exciting technologies, and they address real needs in the IT space. If you’ve answered all of these questions to your satisfaction, and still want to jump in – take a good running start, and go for it. I’ll be in there with you, looking out for sharks.
(1) Information about the benchmark can be found at http://wiki.github.com/brianfrankcooper/YCSB/
Dmitriy V. Ryaboy
Please answer these questions before taking the NoSQL plunge!