Why you need (and don’t need) a NoSQL database.

Let’s begin with a quick background behind relational and NoSQL databases. Historically, relational databases have been the general purpose go-to database of choice. They are battle-tested, feature rich, and proven to work. However, as the time went by — the volume, velocity and variety of data has increased dramatically… And as a result, we have seen the rise of NoSQL databases — specialized databases for various scenarios. Today, I work on a NoSQL database (specifically a document-oriented database… shout out to Azure DocumentDB), and a question I get frequently asked is “when should I use a relational vs. NoSQL database”?

Before I go deeper… I’d like to call out that NoSQL is a bit of an over-bloated buzzword… there are many kinds of NoSQL databases — including key-value (e.g. Azure Table Storage, AWS DynamoDB, Riak, etc.), document-oriented (e.g. Azure DocumentDB, MongoDB, CouchDB, etc.), and graph databases (e.g. Neo4j). And they behave nothing alike.

Maslow famously once said — “if all you have is a hammer, everything looks like a nail”

Hammer that screw!

Relational databases were that hammer. And today, we are living in a database renaissance — in which, we have screw drivers, drills, and all sorts of specialized tools popping up. We now have the freedom to choose the right tools for the right job.

Choosing the right tool for the right job

Currently, I still view relational databases as a great general purpose database… That said, I believe the area that relational databases truly excel in are scenarios that involve… well… highly-relational data (many-to-many relationships). The biggest challenges for relational databases tend to be scenarios that involve a high variety of data (heterogeneous data, see below) and a high volume and/or velocity workload (sharding is hard).

On the other extreme, I view key-value stores are fantastic for dealing with large volumes and/or velocity of data (they are dirt-cheap and make sharding/partitioning relatively easy). Their biggest drawback tends to be queryability (what if I need to perform queries on something other than the primary key?). Many key-value stores don’t support secondary indices, and the ones that do can become very expensive to operate (some key-value databases handle secondary-index support by effectively performing a double-write).

I see document-oriented databases as a great trade-off between the relational databases and key-value stores. They can handle high volume/velocity scenarios quite well; but the area they especially stand out to me is variety. They offer a schema-agnostic database with a reasonable amount of queryability over de-normalized [and normalized] JSON. Some even support automatic indexing over every property in every JSON document (which is pretty fucking awesome IMO).

Here are a couple example scenarios that I believe are a great fit for document-oriented databases:

1) Heterogeneous data (data with varying schema)

Example: Let’s consider that you are building an e-commerce site where you sell everything from books to video games to laptops. It is really hard to fit this kind of data in a relational database.

You could create and index a column for each product attribute… but that becomes horribly inefficient because there are too many varying attributes among your various products. How much ram does your book have? Who is the author of the laptop? Why waste precious space by writing nulls in sparsely populated product attribute columns?

You could create a separate table for each product type… but that sucks when you have an expansive product catalog. Creating and maintaining 1000s of tables for 1000s of product types quickly becomes an operational nightmare.

Abstracting separate Product and ProductAttributes tables is usually the go-to answer. However, that introduces additional complexity and forces a JOIN. JOINs are cross-products; and using a cross-product on two very large tables generally means performance slows to a crawl. JOINs are great for many-to-many relationships, but are relatively inefficient for one-to-many (hierarchical) relationships.

De-normalizing and storing everything as JSON in a single text column could be another option… but then you lose the ability to easily index and query off various nested attributes.

Storing heterogeneous data in a schema-agnostic database is easy; just store the data with whatever fields you need as JSON in a document-oriented database. Fields can be automatically indexed and you can query off any fields you want. It’s fast. It’s efficient. It just works. Simple.

2) When you don’t get to dictate the schema (e.g. pulling data from 3rd party data sources)

Example: Let’s say you have an application that pulls data from various 3rd party APIs from around the internet, e.g. Reddit, Github, and StackOverflow. JSON has become the de facto data interchange format of the internet. Extracting fields out of JSON from a 3rd party REST API to fit a tabular structure can be tedious; and even worse… what happens when your data source changes their schema tomorrow? Data loss occurs! This is another area where schema-agnostic databases shine. You can store JSON passed back from 3rd party sources directly in to the datastore without having to worry about data loss due to schema changes. Simply update your application’s queries to reflect the latest schema changes and you are back up and running.

TL;DR: Databases are not one-size-fits-all solutions. You should first look at your scenario, before picking a database. Picking a database first, and then fitting it to your scenario leads to problems.