Imagine — You’re in a system design interview and need to pick a database to store, let’s say, order-related data in an e-commerce system. Your data is structured and needs to be consistent, but your query pattern doesn’t match with a standard relational DB’s. You need your transactions to be isolated, and atomic and all things ACID… But OMG it needs to scale infinitely like Cassandra!! So how would you decide what storage solution to choose? Well, let’s see!
First of all, what kind of data are we working with? Is it records or file systems or audio/video content? And what kind of processing do we intend to do on that data? Do we need to search for something or run sophisticated analytics algorithms?
What are the different types of storage solutions out there?
Based on our requirements and how we want to use or access our data, we could be looking for the following storage solutions:
- Caching Solutions — If we are designing a read-heavy system like Twitter or Facebook, we might end up catching a lot of data, even complete timelines, to meet the low latency requirement. Some options here would be Redis or Memcached.
- File system storage — If we are designing some sort of asset delivery service, where we might need to store images or audio/video files, we might need to use something called blob storage. An extremely popular example is Amazon S3.
- Text search engine — What if we are designing a system like Amazon and need to implement a search feature. The thing about search features is that we need to consider typos as well. Suppose a user wants to search for “shirt” but types “shrt” instead. Now if we don’t show any results, it would be very poor user experience. Our system needs to be smart enough to show results for “shirt” or “shorts”. This is known as fuzzy search and this is where we use text search engines like Elasticsearch.
- Data Warehouses — I know! We have been discussing data and storage all this time, so how can we not consider Big Data! Sometimes we just need to dump all the data in a single store where we can later perform all sorts of analytics. These systems are used more for offline reporting that usually transactions. This is where we end up using data warehouse solutions like Hadoop.
Now, you might have noticed that we have been talking about “storage solutions” not “databases”. So let’s have a look at Databases now!
SQL? NoSQL? What is going on here?
Well, there are a few factors basis which we can decide what kind of database to go with, these factors being — the structure of the data, query pattern, and scale.
A little confusing I know! This is why I have added the link to the video I referred to for this article.
They have explained it beautifully, but we are also going to go over it in the coming sections, so read on.
Now, scale, structure, and query pattern. Right. If the information is structured and can be represented as a table, and if we need our transactions to be atomic, consistent, isolated, and durable (ACID), we go with a relational database. The most commonly used one would be MySQL.
Now if the ACID properties are not required, well you could still use a relational database, or you could go with a NoSQL alternative. But if your data lacks a structure, it cannot be represented as a table and now we need to use a NoSQL DB like MongoDB, Cassandra, HBase, Couchbase, etc. And this is where the query pattern becomes a deciding factor.
Pssst: Elasticsearch is sort of a special case of document DB.
If we have a vast variety of attributes in our data and a vast variety of queries, we use a Document DB like MongoDB or Couchbase. But if we have to work on an extremely large scale but have few types of queries we need to run, then we go for a Columnar DB like Cassandra or HBase. And even between columnar DBs, as you might know, HBase is built on top of Hadoop.
So while setting up HBase we first need to set up Hadoop and the related components, and then set up HBase on top of it. This adds a level of complexity while setting up the system, so I would personally go for Cassandra if only for the sake of simplicity. Performance-wise both give similar results.
Now, the thing with Columnar databases like Cassandra is that they work majorly by partitioning and duplicating the data. So if you can choose the partition key such that all of your queries use the common partition key in the where clause, Cassandra is the way to go.
I came across this article about how to choose the best storage solution by “codekarle”, which beautifully explains when it might make sense to use a columnar DB vs a document DB with an example of how Uber interacts with their driver and rider sides of the system. Let me try to explain this using the same scenario.
Let’s say Uber has saved the ride related information in a Cassandra with driver id as the partition key. Now when we run a query to fetch all data for a particular driver on a day, it fetches it based on the partition key driver id. This was the partition side of Cassandra’s solution. Now, what if we try to query a customer’s rides on a particular day by customer id. Now the query will be sent out to all the partitions and there goes the efficiency! This is where the replication side of the solution comes in.
We can simply replicate the whole data and now use customer id as the partition key. Now when a query comes in based on customer id, it will be directed to the instance using customer id as the partition key. And this is why Cassandra can scale infinitely. Remember the query pattern for Cassandra? We mentioned it is only useful if there is a limited variety of queries. That is because we can only replicate the data so many times.
Let’s get real though, is one database enough?
Now, we have seen various storage solutions and also how to choose between various DBs based on our requirements and the type of information we need to store. But is that enough? Referring to another example from the above-mentioned article, there are some cases where a single DB is not enough for our requirement.
For example, in the case of Amazon, orders data needs to follow ACID properties, but it needs to be infinitely scalable like a Columnar DB. In such a case we will use a combination of databases like MySQL + Cassandra. Now, all the information about ongoing orders, which need to follow ACID properties, will be stored in a MySQL DB and once they are completed, we can move them to Cassandra, which can be used as a permanent store. So as long as we require the ACID properties, the data remains in a relational DB and then is moved to a columnar DB which can scale according to the size of data. Problem solved!
I hope this can help you come up with a checklist of sorts to refer to during your next system design interview. Think I missed an important database/storage solution? Let me know in the comments!
Previously published behind a paywall at https://towardsdatascience.com/choosing-the-right-database-in-a-system-design-interview-b8af9c6dc525