Clickhouse, an analytics database for the 21st century

Clickhouse is a fairly new column store database. It’s developed by the guys over at Yandex (the Google of Russia), made to scale horizontally reasonably well and run high speed aggregate queries on hundreds of billions of rows of data.

It uses it’s own SQL dialect and it matches pl/pgSQL in terms of expressivity and simplicity. It even includes higher order functions for working with nested object within rows, such as arrayMap and arrayFilter.

What is Clickhouse used for ?

In short, Clickhouse is used to run fast analytics on very large amount of data. It’s rather bad as a transactional database, but it can run aggregate queries on billions of rows in sub second times.

It fills the ever increasing niche that technologies like Hadoop, Spark, Druid, Big Query, Redshift, Athena and MonetDb aim for. It allow a team to easily store and analyze large sets of data with high dimensionality. More importantly, it blows the aforementioned solutions out of the water in terms of speed and query complexity.

It should be noted, Clickhouse is meant solely for analytic workloads, so as long as you have a lot of GROUP BY and aggregated functions such as COUNT, SUM and DISTINCT it will be blazing fast, even if your number of rows is in the 11 to 12 digits area. But if you want to do a lot of large scale joins and point queries, a typical RDBMS is still going to be the better choice.

It should be noted here, Clickhouse does not yet support queries like DELETE and UPDATE in it’s stable syntax. Even when it will, they will be slow due to it’s compression mechanism. Again, this is not a database meant for transactional workloads.

It will also compress your data, the space this saves is based on the column granularity and on how much query speed you are willing to sacrifice. The way Clickhouse compresses and partitions data is a key to it’s quick queries.

If you want to read more about how column stores work, I would suggest you start with the Wikipedia article, since most of the information there is quite easy to understand and accurate.

What exactly makes Clickhouse amazing ?

1It’s extremely easy to setup and use,even if you need data replication and data distribution on a few dozen machines.

Deploying a Clickhouse cluster on 3 machines is almost instant if you are familiar with what you are doing and could take you a few hours to do properly if you are learning everything from scratch.

No need for obscure libraries, no endless configuration files, user permission issues and arcane problems with data distribution and replication.

2 It’s not fragile. One horrible thing I’ve noticed in my rendezvous with various “big data” technologies (spark is a chief offender here) is that they tend to often break with no warning and for no apparent reason. Even worse, they give little information as to why the problem happened within the humongous logs they generate.

Clickhouse is designed to recover from failure and it produces quite a reasonably sized system log with clear explanations of the problems it encountered.

3 It doesn’t reinvent the wheel, it adds a steam engine to the carriage. Perhaps the most amazing thing about Clickhouse is that it doesn’t try to push a lot of limitations on you nor does it force you to completely re-think the way you store and query your data.

Its data types cover everything from nested structure to matrices, it has a gigantic library of functions and one of the most versatile query language that I’ve ever seen.

Unlike Map-reduce or other column store solutions, that will force you to rethink your entire schema, you can pretty much duplicate a transactional database schema with Clickhouse and it will do just fine.

4It’s actually got quite an amazing documentation and a great community. Perhaps the nastiest thing with “big data” technologies nowadays is that most of them are designed in order to sell support.

A prime example of this would be Druid, open source and free in theory, but an unbreakable bastion of weird complexity coupled with a complete lack of documentation and community involvement.

Clickhouse’s documentation is relatively easy to search through and it’s being groomed, expanded and improved constantly.

The community is nothing short of amazing, it’s quite tiny, consisting of the developers and users from Yandex plus a few other early adopters, but that hasn’t stopped every question and issue I raised on Github and Stack from being answered within hours.

The developer also seem to very receptive to contributors external to Yandex.

Enough talk, lets install the thing

The first step is to install Clickhouse, for the purpose of this tutorial, it’s best to have a machine running Debian or Ubuntu, since that’s the official OS Yandex supports.

It’s also quite easy to install it on Fedora, CentOS, Suse and RHEL, I’ve personally installed it with success from the AUR repository on Arch, it exists as a package for Gentoo. If all else fails, you can use the docker image or build it from source.

To install it on Ubuntu, the execute the following:

deb http://repo.yandex.ru/clickhouse/deb/stable/ main/
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E0C56BD4

sudo apt-get update

sudo apt-get install clickhouse-client clickhouse-server

Now you can start the server as a daemon by running sudo systemctl start clickhouse-server and connect to it via the client by running clickhouse-client.

If you want to play around with the configurations, you can find them inside /etc/clickhouse-server and /etc/clickhouse-client respectively.

Playing with some data

For this introduction, I chose to use the NYC property taxes dataset, it’s a rather small dataset of ~1 million rows, but it means downloading and insertion will be very quick. The query speed will scale linearly with larger datasets as long as they fit on a local RAID and won’t suffer to much if data is distributed.

In case you’ve installed an older version you may want to run:

SET experimental_allow_extended_storage_definition_syntax = 1;

This will allow you to use the syntax used throughout this tutorial. If your version is new enough, trying to run this will just throw an exception, no need to worry, you are all set.

Let’s create our database and table:

CREATE DATABASE test_database;
USE test_database;
CREATE TABLE tax_bills_nyc (
bbl Int64,
owner_name String,
address String,
tax_class String,
tax_rate String,
emv Float64,
tbea Float64,
bav Float64,
tba String,
property_tax String,
condonumber String,
condo String,
insertion_date DateTime MATERIALIZED now()
)
ENGINE = MergeTree PARTITION BY tax_class ORDER BY (owner_name)

Before we insert that data, let’s look at the things which might be a bit unclear here.

MATERIALIZED is a way to create to column that is automatically set whenever a row is inserted based on the function after the keyword. In this case, the function I chose to use is now(), in order to get the insertion timestamp for each row.

ENGINE = MergeTree sets the storage engine for the table. There are many types of table engines that clickhouse supports. Including your bog standard MEMORY engine, BUFFER engines that are meant to buffer rows when a very high write rate is needed, JOIN engines that are meant to speed up joins and many others. MergeTrees are the best table engines for most analytic queries, and the MergeTree engine is the most simple of those, and usually the best choice.

Whilst I the specific of the MergeTree engines would take too long to explain here. If you are familiar with long-structured merge-trees, Radix trees, immutable “Vectors”, Merkel trees or Patricia trees… then thing of MergeTrees as being something conceptually close to that.

If you aren’t, I’d like to invite you to the data structures sermon when your local church of Functional programming next meets.

PARTITION BY tax_class will partition the table using the tax_class column. Partitions are the largest storage unit under tables in clickhouse. You can delete partitions, attach partitions and temporarily drop partitions. Efficient clickhouse replication and distribution also depends on good partitioning.

From the end user perspective, what you should care about most is that partitions are very fast to operate on using >, < and ==.

For the column you partition by, you should chose one with somewhat low granularity, as a general rule of thumb. A good rule of thumb is to have less than 1000 distinct values per billion of rows. When your data associated with a point in time, a good partition key is the date associated with every row.

ORDER BY is similar to a btree index in an RDBMS, from a user perspective. It speeds up queries using comparison operators. Unlike PARTITION, you can use high granularity data here without losing performance. Again, in the case of a table where you often query based on a time associate with each data point, a good candidate for this value is the timestamp of each row. You can order your table by a combination of columns if you so desire.

Now you should be familiar with most of the new concepts you need to get along with Clickhouse, the rest should feel quite natural to you.

Let’s insert our data:

curl -X GET 'http://taxbills.nyc/tax_bills_june15_bbls.csv'  | /usr/bin/clickhouse-client --input_format_allow_errors_num=10 --query="INSERT INTO test_database.tax_bills_nyc FORMAT CSV"

That should take a few seconds or minutes based on your internet speed, once it’s done, we can start testing some queries.

If you want to work with a bigger dataset but don’t have time to define your own table, just run the above command a few more times 🙂

Let’s take it for a ride

Ok, lets run our first query, don’t blink: SELECT COUNT(*)/pow(10,3) FROM tax_bills_nyc. That query should have been relatively instant, probably done in sub ms time if you are using server grade hardware.

Lets try something a bit harder:

SELECT SUM(emv) as total_emv, SUM(tbea) as total_tbea, SUM(bav) as total_bav, tax_class,tax_rate 
FROM tax_bills_nyc
GROUP BY tax_class, tax_rate

That should have taken somewhere between 50 to 10ms depending on your machine, quite the feat for doing a GROUP BY 2 dimensions and summing up 4 different columns.

Ok, what if we add a high dimensionality column in the GROUP BY, let’s say the owner’s name, since there are about 885,000 distinct names in the dataset:

SELECT SUM(emv) as total_emv, SUM(tbea) as total_tbea, SUM(bav) as total_bav, tax_class,tax_rate 
FROM tax_bills_nyc
GROUP BY tax_class, tax_rate, owner_name

That was probably slightly slower, but it still probably took under half a second, despite the high granularity of our new grouping criteria.

But, if we built our schema correctly, we should be able to filter most queries using our order or partition column, let’s add a where clause based on the tax_class column:

SELECT SUM(emv) as total_emv, SUM(tbea) as total_tbea, SUM(bav) as total_bav, tax_class,tax_rate 
FROM tax_bills_nyc
WHERE tax_class='1c - condo in 1-3 unit building'
GROUP BY tax_class, tax_rate, owner_name

Suddenly, our query time drops in the sub 10ms range again.

Clickhouse not only offers speed, but it has various type and value manipulation functions that can allow us to easily write a query which output something that’s easily human readable:

SELECT concat(toString(uniq(owner_name) / pow(10, 3)), ' thousand distinct names among the owners of ', toString(uniq(address) / pow(10, 3)), ' thousand properties') AS result FROM tax_bills_nyc

This should give you a result similar to:

85.058 thousand distinct names among the owners of 989.466 thousand properties

Ok, let’s try and see what happens if we try grouping by every single column in this table. To save you some typing, we’ll use the DISTINCT(*) function, which will essentially do just that.

SELECT COUNT(DISTINCT(*)) FROM tax_bills_nyc

Surprisingly enough, this once again takes under half a second.

It might actually be quicker then some of the previous GROUP BYs since we didn’t have to transform any of the string typed columns into a printable version.

In my books, that was quite impressive when I first saw it.

Have I peeked() your interest ?

Hopefully you can see why I’m very excited about this database due to the expressive syntax and amazing query speed alone.

This was mostly a toy dataset, meant to make this tutorial easy to follow. So I encourage you to try and go up to a few billion rows and see for yourself how unbelievable this database is.

As for the dataset you want to do that with, I leave that up to you. Maybe try one of your large internal tables on which you often have trouble running analytical queries. Try formatting some of the data in your “data lake”, which you’d usually use a tool like Spark or Athena to query, and insert it into Clickhouse.

If enough people are interested, I might produce a followup article or series with some more tips, tricks and real world usecases for Clickhouse.

read original article here