April 2nd 2020
This article will be of particular interest to people who are curious about how powerful encryption can be implemented in high performance systems. It is also for people who just want to know a bit more about how strong encryption works and how all the pieces fit together.
Everyone says they want more secure systems, but people don’t like using systems that are slow or hard to use, and making a system more secure can add delays and reduce convenience. It is clearly bad when people avoid using a system that is secure, and instead choose a system that is less safe but faster or easier to use.
Because databases store the data that we most need to protect, they are a major piece of the security puzzle. How can we add strong encryption to a database without unduly compromising speed or usability? Looking at how various companies have done this can help.
Badger, a Fast Database
KVDBs have become popular because they are one of the simplest and fastest kind of databases, and they are easy to learn because they use a data structure that is common in most programming languages.
The challenge for the team was to figure out how to implement encryption without compromise. We knew that this should be possible, because companies were able to (eventually) figure out how to add encryption to http (creating https) without making it noticeably slower for the end user.
As we shall see, Badger’s advanced implementation positively affected how encryption was added to it.
Implementation of Badger
Separating Keys from Values
An advanced feature of Badger’s implementation is splitting up the key-value pairs and storing the values separately from the keys. Keys are typically short strings, so storing them together in one place makes searching for a key significantly faster. Keys typically don’t change as often as values, so more extensive optimization of the key files (including the hash tables used to look up a key) can be done to improve performance even more.
On the other hand, values are typically larger and change more often than keys. For example, a value can be an entire document that is updated as the document is edited. Multiple values are stored in a value log file (vlog). Associated with each key is a pointer to its value, consisting of the vlog file id, the offset of the value in the file, and the size of the value. Note that as an optimization, values smaller than a user specified size (default 32 bytes) are stored with their associated keys rather than in a vlog file.
Assuming 16 bytes per key and 16 bytes per value pointer, a single 64MB file can store two million key-value pairs. This is small enough to fit inside the RAM of virtually any computer, which means that often a key search can be done in high-speed memory, which can be orders of magnitude faster than searching secondary storage.
Advanced Encryption Standard
One Key to Rule Them All
Instead of using the AES encryption key directly to encrypt data, Badger uses two types of keys: 1) A user-provided AES encryption key, called the master key, is used to encrypt data keys. 2) Data keys are generated automatically by Badger, and used to encrypt and decrypt data on disk. Each encrypted data key is stored on disk along with the encrypted data.
By default, Badger rotates the data keys every ten days, but this can be changed by the user.
You should rotate your master key on a regular schedule. Fortunately, because the master key is used to encrypt only data keys, which are (even all together) much smaller compared to the data stored in the database, the master key does not need to be rotated as often (as data keys) to prevent key leak. Even better, when the master key is rotated, only the data keys need to be decrypted using the old master key, and then re-encrypted using the new master key. This is much faster than re-encrypting all of the data on disk.
Avoiding Encrypted Duplicates
When data keys are used to encrypt data stored in the database, the same data will often be encrypted multiple times before the rotation period for the data key expires. Encrypting the same data with the same data key always generates the same encrypted text to be stored in the database. This increases the ability of an attacker to reverse engineer the original plaintext data.
In addition to reducing data predictability, the use of IVs increases performance because an XOR operation is vastly faster than running the AES encryption algorithm directly on the data to be encrypted. The overhead of this increased security and performance is the extra storage used for each 16-byte IV. Because of this, a potential issue with using IVs is that it can lead to data bloat.
Reducing the Storage Overhead of IVs
The files containing the keys are each divided into blocks, where each block is 4KB in size. Badger uses a unique IV to encrypt each of these blocks and stores the IV in plaintext at the end of the encrypted block. The storage overhead of a 16B IV over a 4KB block is 0.4%, which is reasonable.
Note that storing the IVs in plaintext is still secure. Assume that a cracker gets access to the IV and an encrypted block. To decrypt the block, they’d need to encrypt the IV with the data key (to XOR the encrypted data back to plaintext). But to get access to the data key, they’d need to decrypt it using the master key. If master key is safe and secure, that won’t be possible, rendering the effort futile. Thus, knowing the IV is not sufficient to decrypt the data.
Avoiding the bloat of attaching a 16-byte IV to every value
Next, we need to encrypt the values stored in the vlog files. Each value is read individually from a vlog file, using the value pointer associated with each key, which stores the vlog file id, the offset in the file, and the size of the value. If we encrypt a whole block at a time, having multiple values in a block would cause a significant performance slowdown because considerable extra data would need to be decrypted to read an individual value.
As values are often accessed individually, we decided to encrypt each value individually. Because each encrypted value is required to have its own random IV, using one IV per value adds a 16B overhead for each value. For a vlog file that contains 10,000 entries, storing an IV with each value would require 160,000 bytes, which is not good.
It isn’t quite as bad as that, because small values are stored directly in the key files, instead of pointers, and that value gets decrypted along with its key. But even with this optimization, unless all values are very small it is easy to imagine that having one IV per value could double the size of a vlog file. Luckily, we came up with a better solution.
To optimize the encryption of the vlog entries, Badger uses a unique technique. Instead of generating a 16-byte IV and storing it at the end of each value in the vlog, Badger generates a single 12-byte IV that is used for all values in a vlog file.
Along with it, Badger attaches 4 bytes that contain the offset of the value in the vlog file, which together make up the required 16-byte IV. For decryption, the 4-byte vlog offsets are available from the (decrypted) value pointer stored with each key. The 12-byte IVs are unique for each vlog, and the 4-byte offsets are unique within a single vlog file, so each composite 16-byte IV is also unique.
This technique saves 16 bytes of space on disk for every value in a vlog file. Instead, this technique only requires 12 bytes per vlog file, regardless of the number of values in a file. That’s an even better savings than for the key files.
We added encryption to Badger using established and proven standards and best practices, implemented in a way that works well with the existing features and benefits of Badger. Most importantly, the addition of encryption did not appreciably slow Badger down, make it difficult or inconvenient to use, or significantly increase the storage used.
Badger is open source, and has been used as the storage engine for many large systems, and now it includes encryption for free. We hope Badger will make it easier for even more projects, especially ones that need to securely store encrypted data.