Fast Insert Performance Custom Database
The heart of our SEO Analytics product is a fast insert performance custom database.
In this blog post, we cover the reasons why we developed a custom database. In summary, MySQL/PostgreSQL insert performance degrades after 50-100GB of data from 14,000 inserts per second to 3,000 inserts per second. This limitation forces the utilization of clusters, which can make the monthly cost of use very high. The cost of a cluster that can hold 10TB of data on SSDs (uncompressed) is $4,000 to $30,000 USD per month, depending on the WebHost and sophistication of the dev ops.
Just for comparison, our custom database can hold 10TB of compressed data with indexes for the cost of $300 USD per month and to add redundancy in the form of another server the total cost is $600 USD per month. We wrote a post just about why we would want to build a custom database.
Besides the lower cost, our database can reach an insert speed of 100,000 to 300,000 inserts per second, dependent upon the data structure and entropy. The insert speed remains constant and does not degrade when the data grows.
To achieve this high insert performance for our custom database, we needed a new design that was unlike the standard of modern databases. Otherwise, we would use a modern database.
Because the database is custom, all definitions are made at the code level. It is possible to make this database system more flexible. Once we discover multiple uses of the database, we will probably make this change.
One of the advantages of the custom database is knowing precisely what you need it for. In our case, we don’t need ACID compliance. The reason is we insert the data from files, and it means that if there’s a failure, we can always resume the inserts from the log files.
Inserted data is not used until there is a full rescan. This is the reason why we don’t need to do any locking, which allows for greater insert/select speed.
The database tables are similar to modern database tables but have fewer options.
The next paragraphs show how we define the URLs table:
The table supports a primary key function that can span one field or many fields.
The primary key can be used in two ways:
- An index that is used to retrieve records.
- An index that is only used while rebuilding the database to remove duplicate records.
The table supports the following field types:
- Char (1 byte)
- Short (2 bytes)
- Int (4 bytes)
- Constant size string (N bytes)
- Variable size string (1+actual byte size of string)
- Variable size compressed string (2+either: actual size of the compressed string, or if compression is bad, the N bytes of actual string)
- Variable size combined compressed string (When doing a rebuild, the database will try to do a special dictionary build for URLs on the string and will use the best compression per string.)
It’s possible to put create an index on one field; the number of indexes is not limited.
A table can have partitions that divide the table into 1296 partitions. This record is saved by the primary key first two hash letters.
The partitioning of the tables allows the size to be kept under a certain threshold.
A table is created for every entity, so in our example, we have a site called CNN.com. Each site has five tables and, in turn, the database will create five files on the drive.
The file block is 4k which means that many small files under 4k will create data loss because of slack. The table rebuild solves this issue.
Inserting the Data
Data is inserted in bulk from a file; the database has a preprogrammed parser to read the data and insert it to the right tables.
The database caches all the inserts keys in memory. This allows it to ignore duplicate records inside the insert data.
The data is inserted into the memory. Every 15 files, the database dumps information to the SSD tables that are over 50 records.
Once the memory is exhausted, the database will flush all data to the SSD and reset the memory. The flush occurs every 1000 files.
Problem with STL
When we first started to test the custom database, it was working fine for a few iterations and after an hour or so it started to slow down. It took us a while to track the problem.
We discovered that because STL allocates and deallocates a lot of memory. The memory becomes fragmented so the default memory manager tries to defrag it, which causes the slowdowns.
Our Own Memory Manager
The first solution was to replace STL memory management with our own, which partially solved the problem.
We had to develop our memory manager. The way it works is by allocating a big chunk of memory (100MB) at a time, which was much faster and took less data than the standard C++ memory management.
However, the speed came at a cost: it can only be used by a single thread and we needed multiple threads to write the data.
Since STL does its own processes, we can’t 100% control it and tell it not to allocate memory.
Our own Data Structures
We ended up writing our own data structures. STL does not support some of them. Although it took time, it gave us 100% control of the memory allocation process and how the memory was allocated.
For example, if we know we will not try to reuse the memory, the allocation strategy is different than with memory we do plan to reuse.
Another strategy we used is that, after each flush, we discarded the memory manager (similar to using heaps) which allowed for the new batch to allocate the memory differently.
Tables in the Database
The database contains the following tables that are created for every site:
- URLs – URLs of a website
- Backlinks – All backlinks pointing to a specific site
- Outlinks – All outlinks from a specific site
Tables that are related to keywords are built for every keyword:
- Keyword links – Points to all links by the keyword name (to allow searching for links)
- Keyword titles – Points to all URLs that contain the keyword in the title
- Keyword meta description – Points to all URLs that contain the keyword in the meta description
- Keyword meta keywords – Points to all URLs that contain the keyword in the meta keywords
- Keywords – All the keywords in the database
- Hosts – All the hosts in the database
The database can have more types of tables. These are the tables needed for our design.
Keywords and Hosts are repetitive data, so they are normalized to 6 bytes hash. The data that relates to Keywords or Hosts only contain the hash to save space.
The server contains three 4TB SSDs, with a directory division to 3 levels of 0-9 and a-z, which gives 36^3 directories.
The directory is chosen by the three letters of the hash of the host or keyword. The directory structure allows us to avoid having many files in a single directory.
Data is inserted sequentially without any index build or recomputation, which means that the index on an existing table will only point to the old data.
To integrate the new data and new files into the database, a full rebuild is needed.
The rebuild allows us to save time on index builds since the data is not needed immediately and can be updated over days and not seconds.
The first task of the rebuild is to remove any duplicate records. By definition, there are duplicates records because it’s faster and more memory efficient to write any record and later remove duplicates than to try to remove them on every write.
Furthermore, some records are updated. For example, if there is a changed title for a URL, the deduplication process knows how to take the most updated record and adjust the dates of the record.
Data is compressed in blocks of 64KB to allow for fast index retrieval. This means that data sorted in various ways will have a different compression rate. This process is unlike running compression which is more efficient but will force the system to decompress the entire file for a single record.
To increase the compression rate, the table is sorted by every field and compressed. The sorted field that has yielded the best compression will be used for that table.
The sort before compression means that different tables from the same type (for example, URL tables) will be sorted differently because of the different data types.
Indexes are computed after the sorting order had been determined. The index itself is sorted by the key, to allow for the creation of sparse indexes.
After the index is generated, it’s also compressed in 64kb blocks. The newly created index is placed at the end of the data file.
There are two types of indexes:
- Primary key, the key is unique.
- Regular key, the key is not unique.
Creating File Packs
To save data on slacks (the difference between file size and the SSD file block size), files under a specific size are placed in one file, under the relevant directory.
An index is also created for that file to allow for fast retrieval of the files.
The first stage of data retrieval was to add the read feature to our data library. This means that any C++ program can access the data without the need for a server.
We took one of our socket servers and connected it to the data library. This way, we could serve the data over the network to our website and API server.
The server is multithreaded and has data caching for regular and chunked retrieval just like a regular database.
We also added a feature that allows us to go from a raw dataset in the cache to a filtered dataset without querying the data again. Once there is no more data in the raw dataset, the server will query the data again.
Could It Be Done with Another Database?
People that read our database story or the people I have spoken with always have a database in mind. Now there’s a difference between theory and practice. The database being considered may have good performance, but when using our schema, it’s different.
Also, there might be a good performance, but the tradeoff is space. SSD/NVME space is not cheap! If a database takes 3 to 5 times more storage than our database, the costs of utilization go up and continue to rise as more data is added.
Other companies have backlinks databases with petabytes of storage; some even use a custom database. Could they benefit from our database? Maybe, time will tell.
The reason for this blog post about custom database development is to show the details of our own system and draw attention to it for someone whose use might require exactly what we provide. If anyone thinks this is the answer they have been looking for and needs a paid solution, contact us and we can see if we can work together.