'Performance about distributed database on a single physical machine?
Currently I'm working with time-series data(Tickers) and stored the pandas dataframe in .h5 and .parquet format. My workload is basically querying the columns of data & writing new columns of data.
It starts to get out of control when the files increases and scatters around, so I'm considering using a database. I saw some NoSQL database like Cassandra and HBase.
These database often emphasis that they are distributed or elastic database. But I only have one single physical machine(a 3990x 64-core threadripper).
So my question is basically two part:
What is the performance if I run these
distributeddatabase on a single node? Will they still outperform a traditional database like mySQL regarding my workload?Can I use virtualization to build a cluster? Say, using ESXI to virtualize multiple nodes? Will they outperform a single node setup?
Solution 1:[1]
No.
You will spend a lot of time and effort trying to figure out how to use "distributed" when it is not necessary.
Start with a database with
CREATE TABLE stocks (
ticker_id MEDIUMINT UNSIGNED NOT NULL,
date DATE NOT NULL,
open, close, volume, etc
PRIMARY KEY(ticker_id, date),
INDEX(date)
) ENGINE=InnoDB;
CREATE TABLE tickers (
ticker_id MEDIUMINT UNSIGNED AUTO_INCREMENT NOT NULL,
ticker VARCHAR(33) CHARACTER SET ascii NOT NULL,
PRIMARY KEY(ticker_id),
INDEX(ticker)
) ENGINE=InnoDB;
This will be good enough for a billion rows on a single server even with very few cores.
Come back with the SELECTs after you have a few million rows. We can discuss the next steps to make it work well.
Solution 2:[2]
You should understand the benefits of using distributed/elastic DB systems vs (not-so-easily-distributed) DB systems like MySQL. Distributed systems are good from a horizontal scalability perspective; as you need to store more and more data, you can (often) seamlessly add new nodes and partition your data. The strategies vary with Databases, like Cassandra uses consistent hashing across nodes while Couchbase uses concept of virtual buckets across nodes.
If you have constraint of using a single machine, horizontal scalability is thrown out of picture. With virtualization, what you are can achieve is have multiple multiple DBs run in sandboxes. While these virtual machines may seem independent, your system resources are being used by all. Eg Your actual storage will remain same.
With that out of picture, I will advise you some column oriented DB and avoid NoSQL DBs for your usecase (unless you are specifically looking for schemaless support, and little to no joins are needed).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Rick James |
| Solution 2 | agyeya |
