'Most efficient way to bind several csv of 4 million rows each up to 80 million rows?
I have several files containing ~4 million rows each with the same 4 columns IDs. I am looking for the most efficient way to bind all of them (total rows would be around 80 million). I think it would be equivalent to concatenating all these rows. In R I would simply use
rbind(csv1, csv2)
but I've tried and it took really long. I don't know if there is a more efficient way to do this, even considering other tools. I am running them in my laptop (8GB RAM).
The number of rows on each file is different ranging from 4 to 2 million each. A sample file would look like this:
id chr pos genotype
rs7349153 1 565490 TC
rs568632519 1 565596 GA
rs534091456 1 565619 AT
rs539860681 1 565643 TC
rs572552962 1 565658 TC
rs375428604 1 565696 CA
where id chr pos genotype are the column names. All rows are different, the only pattern is that each file is splitted by chr column (so there is one file with chr1, other with chr2, etc). Final output I expect is a txt with all those rows concatenated, such as:
id chr pos genotype
rs4349153 1 565490 TC
rs468622519 1 565396 GA
rs534091456 2 565319 TT
rs639810381 2 565443 TT
rs572552362 3 564658 AC
rs675422304 3 565396 CA
I am open to using any other tools. I've never used a database but I can give it a try.
I also thought about using bash's cat but I don't know if I'll have the same problems as with rbind
Thank you for your insights!
EDIT: Added more details.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
