'How to slowly migrate a huge amount of data from one PostgreSQL server to another?
I have a postgres database which holds a big portion of data (~400 GB). I need to transfer it to another postgres instance set up on a different server which I need to migrate that data to. What are my options?
Both DBs live in k8s pods within AWS, pods are orchestrated using the same k8s instance but are otherwise independent. Also, the destination DB is under a serious write load, so I would prefer migrating the data slowly, maybe over the course of several days, to keep the extra write load minimal.
I understand that there are standard ways using pg_dump and dblink, but if I understand correctly pg_dump needs to first create an export file in full, which is not really feasible considering the DB size, while dblink would try to select the whole data which falls under the SELECT * FROM ... query, so it would also fail.
One more thing about the data I need to migrate is that it has a timestamp field and is distributed more or less evenly across a long time range. But I don't know if there's a way to run something purely on the PG server side which would in a loop fetch data for a small period of time, store it locally, and then move to the next period of time.
If all else fails I'll just have to write a standalone ad-hoc service which I'll deploy alongside those DB services to do exactly that. But I'd prefer a solution based on already existing tools.
Solution 1:[1]
I had great success in the past by creating a pipeline between two PostgreSQL databases. This is possible because pg_dump actually creates normal SQL that is the 'fed into' the destination database:
$ pg_dump <from first database> | pgsql <to second database>
Instead of creating an export file, it just sends the export to the other database.
I only did it one table at-a-time, and I only did it for the table data, but you can tell pg_dump whether to include schema and table definitions, or just the data.
Actually, I'd recommend doing it first just to create the schema and tables, and then do a separate run to just copy the data. That way, if something goes wrong, you can just copy specific tables. It also gives you a chance to verify the schema (which should be quick) before copying the data (which should be slow).
I ran the command on a separate computer, so effectively it was going db -> computer -> db.
Solution 2:[2]
You could write an python script to get it done. Actually there is already a script here which could be found in github. It is a script with the goal to move lots of data between postgresql databases quickly, which is not same with your requirement, but I think you can easily custom it to introduce some logic to limit the rate. Hope this is helpful.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | John Rotenstein |
| Solution 2 | BlueMice |
