'What is the best way to handle DJANGO migration data with over 500k records for MYSQL
- A migration handles creating two new fields
action_duplicateandstatus_duplicate - The second migration copies the data from the
actionandstatusfields, to the two newly created fields
def remove_foreign_keys_from_user_request(apps, schema_editor):
UserRequests = apps.get_model("users", "UserRequest")
for request_initiated in UserRequest.objects.all().select_related("action", "status"):
request_initiated.action_duplicate = request_initiated.action.name
request_initiated.status_duplicate = request_initiated.status.name
request_initiated.save()
- The third migration is suppose to remove/delete the old fields
actionandstatus - The fourth migration should rename the new duplicate fields to the old deleted fields
The solution here is to remove the dependency on the status and action, to avoid unnecessary data base query, since the status especially will only be
pendingandcompleted
My question is for the second migration. The number of records are between 300k to 600k records so I need to know a more efficient way to do this so it doesn't take up all the memory available.
Note: The Database is MySQL.
A trimmed-down version of the UserRequest model
class UserRequest(models.Model):
id = models.UUIDField(primary_key=True, default=uuid.uuid4, editable=False)
reference = models.CharField(max_length=50, null=True, blank=True)
requester = models.ForeignKey(User, blank=True, null=True, on_delete=models.CASCADE)
action = models.ForeignKey(Action, on_delete=models.CASCADE)
action_duplicate = models.CharField(
max_length=50, choices=((ACTION_A, ACTION_A), (ACTION_B, ACTION_B)), default=ACTION_A
)
status = models.ForeignKey(ProcessingStatus, on_delete=models.CASCADE)
status_duplicate = models.CharField(
max_length=50,
choices=((PENDING, PENDING), (PROCESSED, PROCESSED)),
default=PENDING,
)
Solution 1:[1]
You can work with a Subquery expression [Django-doc], and do the update in bulk:
def remove_foreign_keys_from_user_request(apps, schema_editor):
UserRequests = apps.get_model('users', 'UserRequests')
Action = apps.get_user('users', 'Action')
Status = apps.get_user('users', 'ProcessingStatus')
UserRequests.objects.update(
action_duplicate=Subquery(
Action.objects.filter(
pk=OuterRef('action_id')
).values('name')[:1]
),
status_duplicate=Subquery(
Status.objects.filter(
pk=OuterRef('status_id')
).values('name')[:1]
)
)
That being said, it looks that what you are doing is actually the opposite of database normalization [wiki]: usually if there is duplicated data, you make an extra model where you make one Action/Status per value, and thus prevent having the same value for action_duplicate/status_duplicate multiple times in the database: this will make the database larger, and harder to maintain.
Note: normally a Django model is given a singular name, so
UserRequestinstead of.UserRequests
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Willem Van Onsem |
