'understanding gpu usage huggingface classification
I am building a classifier using huggingface and would like to understand the line Total train batch size (w. parallel, distributed & accumulation) = 64 from below
Num examples = 7000
Num Epochs = 3
Instantaneous batch size per device = 4
Total train batch size (w. parallel, distributed & accumulation) = 64
Gradient Accumulation steps = 16
Total optimization steps = 327
i have 7000 rows of data, i have defined epochs to be 3 and per_device_train_batch_size = 4 and per_device_eval_batch_size= 16. I also get that Total optimization steps = 327 - (7000*3/64)
But I am not clear about Total train batch size (w. parallel, distributed & accumulation) = 64. Does it mean that there are 16 devices as 16*4(Instantaneous batch size per device = 4) comes to 64?
Solution 1:[1]
Well the variable used for printing that summary is this one: https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py#L1211.
The total train batch size is defined as train_batch_size * gradient_accumulation_steps * world_size, so in your case 4 * 16 * 1 = 64. world_size is always 1 except when you are using a TPU/training in parallel, see https://github.com/huggingface/transformers/blob/master/src/transformers/training_args.py#L1127.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | ewz93 |
