When fine-tuning giant language fashions (LLMs) domestically, utilizing giant batch sizes is commonly impractical as a consequence of their substantial GPU reminiscence consumption. To beat this limitation, a method referred to as gradient accumulation is often used to simulate bigger batch sizes. As a substitute of updating the mannequin weights after processing every batch, gradient accumulation includes summing the gradients over a number of smaller mini-batches. The mannequin weights are up to date solely after a predetermined variety of these mini-batches have been processed. This methodology successfully mimics coaching with a bigger batch measurement with out the reminiscence overhead sometimes related to it.
For example, setting a mini-batch measurement of 1 and accumulating gradients over 32 mini-batches must be equal to coaching with a full batch measurement of 32. Nonetheless, I found that gradient accumulation typically ends in considerably degraded efficiency in comparison with coaching with bigger precise batch sizes with widespread deep-learning frameworks like Transformers.
After sharing this problem on X and Reddit, Daniel Han from Unsloth AI replicated the issue. He discovered that it was affecting not solely gradient accumulation but in addition multi-GPU setups. In such…