When we are using torch.nn.parallel.DistributedDataParallel(), we may get this error: Runtimeerror: Expected to have finished reduction in the prior iteration before starting a new one. In this tutorial, we will introduce you how to fix it.
How to fix this error?
There are two methods to fix.
Method 1: use find_unused_parameters=True
torch.nn.parallel.DistributedDataParallel(model, device_ids=[self.local_rank], broadcast_buffers=False, find_unused_parameters=True)
Then, this runtimeerror can be fixed.
Method 2: remove all model forward() outputs what not be used when calculating loss.
model = nn.parallel.DistributedDataParallel(model, device_ids=[self.local_rank], broadcast_buffers=False, find_unused_parameters=True) y_pred, y_tgt = model(x) loss = cross_entropy(y_pred)
In this example code, model forward() return two variables: y_pred and y_tgt.
However, only y_pred is used when computing cross entropy loss. y_tgt is not used.
Then, this runtimeerror will occur.
In order to fix this error, we should make model not return y_tgt.