Tips on how to cut back “Cuda Memcpy Async” occasions and why you must watch out for boolean masks operations
![Towards Data Science](https://miro.medium.com/v2/resize:fill:48:48/1*CJe3891yB1A1mzMdqemkdg.jpeg)
19 hours in the past
That is the third a part of a sequence of posts on the subject of analyzing and optimizing PyTorch fashions utilizing PyTorch Profiler and TensorBoard. Our intention has been to focus on the advantages of efficiency profiling and optimization of GPU-based coaching workloads and their potential influence on the velocity and price of coaching. Particularly, we want to exhibit the accessibility of profiling instruments corresponding to PyTorch Profiler and TensorBoard to all ML builders. You don’t want to be a CUDA skilled with a view to derive significant efficiency beneficial properties from making use of the methods we talk about in our posts.
In our first put up we demonstrated how the completely different views of the PyTorch Profiler TensorBoard plugin can be utilized to determine efficiency points and reviewed a couple of widespread methods for accelerating coaching. Within the second put up we confirmed how the TensorBoard plugin Hint View can be utilized to determine when tensors are being copied from the CPU to the GPU, and again. Such motion of knowledge — which might trigger factors of synchronization and gradual the velocity of coaching significantly — is usually unintentional and may generally be simply prevented. The subject of this put up might be conditions during which we encounter factors of synchronization between the GPU and CPU that aren’t related to tensor copies. As within the case of tensor copies, these could cause stagnation in your coaching step and gradual the general time of your coaching significantly. We are going to exhibit the existence of such occurrences, how they are often recognized utilizing PyTorch Profiler and the PyTorch Profiler TensorBoard plugin Hint View, and the potential efficiency advantages of constructing your mannequin in a approach that minimizes such synchronization occasions.
As in our earlier posts, we are going to outline a toy PyTorch mannequin after which iteratively profile its efficiency, determine bottlenecks, and try to repair them. We are going to run our experiments on an Amazon EC2 g5.2xlarge occasion (containing an NVIDIA A10G GPU and eight vCPUs) and utilizing the official AWS PyTorch 2.0 Docker picture. Remember the fact that a few of the behaviors we describe might fluctuate between variations of PyTorch.
Within the following blocks we introduce a toy PyTorch mannequin that performs semantic segmentation on a 256×256 enter picture, i.e., it takes a 256×256 RGB picture and outputs a 256×256 map of “per-pixel” labels from a category of ten semantic classes.
import torchimport torch.nn as nnimport torch.nn.purposeful as Fimport torch.optimimport torch.profilerimport torch.utils.datafrom torch import Tensor
class Internet(nn.Module):def __init__(self, num_hidden=10, num_classes=10):tremendous().__init__()self.conv_in = nn.Conv2d(3, 10, 3, padding=’similar’)hidden = []for i in vary(num_hidden):hidden.append(nn.Conv2d(10, 10, 3, padding=’similar’))hidden.append(nn.ReLU())
self.hidden = nn.Sequential(*hidden)self.conv_out = nn.Conv2d(10, num_classes, 3, padding=’similar’)
def ahead(self, x):x = F.relu(self.conv_in(x))x = self.hidden(x)x = self.conv_out(x)return x
To coach our mannequin we are going to use the usual cross-entropy loss with a couple of modifications:
We are going to assume that the goal labels embody an ignore worth indicating pixels that we wish to exclude from the loss calculation.We are going to assume that one in every of semantic labels identifies sure pixels as belonging to the “background” of the picture. We outline our loss perform to deal with these as ignore labels.We are going to replace our mannequin weights solely after we encounter batches with targets tensors that embody no less than two distinctive values.
Whereas we’ve got chosen these modifications for the needs of our demonstration, a lot of these operations should not unusual and will be discovered in lots of “commonplace” PyTorch fashions. Since we’re already “consultants” at efficiency profiling, we’ve got already gone forward and wrapped every of the operations in our loss perform with a torch.profiler.record_function context supervisor, (as described in our second put up).
class MaskedLoss(nn.Module):def __init__(self, ignore_val=-1, num_classes=10):tremendous().__init__()self.ignore_val = ignore_valself.num_classes = num_classesself.loss = torch.nn.CrossEntropyLoss()
def cross_entropy(self, pred: Tensor, goal: Tensor) -> Tensor:
# create a boolean masks of legitimate labelswith torch.profiler.record_function(‘create masks’):masks = goal != self.ignore_val
# permute the logits in preparation for maskingwith torch.profiler.record_function(‘permute’):permuted_pred = torch.permute(pred, [0, 2, 3, 1])
# apply the boolean masks to the targets and logitswith torch.profiler.record_function(‘masks’):masked_target = goal[mask]masked_pred = permuted_pred[mask.unsqueeze(-1).expand(-1, -1, -1,self.num_classes)]masked_pred = masked_pred.reshape(-1, self.num_classes)
# calculate the cross-entropy losswith torch.profiler.record_function(‘calc loss’):loss = self.loss(masked_pred, masked_target)return loss
def ignore_background(self, goal: Tensor) -> Tensor:
# uncover all indices the place goal label is “background”with torch.profiler.record_function(‘non_zero’):inds = torch.nonzero(goal == self.num_classes – 1, as_tuple=True)
# reset all “background” labels to the ignore indexwith torch.profiler.record_function(‘index project’):goal[inds] = self.ignore_valreturn goal
def ahead(self, pred: Tensor, goal: Tensor) -> Tensor:
# ignore background labelstarget = self.ignore_background(goal)
# retrieve a listing of distinctive components in targetwith torch.profiler.record_function(‘distinctive’):distinctive = torch.distinctive(goal)
# test if the variety of distinctive objects cross the thresholdwith torch.profiler.record_function(‘numel’):ignore_loss = torch.numel(distinctive) < 2
# calculate the cross-entropy lossloss = self.cross_entropy(pred, goal)
# zero the loss within the case that the variety of distinctive components# is beneath the thresholdif ignore_loss:loss = 0. * loss
return loss
Our loss perform appears harmless sufficient, proper? Incorrect! As we are going to see beneath, the loss perform consists of a variety of operations that set off host-device synchronization occasions that gradual the velocity of coaching significantly — none of which contain copying tensors into or out of the GPU. As in our earlier put up, we problem you to attempt to determine three alternatives for efficiency optimization earlier than studying on.
For the needs of our demo, we use randomly generated photographs and per-pixel label maps, as outlined beneath.
from torch.utils.information import Dataset
# A dataset with random photographs and label mapsclass FakeDataset(Dataset):def __init__(self, num_classes=10):tremendous().__init__()self.num_classes = num_classesself.img_size = [256, 256]
def __len__(self):return 1000000
def __getitem__(self, index):rand_image = torch.randn([3]+self.img_size, dtype=torch.float32)rand_label = torch.randint(low=-1, excessive=self.num_classes, dimension=self.img_size)return rand_image, rand_label
train_set = FakeDataset()train_loader = torch.utils.information.DataLoader(train_set, batch_size=256, shuffle=True, num_workers=8, pin_memory=True)
Final, we outline our coaching step with the PyTorch Profiler configured to our want:
machine = torch.machine(“cuda:0”)mannequin = Internet().cuda(machine)criterion = MaskedLoss().cuda(machine)
optimizer = torch.optim.SGD(mannequin.parameters(), lr=0.001, momentum=0.9)mannequin.practice()
# coaching loop wrapped with profiler objectwith torch.profiler.profile(schedule=torch.profiler.schedule(wait=1, warmup=4, energetic=3, repeat=1),on_trace_ready=torch.profiler.tensorboard_trace_handler(‘/tmp/prof’),record_shapes=True,profile_memory=True,with_stack=True) as prof:for step, information in enumerate(train_loader):inputs = information[0].to(machine=machine, non_blocking=True)labels = information[1].to(machine=machine, non_blocking=True)if step >= (1 + 4 + 3) * 1:breakoutputs = mannequin(inputs)loss = criterion(outputs, labels)optimizer.zero_grad(set_to_none=True)loss.backward()optimizer.step()prof.step()
For those who have been to naively run this coaching script, you’d most likely see excessive GPU (~90%) utilization and never know that there was something improper with it. It’s only by profiling that we’re in a position to determine the underlying efficiency bottlenecks and potential alternatives for coaching acceleration. So, with out additional ado, let’s see how our mannequin performs.
On this put up we are going to concentrate on the Hint View of the PyTorch Profiler TensorBoard plugin. Please see our earlier posts for recommendations on learn how to use a few of the different views supported by the plugin.
Within the picture beneath we present the Hint View of a single coaching step of our toy mannequin.
We are able to clearly see that our 1.3 second lengthy coaching step is totally dominated by the torch.nonzero operator within the first line of our loss perform. All the opposite operations seem bunched collectively on both aspect of the large cudaMemcpyAsyn occasion. What’s going on??!! Why would such a seemingly harmless operation trigger such an enormous eyesore?
Maybe we shouldn’t be so shocked, because the torch.nonzero documentation does embody the next be aware: “When enter is on CUDA, torch.nonzero() causes host-device synchronization.” The necessity for synchronization arises from the truth that, opposite to different frequent PyTorch ops, the dimensions of the tensor that’s returned by torch.nonzero shouldn’t be pre-determined. The CPU doesn’t know what number of non-zero components there are within the enter tensor forward of time. It wants to attend for the sync occasion from the GPU with a view to carry out the suitable GPU reminiscence allocation and appropriately put together the next PyTorch ops.
Word that the size of cudaMempyAsync shouldn’t be indicative of the complexity of the torch.nonzero op, however moderately displays the period of time that the CPU wants to attend for the GPU to complete the entire earlier kernels that the CPU launched. For instance, have been we to make a further torch.nonzero name instantly after our first one, our second cudaMempyAsync occasion would seem considerably shorter than the primary because the CPU and GPU are already kind of “in sync”. (Remember the fact that this clarification is coming from a non-CUDA skilled, so make of it what you’ll…)
Now that we perceive the supply of the bottleneck, the problem turns into discovering another sequence of operations that performs the identical logic however that doesn’t set off a host-device synchronization occasion. Within the case of our loss perform, we will simply accomplish this utilizing the torch.the place operator as proven within the code block beneath:
def ignore_background(self, goal: Tensor) -> Tensor:with torch.profiler.record_function(‘replace background’):goal = torch.the place(goal==self.num_classes-1, -1*torch.ones_like(goal),goal)return goal
Within the picture beneath we present the Hint View following this transformation.
Whereas we’ve got succeeded in eradicating the cudaMempyAsync coming from the torch.nonzero op, it has been instantly changed with one coming from the torch.distinctive op, and our step time has not budged. Right here the PyTorch documentation is much less sort, however based mostly on our earlier expertise we will assume that, as soon as once more, we’re affected by a host-device synchronization occasion attributable to our use of tensors with undetermined dimension.
Changing the torch.distinctive operator with an equal different shouldn’t be all the time doable. Nonetheless, in our case we don’t really have to know the values of the distinctive labels, we have to know solely the variety of distinctive labels. This may be calculated by making use of the torch.kind op on the flattened goal tensor and counting the variety of steps within the resultant step perform.
def ahead(self, pred: Tensor, goal: Tensor) -> Tensor:
# ignore background labelstarget = self.ignore_background(goal)
# kind the record of labelswith torch.profiler.record_function(‘kind’):sorted,_ = torch.kind(goal.flatten())
# indentify the steps of the resultant step functionwith torch.profiler.record_function(‘deriv’):deriv = sorted[1:]-sorted[:-1]
# rely the variety of stepswith torch.profiler.record_function(‘count_nonzero’):num_unique = torch.count_nonzero(deriv)+1
# calculate the cross-entropy lossloss = self.cross_entropy(pred, goal)
# zero the loss within the case that the variety of distinctive components# is beneath the thresholdwith torch.profiler.record_function(‘the place’):loss = torch.the place(num_unique<2, 0.*loss, loss)
return loss
Within the picture beneath we seize the Hint View following our second optimization:
As soon as once more, we’ve got solved one bottleneck solely to be confronted with a brand new one, this time coming from the boolean masks routine.
Boolean masking is a routine we generally use with a view to cut back the general variety of machine operations which can be required. In our case, our intention was to cut back the quantity of computation by eradicating the “ignore” pixels and limiting the cross-entropy calculation to the pixels of curiosity. Clearly, this has backfired. As earlier than, making use of a boolean masks leads to a tensor of undetermined dimension, and the cudaMempyAsync that it triggers drastically overshadows any of the financial savings from excluding the “ignore” pixels.
In our case, fixing this problem is moderately easy because the PyTorch CrossEntropyLoss has a built-in possibility for setting an ignore_index.
class MaskedLoss(nn.Module):def __init__(self, ignore_val=-1, num_classes=10):tremendous().__init__()self.ignore_val = ignore_valself.num_classes = num_classesself.loss = torch.nn.CrossEntropyLoss(ignore_index=-1)
def cross_entropy(self, pred: Tensor, goal: Tensor) -> Tensor:with torch.profiler.record_function(‘calc loss’):loss = self.loss(pred, goal)return loss
Within the picture beneath we present the resultant Hint View:
Holy cow!! Our step time has dropped all the best way down to five.4 milliseconds. That’s 240 (!!) instances quicker than what we began with. By merely altering round a couple of perform calls and with none modification to the loss perform logic, we have been in a position to optimize the efficiency of the coaching step dramatically.
Vital Word: Within the toy instance we’ve got chosen, the steps that we took to cut back the quantity cudaMempyAsync occasions had a transparent influence on the coaching step time. Nonetheless, there could also be conditions the place the identical varieties of adjustments will hurt efficiency moderately than enhance it. For instance, within the case of boolean masking, if our masks is extraordinarily sparse and the unique tensors extraordinarily massive, the financial savings in computation from making use of the masks may outweigh the worth of the host-device synchronization. Importantly, the influence of every optimization needs to be evaluated on a case-by-case foundation.
On this put up we’ve got targeted on efficiency points in coaching functions which can be brought on by host-device synchronization occasions. We noticed a number of examples of PyTorch operators that set off such occasions — the frequent property of all of them being that the dimensions of the tensors that they output are depending on the enter. You may also encounter synchronization occasions from different operators, not coated on this put up. We demonstrated how efficiency analyzers corresponding to PyTorch Profiler and its related TensorBoard plugin can be utilized to determine these sorts of occasions.
Within the case of our toy instance, we have been capable of finding equal alternate options to the problematic operators that use fastened sized tensors and keep away from the necessity for synchronization occasions. These led to a major enchancment in coaching time. Nonetheless, in apply you may discover it a lot more durable — even inconceivable — to resolve these sorts of bottlenecks. Generally, overcoming them may require redesigning elements of your mannequin.