EfficientDet Implementation for Object Detection

I have been interested in Machine Learning but left it untouched for years. I finally decided to start training myself so that I get insight into data usage and the capability of coding by myself. I found an interesting image competition and started with it. The competition had already finished but the data is still available and I can still submit my prediction and get the score.

The tasks are straightforward, including object detection and classification. I found EfficientDet as a useful model these days that manages both of these tasks, and decided to develop a model with it. However, the implementation was extremely hard. Some error messages require me to edit the packages imported, which I couldn't manage in Kaggle notebook. Therefore, I re-started the implementation of EfficientDet with a simple data.

A useful example I found is the blog written in Japanese in Oct 2021. The setting is to detect a red circle on a black square background. The source code on the blog worked in most parts, but I met some errors when I tested it in April 2023. The below is a note of the errors and remedies, and the accuracy of result.

Errors and remedies

view size is not compatible with input tensor's size and stride

We train the model with images and bounding boxes input like below:

# Training loop
for epoch in range(1, args.epoch+1):
  ...
  for (inputs, targets) in t:
    ...
    losses = bench(inputs, targets)
    ...

The error below occurred:

# Error message
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-7-9f3114f08672> in <cell line: 19>()
     33     targets['cls'] = targets['cls']
     34     optimizer.zero_grad()
---> 35     losses = bench(inputs, targets)
     36     loss = losses['loss']
     37     loss.backward()

/usr/local/lib/python3.9/dist-packages/effdet/anchors.py in batch_label_anchors(self, gt_boxes, gt_classes, filter_valid)
    396                     cls_targets[count:count + steps].view([feat_size[0], feat_size[1], -1]))
    397                 box_targets_out[level_idx].append(
--> 398                     box_targets[count:count + steps].view([feat_size[0], feat_size[1], -1]))
    399                 count += steps
    400                 if last_sample:

RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

'view' function is called when the model inputs images and bounding boxes to reshape the bounding boxes. So, the error suggests to use 'reshape' function instead of 'view' function. I edited 'anchors.py' in the effdet package like below:

> Line 398 Before correction
  #box_targets[count:count + steps].view([feat_size[0], feat_size[1], -1]))
> After correction
  box_targets[count:count + steps].reshape([feat_size[0], feat_size[1], -1]))

Labels vanish when the bounding box goes out of boundaries

The dataset makes augmentation processes before outputting the data. These processes include in the first part randomly cropping the input image, which sometimes delete the information of bounding boxes and labels when the bounding boxes are cropped out from the original image. The sample code defines the process if this case happens, but it only defines the new bounding box and doesn't define the new labels, which causes the error.

class CircleDataset(Dataset):
  ...

  def __getitem__(self, idx):
    ...

    if bboxes.shape[0] == 0:
      bboxes = torch.zeros([1, 4], dtype=bboxes.dtype)

    ...
    return x, y

  ...

# Error message
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-10-9f3114f08672> in <cell line: 19>()
     28   t = tqdm(loader, leave=False)
     29 
---> 30   for inputs, targets in t:
     31     inputs = inputs
     32     targets['bbox'] = targets['bbox']
/usr/local/lib/python3.9/dist-packages/torch/utils/data/_utils/collate.py in collate_tensor_fn(batch, collate_fn_map)
    161         storage = elem.storage()._new_shared(numel, device=elem.device)
    162         out = elem.new(storage).resize_(len(batch), *list(elem.size()))
--> 163     return torch.stack(batch, 0, out=out)
    164 
    165 
RuntimeError: stack expects each tensor to be equal size, but got [0] at entry 0 and [1] at entry 1

To avoid this, a new label as well as a new bounding box needs to be re-defined when they vanish.

# After correction
if bboxes.shape[0] == 0:
    bboxes = torch.zeros([1, 4], dtype=bboxes.dtype)
    labels = torch.FloatTensor(np.array([0])) # Added

Accuracy

I obtained a prediction, taking one image randomly from the training set and inputting it into the trained model.

Prediction uses DetBenchPredict within the effdet package. The original data size is (3, 512, 512) while DetBenchPredict takes batch as its input. So, I added a dimension using 'unsqueeze' function.

DetBenchPredict outputs (N, 6) tensor. N is the number of bounding boxes predicted, and the meaning of each of the six elements is:

x-coordinate of bounding box top left
y-coordinate of bounding box top left
x-coordinate of bounding box bottom right
y-coordinate of bounding box bottom right
probability that the image is classified correctly
classification

The code is as below. Bounding boxes are drawn if the probability is over 50%.

image, targets = dataset.__getitem__(0)
image = image.unsqueeze(0)

bench = DetBenchPredict(model)
with torch.no_grad():
  output = bench(image)

# Draw the predictions with over 50% probability
fig, ax = pp.subplots()
ax.imshow(image[0,:,:])

for i in range(output.shape[1]):
  if output[0, i, 4] > 0.5:
    x1 = int(output[0, i, 0])
    y1 = int(output[0, i, 1])
    width = int(output[0, i, 2] - output[0, i, 0])
    height = int(output[0, i, 3] - output[0, i, 1])
    rect = patches.Rectangle((x1, y1), width, height, edgecolor='r', facecolor='none')
    ax.add_patch(rect)
    print(output[0,i,:])

pp.show()

The accuracy after 1 epoch is like this:

(output[0, i, :)
tensor([ 14.0453, 114.7553,  26.5884, 158.7972,   0.6781,   1.0000])
tensor([144.7045, 129.4016, 182.4770, 259.8239,   0.6156,   1.0000])
tensor([ -0.6067, 162.9664,  68.7289, 175.3027,   0.5549,   1.0000])
tensor([ -4.6260,   7.1583, 156.3810, 120.1586,   0.5246,   1.0000])
tensor([ 29.6035,  88.9964,  99.8469, 168.4458,   0.5069,   1.0000])
tensor([182.1268, 257.2897, 182.7585, 465.5251,   0.5004,   1.0000])

The accuracy after 10 epoch is like this:

Blog

EfficientDet Implementation for Object Detection

nekot0

Errors and remedies

Accuracy

Join Our Newsletter. No Spam, Only the good stuff.

Related