Parallel versus sequential processing

The ANPR/ALPR detector uses Convolutional Neural Networks (ConvNet/CNN). The shape for the input layer is [300,300,3] which means height=300, width=300 and NumBytesPerSample = 3 (R, G, B). Any image will be resized to a fixed 300x300 resolution and converted to RGB_888 (a.k.a RGB24) to match the input layer. The output layer is more complex and one important element is the prediction boxes which interest us in this section.

The CNN will always output #100 prediction boxes regardless the input. Here comes the “post-processing” operation which has the role to filter and fuse these boxes. The filtering is based on the scores/confidences and the fusion is based on the anchors. At the end of the post-processing operation you’ll have the real detections which could be zero or up to #10 (arbitrary number used on training stage).

As you may expect the post-processing operation is very CPU intensive and makes the detection very slow, this is a bad news. For example, one operation executed in the post-processing stage is the NMS (non-maximum suppression).

The good news about the post-processing operation is that we can do it in #2 passes, the first one being very fast and allowing to have the real predictions with 98% accuracy. The second pass is very slow and we can schedule it to be executed in parallel to the next detection.

Here is the idea behind the parallel processing:

  1. The decoder accepts a video frame N with any size, convert it to 300x300 RGB_888 and pass it to the CNN as input.

  2. Predicts #100 bounding boxes representing possible license plates.

  3. Run first pass post-processing operation to get candidate boxes with 98% accuracy.

  4. Asynchronously schedule second pass post-processing operation using a parallel process and register the result for recognition.

  5. Return the first pass result to the user. At this step the recognition isn’t done yet but the user can use the first pass result to determine if the frame potentially have license plates. For example, the user can crop the frame using the license plate bounding box coordinates and save it for later.

  6. The user provides video frame N+1 to the decoder.

  1. While the user is preparing frame N+1 the decoder is running the second pass on the background and passed the result to the recognizer.

  2. Frame N+1 will have the same fate as frame N (see steps #1 to #5).

As you have noticed, the preparation and detection operation for frame N+1 will overlap (parallel execution) with second pass detection and the recognition of frame N. This means you’ll have the recognition result for frame N while you’re in mid-process for frame N+1. When the pipeline is running at 47fps this means you’ll have the recognition result for frame N within 5 to 10 milliseconds interval after providing frame N +1 for detection.

Please note that, if the first pass outputs K boxes and the second pass outputs M boxes then:

  • every box m in M is in K, which means the second phase will never add new boxes to the prediction

On Android devices we have noticed that parallel processing can speedup the pipeline by up to 120% on some devices while on Raspberry Pi 4 the gain is marginal. On RockPi 4B (ARM64) the code is 5 times faster when parallel processing is enabled.


Please note that enabling parallel processing increase memory usage as more threads are used. We recommend using OpenVINO instead of Tensorflow to decrease memory usage.