Real-time Object Detection with Phoenix and Python

This article is not just about Machine Learning and Object Detection, it’s about Elixir interoperability and how we can take advantage of the Python’s fantastic set of ML libraries, bringing their features into the Elixir world.

We see how to bring YOLO, a state-of-the-art real-time object detection system, in a Phoenix web app.

We start with Python, by building a small app which does the actual object detection. Then we focus on the Elixir-Python interoperability, building an Elixir wrapper around the Python app, using Ports.

The second part of the article is all about using our YOLO Elixir module in Phoenix, at first detecting objects on single images and then doing real-time object detection using the computer’s webcam.

In the poeticoding/yolo_example GitHub repo you find all the code we see here, both the Phoenix examples and the object detection Python script.

Table of Contents

Making Elixir and Python work together

We are not going to implement the YOLO algorithm ourself, that’s for sure! It would not make any sense since there are great easy-to-use Python libraries that implement YOLOv3 for us.

cvlib it’s a high level library that runs object detection with just a few lines of code; it uses OpenCV and TensorFlow under the hood. We don’t even need to train a model our-self: cvlib uses a model pre-trained on the COCO dataset, capable of detecting 80 common objects.

But how can we take advantage of this Python library, letting Elixir talk with Python?

The simplest way would be to use System.cmd: we run our Python object detection script, passing the image path as an argument and waiting that the program exits returning the result. Unfortunately this is too slow: before detecting the objects our Python code needs to load the libraries and the YOLOv3 model in memory, which could take few seconds (on my laptop it takes around 2 seconds).

Since we can’t wait to load the model for each detection, we are going to use Port to run our Python app as a long running process (operating system process, external to the Erlang VM) which holds the model in memory and communicates with Elixir via stdin/stdout.

Elixir Port with Python process

Python receives the data via stdin and sends back the result writing it to stdout. Everything written to stdout is sent to the Elixir process’ mailbox (the one that opened the Port). We’ll see in detail how to use Ports to build our Elixir wrapper, but if you never used Ports, Outside Elixir (written by Saša Jurić) it’s a great in-depth read!

There are also other ways to handle the Elixir – Python interoperability. We could, for example, use Pyrlang to run a Python node as part of our Elixir cluster. Or we could run the Python app with a HTTP server like Flask, letting Elixir and Python communicate via HTTP. Each one has its own pros and cons.

I preferred to go with Ports because it’s really easy to detect crashes and it’s a solution that works seamlessly on my computer, on a server or on an embedded device (like the Nvidia Jetson Nano or Raspberry Pi).

YOLO Object Detection in Python

Let’s start easy, with a really simple Python script that processes only one image. It starts by loading cvlib and the YOLOv3 model, then detects the objects present in the dog.jpg image.

First, we need to create a new Python virtual environment and install the OpenCV, TensorFlow and cvlib. Anaconda makes easy to create a new python virtual environment. With the conda command we create a new python3.6 environment called yolo.

$ conda create -n yolo python=3.6

Once created, we need to activate the new environment and install OpenCV, TensorFlow with conda and cvlib with pip.

$ conda activate yolo
$ conda install tensorflow opencv
$ pip install cvlib
# detect.py
import cv2
import cvlib as cv

img = cv2.imread("dog.jpg")
boxes, labels, _conf = cv.detect_common_objects(img, model="yolov3")

print(labels, boxes)
dog.jpg

This script is really simple, it imports cv2 (OpenCV) and cvlib, then it loads the dog.jpg image (you see above) into memory and passes it to cv.detect_common_objects function, using the YOLOv3 model. It prints the detected objects’ labels and bounding boxes at the end.

$ python detect.py
Using TensorFlow backend.
Downloading yolov3.cfg from https://github.com/arunponnusamy/object-detection-opencv/raw/master/yolov3.cfg
Downloading yolov3.weights from https://pjreddie.com/media/files/yolov3.weights
Downloading yolov3_classes.txt from https://github.com/arunponnusamy/object-detection-opencv/raw/master/yolov3.txt

['dog', 'bicycle', 'truck'] [[122, 223, 320, 543], [117, 124, 569, 432], [472, 86, 692, 166]]

Fantastic! With just a few lines of code we are able to detect objects in an image! The script tells us the are a dog, a bicycle and a truck and where they are located.

The first time you run the script, cvlib downloads three files for us (yolov3.cfg, yolov3.weights and yolov3_classes.txt) which are used to load the YOLOv3 model.

What about speed?

$ time python detect.py
...
real	0m2.252s
user	0m3.176s
sys	0m0.638s

On my MacBook Pro 2018 (with i9) it takes more than 2 seconds… too much if we want to detect objects in real-time. But most of this time is spent loading the model, the detection itself is around 0.2s.

# detect.py
...
import time

start = time.time()
boxes, labels, _conf = cv.detect_common_objects(img, model="yolov3")
print("first detection: ", time.time() - start)

start = time.time()
boxes, labels, _conf = cv.detect_common_objects(img, model="yolov3")
print("second detection: ", time.time() - start)
$ python detect.py
Using TensorFlow backend.
first detection:  0.63
second detection:  0.21

Loading cv2 and cvlib libraries takes around 1.4s and the first time we call cv.detect_common_objects(img, model="yolov3") cvlib takes 0.63s, since cvlib needs to load the model in memory, but the second time is much faster (0.21s). That’s why we can’t run this script with System.cmd for each detection and why we need a long-running process which keeps the model in memory!

0.21s means that the best I can get from my laptop (MacBook Pro 15 2018 with a 6 cores 2.9GHz i9 ) is around 4 detections per second. Can we do any better? Definitely! But with a GPU. Using an Nvidia GTX 1080 we should reach 0.03s (30ms) per detection (check TensorFlow GPU).

We can also use Darknet, a Neural Network Framework written in C and CUDA. When just running on CPU the OpenCV implementation is faster than Darknet; but Darknet really shines when compiled with CUDA running on a GPU!

I did many benchmarks both locally and on the cloud. The fastest my computer could process the dog.jpg image is ~0.2s. On the cloud I’ve tried to run YOLO on both CPU and GPU: on AWS, to reach 0.2s per image, I needed a C5.4xlarge (which costs $0.68/hour). But I’ve got the most interesting result with a P3 instance (an expensive one with the Nvidia Tesla GPU!) and Darknet, processing an image in just 0.03s!

I’ve bought an Nvidia Jetson Nano, a small computer with a 128-core Nvidia GPU that runs with only 10W. My idea is to install Elixir on it, compile Darknet with CUDA and write a NIF for Yolo object detection (more on this in the coming weeks, stay tuned!)

To make our object detection faster, at the expense of accuracy, we can use a smaller model called tiny YOLO (yolov3-tiny).

boxes, labels, _conf = cv.detect_common_objects(
  img, 
  model="yolov3-tiny"
)
$ python detect.py
...
detection after warmup:  0.032
['dog', 'car'] [[124, 218, 382, 518], [466, 82, 686, 172]]

Then tiny version is 8 times faster on my laptop (only 32ms to process dog.jpg), but it’s also less accurate: it doesn’t detect the bicycle and the truck is now a car.

Elixir Ports

Let’s see first a simple example on how to use Port to communicate with a Python script.

Elixir Port and Python communication via stdio
Elixir Port and Python communication via stdio

In this example Elixir sends to Python a string with a list of numbers. The Python script converts this string into a list of integers, sums the numbers and sends back the result to Elixir.

Each message, sent from both sides, is a string ending with a newline – in this way is really easy to distinguish different messages because they are just separate lines. However, we’ll see later that when sending images we can’t rely on newlines as separator and we’ll have to find a different approach.

# python_scripts/add.py
import sys

for line in sys.stdin:
    
    # expecting line in for of "num,num\n"
    line = line.strip()
    # EOF
    if line == "": break
    
    # strings to ints, and sum
    values = line.split(",")
    nums = map(int, values)
    result = sum(nums)

    # send the result via stdout
    sys.stdout.write(str(result) + "\n")
    sys.stdout.flush()

This python script reads a line from stdin, it strips whitespaces and newlines, it splits the string and converts the elements into integers, summing them and writing the result in form of a string on the stdout.

The Elixir process will then receive the result as a message in the mailbox.

Ok, let’s now use Port.open/2 to run the add.py python script and Port.command/3 to send data to it.

iex> port = Port.open({:spawn, "python add.py"}, [:binary])
#Port<0.5>
iex> Port.command(port, "2,5\n")
true
iex> flush() 
{#Port<0.5>, {:data, "7\n"}}

We see how by sending the "2,5\n" string to the stdin of the python application via port, we receive a message with the result in the process mailbox.

Let’s write a function that does everything for us: encodes the list of integers into a string, sends the message to the python script, waits for the result message and returns it as an integer.

add = fn port, nums ->
  
  # integers to a string
  msg = 
    nums
    |> Enum.map(&to_string/1)
    |> Enum.join(",")

  #  sending the msg and ending "\n" as iolist
  Port.command(port, [msg, "\n"])
  
  # receive the result and convert to a string
  receive do
    {^port, {:data, result}} -> 
      String.trim(result)
      |> String.to_integer()
  end
end

iex> add.(port, [1,2,3,4,5])
15

Ports and detect.py

The goal of this part is to write a detect.py Python script that receives images from Elixir and sends back the result of the detection.

The idea is similar to what we’ve seen in the previous example: Elixir sends an image to the python script through a Port, on the other side the Python script reads the image from stdin, runs object detection and sends the result to Elixir writing to stdout.

Elixir sends an image to Python and receives the result

Using strings with ports is an simple and quick solution, but we now need to send images and we can’t rely anymore on newlines as separator between messages. An easy way to get the job done would be to encode the image to a base64 string, but this would add a 33% overhead in the message size, plus encoding/decoding steps.

To understand when the message terminates we prepend a 4 bytes header to each message. In this header we put the message’s size encoded as an unsigned big-endian integer.

Message with its Size

Along with the image, we send an image id as well, which is useful to keep track of multiple images sent asynchronously to the Python process. In this way, when Elixir receives an object detection result, it knows to which image the result refers to.

A variable length image id would require to send its size. For simplicity we make it fixed using a 16-bytes UUID4 (we can use uuid library to generate UUID4 ids)

Full Message with Model, Image id and Image

On Elixir, by opening a port with the {:packet, 4} option, we don’t have to think about adding and reading the message’s size.

port = Port.open(
  {:spawn, "python3 detect.py"}, 
  [:binary, {:packet, 4}]
)

When sending data to Python (using Port.command), the message’s size is automatically prepended; when reading data from the Python stdout, the Port automatically reads the first 4-bytes to understand the message’s size.

At the moment we are using stdin and stdout, which isn’t great since any of the libraries could write on stdout. We’d also like to use stdout ourself to print debugging messages on the terminal.

By adding the :nouse_stdio to the Port.open options we ask Port to use file descriptor 3 (instead of stdin) and 4 (instead of stdout). So, the Port.open code becomes:

port = Port.open(
  {:spawn, "python3 detect.py"}, 
  [:binary, :nouse_stdio, {:packet, 4}]
)

On the Python side, we open the file descriptors 3 and 4 in binary mode with os.fdopen. We define a setup_io() function, which returns a tuple with two opened file objects connected to the file descriptors.

# detect.py
import os

# setup of FD 3 for input (instead of stdin)
# FD 4 for output (instead of stdout)
def setup_io():
    return os.fdopen(3,"rb"), os.fdopen(4,"wb")

To read the first 4-bytes (unsigned int big-endian) and get an int, we use unpack with "!I" option (! is for big-endian bytes-order and I for 4-bytes unsigned integer).

# detect.py

import numpy as np
import cv2, sys

from struct import unpack, pack

UUID4_SIZE = 16

def read_message(input_f):
    # reading the first 4 bytes with the length of the data
    # the other 16 bytes are the UUID bytes
    # the rest is the image

    header = input_f.read(4)
    if len(header) != 4: 
        return None # EOF
    
    (total_msg_size,) = unpack("!I", header)

    # image id
    image_id = input_f.read(UUID4_SIZE)
    
    # read image data
    image_data = input_f.read(total_msg_size - UUID4_SIZE)

    # converting the binary to a opencv image
    nparr = np.fromstring(image_data, np.uint8)
    image = cv2.imdecode(nparr, cv2.IMREAD_COLOR)

    return {'id': image_id, 'image': image}

The read_message() function reads the first 4 bytes from the input_f (the input file object connected to the file descriptor 3) and unpack("!I", header) decodes the header to the total_msg_size integer. It then reads

  • 16-bytes UUID4. We don’t decode it to a string – we keep it as it is
  • total_msg_size - 16 bytes image data

At the end, the function converts the received image to a ready to use OpenCV image and returns a dictionary with image and id. When the header size is less than 4 bytes, it returns None instead. This happens when the Port is closed: a port doesn’t kill the Python process, but instead, it closes the input (stdin or 3) and output (stdout or 4) file descriptors.

Then we write the detect(image, model) function, which detects the objects using cvlib and returns a tuple.

# detect.py
import cvlib

def detect(image, model):
    boxes, labels, _conf = cv.detect_common_objects(image, model=model)
    return boxes, labels

The first argument is an OpenCV image and the second is the model name (like "yolov3" or "yolov3-tiny").

We just need now a write_result function with output_f, image_id, image_shape, boxes and labels arguments.

Python message
Python message
# detect.py
import json
from struct import unpack, pack

def write_result(output, image_id, shape, boxes, labels):
    result = json.dumps({
        'shape': shape,
        'boxes': boxes, 
        'labels': labels
    }).encode("ascii")

    total_msg_size = len(result) + UUID4_SIZE

    header = pack("!I", total_msg_size)
    output.write(header)
    output.write(image_id)
    output.write(result)
    output.flush()

We encode the result into a json string and we write to output_f the message’s total size total_msg_size, which is the result string len + 16 (the UUID4 size).

pack("!I", total_msg_size) converts the integer into a 4-bytes header. We then write the image_id and the result.

We now have everything we need to write the detect.py script mainloop!

# detect.py

# def read_msg ...
# def detect ...
# def write_result ...

def run(model):
    input_f, output_f = setup_io()
    
    while True:
        msg = read_message(input_f)
        if msg is None: break
        
        #image shape
        height, width, _ = msg["image"].shape
        shape = {'width': width, 'height': height}

        #detect object
        boxes, labels = detect(msg["image"], model)

        #send result back to elixir
        write_result(output_f, msg["id"], shape, boxes, labels)

if __name__ == "__main__":
    model = "yolov3"
    if len(sys.argv) > 1: 
        model = sys.argv[1]
        
    run(model)

At the end of the script we call deal with the arguments and run(model). By default detect.py runs the full yolov3 model. By passing a different model name as the script’s argument we can load the yolov3-tiny

$ python3 detect.py yolov3-tiny
Using TensorFlow backend.

You can find the full script here.

Let’s now open a port running the detect.py script and detect objects on dog.jpg image

iex> port = Port.open({:spawn, "python3 detect.py"}, [:binary, {:packet, 4}])
#Port<0.5>
iex> id = :crypto.strong_rand_bytes(16)
<<225, 211, 65, 208, ...>>
iex> image = File.read!("dog.jpg")
iex> Port.command(port, [id, image])
true
iex> flush
{#Port<0.5>,
 {:data,
  <<225, 211, 65, 208, 60, ...>>}}

To get a random 16-bytes image id we’ve simply used :crypto.strong_rand_bytes(16) (we’ll later use uuid to get a UUID4). Then, we send id and image binary as an iolist. Using flush, we see that we’ve received a message from port.

Let’s use pattern matching to extract the image_id and the result’s json string.

iex> Port.command(port, [id, image])
true
iex> receive do
...>{^port, {:data, <<image_id::binary-size(16), json_string::binary()>>}} -> 
...>  {image_id, json_string}
...> end
{<<225, 211, 65, 208, ...>>, 
"{\"labels\": [\"dog\", \"bicycle\", \"truck\"], \"shape\": {\"width\": 768, \"height\": 576}, \"boxes\": [[123, 222, 319, 544], [118, 124, 568, 432], [473, 86, 691, 166]]}"}

Yolo Phoenix app

Let’s now create a yolo Phoenix project in which we’ll write the rest of the code. You find the full code with the examples on the poeticoding/yolo_examples GitHub repo.

Since we don’t need a database, we pass the --no-ecto option

$ mix phx.new yolo --no-ecto

We then add the uuid library to the dependencies in mix.exs

# mix.exs
def deps do
  [
    ...
    {:uuid, "~> 1.1"},
  ]
end

and run mix deps.get.

Yolo.Worker GenServer

poeticoding/yolo_example/lib/yolo/worker.ex

It brings many advantages to build a Yolo.Worker module that implements the GenServer behaviour and wraps our Port: it becomes easy to supervise the process, we can hide the complexity behind a simple interface and we can easily spawn a pool of Yolo workers.

Yolo.Worker should handle multiple asynchronous requests from different processes, while taking care of the communication with detect.py via Port.

Yolo.Worker handles multiple requests

In the diagram above, both #PID<0.110.0> and #PID<0.112.0> processes send an image to Yolo.Worker. When the Yolo.Worker process receives the image with id <<id_1>> from #PID<0.110.0>, it forwards this request to the Python process. While waiting for a result, Yolo.Worker can accept new requests and keeps record of all the pending requests (image ids and requesting process pids). In this way Yolo.Worker knows to which process it has to forward the result once received – in the example above, once received the result with image id <<id_1>> , Yolo.Worker forwards it to #PID<0.110.0>.

start_link, init and config

Let’s start by writing the module’s start_link/1 and init/1 functions.

# lib/yolo/worker.ex

defmodule Yolo.Worker do
  use GenServer

  def start_link(opts \\ []) do
    GenServer.start_link(__MODULE__, :ok, opts)
  end

  def init(:ok) do
    config = config()
    
    port = Port.open(
      {:spawn_executable, config.python}, 
      [:binary, :nouse_stdio, {:packet, 4}, 
      args: [config.detect_script, config.model]
    ])

    {:ok, %{port: port, requests: %{}}}
  end

  ...
end

We start the Port with :spawn_executable instead of :spawn. With :spawn we were passing the full shell command, while with :spawn_executable we need to pass the full path python executable – all the arguments (detect_script and model) are passed as an :args option.

But before starting the port, init(:ok) loads the worker configuration to get model name, python executable and detect.py script full paths . We set the configuration in config/dev.exs

# config/dev.exs
...
config :yolo, Yolo.Worker,
  python: "/opt/anaconda3/envs/yolo/bin/python",
  detect_script: "/Users/alvise/yolo/python_scripts/detect.py",
  model: {:system, "YOLO_MODEL"}

and config/0 loads the configuration, getting the :model value from the YOLO_MODEL environment variable.

# lib/yolo/worker.ex

@default_config [
  python: "python", 
  detect_script: "python_scripts/detect.py",
  model: "yolov3"
]

def config do
  @default_config
  |> Keyword.merge(Application.get_env(:yolo, __MODULE__, []))
  
  #loads the values from env variables when {:system, env_var_name}
  |> Enum.map(fn 
    
    # it finds the full path when not provided
    {:python, path} -> {:python, System.find_executable(path)}

    # it loads the value from the environment variable
    # when the env variable is not set, it defaults to @default_config[option]
    {option, {:system, env_variable}} -> 
      {option, System.get_env(env_variable, @default_config[option])}
    
    # all the other options
    config -> config
  
  end)
  |> Enum.into(%{})
end

In my case, the yolo env’s python executable is at /opt/anaconda3/envs/yolo/bin/python; when using this full path we don’t need to load the yolo anaconda’s environment, like we did before with conda activate yolo. I’ve also placed detect.py into the python_scripts directory of the yolo Phoenix app.

When the YOLO_MODEL environment variable isn’t set it defaults to "yolov3".

init(:ok) then returns a state with the opened port and an empty requests Map which we’ll use to keep track of the pending requests.

Request a detection

The function we call to request an object detection is request_detection/3

# lib/yolo/worker.ex

def request_detection(pid, image) do
  image_id = UUID.uuid4() |> UUID.string_to_binary!()
  request_detection(pid, image_id, image)
end

@uuid4_size 16
def request_detection(pid, image_id, image) 
  when byte_size(image_id) == @uuid4_size do
  GenServer.call(pid, {:detect, image_id, image})
end

request_detection/3 needs the pid of Yolo.Worker GenServer, a 16-bytes image_id and the image data. It also checks the size of the image id.
The function makes a GenServer.call: it sends {:detect, image_id, image} to the GenServer and waits for the reply. The detection itself is asynchronous, the Yolo.Worker process doesn’t wait for the result from of the detection from Python – instead it returns the image_id. I preferred to use call, instead of a cast, to have a confirmation that the Yolo.Worker GenServer received the request.

In case we do not provide an image_id ourself, request_detection/2 generates a UUID4 image_id for us.

# lib/yolo/worker.ex

def handle_call({:detect, image_id, image_data}, {from_pid, _}, worker) do
  Port.command(worker.port, [image_id, image_data])
  worker = put_in(worker, [:requests, image_id], from_pid)
  {:reply, image_id, worker}
end

The handle_call/3 callback is pretty simple. Once Yolo.Worker receives a :detect request, it sends image_id and image_data to the port, which is held in the worker map (the process state). To keep track of the pending detection, the image_id is set as a key of the worker.requests map with from_pid value.

Handling the result

When detect.py has processed the image and sent the result, the port sends a message to the Yolo.Worker process. This message is handled by the handle_info/2 callback.

# lib/yolo/worker.ex

def handle_info({port, 
  {:data, <<image_id::binary-size(@uuid4_size),json_string::binary()>>}}, 
  %{port: port}=worker) 
do
  result = get_result!(json_string)

  # getting from pid and removing the request from the map
  {from_pid, worker} = pop_in(worker, [:requests, image_id])

  # sending the result map to from_pid
  send(from_pid, {:detected, image_id, result})
  {:noreply, worker}
end

defp get_result!(json_string) do
  result = Jason.decode!(json_string)
  %{
    shape: %{width: result["shape"]["width"], height: result["shape"]["height"]},
    objects: get_objects(result["labels"], result["boxes"])
  }
end

As we saw, the type of message that port sends is {#Port<...>, {:data, <<...>>}}. We pattern match the message, making sure that the port is the one in the process’ state; we also extract the image_id and json_string with

<<image_id::binary-size(@uuid4_size),json_string::binary()>>

where @uuid4_size is 16.

Instead of just decoding the JSON string to a Map, with get_result(json_string) we build a Map with :shape and a detected :objects list. The :objects list is generated by get_objects/2, which we’ll see in a moment.

handle_info/2 then pops from_pid from worker.requests map and sends to from_pid the result. It returns an updated worker state at the end.

# lib/yolo/worker.ex

def handle_info(...) do
  result = get_result!(json_string)
  
  # get from_pid and removing the request from the map
  {from_pid, worker} = pop_in(worker, [:requests, image_id])

  # send the result map to from_pid
  send(from_pid, {:detected, image_id, result})

  {:noreply, worker}
end

get_objects(labels, boxes)

In the JSON string we don’t have a list of objects, we just have a two separate lists, labels and boxes. The first box in boxes refers to the first label in labels and so on…

labels = ["dog", "bicycle", "truck"] 
boxes = [[122, 224, 320, 542], [118, 124, 568, 432], [473, 86, 691, 166]]

Each box element is a bounding-box: top-left and bottom-right x,y coordinates.

Bounding box original coordinates
Bounding box original coordinates

get_objects(labels, boxes) transforms the two lists into an object list where each object is a map with a :label and :x,:y top-left coordinates, :w (for width) and :h (for height) of the bounding box.

# lib/yolo/worker.ex

def get_objects(labels, boxes) do
  Enum.zip(labels, boxes)
  |> Enum.map(fn {label, [x, y, bottom_right_x, bottom_right_y]}->
    w = bottom_right_x - x
    h = bottom_right_y - y
    %{label: label, x: x, y: y, w: w, h: h}
  end)
end
iex> Yolo.Worker.get_objects(
...> ["dog", "bicycle", "truck"],
...> [[122, 224, 320, 542], [118, 124, 568, 432], [473, 86, 691, 166]])
[
  %{h: 318, label: "dog", w: 198, x: 122, y: 224},
  %{h: 308, label: "bicycle", w: 450, x: 118, y: 124},
  %{h: 80, label: "truck", w: 218, x: 473, y: 86}
]

Try it on iex

Let’s try Yolo.Worker on iex!

iex> {:ok, worker_pid} = Yolo.Worker.start_link([])
{:ok, #PID<0.304.0>}
iex> image = File.read!("dog.jpg")
<<255, 216, 255, 225, ...>>
iex> image_id = Yolo.Worker.request_detection(worker_pid, image)
<<3, 76, 254, 221, ...>>

iex> flush
{:detected,
 <<3, 76, 254, 221, ...>>,
 %{
   objects: [
     %{h: 318, label: "dog", w: 198, x: 122, y: 224},
     %{h: 308, label: "bicycle", w: 450, x: 118, y: 124},
     %{h: 80, label: "truck", w: 218, x: 473, y: 86}
   ],
   shape: %{height: 576, width: 768}
 }}

Great, it works! 🎉

await/2

It’s useful to have an await(image_id, timeout) function that awaits a :detected message and returns the result – let’s add it to the Yolo.Worker module.

# lib/yolo/worker.ex

def await(image_id, timeout \\ 5_000) do
  receive do
    {:detected, ^image_id, result} -> result
  after
    timeout -> {:timeout, image_id}
  end
end
iex> worker_pid \
...> |> Yolo.Worker.request_detection(image) \
...> |> Yolo.Worker.await()
 %{
   objects: [...],
   shape: %{...}
 }

Supervised Yolo.Worker

We can easily make Yolo.Worker supervised, adding it as a child of the application Supervisor, in Yolo.Application.start/2. Passing the [name: Yolo.Worker] option, the process is registered locally with the given name. In this way we can just use the Yolo.Worker name instead of the pid – this is pretty useful when the pid can change due to crashes or Supervisor restarts.

defmodule Yolo.Application do
  use Application

  def start(_type, _args) do
    children = [
      YoloWeb.Endpoint,
      
      # one worker named Yolo.Worker
      {Yolo.Worker, [name: Yolo.Worker]},
    ]

    opts = [strategy: :one_for_one, name: Yolo.Supervisor]
    Supervisor.start_link(children, opts)
  end

  ...
end
$ iex -S mix
Using TensorFlow backend
iex> Yolo.Worker.request_detection(Yolo.Worker, File.read!("dog.jpg")) \
...> |> Yolo.Worker.await()
%{ objects: [...], ...}

When starting the application (a iex session in this case), Yolo.Worker is started automatically by the Supervisor (cvlib, in python script, prints on stderr the Using TensorFlow backend message).

Without closing iex, let’s take another terminal and see what happens when we kill the python process and send another detection request.

# on another terminal
$ ps aux | grep -i detect.py
alvise  15206 ... /opt/anaconda3/envs/yolo/bin/python python_scripts/detect.py yolov3
$ kill -9 15206
iex> Yolo.Worker.request_detection(Yolo.Worker, File.read!("dog.jpg")) |> Yolo.Worker.await()
[error] GenServer Yolo.Worker terminating
** (ArgumentError) argument error
:erlang.port_command(#Port<0.6>, ...)
...
Using TensorFlow backend.

iex> Yolo.Worker.request_detection(Yolo.Worker, File.read!("dog.jpg")) |> Yolo.Worker.await()
%{ objects: [...], ...}

When we kill the python process (or more realistically it just crashes) the port closes automatically. Then, when Yolo.Worker tries to send data to the closed port, by calling Port.command(port, [image_id, image]), the Yolo.Worker process crashes. The Supervisor catches the crash and starts another Yolo.Worker process, ready to serve new requests.

Detect Objects in Uploaded Images

Now that Yolo.Worker does the heavy lifting, we can use it on a Phoenix app to detect objects in uploaded images. When a user uploads an image via a <form>, we run object detection on the uploaded image and show labels and bounding boxes using svg.

Upload an image and render svg boxes and labels
Upload an image and render svg boxes and labels

Let’s start by creating a new YoloWeb.UploadController module in lib/yolo_web/controllers/upload_controller.ex with new and create actions, an empty YoloWeb.UploadView, and add the new routes in YoloWeb.Router (/lib/yolo_web/router.ex).

# lib/yolo_web/router.ex

defmodule YoloWeb.Router do
  ...
  scope "/", YoloWeb do
    ...
    resources "/uploads", UploadController, only: [:new, :create]
  end
end
# lib/yolo_web/views/upload_view.ex

defmodule YoloWeb.UploadView do
  use YoloWeb, :view
end
# lib/yolo_web/controllers/upload_controller.ex

defmodule YoloWeb.UploadController do
  use YoloWeb, :controller

  def new(conn, _params) do
    render(conn, "new.html")
  end

  ...
end

lib/yolo_web/templates/upload/new.html.eex below

<%= form_for @conn, Routes.upload_path(@conn, :create), 
    [multipart: true], fn f-> %>
    <%= file_input f, :upload, class: "form-control" %>
    <%= submit "Detect", class: "button"%>
<% end %>

The new action in UploadController renders a form, which simply uploads the selected image to /uploads path via HTTP POST.

If you want to take a deeper look at uploads in Phoenix, I wrote a series of articles on how to handle uploads on Phoenix, upload with Javascript and make a progress bar.

When the image is uploaded, the create action in YoloWeb.UploadController is called, passing a Plug.Upload struct in params.

# lib/yolo_web/controllers/upload_controller.ex

def create(conn, %{"upload" => %Plug.Upload{}=upload}=_params) do
  data = File.read!(upload.path)
  detection = 
    Yolo.Worker.request_detection(Yolo.Worker, data) 
    |> Yolo.Worker.await()

  base64_image = base64_inline_image(data, upload.content_type)
  render(conn, "show.html", image: base64_image, detection: detection)
end

defp base64_inline_image(data, content_type) do
  image64 = Base.encode64(data)
  "data:#{content_type};base64, #{image64}"
end

create/2 reads the image data from upload.path temporary path and simply runs the detection awaiting the result.

Since I really didn’t want to use JavaScript for this example, I decided to render the final result (image, boxes and labels) using just SVG. It turned out to be much easier than playing with JavaScript Canvas.

To avoid to locally store the image (and have serve it), we can embed it in the SVG. To do so, we need to convert data to its base64 representation with base64_inline_image/2.
create/2 then renders the show.html template, by passing image and detection.

lib/yolo_web/templates/show.html.eex

<svg width="<%= @detection.shape.width %>" 
     height="<%= @detection.shape.height %>">

<g fill="grey" transform="scale(0.5 0.5)">
  <image width="<%= @detection.shape.width %>" height="<%= @detection.shape.height %>" 
  xlink:href="<%= @image %>"></image>

  <%= for o <- @detection.objects do %>

    <rect x="<%= o.x - 2%>" y="<%= o.y - 20%>" height="20" width="100" fill="blue"/>
    <text x="<%= o.x %>" y="<%= o.y %>" dy="-5" font-family="sans-serif" font-size="16px" font-weight="bold" fill="white"><%= o.label %></text>

    <rect x="<%= o.x %>" y="<%= o.y %>" width="<%= o.w %>" height="<%= o.h %>" style="fill:rgb(0,0,0,0);stroke-width:3;stroke:rgb(0,0,255)" /> 

  <% end %>
</g>

</svg>

We render an svg, setting its width and height to the original image’s shape. Inside the svg tag, we render an <image> with the inline base64 @image set to the xlink:href attribute.

Then, we enumerate the detected objects, rendering a <text> tag for the label and <rect> for the bounding box. To position these svg elements we simply use the object coordinates.

Image with SVG rendered labels and boxes
Image with SVG rendered labels and boxes

Fantastic, we can finally see labels and bounding boxes on an image, noticing how accurate is YOLO!

Some considerations

Can we use it straight away in a production cloud server? As always, it depends! YOLOv3 is fast, especially with a good Nvidia GPU it takes only 30 milliseconds to detect objects in an image, but it’s an expensive computation that can easily exhaust a server’s CPU/GPU!

So, it depends by the throughput we need (number of processed images in the unit of time), the hardware or the budget we have and the accuracy we want to get. 

I’ve previously talked about speed; to process and image in ~0.2s we need an AWS C5.4xlarge instance, which isn’t cheap. Now, this could be more than enough in some situations but it could form a bottleneck in others: on an AWS C5.4xlarge instance, if we’d need to detect objects in real-time on 10-15 images per second, the requests would pile-up leading to timeouts.

We could delegate the object detection job to services like AWS Rekognition or Google Vision, which are fantastic, but they are not a silver bullet.  Especially when we are just interested in running real-time object detection on an embedded device: we’d need an internet connection, each frame would suffer from the delay given by the network and we would also risk to see the cloud bill grow really fast!

Object Detection with a Webcam

Let’s make it more interesting, processing frames coming from the computer’s webcam feed! This can be useful on embedded devices with a camera, which need to take decisions based on detected objects.

For simplicity, in this example we are going to use a browser and HTML5 to get frames from the webcam and render the labels and bounding boxes on the webpage.

We use JavaScript and webcamjs, on the front-end, to get 720p camera frames and send them to the Phoenix server via Channels. The channel’s process sends an asynchronous request to Yolo.Worker with the given frame and, once received the detection’s result, it then pushes a detected event to the browser.

Webcam, Phoenix Channel and Yolo.Worker
Webcam, Phoenix Channel and Yolo.Worker

Frontend – Phoenix Channel, Webcamjs and canvas.

Let’s start with the frontend. We create a new YoloWeb.WebcamController which simply renders lib/templates/webcam/index.html.eex, where we have div#camera, which is the element where we show the webcam stream, canvas#objects where we render labels and boxes, and button#detect_button to start and stop the detection.

<button id="start_stop">Start</button>
<div class="camera_container">
    <div id="camera"></div>
    <canvas id="objects" width="1280" height="720"></canvas>
</div>

After adding the webcamjs library in assets/package.json, we create a new JavaScript module in assets/js/webcam.js and import it in app.js. In webcam.js, we import Webcam, connect the socket and join the webcam:detection channel.

// assets/js/webcam.js

import Webcam from "webcamjs"

import { Socket } from "phoenix"
let socket = new Socket("/socket")

socket.connect()

// Now that you are connected, you can join channels with a topic:
let channel = socket.channel("webcam:detection", {})

channel.join()
    .receive("ok", resp => { console.log(`Joined successfully to "webcam:detection"`, resp) })
    .receive("error", resp => { console.log("Unable to join", resp) });

Then, we set the camera options and attach it to the #camera element.

// assets/js/webcam.js

Webcam.set({
    width: 1280,
    height: 720,
    image_format: 'jpeg',
    jpeg_quality: 90,
    fps: 30
});
Webcam.attach("#camera")

We define a capture function that takes a snapshot and sends a "frame" event, with the base64 encoded (data URI scheme) frame, to the WebcamChannel process.

// assets/js/webcam.js

function capture() {
    Webcam.snap(function (data_uri, canvas, context) {
        channel.push("frame", { "frame": data_uri})
    });
}

On the back-end, when an image is processed, WebcamChannel sends to the frontend a detected event with the detected objects. When we receive a detected event on the front-end, draw_objects(result) is called and it renders labels and bounding-boxes on the canvas.

// assets/js/webcam.js

//listen to "detected" events and calls draw_objects() for each event

channel.on("detected", draw_objects);

//our canvas element
let canvas = document.getElementById('objects');
let ctx = canvas.getContext('2d');
const boxColor = "blue";
//labels font size
const fontSize = 18;

function draw_objects(result) {
    
    let objects = result.objects;

    //clear the canvas from previews rendering
    ctx.clearRect(0, 0, canvas.width, canvas.height);
    ctx.lineWidth = 4;
    ctx.font = `${fontSize}px Helvetica`;

    //for each detected object render label and box
    objects.forEach(function(obj) {
        let width = ctx.measureText(obj.label).width;
        
        // box
        ctx.strokeStyle = boxColor;
        ctx.strokeRect(obj.x, obj.y, obj.w, obj.h);

        // white label + background
        ctx.fillStyle = boxColor;
        ctx.fillRect(obj.x - 2, obj.y - fontSize, width + 10, fontSize);
        ctx.fillStyle = "white";
        ctx.fillText(obj.label, obj.x, obj.y - 2);
    });
}

The Start/Stop button starts and stops an interval that calls capture every 1000/FPS milliseconds. We start with FPS=1 (my laptop should be able to process 4 FPS with the YOLO OpenCV implementation).

// assets/js/webcam.js

//toggle button starts and stops an interval
const FPS = 1; // frames per second
let intervalID = null;

document.getElementById("start_stop")
        .addEventListener("click", function(){

  if(intervalID == null) {
      intervalID = setInterval(capture, 1000/FPS);
      this.textContent = "Stop";
  } else {
    clearInterval(intervalID);
    intervalID = null;
    this.textContent = "Start";
  }
});

export default socket
//EOF
Working webcam stream
Working webcam stream

The browser requests to use the camera – once accepted, it starts showing the video in the #camera element. The browser fails to join into the channel… we still need to to write the WebcamChannel in the backend.

Backend – WebcamChannel

We start with a really simple implementation of YoloWeb.WebcamChannel, making a detection request for every frame event.

We update at first the YoloWeb.UserSocket module (in lib/yolo_web/channels/user_socket.ex) adding a channel route.

#lib/yolo_web/channels/user_socket.ex

defmodule YoloWeb.UserSocket do
  use Phoenix.Socket

  ## Channels
  channel "webcam:*", YoloWeb.WebcamChannel
  ...
end

Then, we define the YoloWeb.WebcamChannel module in lib/yolo_web/channels/webcam_channel.ex.

#lib/yolo_web/channels/webcam_channel.ex

defmodule YoloWeb.WebcamChannel do
  use Phoenix.Channel

  def join("webcam:detection", _params, socket) do
    {:ok, socket}
  end

  def handle_in("frame", %{"frame" => "data:image/jpeg;base64,"<> base64frame}=_event, socket) do
    frame = Base.decode64!(base64frame)
    Yolo.Worker.request_detection(Yolo.Worker, frame)
    {:noreply, socket}
  end

  def handle_info({:detected, _image_id, result}, socket) do
    push(socket, "detected", result)
    {:noreply, socket}
  end
end

When a frame event is sent from the browser, handle_in/3 pattern matches the data URI extracting the base64 encoded frame. After decoding base64frame, we send a detection request to Yolo.Worker without awaiting the result, which would block the channel process. Instead, when Yolo.Worker finishes to process the image and sends a {:detected, image_id, result} message to the channel process, handle_info/2 pushes to the browser a "detected" event with the result.

1fps real-time object detection
1fps real-time object detection

But what happens when, for any reason, Yolo.Worker doesn’t process fast enough all the incoming frames?

Worker unable to keep up with the requests
Worker unable to keep up with the requests

Even if it doesn’t crash, when unable to keep up with the requests it slows down the application, piling up requests and showing the results with visible delays.

It’s easy to simulate: my computer can’t run the full YOLOv3 model at 10fps on the CPU. Just increasing the FPS constant to 10 in webcam.js, we see how the tracking slows down immediately with delays of seconds.

Delays in detection - Yolo.Worker busy processing old frames.
10FPS, delays in detection – Yolo.Worker busy processing old frames.

Drop frames and dynamically adapt

To avoid exhausting Yolo.Worker, we can implement in WebcamChannel a simple mechanism that drops the frames while Yolo.Worker is still busy processing a previous request.

Dropping frames
Dropping frames
# lib/yolo_web/channels/webcam_channel.ex

defmodule YoloWeb.WebcamChannel do
  use Phoenix.Channel

  def join("webcam:detection", _params, socket) do
    socket =
      socket
      |> assign(:current_image_id, nil)
      |> assign(:latest_frame, nil)
    {:ok, socket}
  end

  def handle_in("frame", 
    %{"frame" => "data:image/jpeg;base64,"<> base64frame}=_event, 
    %{assigns: %{current_image_id: image_id}}=socket) 
do
    if image_id == nil do
      {:noreply, detect(socket, base64frame)}
    else
      {:noreply, assign(socket, :latest_frame, base64frame)}
    end
  end

  # only the result of the current_image_id
  def handle_info({:detected, image_id, result},  
      %{assigns: %{current_image_id: image_id}}=socket), 
  do: handle_detected(result, socket)

  # skipping results we are not waiting for
  def handle_info({:detected, _, _}, socket), 
  do: {:noreply, socket}

  def detect(socket, b64frame) do
    frame = Base.decode64!(b64frame)
    image_id = Yolo.Worker.request_detection(Yolo.Worker, frame)
    
    socket
    |> assign(:current_image_id, image_id)
    |> assign(:latest_frame, nil)
  end

  def handle_detected(result, socket) do
    push(socket, "detected", result)

    socket =
      socket
      |> assign(:current_image_id, nil)
      |> detect_if_need()

    {:noreply, socket}
  end

  def detect_if_need(socket) do
    if socket.assigns.latest_frame != nil do
      detect(socket, socket.assigns.latest_frame)
    else
      socket
    end
  end

end

When the browser joins the channel, we assign a nil value to current_image_id and latest_frame. In current_image_id we set the id returned by Yolo.Worker.request_detection/2 and in latest_frame we keep the latest received frame.

When the channel receives a new frame, if current_image_id is not nil it means that Yolo.Worker is still processing a frame for us. So we just keep the frame in latest_frame without making any detection request.

If current_image_id is nil it means that we can call detect/2 which makes a detection request, assigns a new current_image_id and returns an updated socket.

When the result of a detection is ready, the handle_info({:detected, image_id, result}, socket) function is called. We make sure that the result’s image_id is equal to current_image_id.

handle_detected/2 sends result to the browser and sets current_image_id to nil. if latest_frame isn’t nil, it means that the most recent frame hasn’t been processed – since Yolo.Worker is now free, we request a new detection for this frame.

Let’s now try to set FPS to 20fps, and see what happens.

Skipping frames. Sending 20fps, adapting to 4fps object detection
Skipping frames. Sending 20fps, adapting to 4fps object detection

We see that WebcamChannel seems much more reactive than the previous implementation. It just processes at the Yolo.Worker pace, skipping the rest of the frames. The server is local, so we obviously don’t suffer from any network delays! (The real fps is ~4, 0.25s per image).

It’s obviously just an initial implementation and we could add many other features. For example, if Yolo.Worker crashes while processing a frame, WebcamChannel will continue to drop frames waiting for a detection result – something we could solve with a detection timeout mechanism.

Tiny YOLO

Ah, wait… but there is still the yolov3-tiny model to try with the webcam – on my computer it can run it at more than 10fps.

$ YOLO_MODEL="yolov3-tiny" mix phx.server
yolov3-tiny

What’s next?

The YOLO OpenCV implementation runs much faster on the CPU than the original YOLO Darknet. But Darknet really shines when compiled with CUDA and runs using a Nvidia GPU – I’m really tempted to buy an eGPU!

It’s simple and fun to use a browser and HTML5 to get webcam’s frames and to show the tracked objects. But as soon as we try to reach >= 30fps we see that this solution has a toll on the overall performance. 30fps means that we only have ~30ms to send a frame, process it and receive a result. On the browser, just making a snapshot and encoding it to base64 takes ~7ms, then, to decode the base64 image to a binary is another ~5ms… when we have a 30ms restriction all these milliseconds become precious. So, I’ll try to get camera frames using OpenCV directly.

In the next few weeks I want to try to use Darknet on the Jetson Nano I’ve just bought; it should easily reach 4fps with the full YOLOv3 (like my laptop with i9 CPU!). Since Darknet is written in C, I’m thinking to write a NIFs. To render the frames and detected objects I could use Phoenix or Scenic! More on this in further articles!

As I mentioned at the beginning, I’ve also explored other ways to talk with Python, like Pyrlang for example, which deserves an article on its own.


A very special thanks to Evadne Wu, who gave me great advice and feedbacks!