Run multiple models with multiple Edge TPUs

The Edge TPU includes a small amount of RAM that's used to store the model's parameter data locally, enabling faster inference speed compared to fetching the data from external memory. Typically, this means performance is best when running just one model per Edge TPU, because running a second model requires swapping the model's parameter data in RAM, which slows down the entire pipeline. One solution is to simply run each model on a different Edge TPU, as described on this page.

Alternatively, you might reduce the overhead cost of swapping parameter data by co-compiling your models. Co-compiling allows the Edge TPU to store the parameter data for multiple models in RAM together, which means it typically works well only for small models. To learn more about this option, read about parameter data caching and co-compiling. Otherwise, keep reading here if you want to distribute multiple models across multiple Edge TPUs.

Performance considerations

Before you add more Edge TPUs in your system, consider the following possible performance issues:

  • Python does not support real multi-threading for CPU-bounded operations (read about the Python global interpreter lock (GIL)). However, we have optimized the Edge TPU Python API (but not TensorFlow Lite Python API) to work within Python’s multi-threading environment for all Edge TPU operations—they are IO-bounded, which can provide performance improvements. But beware that CPU-bounded operations such as image downscaling will probably encounter a performance impact when you run multiple models because these operations cannot be multi-threaded in Python.

  • When using multiple USB Accelerators, your inference speed will eventually be bottlenecked by the host USB bus’s speed, especially when running large models.

  • If you connect multiple USB Accelerators through a USB hub, be sure that each USB port can provide at least 500mA when using the default operating frequency or 900mA when using the maximum frequency (refer to the USB Accelerator performance settings). Otherwise, the device might not be able to draw enough power to function properly.

  • If you use an external USB hub, connect the USB Accelerator to the primary ports only. Some USB hubs include sub-hubs with secondary ports that are not compatible—our API cannot establish an Edge TPU context on these ports. For example, if you type lsusb -t, you should see ports printed as shown below. The first 2 USB ports (usbfs) will work fine but the last one will not.

    /:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/7p, 5000M
    | Port 3: Dev 36, If 0, Class=Hub, Driver=hub/4p, 5000M
        | Port 1: Dev 51, If 0, Class=Vendor Specific Class, Driver=usbfs, 5000M  # WORKS
        | Port 2: Dev 40, If 0, Class=Hub, Driver=hub/4p, 5000M
            | Port 1: Dev 41, If 0, Class=Vendor Specific Class, Driver=usbfs, 5000M  # WORKS
            |__ Port 2: Dev 39, If 0, Class=Vendor Specific Class, Driver=usbfs, 5000M  # DOESN'T WORK
    

Using the TensorFlow Lite Python API

If you're using the TensorFlow Lite Python API to run inference and you have multiple Edge TPUs, you can specify which Edge TPU each Interpreter should use via the load_delegate() function.

Simply pass load_delegate() a dictionary with one entry: "device", specifying the Edge TPU device you want to use. Accepted values are one of the following:

  • "usb": Use the default USB-connected Edge TPU.
  • "usb:<index>": Use the USB-connected Edge TPU indicated by the enumerated device index.
  • "pci": Use the default PCIe-connected Edge TPU.
  • "pci:<index>": Use the PCIe-connected Edge TPU indicated by the enumerated device index.

For example, if you have two USB Accelerators attached, you should specify two interpreters that each use one of the USB Accelerators as follows:

interpreter_1 = Interpreter(model_1_path,
  experimental_delegates=[load_delegate('libedgetpu.so.1', {"device": "usb:0"})])

interpreter_2 = Interpreter(model_2_path,
  experimental_delegates=[load_delegate('libedgetpu.so.1', {"device": "usb:1"})])

If you don't specify separate Edge TPUs this way, then both models execute on the same Edge TPU, which is slower, as described in the introduction above.

Using the TensorFlow Lite C++ API

If you're using the TensorFlow Lite C++ API to run inference and you have multiple Edge TPUs, you can specify which Edge TPU each Interpreter should use when you create the EdgeTpuContext via EdgeTpuManager::OpenDevice().

The OpenDevice() method includes a parameter for device_type, which accepts one of two values:

  • DeviceType.kApexUsb: Use the default USB-connected Edge TPU.
  • DeviceType.kApexPci: Use the default PCIe-connected Edge TPU.

If you have multiple Edge TPUs of the same type, then you must specify the second parameter, device_path. To get the specific device path for each available Edge TPU, call EdgeTpuManager.EnumerateEdgeTpu().

If you don't specify separate Edge TPUs this way, then both models execute on the same Edge TPU, which is slower, as described in the introduction above.

For an example, see two_models_two_tpus_threaded.cc.

Also see the API details in edgetpu.h.

Using the Edge TPU Python API

If you're using the Edge TPU Python API to run inference and you have multiple Edge TPUs, the Edge TPU API automatically assigns each inference engine (such as ClassificationEngine and DetectionEngine) to a different Edge TPU. So you don't need to write any extra code if you have an equal number of inference engines and Edge TPUs—unlike the TensorFlow Lite API above.

For example, if you have two Edge TPUs and two models, you can run each model on separate Edge TPUs by simply creating the inference engines as usual:

# Each engine is automatically assigned to a different Edge TPU
engine_a = ClassificationEngine(classification_model)
engine_b = DetectionEngine(detection_model)

Then they'll automatically run on separate Edge TPUs.

If you have just one Edge TPU, then this code still works and they both use the same Edge TPU.

However, if you have multiple Edge TPUs (N) and you have N + 1 (or more) models, then you must specify which Edge TPU to use for each additional inference engine. Otherwise, you'll receive an error that says your engine does not map to an Edge TPU device.

For example, if you have two Edge TPUs and three models, you must set the third engine to run on the same Edge TPU as one of the others (you decide which). The following code shows how you can do this for engine_c by specifying the device_path argument to be the same device used by engine_b:

# The second engine is purposely assigned to the same Edge TPU as the first
engine_a = ClassificationEngine(classification_model)
engine_b = DetectionEngine(detection_model)
engine_c = DetectionEngine(other_detection_model, engine_b.device_path())

You can also get a list of available Edge TPU device paths from ListEdgeTpuPaths().

For example code, see two_models_inference.py.

Note: All Edge TPUs connected over USB are treated equally; there's no prioritization when distributing the models. But if you attach a USB Accelerator to a Dev Board, the system always prefers the on-board (PCIe) Edge TPU before using the USB devices.