Pipeline a model with multiple Edge TPUs

Model pipelining allows you to execute different segments of the same model on different Edge TPUs. This can improve throughput for high-speed applications and can reduce total latency for large models that otherwise cannot fit into the cache of a single Edge TPU.

To start pipelining, just pass your TensorFlow Lite model to the Edge TPU Compiler and specify the number of segments you want. Then use the PipelinedModelRunner C++ API to run inferences. (We currently do not support pipelining with Python.) The rest of this page describes this process in detail.

Note: This API is in beta and may change.

Overview

When using a single Edge TPU, throughput is bottlenecked by the fact that the model can execute only one input at a time—the model cannot accept a new input until it finishes processing the previous input.

One way to solve this is with data parallelism: Load the same model on multiple Edge TPUs, so if one Edge TPU is busy, you can feed new input to the same model on another Edge TPU. However, data parallelism typically works best only if your model fits in the Edge TPU cache (~8 MB). If your model doesn't fit, then you can instead use pipeline parallelism: Divide your model into multiple segments and run each segment on a different Edge TPU.

For example, if you divide your model into two segments, then two Edge TPUs can run each segment in sequence. When the first Edge TPU finishes processing an input for the first segment of the model, it passes an intermediate tensor a second Edge TPU to continue processing on the next segment. Now the first Edge TPU can accept a new input, instead of waiting for the first input to flow through the whole model.

Additionally, segmenting your model distributes the executable and parameter data across the cache on multiple Edge TPUs. So with enough Edge TPUs, you can fit any model into the Edge TPU cache (collectively). Thus, the total latency should be lower because the Edge TPU doesn't need to fetch data from external memory.

You can pipeline your model with as many Edge TPUs (and segments) as necessary to either meet your throughput needs or fit your entire model into Edge TPU cache.

Note: Segmenting any model will add some latency, because intermediate tensors must be transferred from one Edge TPU to another. However, the amount of added latency from this I/O transaction depends various factors such as the tensor sizes and how the Edge TPUs are integrated in your system (such as via PCIe or USB bus), and such latency is usually offset by gains in overall throughput and additional Edge TPU caching. So you should carefully measure the performance benefits for your models.

Segment a model

To pipeline your model, you must segment the model into separate .tflite files for each Edge TPU. You can do this by specifying the num_segments argument when you pass your model to the Edge TPU Compiler. For example:

edgetpu_compiler --num_segments=4 model.tflite

The compiler outputs each segment as a separate .tflite file with an enumerated filename. You'll then pass each segment to PipelinedModelRunner in the order matching the filenames.

The number of segments you should use will vary, depending on the size of your model and whether you're trying to fit a large model entirely into the Edge TPU cache (to reduce latency) or you're trying to increase your model's throughput:

  • If you just need to fit your model into the Edge TPU cache, then you can incrementally increase the number of segments until the compiler prints "Off-chip memory used" is 0.00B for all segments.
  • If you want to increase the model throughput, then finding the ideal number of segments might be a little trickier. That's because although the Edge TPU Compiler divides your model so that each segment has the same amount of parameter data, each segment may still have a different amount of latency. For example, one layer might receive much larger input tensors than others, and that added processing can create a bottleneck in your pipeline. So you might improve throughput further by simply adding an extra segment. Based on our experiments, we found the following formula creates a well-distributed pipeline:

    num_segments = [Model size] MB / 6 MB

    Then round-up to a whole number. For example, if your model is 20 MB, the result is 3.3, so you should try 4 segments.

If you want complete control of where each segment is cut, you can instead manually cut your model into separate .tflite files using the TensorFlow toco_convert tool. Be sure you name each .tflite segment with a number corresponding to the order it belongs in the pipeline. Also beware that toco_convert is only compatible with models using uint8 quantized parameters, so it's not compatible with post-training quantized models (which uses int8).

Run a segmented model in a pipeline

To pipeline your segmented model, you need to use the PipelinedModelRunner C++ API. This API is similar to a FIFO stack manager: Once you pass it each model segment, you push it your input tensor and then pop the output tensor. PipelinedModelRunner manages the pipeline by passing intermediate tensors between the segments running on separate Edge TPUs.

Tip: If you just want to see example code, see model_pipelining.cc.

To get started with the pipeline API, you need to be familiar with our existing C++ API for inferencing on the Edge TPU (how to create an EdgeTpuContext and TensorFlow Interpreter with the Edge TPU custom operator). Assuming you're familiar with that, the basic procedure to pipeline your segmented model is as follows:

  1. Create a PipelinedModelRunner by passing the constructor a vector of all Interpreter objects, each corresponding to a model segment and Edge TPU in the pipeline. For example:

    std::vector<tflite::Interpreter*> interpreters(num_segments);

    // // Code goes here to populate interpreters with TF Interpreter objects, // each one initialized with a different model segment and EdgeTpuContext. // The order of elements in the vector must match the model segmentation order. //

    std::unique_ptr<coral::PipelinedModelRunner> runner( new coral::PipelinedModelRunner(interpreters));

    Optionally, you can pass the constructor your own Allocator objects to be used when allocating memory for the input and output tensors.

  2. Pass your input tensors to PipelinedModelRunner.Push() as a vector of PipelineTensor objects—each PipelineTensor holds a pointer to the input tensor data. For example:

    std::vector<std::vector<coral::PipelineTensor>> input_requests(num_inferences);

    // // Code goes here to populate input_requests with input tensors. //

    auto request_producer = &runner, &input_requests { for (const auto& request : input_requests) { runner->Push(request); } runner->Push({}); }; auto producer = std::thread(request_producer); producer.join();

    PipelineTensor is a struct that defines the tensor data type, the data byte size, and a pointer to that data. The Push() method takes a vector in order to accommodate models that have multiple inputs. So if your model requires multiple inputs, the order in the vector must match the order expected, as specified by Interpreter.inputs().

    Note: You're responsible for allocating the input tensors. Although not shown in the above code, you can use the pipeline's default allocator by calling PipelinedModelRunner.GetInputTensorAllocator(), which returns an Allocator object to use when populating input tensors (or pass your own allocator to the PipelinedModelRunner constructor).

    As shown after the above loop, passing an empty vector to Push() signals PipelinedModelRunner that you're done and it can close the pipeline. That causes PipelinedModelRunner.Pop() (the next code snippet) to return False.

  3. Finally, receive the output tensors from PipelinedModelRunner.Pop() by passing a pointer to another PipelineTensor vector. This is a blocking call. It returns True when PipelinedModelRunner receives the latest output tensor—at which time, it updates the given PipelineTensor with a pointer to the output tensor data. For example:

    auto request_consumer = &runner {
      std::vector<coral::PipelineTensor> output_tensors;
      while (runner->Pop(&output_tensors)) {
        coral::FreeTensors(output_tensors, runner->GetOutputTensorAllocator());
        output_tensors.clear();
      }
      std::cout << "All tensors consumed" << std::endl;
    };
    auto consumer = std::thread(request_consumer);
    consumer.join();
    

When the pipeline is done (you passed an empty vector to Push()), Pop() returns False.

Note: You're responsible for deallocating the output tensors, which you can do with FreeTensors(), as shown above.

For some example code, see model_pipelining.cc.

Caution:
  • The default allocator does not set memory limits, so if you feed the pipeline new inputs faster than the model can consume them, the allocations could overrun your available memory. You can solve this with a custom allocator that implements your own memory management strategy, such as blocking new input if your pre-allocated memory is full.
  • If your system has multiple processes that use the Edge TPUs, then you must provide a central resource manager to be sure that none of the Edge TPUs assigned to pipelining can be suddenly interrupted by a different process.

Add the pipelining library to your project

The PipelinedModelRunner API isn't available as a stand-alone library and you need to build it with Bazel as a dependency in your project. (If you're new to Bazel, try building this simple C++ example—the program just lists the Edge TPU devices available in your system, but offers an example of how to build a project with Bazel for the Edge TPU.)

You can find the pipeline API source code here, and an example Bazel build rule here (for the model_pipelining target).

As an example, you can build the model_pipelining.cc example as follows:

git clone https://github.com/google-coral/edgetpu.git

cd edgetpu

# Build for aarch64 (Coral Dev Board)
make DOCKER_IMAGE=debian:stretch DOCKER_CPUS="aarch64" DOCKER_TARGETS="examples" docker-build

When it finishes (5+ min), you can find the model_pipelining binary in ./out/aarch64/examples/.

When you run this example, you must provide 4 arguments, in this order:

  • The directory where the model segments are stored
  • The original model's filename (without .tflite)
  • The number of segments
  • The number of inferences to run

For example, if you segment the model inception_v3_quant.tflite (using the Edge TPU Compiler) into 3 segments, you can run the example as follows:

./model_pipelining ./model_dir/ inception_v3_quant 3 100

Beware that, although pipelining can work with any model that's compatible with the Edge TPU, not all models are compatible with the Edge TPU Compiler. For information about model compatibility, read TensorFlow models on the Edge TPU.