Pipeline a model with multiple Edge TPUs

Model pipelining allows you to execute different segments of the same model on different Edge TPUs. This can improve throughput for high-speed applications and can reduce total latency for large models that otherwise cannot fit into the cache of a single Edge TPU.

To start pipelining, just pass your TensorFlow Lite model to the Edge TPU Compiler and specify the number of segments you want. Then pass the compiled segments to the PipelinedModelRunner API and use it to run your inference—the pipeline API is available in Python and in C++. PipelinedModelRunner delegates each model segment to each Edge TPU and manages the pipeline flow to give you the final output. The rest of this page describes this process in detail.

Overview

When using a single Edge TPU, throughput is bottlenecked by the fact that the model can execute only one input at a time—the model cannot accept a new input until it finishes processing the previous input.

One way to solve this is with data parallelism: Load the same model on multiple Edge TPUs, so if one Edge TPU is busy, you can feed new input to the same model on another Edge TPU. However, data parallelism typically works best only if your model fits in the Edge TPU cache (~8 MB). If your model doesn't fit, then you can instead use pipeline parallelism: Divide your model into multiple segments and run each segment on a different Edge TPU.

For example, if you divide your model into two segments, then two Edge TPUs can run each segment in sequence. When the first Edge TPU finishes processing an input for the first segment of the model, it passes an intermediate tensor a second Edge TPU to continue processing on the next segment. Now the first Edge TPU can accept a new input, instead of waiting for the first input to flow through the whole model.

Additionally, segmenting your model distributes the executable and parameter data across the cache on multiple Edge TPUs. So with enough Edge TPUs, you can fit any model into the Edge TPU cache (collectively). Thus, the total latency should be lower because the Edge TPU doesn't need to fetch data from external memory.

You can pipeline your model with as many Edge TPUs (and segments) as necessary to either meet your throughput needs or fit your entire model into Edge TPU cache.

Note: Segmenting any model will add some latency, because intermediate tensors must be transferred from one Edge TPU to another. However, the amount of added latency from this I/O transaction depends various factors such as the tensor sizes and how the Edge TPUs are integrated in your system (such as via PCIe or USB bus), and such latency is usually offset by gains in overall throughput and additional Edge TPU caching. So you should carefully measure the performance benefits for your models.

Segment a model

To pipeline your model, you must segment the model into separate .tflite files for each Edge TPU. You can do this by specifying the num_segments argument when you pass your model to the Edge TPU Compiler. For example:

edgetpu_compiler --num_segments=4 model.tflite

The compiler outputs each segment as a separate .tflite file with an enumerated filename. You'll then pass each segment to PipelinedModelRunner in the order matching the filenames.

The number of segments you should use will vary, depending on the size of your model and whether you're trying to fit a large model entirely into the Edge TPU cache (to reduce latency) or you're trying to increase your model's throughput:

  • If you just need to fit your model into the Edge TPU cache, then you can incrementally increase the number of segments until the compiler prints "Off-chip memory used" is 0.00B for all segments.
  • If you want to increase the model throughput, then finding the ideal number of segments might be a little trickier. That's because although the Edge TPU Compiler divides your model so that each segment has the same amount of parameter data, each segment may still have a different amount of latency. For example, one layer might receive much larger input tensors than others, and that added processing can create a bottleneck in your pipeline. So you might improve throughput further by simply adding an extra segment. Based on our experiments, we found the following formula creates a well-distributed pipeline:

    num_segments = [Model size] MB / 6 MB

    Then round up to a whole number. For example, if your model is 20 MB, the result is 3.3, so you should try 4 segments.

    If the pipelined model latency is still too high, then it might be because one segment of your model is slowing down the whole pipeline, so you should try our profiling-based partitioner (the profiling_partioner). This tool profiles the latency for each segment and then re-segments the model to balance the latency across all segments.

If you want complete control of where each segment is cut, you can instead manually cut your model into separate .tflite files using the TensorFlow toco_convert tool. Be sure you name each .tflite segment with a number corresponding to the order it belongs in the pipeline. Also beware that toco_convert is only compatible with models using uint8 quantized parameters, so it's not compatible with post-training quantized models (which uses int8).

Run a segmented model in a pipeline

To pipeline your segmented model, you need to use the PipelinedModelRunner API, available in Python and C++. The PipelinedModelRunner API works like a FIFO (first in, first out) stack manager: Once you initialize it with your model segments, you push it your input tensors and pop the output tensors. PipelinedModelRunner manages the pipeline by passing intermediate tensors between the segments on each Edge TPU, and it delivers the output from the final segment.

The following sections describe the basic API workflow, either with Python or with C++.

Note: Although the Python and C++ versions of PipelineModelRunner share many semantics, the Python API does not support custom allocators. So if memory optimization is important for your application, we recommend using the C++ API, because it allows you to define your own memory allocator for the input and output tensor.

Run a pipeline with Python

To get started with the pipeline API, you should already be familiar with how to run an inference on the Edge TPU with Python. Because the pipeline API builds upon on that, the following is a basic walk-through of the pipeline API workflow. For more complete example code, see model_pipelining_classify_image.py.

The basic procedure to pipeline your model in Python is as follows:

  1. Be sure you install and import the coral Python module as shown in Run inference on the Edge TPU with Python.
  2. Create a PipelinedModelRunner by passing the constructor a list of all Interpreter objects, each corresponding to a model segment and Edge TPU in the pipeline. For example:
    interpreters = []
    #
    # Code goes here to populate `interpreters` with TF Interpreter objects,
    # each initialized with a different model segment and Edge TPU device.
    # The order of elements in the list must match the model segmentation order
    # (as they were output by the Edge TPU compiler).
    #
    for interpreter in interpreters:
      interpreter.allocate_tensors()
    
    runner = coral.pipeline.PipelinedModelRunner(interpreters)
    
  3. To run each inference, pass your input tensors to PipelinedModelRunner.push() as a list. For example, here's how to push one image input:
    size = coral.adapters.classify.input_size(runner.interpreters()[0])
    image = np.array(Image.open(image_path).convert('RGB').resize(size, Image.ANTIALIAS))
    
    runner.push([image])
    

    If your model requires multiple inputs, the order of inputs in the list must match the order expected, as specified by Interpreter.inputs().

    Passing an empty list to push() signals PipelinedModelRunner that you're done and it can close the pipeline. That causes PipelinedModelRunner.pop() (the next code snippet) to return None.

  4. Then, receive the pipeline output tensors from PipelinedModelRunner.pop(). For example, this loop repeatedly accepts new outputs until the result is None:
    while True:
      result = runner.pop()
      if not result:
        break
      #
      # Code goes here to process the result
      #
    

To see more code with the pipeline API, including separate threads to deliver input and receive output, see model_pipelining_classify_image.py.

Run a pipeline with C++

To get started with the pipeline API, you should already be familiar with our C++ API for inferencing on the Edge TPU (how to create an EdgeTpuContext and TensorFlow Interpreter with the Edge TPU custom operator). Because the pipeline API builds upon on that, the following is a basic walk-through of the pipeline API workflow. For complete example code, see model_pipelining.cc.

The basic procedure to pipeline your model in C++ is as follows:

  1. Create a PipelinedModelRunner by passing the constructor a vector of all Interpreter objects, each corresponding to a model segment and Edge TPU in the pipeline. For example:

    std::vector<tflite::Interpreter*> interpreters(num_segments);
    //
    //  Code goes here to populate interpreters with TF Interpreter objects,
    //  each one initialized with a different model segment and EdgeTpuContext.
    //  The order of elements in the vector must match the model segmentation order.
    //
    std::unique_ptr<coral::PipelinedModelRunner> runner(
        new coral::PipelinedModelRunner(interpreters));
    

    Optionally, you can pass the constructor your own Allocator objects to be used when allocating memory for the input and output tensors.

  2. Pass your input tensors to PipelinedModelRunner.Push() as a vector of PipelineTensor objects—each PipelineTensor holds a pointer to the input tensor data. For example:

    std::vector<std::vector<coral::PipelineTensor>> input_requests(num_inferences);
    //
    //  Code goes here to populate input_requests with input tensors.
    //
    auto request_producer = &runner, &input_requests {
      for (const auto& request : input_requests) {
        runner->Push(request);
      }
      runner->Push({});
    };
    auto producer = std::thread(request_producer);
    producer.join();
    

    PipelineTensor is a struct that defines the tensor data type, the data byte size, and a pointer to that data. The Push() method takes a vector in order to accommodate models that have multiple inputs. So if your model requires multiple inputs, the order in the vector must match the order expected, as specified by Interpreter.inputs().

    Note: You're responsible for allocating the input tensors. Although not shown in the above code, you can use the pipeline's default allocator by calling PipelinedModelRunner.GetInputTensorAllocator(), which returns an Allocator object to use when populating input tensors (or pass your own allocator to the PipelinedModelRunner constructor).

    As shown after the above loop, passing an empty vector to Push() signals PipelinedModelRunner that you're done and it can close the pipeline. That causes PipelinedModelRunner.Pop() (the next code snippet) to return False.

  3. Finally, receive the output tensors from PipelinedModelRunner.Pop() by passing a pointer to another PipelineTensor vector. This is a blocking call. It returns True when PipelinedModelRunner receives the latest output tensor—at which time, it updates the given PipelineTensor with a pointer to the output tensor data. For example:

    auto request_consumer = &runner {
      std::vector<coral::PipelineTensor> output_tensors;
      while (runner->Pop(&output_tensors)) {
        coral::FreeTensors(output_tensors, runner->GetOutputTensorAllocator());
        output_tensors.clear();
      }
      std::cout << "All tensors consumed" << std::endl;
    };
    auto consumer = std::thread(request_consumer);
    consumer.join();
    

When the pipeline is done (you passed an empty vector to Push()), Pop() returns False.

Note: You're responsible for deallocating the output tensors, which you can do with FreeTensors(), as shown above.

To see more code, see model_pipelining.cc.

Caution:
  • The default allocator does not set memory limits, so if you feed the pipeline new inputs faster than the model can consume them, the allocations could overrun your available memory. You can solve this with a custom allocator that implements your own memory management strategy, such as blocking new input if your pre-allocated memory is full.
  • If your system has multiple processes that use the Edge TPUs, then you must provide a central resource manager to be sure that none of the Edge TPUs assigned to pipelining can be suddenly interrupted by a different process.

Add the C++ library to your project

The PipelinedModelRunner C++ API isn't available as a stand-alone library and you need to build it with Bazel as a dependency in your project. (If you're new to Bazel, try building this simple C++ example—the program just lists the Edge TPU devices available in your system, but offers an example of how to build a project with Bazel for the Edge TPU.)

You can find the pipeline API source code here, and an example Bazel build rule here (for the model_pipelining target).

As an example, you can build the model_pipelining.cc example as follows:

git clone https://github.com/google-coral/libcoral.git

cd libcoral

# Build for aarch64 (Coral Dev Board)
make DOCKER_IMAGE=debian:stretch DOCKER_CPUS="aarch64" DOCKER_TARGETS="examples" docker-build

When it finishes (5+ min), you can find the model_pipelining binary in ./out/aarch64/examples/.

When you run this example, you must provide 4 arguments, in this order:

  • The directory where the model segments are stored
  • The original model's filename (without .tflite)
  • The number of segments
  • The number of inferences to run

For example, if you segment the model inception_v3_quant.tflite (using the Edge TPU Compiler) into 3 segments, you can run the example as follows:

./model_pipelining ./model_dir/ inception_v3_quant 3 100

Beware that, although pipelining can work with any model that's compatible with the Edge TPU, not all models are compatible with the Edge TPU Compiler. For information about model compatibility, read TensorFlow models on the Edge TPU.