Pipeline a model with multiple Edge TPUs
Model pipelining allows you to execute different segments of the same model on different Edge TPUs. This can improve throughput for high-speed applications and can reduce total latency for large models that otherwise cannot fit into the cache of a single Edge TPU.
To start pipelining, just pass your TensorFlow Lite model to the Edge TPU Compiler and specify the
number of segments you want. Then pass the compiled segments to the PipelinedModelRunner
API and
use it to run your inference—the pipeline API is available in
Python and in
C++. PipelinedModelRunner
delegates each model segment to each
Edge TPU and manages the pipeline flow to give you the final output. The rest of this page describes
this process in detail.
Overview
When using a single Edge TPU, throughput is bottlenecked by the fact that the model can execute only one input at a time—the model cannot accept a new input until it finishes processing the previous input.
One way to solve this is with data parallelism: Load the same model on multiple Edge TPUs, so if one Edge TPU is busy, you can feed new input to the same model on another Edge TPU. However, data parallelism typically works best only if your model fits in the Edge TPU cache (~8 MB). If your model doesn't fit, then you can instead use pipeline parallelism: Divide your model into multiple segments and run each segment on a different Edge TPU.
For example, if you divide your model into two segments, then two Edge TPUs can run each segment in sequence. When the first Edge TPU finishes processing an input for the first segment of the model, it passes an intermediate tensor a second Edge TPU to continue processing on the next segment. Now the first Edge TPU can accept a new input, instead of waiting for the first input to flow through the whole model.
Additionally, segmenting your model distributes the executable and parameter data across the cache on multiple Edge TPUs. So with enough Edge TPUs, you can fit any model into the Edge TPU cache (collectively). Thus, the total latency should be lower because the Edge TPU doesn't need to fetch data from external memory.
You can pipeline your model with as many Edge TPUs (and segments) as necessary to either meet your throughput needs or fit your entire model into Edge TPU cache.
Segment a model
To pipeline your model, you must segment the model into separate .tflite
files for each Edge TPU.
You can do this by specifying the num_segments
argument when you pass your model to the Edge TPU
Compiler. For example:
edgetpu_compiler --num_segments=4 model.tflite
The compiler outputs each segment as a separate .tflite
file with an enumerated filename.
You'll then pass each segment to PipelinedModelRunner
in the order matching the filenames, as
described in the following sections.
The number of segments you should use will vary, depending on the size of your model and whether you're trying to fit a large model entirely into the Edge TPU cache (to reduce latency) or you're trying to increase your model's throughput. Also beware that although the Edge TPU Compiler divides your model so that each segment has roughly equal amounts of parameter data, that does not mean each segment will have the same latency—one segment could still be a bottleneck in the pipeline.
For details about how to choose an optimal number of segments, and how to minimize latency bottlenecks across segments (by using the profiling-based partitioner), see the Edge TPU Compiler guide.
Run a segmented model in a pipeline
To pipeline your segmented model, you need to use the PipelinedModelRunner
API, available in
Python and C++. The PipelinedModelRunner
API works like a FIFO (first in, first out) stack
manager: Once you initialize it with your model segments, you push it your input tensors and pop
the output tensors. PipelinedModelRunner
manages the pipeline by passing intermediate tensors
between the segments on each Edge TPU, and it delivers the output from the final segment.
The following sections describe the basic API workflow, either with Python or with C++.
PipelineModelRunner
share many semantics,
the Python API does not support custom allocators. So if memory optimization is important
for your application, we recommend using the C++ API, because it allows you to define your
own memory allocator for the input and output tensor.
Run a pipeline with Python
To get started with the pipeline API, you should already be familiar with how to run an inference on the Edge TPU with Python. Because the pipeline API builds upon on that, the following is a basic walk-through of the pipeline API workflow. For more complete example code, see model_pipelining_classify_image.py.
The basic procedure to pipeline your model in Python is as follows:
- Be sure you install and import the
coral
Python module as shown in Run inference on the Edge TPU with Python. -
Create a
PipelinedModelRunner
by passing the constructor a list of allInterpreter
objects, each corresponding to a model segment and Edge TPU in the pipeline. For example:interpreters = [] # # Code goes here to populate `interpreters` with TF Interpreter objects, # each initialized with a different model segment and Edge TPU device. # The order of elements in the list must match the model segmentation order # (as they were output by the Edge TPU compiler). # for interpreter in interpreters: interpreter.allocate_tensors() runner = coral.pipeline.PipelinedModelRunner(interpreters)
-
To run each inference, pass your input tensors to
PipelinedModelRunner.push()
as a list. For example, here's how to push one image input:size = coral.adapters.classify.input_size(runner.interpreters()[0]) image = np.array(Image.open(image_path).convert('RGB').resize(size, Image.ANTIALIAS)) runner.push([image])
If your model requires multiple inputs, the order of inputs in the list must match the order expected, as specified by
Interpreter.inputs()
.Passing an empty list to
push()
signalsPipelinedModelRunner
that you're done and it can close the pipeline. That causesPipelinedModelRunner.pop()
(the next code snippet) to return None. -
Then, receive the pipeline output tensors from
PipelinedModelRunner.pop()
. For example, this loop repeatedly accepts new outputs until the result is None:while True: result = runner.pop() if not result: break # # Code goes here to process the result #
To see more code with the pipeline API, including separate threads to deliver input and receive output, see model_pipelining_classify_image.py.
Run a pipeline with C++
To get started with the pipeline API, you should already be familiar with our C++ API for
inferencing on the Edge TPU (how to create an
EdgeTpuContext
and TensorFlow
Interpreter
with the Edge
TPU custom operator). Because the pipeline API builds
upon on that, the following is a basic walk-through of the pipeline API workflow.
For complete example code, see model_pipelining.cc.
The basic procedure to pipeline your model in C++ is as follows:
-
Create a
PipelinedModelRunner
by passing the constructor a vector of allInterpreter
objects, each corresponding to a model segment and Edge TPU in the pipeline. For example:std::vector<tflite::Interpreter*> interpreters(num_segments); // // Code goes here to populate
interpreters
with TF Interpreter objects, // each one initialized with a different model segment and EdgeTpuContext. // The order of elements in the vector must match the model segmentation order. // std::unique_ptr<coral::PipelinedModelRunner> runner( new coral::PipelinedModelRunner(interpreters));Optionally, you can pass the constructor your own
Allocator
objects to be used when allocating memory for the input and output tensors. -
Pass your input tensors to
PipelinedModelRunner.Push()
as a vector ofPipelineTensor
objects—eachPipelineTensor
holds a pointer to the input tensor data. For example:std::vector<std::vector<coral::PipelineTensor>> input_requests(num_inferences); // // Code goes here to populate
input_requests
with input tensors. // auto request_producer = &runner, &input_requests { for (const auto& request : input_requests) { runner->Push(request); } runner->Push({}); }; auto producer = std::thread(request_producer); producer.join();PipelineTensor
is a struct that defines the tensor data type, the data byte size, and a pointer to that data. ThePush()
method takes a vector in order to accommodate models that have multiple inputs. So if your model requires multiple inputs, the order in the vector must match the order expected, as specified byInterpreter.inputs()
.Note: You're responsible for allocating the input tensors. Although not shown in the above code, you can use the pipeline's default allocator by callingPipelinedModelRunner.GetInputTensorAllocator()
, which returns anAllocator
object to use when populating input tensors (or pass your own allocator to thePipelinedModelRunner
constructor).As shown after the above loop, passing an empty vector to
Push()
signalsPipelinedModelRunner
that you're done and it can close the pipeline. That causesPipelinedModelRunner.Pop()
(the next code snippet) to return False. -
Finally, receive the output tensors from
PipelinedModelRunner.Pop()
by passing a pointer to anotherPipelineTensor
vector. This is a blocking call. It returns True whenPipelinedModelRunner
receives the latest output tensor—at which time, it updates the givenPipelineTensor
with a pointer to the output tensor data. For example:auto request_consumer = &runner { std::vector<coral::PipelineTensor> output_tensors; while (runner->Pop(&output_tensors)) { coral::FreeTensors(output_tensors, runner->GetOutputTensorAllocator()); output_tensors.clear(); } std::cout << "All tensors consumed" << std::endl; }; auto consumer = std::thread(request_consumer); consumer.join();
When the pipeline is done (you passed an empty vector to Push()
), Pop()
returns False.
FreeTensors()
, as shown above.
To see more code, see model_pipelining.cc.
- The default allocator does not set memory limits, so if you feed the pipeline new inputs faster than the model can consume them, the allocations could overrun your available memory. You can solve this with a custom allocator that implements your own memory management strategy, such as blocking new input if your pre-allocated memory is full.
- If your system has multiple processes that use the Edge TPUs, then you must provide a central resource manager to be sure that none of the Edge TPUs assigned to pipelining can be suddenly interrupted by a different process.
Add the C++ library to your project
The PipelinedModelRunner
C++ API isn't available as a stand-alone library and you need to build it
with Bazel as a dependency in your project. (If you're new to Bazel, try
building this simple C++
example—the program just
lists the Edge TPU devices available in your system, but offers an example of how to build a project
with Bazel for the Edge TPU.)
You can find the
pipeline API source code here,
and an
example Bazel build rule here
(for the model_pipelining
target).
As an example, you can build the
model_pipelining.cc
example as follows:
git clone https://github.com/google-coral/libcoral.git
cd libcoral
# Build for aarch64 (Coral Dev Board)
make DOCKER_IMAGE=debian:stretch DOCKER_CPUS="aarch64" DOCKER_TARGETS="examples" docker-build
When it finishes (5+ min), you can find the model_pipelining
binary in ./out/aarch64/examples/
.
When you run this example, you must provide 4 arguments, in this order:
- The directory where the model segments are stored
- The original model's filename (without
.tflite
) - The number of segments
- The number of inferences to run
For example, if you segment the model inception_v3_quant.tflite
(using the Edge TPU
Compiler) into 3 segments, you can run the example as follows:
./model_pipelining ./model_dir/ inception_v3_quant 3 100
Beware that, although pipelining can work with any model that's compatible with the Edge TPU, not all models are compatible with the Edge TPU Compiler. For information about model compatibility, read TensorFlow models on the Edge TPU.
Is this content helpful?