Distribution anti-patterns ~ James Akl

Two common distribution problems emerge in robotics systems: distributing computation and distributing communication.

ROS 2 (robotics middleware built on DDS for message passing and more) is purpose-built for the communication problem — coordinating messages between robots, drivers, and software components across network boundaries. It is sometimes applied to the computation problem — parallelizing work inside a single system or process.

In prototyping, this approach can accelerate development. In more mature deployments, however, it can produce architectures where too many processing steps exist as separate nodes. This introduces serialization, network overhead, duplicated runtimes, and additional process management, where lighter concurrency primitives would suffice.

The core observation: the system’s communication topology and the computation graph do not need to be identical. They can often be designed independently, reducing coupling to points where cross-process or cross-machine communication is genuinely required.

The two problems

Understanding this anti-pattern requires distinguishing between two distinct distribution challenges:

Distributed communication. Moving messages between processes, machines, or robots. ROS topics, services, and actions are designed for this; alternatives include ZeroMQ, RabbitMQ, Redis pub/sub, Kafka, or direct DDS for high-throughput or non-robotics applications.

Distributed computation. Splitting work across cores, devices, or machines. This is handled by threading, multiprocessing, native accelerators (GPU, TPU, …), or distributed frameworks (Ray, Dask, Celery, …).

Note that “distributed” is usually reserved for work spread across multiple machines, while threads and processes on a single machine are described as parallelism or concurrency; here we discuss them together because the engineering trade-offs often overlap.

Most systems involve both: communication delivers results across boundaries, while computation accelerates processing within those boundaries. The issues can arise when tools designed for one problem are applied to the other.

When ROS becomes the default

Faced with a parallel processing task, it is possible to implement each stage as a ROS node with appropriate communication primitives between them:

/sensor_input → /preprocess → /algorithm → /post_process → /output

Costs of this pattern include:

Serialization/deserialization and usually copies at each hop
Adaptation overhead and glue code for message types
Latency even for local operations (IPC ≠ direct call)
Duplicated memory/runtime across multiple processes, even when threads could suffice
Complexity in debugging and lifecycle management

This illustrates using inter-system communication tools for intra-system computation. ROS 2 mitigations (composable nodes and intra-process transport) reduce per-hop overhead if nodes are colocated and the ROS execution model is followed, but this also admits the consideration of whether the processing work should be done via ROS.

Computation tools

Given this distinction, different tools address different aspects of the computation problem. Each offers specific trade-offs in concurrency, isolation, and performance:

AsyncIO — cooperative I/O

Useful for network or file I/O and coordinating multiple streams when callbacks are cooperative. Low scheduler overhead and avoids IPC when used within a single process. Less suitable for CPU-bound Python computation.

Threading — blocking I/O and native libraries

Threads enable shared memory access while handling blocking I/O or delegating to native libraries that release the Python GIL. Practical for I/O-heavy tasks and interfacing with C/C++ extensions.

Multiprocessing — parallelism and isolation

Provides true parallelism for CPU-bound Python code. Offers process-level fault isolation and independent memory spaces.

GPU acceleration — high-throughput, local compute backend

GPUs accelerate dense, data-parallel kernels such as matrix operations, convolutions, and large point-cloud filters. Practical considerations:

Batching and kernel aggregation: Minimize tiny host↔device transfers or kernel launches by combining operations or batching inputs.
CUDA streams: Overlap computation and data transfer to improve device utilization. Input for batch n+1 can transfer while batch n executes.
Multi-GPU scaling: When a single GPU is insufficient, workloads can be partitioned across multiple GPUs (e.g., DistributedDataParallel in torch), or across nodes using distributed frameworks like Ray.
Shared memory / zero-copy: Large intermediate results exchanged with CPU processes or other GPUs benefit from shared memory (multiprocessing.shared_memory, CUDA pinned memory, or zero-copy DDS) to avoid repeated serialization.
The broader CUDA ecosystem also includes Unified Memory for simplifying host/device transfers, and NCCL for efficient multi-GPU communication.

Illustrative pattern:

def infer(batch, device="cuda"):
    x = torch.tensor(batch, device=device, copy=False)
    with torch.no_grad():
        y = model(x)
    return y.cpu().numpy()

Core points:

Host↔device transfer latency and (de)serialization can dominate unless batching and stream concurrency are used.
Multi-GPU or distributed workloads require careful synchronization and memory management.
GPU computation often integrates with ROS or other frameworks at defined boundaries; maintaining computation in-process with clear interfaces simplifies coordination.

Ray / Dask — cluster-scale computation

Used for multi-node scaling, distributed scheduling, and resilient execution. Appropriate when workloads exceed a single machine or require advanced scheduling/fault tolerance.

ROS — system integration and cross-language wiring

ROS is most appropriate for hardware abstraction, driver stacks, multi-robot coordination, and interoperable message formats. Internal computation stages benefit from other concurrency primitives unless process isolation or robotics-specific tools are required.

Selection criteria

These tools address different computational contexts:

ROS: integration surfaces, drivers, cross-process messaging
AsyncIO: I/O-heavy pipelines inside a process
Threading: blocking I/O or native libraries that release the GIL
Multiprocessing: CPU-bound Python, isolation, multi-core scaling
GPU: dense kernels, batched inference, heavy ops (minimize host↔device transfers)
Ray/Dask: cluster-scale computation and scheduling

The performance characteristics of each approach help guide selection:

Performance trade-offs

In-process calls: microseconds → sub-milliseconds (depends on function)
Local serialized IPC (ROS/DDS): sub-millisecond → multiple milliseconds per hop; depends on QoS, message size, and transport.
Host↔device transfers: typically tens to hundreds of microseconds per transfer on PCIe; can be higher on lower-end buses. Batching can amortize cost.
Memory: multiple processes duplicate runtime and loaded libraries; single process with threads/multiprocessing or ROS composition can save tens–hundreds of megabytes per service

Actual numbers depend heavily on DDS vendor, QoS settings, transport (shared memory vs UDP), message size, and hardware. These ranges should be read as indicative, not guarantees.

Numbers are illustrative and vary by hardware and workload — profile for particular stacks.

Practical patterns

These performance characteristics and tool capabilities suggest particular implementation patterns:

Communication outside, computation inside

Subsystems define communication boundaries; internal computation can use the most appropriate primitive.

class ProcessingSystem:
    def __init__(self):
        self.compute_core = ComputeCore()  # threads/processes/GPU
        self.ros_interface = ROSInterface()  # marshal/unmarshal

    async def run(self):
        async with asyncio.TaskGroup() as tg:
            tg.create_task(self.compute_core.process())
            tg.create_task(self.ros_interface.publish())

Communication nodes = architectural boundaries

Grouping by responsibility rather than pipeline stages:

/sensors     # drivers and raw aggregation
/processing  # heavy compute and inference
/planning    # decision-making
/control     # actuator commands and low-latency loops

Map internal concurrency to the problem inside each node:

sensors: async streams / threads
processing: GPU + batching or multiprocessing
planning: single-process optimization (or Ray if distributed across machines)
control: low-latency threads or a dedicated real-time process

This approach scales from single machines to larger deployments. For multi-machine or multi-robot systems, container orchestration becomes relevant at the system level.

Orchestration boundaries

Docker and Kubernetes manage containerized nodes, scheduling workloads across machines and enabling scaling or fault tolerance. These tools excel at system-level deployment concerns: distributing /sensors, /processing, /planning, and /control nodes across a fleet of robots or a compute cluster.

It’s worth noting that heavy orchestrators like Kubernetes are mostly used in cloud robotics or enterprise backends; embedded and real-time robots often can’t run them at all.

The anti-pattern emerges when orchestration tools are applied to fine-grained computation tasks that belong within a single process. Containerizing every step of a processing pipeline (where direct function calls or threads would suffice) introduces unnecessary serialization, network latency, and operational complexity.

The boundary principle applies here: use orchestration at system boundaries to manage node lifecycles and resource allocation, while internal computation leverages threads, async, multiprocessing, or GPU parallelism within those boundaries.

Context matters

Different deployment contexts favor different strategies:

Edge devices (limited resources) — Keep process counts low; ROS overhead is significant on embedded boards.
Development environments (rapid iteration) — More nodes may help with debugging and modularity; process isolation eases restarts.
Mature deployments (production systems) — Optimize for reliability, performance, and efficiency; consolidating over-distributed graphs.
Real-time systems (deterministic timing) — Favor direct calls and lightweight threading; avoiding (local) network hops in control loops.

Note that sometimes the overhead of separate processes is intentional: isolating a flaky perception stack or ML model in its own process can prevent crashes from propagating to critical planners or controllers. Furthermore, ROS 2 executors are not hard real-time safe; deterministic control loops are usually isolated in dedicated RT threads or processes outside the ROS executor.

Integration considerations

When mixing communication and computation approaches, several implementation details become important:

ROS2 intra-process / composition: reduces serialization overhead but ties execution to ROS executors and lifecycle; most effective when nodes are colocated in a single process. Note that ROS 2 intra-process transport does not always achieve true zero-copy; behavior depends on message mutability and middleware support.
rclpy and the GIL: MultiThreadedExecutor allows concurrent callbacks, but only one Python thread executes at a time; CPU-bound work is typically offloaded to processes, native extensions, or GPU kernels. These limitations are specific to rclpy; C++ nodes avoid the GIL but still face the same broader architectural trade-offs between threads, processes, and IPC.
Mixing models (asyncio ↔ threads ↔ processes ↔ ROS): feasible but increases complexity; boundaries benefit from explicit documentation.
ROS in a background thread with asyncio: rclpy.spin can run in a thread with callbacks forwarded to the asyncio loop via asyncio.run_coroutine_threadsafe.
Blocking calls in asyncio: handled by wrapping calls with await loop.run_in_executor(None, blocking_ros_call, args) to maintain responsiveness.
MultiThreadedExecutor: prevents long-running callbacks from stalling others.
Bulk data transfer: shared memory or zero-copy DDS reduces overhead for large images, point clouds, or tensors.
ROS at system boundaries: keeps core compute ROS-agnostic, isolating framework dependencies at the edges.

Migration considerations

When refactoring existing distributed systems, the objective is to reduce unnecessary communication overhead while preserving system boundaries that serve architectural purposes:

Profile latency, copies, memory per process, and host↔device traffic
Identify high-state nodes and consider merging nodes sharing large mutable state or frequent large payloads
Preserve external APIs (topics/services) where clients rely on them
Verify timing-sensitive control loops after consolidation
Consider composition/intra-process transport or a single process with clear concurrency primitives and documented boundaries

The overarching goal: use communication at boundaries and select computation primitives appropriate to in-process workloads.