Skip to main content

ROS 2 Architecture: Understanding Distributed Robotics Systems

Introduction

Welcome to the world of distributed robotics! In this section, you'll learn the fundamental architecture that powers ROS 2 (Robot Operating System 2), the leading middleware for building modern robotic systems. Unlike traditional monolithic programs where all functionality exists in a single process, ROS 2 embraces a distributed architecture where independent components communicate over a network.

This design philosophy isn't arbitrary—it mirrors the inherent complexity of robotics. A humanoid robot doesn't have a single "brain" that controls everything. Instead, it has sensors (cameras, IMUs, force sensors), actuators (motors), decision-making systems (AI agents), and safety monitors—all working simultaneously. ROS 2 provides the infrastructure to coordinate these diverse components.

Why Distributed Architecture for Robotics?

The Monolithic Problem

Imagine writing a single Python script that:

  • Reads data from 20 cameras at 30 FPS
  • Processes sensor fusion from IMUs and force sensors
  • Runs AI inference for object detection
  • Controls 30+ motors with real-time constraints
  • Monitors battery levels and safety limits
  • Logs data to disk

This monolithic approach creates several problems:

  1. Complexity Explosion: A single file with thousands of lines becomes unmaintainable
  2. No Fault Isolation: If one component crashes, the entire system dies
  3. Difficult Testing: Can't test individual components in isolation
  4. Poor Resource Utilization: Can't distribute computation across multiple cores/machines
  5. Rigid Deployment: Can't easily move components to different hardware

The ROS 2 Solution: Nodes as Building Blocks

ROS 2 solves this by breaking systems into nodes—independent processes that each handle a specific responsibility. For example:

  • Camera Node: Captures and publishes image data
  • Object Detector Node: Subscribes to images, runs AI inference, publishes detections
  • Motion Planner Node: Receives detections and current robot state, outputs motion commands
  • Motor Controller Node: Subscribes to motion commands, controls hardware

Each node is:

  • Independent: Runs in its own process with isolated memory
  • Single-Purpose: Does one thing well (Unix philosophy)
  • Replaceable: Can swap implementations without changing other nodes
  • Testable: Can be tested in isolation with mock data

Core Communication Paradigms

ROS 2 provides three primary communication patterns, each suited for different use cases:

1. Topics (Publish-Subscribe Pattern)

Use Case: Continuous streaming data where multiple consumers might be interested.

How It Works:

  • Publishers send messages to named topics (e.g., /camera/image)
  • Subscribers listen to topics and receive messages via callbacks
  • Asynchronous: Publishers don't wait for subscribers
  • Many-to-Many: Multiple publishers and subscribers can share a topic

Real-World Examples:

  • Camera streaming images (1 publisher → many subscribers)
  • Sensor data (joint positions, IMU readings, battery status)
  • Robot telemetry (position, velocity, diagnostics)

Key Characteristics:

  • Fire-and-Forget: Publishers don't know if anyone is listening
  • Latest-Value Semantics: Subscribers typically process the most recent message
  • High Frequency: Ideal for data published at 10Hz, 100Hz, or faster

2. Services (Request-Response Pattern)

Use Case: Blocking request-response interactions where you need an answer.

How It Works:

  • Service Server: Waits for requests, processes them, returns responses
  • Service Client: Sends request, blocks until response is received
  • Synchronous: Client waits for server
  • One-to-One: Each request gets exactly one response

Real-World Examples:

  • Triggering a calibration routine (request: "calibrate", response: "success/failure")
  • Querying robot state (request: "get current position", response: position data)
  • Resetting a component (request: "reset", response: "done")

Key Characteristics:

  • Blocking: Client waits for server to respond
  • Transactional: Guaranteed request-response pairing
  • Lower Frequency: Ideal for occasional interactions

3. Actions (Long-Running Tasks)

Use Case: Tasks that take significant time and need progress feedback.

How It Works:

  • Action Server: Accepts goals, executes them asynchronously, sends feedback and results
  • Action Client: Sends goals, receives periodic feedback, can cancel goals
  • Asynchronous with Feedback: Client doesn't block but gets progress updates
  • Cancelable: Client can abort in-progress actions

Real-World Examples:

  • Navigation: "Go to position X" (feedback: current distance, result: "arrived" or "failed")
  • Grasping: "Pick up object" (feedback: gripper position, result: "grasped" or "failed")
  • Motion execution: "Execute trajectory" (feedback: % complete, result: success/failure)

Key Characteristics:

  • Goal-Oriented: Represents a task to accomplish
  • Preemptable: Can be canceled mid-execution
  • Stateful: Provides feedback during execution

Comparison Table: Topics vs Services vs Actions

FeatureTopicsServicesActions
CommunicationPub-SubRequest-ResponseGoal-Feedback-Result
BlockingNoYesNo (async)
FrequencyHigh (continuous)Low (on-demand)Low (tasks)
FeedbackNoNoYes (progress)
CancelableN/AN/AYes
Use CaseStreaming dataQuick queriesLong-running tasks
ExampleSensor dataGet battery %Navigate to waypoint

The Data Distribution Service (DDS) Layer

Unlike ROS 1, which used a custom TCP-based protocol, ROS 2 is built on top of DDS (Data Distribution Service)—an industry-standard middleware used in aerospace, defense, and industrial automation.

What is DDS?

DDS is a peer-to-peer communication standard that provides:

  • Discovery: Nodes automatically find each other on the network (no central master)
  • Quality of Service (QoS): Fine-grained control over reliability, latency, durability
  • Real-Time Capable: Designed for systems with strict timing requirements
  • Security: Built-in authentication, encryption, access control

Benefits for Robotics

  1. No Single Point of Failure: Unlike ROS 1's master node, ROS 2 has no central coordinator
  2. Multi-Robot Systems: Easier to build systems with multiple robots communicating
  3. Network Flexibility: Works across Ethernet, Wi-Fi, shared memory, etc.
  4. Quality of Service: Can prioritize critical data (e.g., safety messages) over telemetry

DDS in Action (Simplified)

When you publish a message in ROS 2:

Your Code → rclpy → DDS Implementation → Network → DDS Implementation → rclpy → Subscriber Code

The DDS layer handles:

  • Serialization (converting Python objects to bytes)
  • Network transport (UDP, TCP, or shared memory)
  • Discovery (finding subscribers)
  • Reliability (resending lost packets if QoS requires it)

ROS 1 vs ROS 2: Key Improvements

If you've encountered ROS 1 (used in many older tutorials), here's what changed in ROS 2:

AspectROS 1ROS 2
ArchitectureMaster-based (single point of failure)Peer-to-peer (distributed discovery)
MiddlewareCustom TCPROS protocolDDS (industry standard)
Real-TimeLimited supportReal-time capable with DDS QoS
SecurityNone (plaintext, no auth)DDS security (encryption, access control)
PlatformsLinux only (practically)Linux, Windows, macOS
PythonPython 2.7 (obsolete)Python 3.6+
Build Systemcatkinament/colcon
LifecycleSimple start/stopManaged lifecycle nodes

Key Takeaway: ROS 2 was redesigned from the ground up to be production-ready, not just for research.

Node Lifecycle (Managed Nodes)

ROS 2 introduces lifecycle management for nodes, allowing fine-grained control over startup and shutdown. While not required for basic nodes, managed lifecycle nodes have explicit states:

  1. Unconfigured: Node exists but isn't ready
  2. Inactive: Node is configured but not processing data
  3. Active: Node is fully operational
  4. Finalized: Node is shutting down

Benefits:

  • Controlled initialization (e.g., "configure" then "activate")
  • Graceful shutdown (clean up resources)
  • Runtime reconfiguration (deactivate → reconfigure → activate)

When to Use:

  • Production systems where startup order matters
  • Nodes that manage hardware (need clean shutdown)
  • Systems requiring dynamic reconfiguration

For Learning: We'll start with simple nodes and introduce lifecycle management later.

Quality of Service (QoS) Profiles

QoS allows you to tune communication behavior for different needs:

Common QoS Settings:

  • Reliability: Best-effort (fast, may drop messages) vs Reliable (guarantees delivery)
  • Durability: Volatile (only current subscribers get messages) vs Transient-Local (new subscribers get last message)
  • History: Keep-Last-N (buffer N messages) vs Keep-All (unlimited buffer)

Example Use Cases:

  • Sensor Data: Best-effort, keep-last-10 (OK to drop old data, prioritize freshness)
  • Commands: Reliable, keep-last-1 (critical commands must arrive)
  • Initialization Data: Reliable, transient-local (new nodes get last configuration)

Default: Most beginners use default QoS (reliable, volatile, keep-last-10) which works well for learning.

Putting It All Together: A Simple System

Imagine a simple robot system:

[Camera Node] --/image--> [Object Detector] --/detections--> [Motion Planner] --/cmd_vel--> [Motor Controller]
^
|
[Service: /reset_planner]

Communication Breakdown:

  1. Camera → Detector: Topic (continuous image stream)
  2. Detector → Planner: Topic (detections as they occur)
  3. Planner → Motors: Topic (continuous velocity commands)
  4. Reset Service: Service (one-time reset request)

Why This Design:

  • Each node is independently testable (can feed mock images to detector)
  • Fault isolation (if detector crashes, camera and motors keep running)
  • Replaceable components (can swap detector algorithm without touching planner)
  • Parallel execution (all nodes run simultaneously on different CPU cores)

Summary

ROS 2 provides a distributed architecture for building complex robotic systems by:

  1. Decomposing systems into nodes: Independent, single-purpose processes
  2. Three communication patterns: Topics (streaming), Services (request-response), Actions (tasks)
  3. Built on DDS: Industry-standard middleware with discovery, QoS, and security
  4. Improved from ROS 1: No master node, real-time capable, multi-platform, production-ready
  5. Lifecycle management: Fine-grained control over node states (optional)
  6. Quality of Service: Tunable reliability, durability, and history settings

In the next section, we'll move from concepts to practice: creating your first ROS 2 package.

Review Questions

  1. What is the main advantage of ROS 2's distributed architecture over a monolithic program?

    Details

    Answer Fault isolation (components can fail independently), easier testing (test nodes in isolation), better resource utilization (parallel execution), and flexibility in deployment.

  2. When would you use a Service instead of a Topic?

    Details

    Answer When you need a synchronous request-response interaction (e.g., querying data, triggering a one-time action) rather than continuous streaming. Services guarantee a response, while topics are fire-and-forget.

  3. What is the role of DDS in ROS 2?

    Details

    Answer DDS is the middleware layer that handles peer-to-peer discovery, network communication, serialization, and Quality of Service. It eliminates the need for ROS 1's central master node.

  4. Give an example of when you would use an Action instead of a Service.

    Details

    Answer For long-running tasks that need progress feedback and/or cancellation capability. Example: navigating a robot to a goal (you want periodic feedback on distance remaining and the ability to cancel mid-navigation).

  5. What are the three key communication patterns in ROS 2?

    Answer
    1. Topics (publish-subscribe for streaming data), 2) Services (request-response for synchronous queries), 3) Actions (goal-feedback-result for long-running tasks).

Next: Package Development - Learn how to create your first ROS 2 package