ROS 2 Architecture: Understanding Distributed Robotics Systems
Introduction
Welcome to the world of distributed robotics! In this section, you'll learn the fundamental architecture that powers ROS 2 (Robot Operating System 2), the leading middleware for building modern robotic systems. Unlike traditional monolithic programs where all functionality exists in a single process, ROS 2 embraces a distributed architecture where independent components communicate over a network.
This design philosophy isn't arbitrary—it mirrors the inherent complexity of robotics. A humanoid robot doesn't have a single "brain" that controls everything. Instead, it has sensors (cameras, IMUs, force sensors), actuators (motors), decision-making systems (AI agents), and safety monitors—all working simultaneously. ROS 2 provides the infrastructure to coordinate these diverse components.
Why Distributed Architecture for Robotics?
The Monolithic Problem
Imagine writing a single Python script that:
- Reads data from 20 cameras at 30 FPS
- Processes sensor fusion from IMUs and force sensors
- Runs AI inference for object detection
- Controls 30+ motors with real-time constraints
- Monitors battery levels and safety limits
- Logs data to disk
This monolithic approach creates several problems:
- Complexity Explosion: A single file with thousands of lines becomes unmaintainable
- No Fault Isolation: If one component crashes, the entire system dies
- Difficult Testing: Can't test individual components in isolation
- Poor Resource Utilization: Can't distribute computation across multiple cores/machines
- Rigid Deployment: Can't easily move components to different hardware
The ROS 2 Solution: Nodes as Building Blocks
ROS 2 solves this by breaking systems into nodes—independent processes that each handle a specific responsibility. For example:
- Camera Node: Captures and publishes image data
- Object Detector Node: Subscribes to images, runs AI inference, publishes detections
- Motion Planner Node: Receives detections and current robot state, outputs motion commands
- Motor Controller Node: Subscribes to motion commands, controls hardware
Each node is:
- Independent: Runs in its own process with isolated memory
- Single-Purpose: Does one thing well (Unix philosophy)
- Replaceable: Can swap implementations without changing other nodes
- Testable: Can be tested in isolation with mock data
Core Communication Paradigms
ROS 2 provides three primary communication patterns, each suited for different use cases:
1. Topics (Publish-Subscribe Pattern)
Use Case: Continuous streaming data where multiple consumers might be interested.
How It Works:
- Publishers send messages to named topics (e.g.,
/camera/image) - Subscribers listen to topics and receive messages via callbacks
- Asynchronous: Publishers don't wait for subscribers
- Many-to-Many: Multiple publishers and subscribers can share a topic
Real-World Examples:
- Camera streaming images (1 publisher → many subscribers)
- Sensor data (joint positions, IMU readings, battery status)
- Robot telemetry (position, velocity, diagnostics)
Key Characteristics:
- Fire-and-Forget: Publishers don't know if anyone is listening
- Latest-Value Semantics: Subscribers typically process the most recent message
- High Frequency: Ideal for data published at 10Hz, 100Hz, or faster
2. Services (Request-Response Pattern)
Use Case: Blocking request-response interactions where you need an answer.
How It Works:
- Service Server: Waits for requests, processes them, returns responses
- Service Client: Sends request, blocks until response is received
- Synchronous: Client waits for server
- One-to-One: Each request gets exactly one response
Real-World Examples:
- Triggering a calibration routine (request: "calibrate", response: "success/failure")
- Querying robot state (request: "get current position", response: position data)
- Resetting a component (request: "reset", response: "done")
Key Characteristics:
- Blocking: Client waits for server to respond
- Transactional: Guaranteed request-response pairing
- Lower Frequency: Ideal for occasional interactions
3. Actions (Long-Running Tasks)
Use Case: Tasks that take significant time and need progress feedback.
How It Works:
- Action Server: Accepts goals, executes them asynchronously, sends feedback and results
- Action Client: Sends goals, receives periodic feedback, can cancel goals
- Asynchronous with Feedback: Client doesn't block but gets progress updates
- Cancelable: Client can abort in-progress actions
Real-World Examples:
- Navigation: "Go to position X" (feedback: current distance, result: "arrived" or "failed")
- Grasping: "Pick up object" (feedback: gripper position, result: "grasped" or "failed")
- Motion execution: "Execute trajectory" (feedback: % complete, result: success/failure)
Key Characteristics:
- Goal-Oriented: Represents a task to accomplish
- Preemptable: Can be canceled mid-execution
- Stateful: Provides feedback during execution
Comparison Table: Topics vs Services vs Actions
| Feature | Topics | Services | Actions |
|---|---|---|---|
| Communication | Pub-Sub | Request-Response | Goal-Feedback-Result |
| Blocking | No | Yes | No (async) |
| Frequency | High (continuous) | Low (on-demand) | Low (tasks) |
| Feedback | No | No | Yes (progress) |
| Cancelable | N/A | N/A | Yes |
| Use Case | Streaming data | Quick queries | Long-running tasks |
| Example | Sensor data | Get battery % | Navigate to waypoint |
The Data Distribution Service (DDS) Layer
Unlike ROS 1, which used a custom TCP-based protocol, ROS 2 is built on top of DDS (Data Distribution Service)—an industry-standard middleware used in aerospace, defense, and industrial automation.
What is DDS?
DDS is a peer-to-peer communication standard that provides:
- Discovery: Nodes automatically find each other on the network (no central master)
- Quality of Service (QoS): Fine-grained control over reliability, latency, durability
- Real-Time Capable: Designed for systems with strict timing requirements
- Security: Built-in authentication, encryption, access control
Benefits for Robotics
- No Single Point of Failure: Unlike ROS 1's master node, ROS 2 has no central coordinator
- Multi-Robot Systems: Easier to build systems with multiple robots communicating
- Network Flexibility: Works across Ethernet, Wi-Fi, shared memory, etc.
- Quality of Service: Can prioritize critical data (e.g., safety messages) over telemetry
DDS in Action (Simplified)
When you publish a message in ROS 2:
Your Code → rclpy → DDS Implementation → Network → DDS Implementation → rclpy → Subscriber Code
The DDS layer handles:
- Serialization (converting Python objects to bytes)
- Network transport (UDP, TCP, or shared memory)
- Discovery (finding subscribers)
- Reliability (resending lost packets if QoS requires it)
ROS 1 vs ROS 2: Key Improvements
If you've encountered ROS 1 (used in many older tutorials), here's what changed in ROS 2:
| Aspect | ROS 1 | ROS 2 |
|---|---|---|
| Architecture | Master-based (single point of failure) | Peer-to-peer (distributed discovery) |
| Middleware | Custom TCPROS protocol | DDS (industry standard) |
| Real-Time | Limited support | Real-time capable with DDS QoS |
| Security | None (plaintext, no auth) | DDS security (encryption, access control) |
| Platforms | Linux only (practically) | Linux, Windows, macOS |
| Python | Python 2.7 (obsolete) | Python 3.6+ |
| Build System | catkin | ament/colcon |
| Lifecycle | Simple start/stop | Managed lifecycle nodes |
Key Takeaway: ROS 2 was redesigned from the ground up to be production-ready, not just for research.
Node Lifecycle (Managed Nodes)
ROS 2 introduces lifecycle management for nodes, allowing fine-grained control over startup and shutdown. While not required for basic nodes, managed lifecycle nodes have explicit states:
- Unconfigured: Node exists but isn't ready
- Inactive: Node is configured but not processing data
- Active: Node is fully operational
- Finalized: Node is shutting down
Benefits:
- Controlled initialization (e.g., "configure" then "activate")
- Graceful shutdown (clean up resources)
- Runtime reconfiguration (deactivate → reconfigure → activate)
When to Use:
- Production systems where startup order matters
- Nodes that manage hardware (need clean shutdown)
- Systems requiring dynamic reconfiguration
For Learning: We'll start with simple nodes and introduce lifecycle management later.
Quality of Service (QoS) Profiles
QoS allows you to tune communication behavior for different needs:
Common QoS Settings:
- Reliability: Best-effort (fast, may drop messages) vs Reliable (guarantees delivery)
- Durability: Volatile (only current subscribers get messages) vs Transient-Local (new subscribers get last message)
- History: Keep-Last-N (buffer N messages) vs Keep-All (unlimited buffer)
Example Use Cases:
- Sensor Data: Best-effort, keep-last-10 (OK to drop old data, prioritize freshness)
- Commands: Reliable, keep-last-1 (critical commands must arrive)
- Initialization Data: Reliable, transient-local (new nodes get last configuration)
Default: Most beginners use default QoS (reliable, volatile, keep-last-10) which works well for learning.
Putting It All Together: A Simple System
Imagine a simple robot system:
[Camera Node] --/image--> [Object Detector] --/detections--> [Motion Planner] --/cmd_vel--> [Motor Controller]
^
|
[Service: /reset_planner]
Communication Breakdown:
- Camera → Detector: Topic (continuous image stream)
- Detector → Planner: Topic (detections as they occur)
- Planner → Motors: Topic (continuous velocity commands)
- Reset Service: Service (one-time reset request)
Why This Design:
- Each node is independently testable (can feed mock images to detector)
- Fault isolation (if detector crashes, camera and motors keep running)
- Replaceable components (can swap detector algorithm without touching planner)
- Parallel execution (all nodes run simultaneously on different CPU cores)
Summary
ROS 2 provides a distributed architecture for building complex robotic systems by:
- Decomposing systems into nodes: Independent, single-purpose processes
- Three communication patterns: Topics (streaming), Services (request-response), Actions (tasks)
- Built on DDS: Industry-standard middleware with discovery, QoS, and security
- Improved from ROS 1: No master node, real-time capable, multi-platform, production-ready
- Lifecycle management: Fine-grained control over node states (optional)
- Quality of Service: Tunable reliability, durability, and history settings
In the next section, we'll move from concepts to practice: creating your first ROS 2 package.
Review Questions
-
What is the main advantage of ROS 2's distributed architecture over a monolithic program?
Details
Answer
Fault isolation (components can fail independently), easier testing (test nodes in isolation), better resource utilization (parallel execution), and flexibility in deployment. -
When would you use a Service instead of a Topic?
Details
Answer
When you need a synchronous request-response interaction (e.g., querying data, triggering a one-time action) rather than continuous streaming. Services guarantee a response, while topics are fire-and-forget. -
What is the role of DDS in ROS 2?
Details
Answer
DDS is the middleware layer that handles peer-to-peer discovery, network communication, serialization, and Quality of Service. It eliminates the need for ROS 1's central master node. -
Give an example of when you would use an Action instead of a Service.
Details
Answer
For long-running tasks that need progress feedback and/or cancellation capability. Example: navigating a robot to a goal (you want periodic feedback on distance remaining and the ability to cancel mid-navigation). -
What are the three key communication patterns in ROS 2?
Answer
- Topics (publish-subscribe for streaming data), 2) Services (request-response for synchronous queries), 3) Actions (goal-feedback-result for long-running tasks).
Next: Package Development - Learn how to create your first ROS 2 package