nav2_ros_common: add lifecycle-managed Subscription wrapper#5834
nav2_ros_common: add lifecycle-managed Subscription wrapper#5834Lotusymt wants to merge 37 commits into
Conversation
Thanks for checking! This was just a draft attempt, and I haven’t addressed the CI issues yet. I’ll follow up after discussing the proposed approach for this issue. |
4ae2015 to
1be146b
Compare
|
I think this is ready to be applied to all subscriptions now! I still have to verify that below is correct, but otherwise LGTM. Can you clarify where you found this + the |
|
Hi @SteveMacenski, this is following ROS 2 core (rclcpp) behavior/pattern:
|
|
OK sounds good! I think the remaining bit is to update for all the servers :-) |
|
@Lotusymt any update? I would love to get this moving |
@SteveMacenski sorry for the slow update, I misinterpreted “apply to all servers” as “move on to the remaining server interfaces. I believe the “apply everywhere” step is already done, as remaining |
|
Don't we need to activate all the subscriptions in |
Yes, for now they(all interfaces) are activated/deactivated at call sites. Per the #5298 issue description:
So my understanding is:
|
|
Hi @SteveMacenski, just wanted to gently follow up here in case this got buried. Please let me know if I missed anything. |
|
Sorry, I've been really slammed and haven't been able to clear all the github comments I need to respond to each day. Sorry that this was one that was temporarily deprioritized. I was traveling and this one was hard to review on my small laptop screen. My bad that it was just Q&A, I thought it was going to be another full review of many files :-) Ooops.
Eventually, to start, lets manually do it like we do for publishers and action servers. The ordering here is really important for bringup stability or segfaults, so I want that to be thoughtfully done
I would want to change to use the auto activate part once we have pub/sub/service/service clients/action clients all (maybe action server; maybe not yet since that's a special child). |
|
Hi Steve , just want to let you know I am actively working on it and here's a little update. I manage to pass all tests expect for some in nav2_system_test. Seems like some activation deadlock issue? Hopefully I can solve it soon : ) |
|
Ready for me now? :-) |
| void on_activate() override | ||
| { | ||
| rclcpp_lifecycle::SimpleManagedEntity::on_activate(); | ||
| } | ||
|
|
||
| void on_deactivate() override | ||
| { | ||
| rclcpp_lifecycle::SimpleManagedEntity::on_deactivate(); | ||
| } |
There was a problem hiding this comment.
These don't need to be overrided if just calling the base class, no? :-)
Not yet, unfortunately qwq. I ran into a few more issues while testing this. For now I’m trying a workaround: for any Transient Local topic, I avoid having the subscription exist in an “inactive” state at all. Concretely, I activate the managed-entity wrapper right before creating the underlying rclcpp::Subscription, so the subscription is effectively “active” for its entire lifespan and we don’t risk losing the latched/transient sample while the wrapper is inactive. My reasoning:
Even with this change, I’m still seeing failures in some system tests. That makes me suspect there may be other subscriptions that also need to be processed earlier, but I’m still tracking down exactly which ones and why. Does this workaround sound reasonable to you? Any suggestions would be really appreciated!! |
|
@fujitatomoya a question for you based on the last comment - you don't need to read the whole thread: Is there a way for the subscription callback to reject a message for a later delivery or create a subscriber object on the network but not add it to be serviced by the executor yet? We're trying to implement lifecycle versions of the subscriptions/services/clients since they don't exist in rclcpp_lifecycle and we're having a problem with how to indicate "No! I'm not active, don't even try it" for services and subscriptions. Services will just happily accept a response of a default constructed result if we exit the callback when the node is not active, and we have problems with subscriptions that are transient local since the delivery will happen only once. Streaming data being dropped (like a sensor) is fine, but this is a case where its not fine. We have long been able to do this with our Simple Action Server since there's an explicit "reject goal" API for actions so the client knows that its request was thrown out. I don't think that is sensible for services/subscriptions, but I feel like there should be a way to construct them and connect them on bringup without servicing them until active. This seems like a gap in the available API that after some digging I couldn't see how to address. But you're more in these details so I'm curious about your thoughts :-) As always, I don't really want to maintain my own version of things that have general purpose; happy to have these all donated to rclcpp_lifecycle (and simple action server to rclcpp_actions) :-) |
|
@Lotusymt all things should be processed in active state, with the sole exception of TF which is required to be running beforehand so we can check TF transformations as existing as part of the activation phase.
That is strange and should generally not be true. The nav2 lifecycle manager should bring all nodes into configure before all nodes into active. So by the time something is being activated, they're all configured but should not be processing data (except those already activated beforehand). There should not be dependency on anything processing messages in configure -- and if there are that's a bug we need to fix (again with the exception of TF). Could be ordering of bringup -- but also could be some interaction we should address. Do you know what node(s) or what topic(s) are the offenders? What are the failures? |
|
1st thing's 1st to answer your question.
AFAIK, no... unfortunately, there is no such things yet... probably something you guys need is something like ros2/rclcpp#2715? IMO LifecycleEntity makes sense for some use cases, there is a few primary states that we need to consider by design.
i think that this is doable... but i am not sure if DDS or zenoh have these managed states internally to be mapped to lifecycle entities of ROS 2. even if they dont, maybe we can land somewhere on "skip to take out the data in inactive state". i may be missing some things here, we obviously need to take some time to consider more details and design before implementation. |
|
Thanks for the correction! I was initially suspicious that AMCL’s map_sub can’t be handled in the node's on_activate. But I looked into it again, it should be fine. Maybe I missed something, and the timeout error happened when I added buffering logic for transient-local in the subscription wrapper at that time. I’m currently not able to reliably reproduce the exact same timeout anymore. So I think the next step would be to try again when
is available. |
This suffers from the same issues, no? :( It creates the subscription on creation -- so if you send a message when the node is in the inactive state, it'll be received by the callback (even though its rejected for processing, if its transient local it will not be reattempted later resulting in meaningful loss). Though @Lotusymt note the use of I think you pointed out the exact thing we need in This doesn't just impact subscriptions, but also services. Clients and publishers are easy because they can just come up and then just check a flag that if someone asks them to do something to simply say "no". subscriptions and services need something to say "no" before we try to execute a callback. I think as such there should be something to create an interface, have it discoverable, but not serviceable until said its ready.
Either of these (and I imagine more) would work. |
|
IMO (i believe everybody agrees...) this feature should be implemented in the core base classes like |
|
OK - for now @Lotusymt I think we should create it on the active state to get past the issue (even if not ideal) for now. Do you also mind starting a thread in discourse with this? I could but I want you to get credit for uncovering this gap :-) |
|
Thanks Steve! really appreciate you encouraging me to start the Discourse thread. I submitted the topic in Discourse, but it’s currently pending moderator approval. I’ll post the link here as soon as it’s visible. If there’s anything you’d like me to adjust, please let me know. On the PR side: I also switched the workaround to the |
|
Great - thanks! |
Signed-off-by: lotusymt <mengtiy5@uci.edu> Signed-off-by: lotusymt <mengtiy5@uci.edu>
Signed-off-by: lotusymt luiseyang36@gmail.com Signed-off-by: lotusymt <mengtiy5@uci.edu>
Signed-off-by: lotusymt <mengtiy5@uci.edu>
Signed-off-by: lotusymt <mengtiy5@uci.edu>
Removed unnecessary CMake module path and include statement. Signed-off-by: Mengting Yang <87471734+Lotusymt@users.noreply.github.com>
Co-authored-by: Steve Macenski <stevenmacenski@gmail.com> Signed-off-by: Mengting Yang <87471734+Lotusymt@users.noreply.github.com>
Co-authored-by: Steve Macenski <stevenmacenski@gmail.com> Signed-off-by: Mengting Yang <87471734+Lotusymt@users.noreply.github.com>
Co-authored-by: Steve Macenski <stevenmacenski@gmail.com> Signed-off-by: Mengting Yang <87471734+Lotusymt@users.noreply.github.com>
Signed-off-by: lotusymt <mengtiy5@uci.edu>
Signed-off-by: lotusymt <mengtiy5@uci.edu>
Signed-off-by: lotusymt <mengtiy5@uci.edu>
Signed-off-by: lotusymt <mengtiy5@uci.edu>
Signed-off-by: lotusymt <mengtiy5@uci.edu>
Signed-off-by: lotusymt <mengtiy5@uci.edu>
Per maintainer feedback on PR ros-navigation#5834 (SteveMacenski, 2026-03-13: "I'd say all"), redesign nav2::Subscription so the underlying rclcpp::Subscription is constructed in on_activate() and released in on_deactivate(), for all subscriptions (not just transient_local). The previous LifecycleSubscription override that gated handle_message() and the transient_local auto-activate hack are removed: while inactive, no DDS endpoint exists and no callback can fire. - subscription.hpp: drop LifecycleSubscription, store callback/qos/options and a weak ref to the topics interface; build the rclcpp::Subscription via SubscriptionFactory inside on_activate(); reset it in on_deactivate(). - test_subscription_latched: explicitly activate the subscription before publishing (the old "transient_local auto-activates at init" guarantee is gone by design). - Revert geojson churn in nav2_route/graphs/{aws_graph,sample_graph}.geojson (unrelated to this PR per maintainer review). Signed-off-by: lotusymt <mengtiy5@uci.edu>
PR ros-navigation#5834 redesigned nav2::Subscription to construct the underlying rclcpp::Subscription only at on_activate() and release it at on_deactivate() (no auto-activate-on-init for transient_local). Two production call sites still depended on the old behavior: * StaticLayer::activate() skipped on_activate() for transient_local map subscriptions, relying on them being created already-active. Now they are not, so the topic endpoint never appeared and Costmap2DROS hung in tests like test_collision_checker / inflation_tests / plugin_container_tests waiting for the latched /map message. * MapSaver::saveMapTopicToFile() had the same conditional skip for transient_local map subscriptions, which silently never delivered. Signed-off-by: lotusymt <mengtiy5@uci.edu>
…igation#5834 Tests that built nav2::Subscriptions via test fixtures were depending on the pre-redesign behavior where the underlying rclcpp::Subscription existed at construction. Now subscriptions exist only after on_activate(), which the test fixtures bypassed because they invoked the on_configure()/ on_activate() override callbacks directly rather than using the lifecycle state machine -- so the auto-activate-if-active branch in nav2::LifecycleNode::create_subscription never fired. * test_costmap_filter_info_server, test_vector_object_server: explicit subscription_->on_activate() after construction so the tester actually receives the latched message the server publishes on activation. * test_costmap_subscriber: drive the lifecycle node through configure() / activate() so its managed entities -- including those added internally by CostmapSubscriber via lc_node->create_subscription -- are activated together. Removes the redundant per-sub on_activate calls. * test_costmap_2d_publisher: defer LayerSubscriber's executor thread start until after the underlying subscription has been built in on_activate(); also activate the LayerSubscriber before activating the costmap, so the subscription is discovered before mapUpdateLoop publishes its first latched costmap message (avoids relying on transient_local replay, which is mid-rewrite in rmw_fastrtps_cpp 9.4.7). Signed-off-by: lotusymt <mengtiy5@uci.edu>
LifecycleServiceClient's constructor blocks on wait_for_service() in a loop. If the rcl context is invalidated mid-construction (e.g. the process receives SIGINT before the dependent service comes up), both wait_for_service() and Rate::sleep() return immediately, which spins the loop tight and prevents the process from exiting. Surfaced by test_costmap_subscriber_exec: after the gtest binary now exits in ~50ms (post-PR-ros-navigation#5834 fixes), the launch wrapper SIGINTs a lifecycle_manager whose constructor was still in this loop, and the ctest timeout fires while the manager refuses to die. Add an rclcpp::ok() check to the loop condition so context shutdown breaks construction cleanly. Signed-off-by: lotusymt <mengtiy5@uci.edu>
2cffbe8 to
990927f
Compare
Previous algorithm_build hit CircleCI's 60-minute job timeout while still making progress (5/15 packages complete, 4 still building). All other CI signals — core_build, jazzy, kilted, pre-commit, linters, DCO — passed on the same code. Signed-off-by: lotusymt <mengtiy5@uci.edu>
Round 1 (#57229): algorithm_build hit 60-min timeout at 5/15 packages, likely cold caches. Round 2 (#57233): algorithm_build hit 60-min timeout at 13/15 packages (2 still building when killed) — caching from #57229 helped. This round expects to finish: caching from #57233 should push us under the limit. Also retries build-docker (jazzy)/(kilted), which both failed with GitHub-runner→archive.ubuntu.com connection timeouts (pure network infrastructure issue, unrelated to this PR's code). Signed-off-by: lotusymt <mengtiy5@uci.edu>
Signed-off-by: lotusymt <mengtiy5@uci.edu>
836bf0f to
73f7ff0
Compare
…uring ACTIVATING
Two related bugs surfaced in CI system tests after switching to the
create-on-activate / destroy-on-deactivate Subscription wrapper:
1. nav2::LifecycleNode::create_{subscription,publisher} only auto-activated
when the node was already in PRIMARY_STATE_ACTIVE. Costmap filters
create their mask subscription lazily inside filterInfoCallback, which
fires from the parent costmap node's internal executor *during* the
on_activate transition (state == TRANSITION_STATE_ACTIVATING). The
wrapper was registered as a managed entity but never received
on_activate(), so the underlying rclcpp subscription was never created
and the filter mask never arrived — failing test_keepout_filter,
test_speed_filter_global, test_speed_filter_local.
Extend the auto-activate predicate to also fire while the node is
ACTIVATING. The wrapper's own on_activate() is idempotent so a later
explicit activation by user code remains a no-op.
2. nav2_costmap_2d::CostmapSubscriber owns two nav2::Subscription wrappers
that are added as managed entities of its LifecycleNode parent at
construction time, but it exposed no way to activate them. RouteServer
constructs a CostmapSubscriber in on_configure and then never wired
it up — the costmap topic was never subscribed, leaving "No costmap
yet received!" in test_route.
Add on_activate()/on_deactivate() to CostmapSubscriber that fan out to
the wrapped subscriptions, and call them from RouteServer's lifecycle
callbacks.
Signed-off-by: lotusymt <mengtiy5@uci.edu>
…pSubscribers After the create-on-activate Subscription wrapper, two more places were silently leaving lifecycle subscriptions unactivated: 1. nav2_behaviors::BehaviorServer creates local_costmap_sub_ and global_costmap_sub_ in on_configure (CostmapSubscribers wrapping nav2::Subscription) but only activated the *footprint* subscriptions in on_activate. Test assisted_teleop_behavior surfaced this as a stream of "Costmap is not available" errors. Activate (and symmetrically deactivate) the costmap subscribers. 2. nav2_route route_server's plugin-owned CostmapSubscribers (the ones CollisionMonitor and CostmapScorer allocate themselves when their costmap topic differs from the server topic) had no path to be activated — RouteServer only owns the shared subscriber. Surfaced as test_route's "Collision Monitor could not obtain a costmap from topic: local_costmap/costmap_raw". Add virtual activate()/deactivate() hooks to RouteOperation and EdgeCostFunction (default no-op), override them in CollisionMonitor and CostmapScorer to fan out to the subscriber they own (tracked via owns_costmap_subscriber_), and forward through OperationsManager, EdgeScorer, RouteTracker, and RoutePlanner so that RouteServer::on_activate/on_deactivate wakes the whole plugin chain. Signed-off-by: lotusymt <mengtiy5@uci.edu>
ament_uncrustify flagged the single-line for-loop bodies introduced in the previous commit. Expand them to a multi-line form. Signed-off-by: lotusymt <mengtiy5@uci.edu>
… setup The create-on-activate Subscription wrapper left LoopbackSimulator's initial_pose_sub_ and cmd_vel_sub_ dormant: both are created in on_configure (state CONFIGURING) so the framework auto-activate doesn't fire, and on_activate didn't call on_activate() on them explicitly. Symptom: test_loopback_simulator's InitialPoseSetsMapToOdom / CmdVelMovesRobot / OdometryContainsTwist / RotationUpdatesYaw failed because the simulator never received either /initialpose or /cmd_vel. Add the missing on_activate/on_deactivate fan-out. ClockPublisherTest used a raw nav2::LifecycleNode that was never configured/activated, so subscriptions created via node_->create_subscription() now stay inactive and miss every /clock message. Drive the test node to ACTIVE in SetUp() so subscriptions auto-activate as designed. Signed-off-by: lotusymt <mengtiy5@uci.edu>
|
@SteveMacenski I am very sorry for the delay, it was more complicated than I expected and I was kind of occupied by something else. Please take a look when you have time and let me know if there's anything wrong. Thanks :D |
Basic Info
Description of contribution in a few bullet points
nav2::Subscriptionwrapper (similar tonav2::Publisher) usingrclcpp_lifecycle::SimpleManagedEntity, so subscriptions can participate in lifecycle transitions.is_activated()and emit a one-time warning when messages are received before activation (messages are dropped until activated).rclcpp::Node(no lifecycle manager to trigger activation), the wrapper auto-activates so behavior matches existing non-lifecycle usage.test_actionsslightly to make the callback type/signature match the new subscription wrapper expectations (no new tests added).Description of documentation updates required from your changes
Description of how this change was tested
nav2_ros_commonand ran its test suite viacolcon test(and verified results withcolcon test-result).Future work that may be required in bullet points
LifecycleNode::activateInterfaces()ordering work (exporters before consumers; deactivate in reverse), and then remove the manual activation boilerplate in task servers.nav2::Publisher).For Maintainers:
backport-*.