Skip to content

fix model name delivery and log more info#1745

Open
Harold-lkk wants to merge 7 commits intoInternLM:agent_devfrom
Harold-lkk:lkk/fix_r3
Open

fix model name delivery and log more info#1745
Harold-lkk wants to merge 7 commits intoInternLM:agent_devfrom
Harold-lkk:lkk/fix_r3

Conversation

@Harold-lkk
Copy link
Copy Markdown
Member

No description provided.

if self.lmdeploy_actor is None:
self.lmdeploy_actor = ray.get_actor(SHARED_STORE, namespace=SHARED_STORE_NAMESPACE)
assert self.lmdeploy_actor is not None, "LMDeploy actor should be available in the shared store."
routed_experts_ref = self.lmdeploy_actor.get.remote(routed_experts)
Copy link
Copy Markdown
Collaborator

@YanhuiDua YanhuiDua May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

routed_experts = ray.get(self.lmdeploy_actor.get.remote(routed_experts)_
routed_experts_ref = ray.put(routed_experts)

extra_info["routed_experts"] = routed_experts
# Turn 1: materialize tensor and hand ownership to the store.
decoded = self._decode_routed_experts(routed_experts)
if isinstance(decoded, ObjectRef):
Copy link
Copy Markdown
Collaborator

@YanhuiDua YanhuiDua May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_decode_routed_experts 中 加await _LMDEPLOY_ACTOR.get.remote(routed_experts) + free obj_ref,这个时候拿到的就是tensor了

Comment thread xtuner/v1/ray/rollout/worker.py Outdated
# same history key. Store TTL GC handles cleanup.
history_ref = await store.get_ref.remote(history_routed_experts_key)
history_routed_experts = await history_ref
elif isinstance(history_routed_experts_key, ObjectRef):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

那这里应该就可以删掉了吧,history_routed_experts_key 就不可能为objref

Comment thread xtuner/v1/ray/rollout/worker.py Outdated
elif isinstance(history_routed_experts_key, ObjectRef):
history_routed_experts = await history_routed_experts_key
ray.internal.free([history_routed_experts_key], local_only=False)
elif isinstance(history_routed_experts_key, (bytes, bytearray)):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上

Comment thread xtuner/v1/ray/rollout/worker.py Outdated
history_routed_experts = await history_ref
elif isinstance(history_routed_experts_key, ObjectRef):
history_routed_experts = await history_routed_experts_key
ray.internal.free([history_routed_experts_key], local_only=False)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为啥还是用ray.internal.free接口,而不是routed_expert_store的release接口

# tokenize.py base64-decoded to bytes.
history_ref = cloudpickle.loads(history_routed_experts_key)
history_routed_experts = await history_ref
ray.internal.free([history_ref], local_only=False)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

最好释放都改成调用release接口吧

Comment thread xtuner/v1/ray/rollout/controller.py Outdated
# rather than an ObjectRef. Legacy ObjectRef path kept as a
# defensive fallback and encoded the same way as before so older
# clients still decode correctly.
if isinstance(response.extra_info.get("routed_experts"), ray.ObjectRef):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个判断也可以删掉了

@YanhuiDua
Copy link
Copy Markdown
Collaborator

建议在xtuner/v1/ray/environment/install_agent_env.py 中 当rollout_finish_reason == "failed" or result == "Failed": orrollout_finish_reason == "skipped" or rollout_state in (RolloutState.SKIPPED, "skipped")时,吧其对应的routed experts清理掉

然后在训练结束后,再对routed_expert_store进行一次检查,现在似乎没看到这部分代码

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants