Skip to content

feat: add ENTITY_ATTRIBUTES support for domain-specific entity enrich…#2986

Draft
akshay-saraswat wants to merge 1 commit intoHKUDS:mainfrom
Evols-AI:feat/entity-attributes
Draft

feat: add ENTITY_ATTRIBUTES support for domain-specific entity enrich…#2986
akshay-saraswat wants to merge 1 commit intoHKUDS:mainfrom
Evols-AI:feat/entity-attributes

Conversation

@akshay-saraswat
Copy link
Copy Markdown

Allows operators to configure per-entity attribute extraction via the ENTITY_ATTRIBUTES environment variable (JSON list of attribute names), without any schema migrations or breaking changes to existing deployments.

Domain-specific LightRAG deployments often need entities enriched with context beyond name, type, and description. For example:

  • A customer support knowledge graph benefits from sentiment and urgency entities.
  • A research graph benefits from confidence scores on extracted claims.

Previously this required forking and hand-editing the extraction prompt with no supported mechanism to pass the attribute list through config.

The feature is implemented as a clean opt-in extension to the existing entity-extraction pipeline:

  1. ENTITY_ATTRIBUTES env var (default: empty list) is parsed in config.py and passed through addon_params → extract_entities, the same path used by ENTITY_TYPES.

  2. When non-empty, two placeholders are injected into the extraction prompts:

    • {entity_attributes}: the comma-separated attribute list
    • {entity_attributes_instruction}: a human-readable instruction string telling the LLM to append a compact single-line JSON object as a 5th field on each entity line.
  3. _handle_single_entity_extraction now accepts 4 OR 5 fields. The 5th field is parsed as JSON → dict; malformed JSON is logged and silently dropped so extraction still succeeds for the entity itself.

  4. Attributes are merged (last non-null value per key wins) across all extraction instances of the same entity, both in _merge_nodes_then_upsert (normal insert path) and _rebuild_single_entity (rebuild path).

  5. The merged attributes dict is serialised as a JSON string and stored in the graph node under the key 'attributes'. All existing graph backends (NetworkX, Postgres, Neo4j, etc.) can round-trip it without schema changes. Consumers deserialise with json.loads.

Entity line with attributes:
entity<|#|>Pain Point X<|#|>painpoint<|#|>Users report...<|#|>{"sentiment":"negative","urgency":"high","confidence":0.87}

Stored node property:
"attributes": "{"sentiment": "negative", "urgency": "high", "confidence": 0.87}"

  • ENTITY_ATTRIBUTES defaults to [] → extraction prompt is byte-for-byte identical to pre-patch behaviour.

  • Parser still accepts 4-field records → existing extracted graphs are unaffected.

  • 5-field records when ENTITY_ATTRIBUTES is empty are treated as malformed (entity still extracted without attributes).

    ENTITY_ATTRIBUTES='["sentiment","urgency","confidence"]'

Combined with ENTITY_TYPES for a customer support graph:

ENTITY_TYPES='["Person","Product","PainPoint"]'
ENTITY_ATTRIBUTES='["sentiment","urgency","confidence"]'

Description

  • Adds ENTITY_ATTRIBUTES env var (JSON list of strings, default []) that requests extra per-entity attributes to be extracted alongside name/type/description.
  • When non-empty, injects an instruction into the extraction prompt asking the LLM to append a compact single-line JSON object as a 5th field on each entity line.
  • Parses and merges the 5th field across all extraction instances of the same entity; stored as a JSON string under node["attributes"].
  • Zero change to default behaviour — empty ENTITY_ATTRIBUTES produces byte-identical prompts and 4-field records as before.

Related Issues

N/A

Changes Made

File Change
lightrag/constants.py DEFAULT_ENTITY_ATTRIBUTES = []
lightrag/api/config.py Read ENTITY_ATTRIBUTES env var; pass into args.entity_attributes
lightrag/api/lightrag_server.py Pass entity_attributes through addon_params
lightrag/prompt.py Document optional 5th field in system prompt; add {entity_attributes_instruction} to user prompt
lightrag/operate.py Read from addon_params; build instruction string; thread through _process_extraction_result and _handle_single_entity_extraction; merge and store attributes in both insert and rebuild paths

Checklist

  • Changes tested locally
  • Code reviewed
  • Documentation updated (if necessary)
  • Unit tests added (if applicable)

Additional Notes

N/A

…ment

Allows operators to configure per-entity attribute extraction via the
ENTITY_ATTRIBUTES environment variable (JSON list of attribute names),
without any schema migrations or breaking changes to existing deployments.

Domain-specific LightRAG deployments often need entities enriched with
context beyond name, type, and description.  For example:
- A customer support knowledge graph benefits from sentiment and urgency entities.
- A research graph benefits from confidence scores on extracted claims.

Previously this required forking and hand-editing the extraction prompt
with no supported mechanism to pass the attribute list through config.

The feature is implemented as a clean opt-in extension to the existing
entity-extraction pipeline:

1. ENTITY_ATTRIBUTES env var (default: empty list) is parsed in
   config.py and passed through addon_params → extract_entities, the
   same path used by ENTITY_TYPES.

2. When non-empty, two placeholders are injected into the extraction
   prompts:
   - {entity_attributes}: the comma-separated attribute list
   - {entity_attributes_instruction}: a human-readable instruction
     string telling the LLM to append a compact single-line JSON object
     as a 5th field on each entity line.

3. _handle_single_entity_extraction now accepts 4 OR 5 fields.  The 5th
   field is parsed as JSON → dict; malformed JSON is logged and silently
   dropped so extraction still succeeds for the entity itself.

4. Attributes are merged (last non-null value per key wins) across all
   extraction instances of the same entity, both in _merge_nodes_then_upsert
   (normal insert path) and _rebuild_single_entity (rebuild path).

5. The merged attributes dict is serialised as a JSON string and stored
   in the graph node under the key 'attributes'.  All existing graph
   backends (NetworkX, Postgres, Neo4j, etc.) can round-trip it without
   schema changes.  Consumers deserialise with json.loads.

Entity line with attributes:
  entity<|#|>Pain Point X<|#|>painpoint<|#|>Users report...<|#|>{"sentiment":"negative","urgency":"high","confidence":0.87}

Stored node property:
  "attributes": "{\"sentiment\": \"negative\", \"urgency\": \"high\", \"confidence\": 0.87}"

- ENTITY_ATTRIBUTES defaults to [] → extraction prompt is byte-for-byte
  identical to pre-patch behaviour.
- Parser still accepts 4-field records → existing extracted graphs are
  unaffected.
- 5-field records when ENTITY_ATTRIBUTES is empty are treated as
  malformed (entity still extracted without attributes).

  ENTITY_ATTRIBUTES='["sentiment","urgency","confidence"]'

Combined with ENTITY_TYPES for a customer support graph:

  ENTITY_TYPES='["Person","Product","PainPoint"]'
  ENTITY_ATTRIBUTES='["sentiment","urgency","confidence"]'
@akshay-saraswat akshay-saraswat marked this pull request as draft April 29, 2026 03:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant