Skip to content

Removes unnecessary pixelpipe cache Invalidation after in-place color space conversion#21251

Open
masterpiga wants to merge 1 commit into
darktable-org:masterfrom
masterpiga:invalidate
Open

Removes unnecessary pixelpipe cache Invalidation after in-place color space conversion#21251
masterpiga wants to merge 1 commit into
darktable-org:masterfrom
masterpiga:invalidate

Conversation

@masterpiga
Copy link
Copy Markdown
Collaborator

I went down a bit of a rabbit hole trying to improve the responsiveness of interactive edits, and I stumbled upon these cache invalidations. The analysis is from Claude, and it seems correct to me. @jenshannoschwalm, you certainly know better. Would you please take a look?

Pixelpipe Cache Invalidation After In-Place Colorspace Conversion

Background

Before a module processes its input, darktable checks whether the buffer's current
colorspace (cst_from = input_format->cst) matches the module's required input
colorspace (cst_to = module->input_colorspace(...)). When they differ, the buffer
is converted in-place — the same allocation serves as both source and destination —
to avoid a second large allocation and a full buffer copy.

Two commits by Hanno introduced cache invalidation after this conversion:

Commit Date Path
ba97902e6f 2026-05-23 CPU path — _pixelpipe_process_on_CPU()
9fbdc4b51e 2026-05-24 OpenCL-tiling fallback — _dev_pixelpipe_process_rec()

Both added dt_dev_pixelpipe_invalidate_cacheline(pipe, input) immediately after the
in-place transform, with the comment:

the cacheline cst does not reflect the module input colorspace any more!
FIXME let's invalidate for now

Both also proposed the correct fix: "implement code for the pipe cache that modifies
the data cst."


The Three Processing Paths

The pixelpipe has three distinct code paths for module execution. The cache
invalidation bug only affected two of them.

Path 1 — Pure OpenCL (no bug, never touched)

When a module runs entirely on the GPU, the colorspace conversion is performed on a
GPU buffer (cl_mem_input), never on the CPU cache buffer. The codebase tracks the
evolving GPU-side colorspace in a local shadow variable input_cst_cl,
initialised from input_format->cst at the start of the OpenCL block and then
updated independently:

dt_iop_colorspace_type_t input_cst_cl = input_format->cst;  // local shadow
...
dt_ioppr_transform_image_colorspace_cl(..., input_cst_cl, cst_to, &input_cst_cl, ...);
// input_format->cst (i.e. the CPU cache descriptor) is NOT touched here

The CPU cache is synchronised only at explicit copy-back points, when the GPU buffer
is downloaded to host memory:

input_format->cst = input_cst_cl;  // lines 2476, 2678, 2721, 2753

At those points the CPU buffer genuinely holds the converted data and the cache
descriptor is written to match. If the GPU input is never copied back (the common
case when important_cl_input is false), the CPU cache retains its pre-conversion
data in cst_from with dsc.cst = cst_from — also self-consistent; a future cache
hit for the same hash will trigger a fresh in-place conversion on the next pass, which
is the correct behaviour.

No invalidation was ever added to this path, and none is needed. The FIXME comments
themselves stated this correctly:

"in opencl code path this is different as the in-data colorspace conversion is
always done in cl_mem and thus does not affect pipecache."

Path 2 — CPU processing (bug fixed in ba97902e6f)

_pixelpipe_process_on_CPU() handles both direct CPU processing and CPU tiling.
The in-place conversion mutates the cached CPU buffer and writes the resulting
colorspace back through &input_format->cst. Commit ba97902e6f added the
now-removed invalidation here.

Path 3 — OpenCL-tiling fallback (bug fixed in 9fbdc4b51e)

When module->process_tiling_cl is attempted but fails, the code falls back to CPU
tiling. Before handing control to the CPU tiler it performs an in-place colorspace
conversion on the CPU buffer and writes back through &input_format->cst. Commit
9fbdc4b51e added the now-removed invalidation here.


The Cache Pointer Mechanism

The claim that "the cacheline cst does not reflect the module input colorspace any
more" rests on a misreading of how input_format is wired to the cache.

dt_dev_pixelpipe_cache_get is the function that hands out input buffers. In both
the cache-hit path and the cache-miss (new allocation) path it redirects the caller's
pointer to point directly into the live cache descriptor array:

// cache-miss path (pixelpipe_cache.c:349-351)
cache->dsc[cline] = **dsc;   // copy caller's descriptor into the cache slot
*dsc = &cache->dsc[cline];   // redirect the caller's pointer INTO the cache slot

// cache-hit path (_get_by_hash, pixelpipe_cache.c:284)
*dsc = &cache->dsc[k];       // same redirection on a hit

The caller passes &input_format, so after the call, input_format is no longer a
pointer to a local struct — it points directly into cache->dsc[k]. Every field
write through input_format is a direct write into the live cache descriptor.

dt_ioppr_transform_image_colorspace takes &input_format->cst as its
converted_cst output parameter and writes the resulting colorspace back through it:

// _transform_matrix (iop_profile.c:612)
*converted_cst = cst_to;   // set first; reset to cst_from only on invalid pair

// _transform_lcms2 (iop_profile.c:286)
*converted_cst = cst_to;   // identical pattern

Both inner functions set *converted_cst = cst_to before doing any work. They reset
it to cst_from only when the requested conversion is an outright invalid pair
(e.g. IOP_CS_RAW ↔ IOP_CS_RGB), in which case the buffer is left completely
unchanged — a self-consistent no-op. There is no partial-conversion failure mode:
the image loops are pure arithmetic that either complete fully or do not run at all.

Consequence: after dt_ioppr_transform_image_colorspace returns,
cache->dsc[k].cst already holds the correct value for the buffer's actual content.
The cache descriptor was never stale. Both the data and the metadata were correct
before the invalidation line ran.

This mechanism was already in place when both FIXME commits were authored, confirming
that the invalidations were added under an incorrect assumption.


What dt_dev_pixelpipe_invalidate_cacheline Does

// pixelpipe_cache.c:371-375
static void _mark_invalid_cacheline(const dt_dev_pixelpipe_cache_t *cache, const int k)
{
  cache->hash[k] = DT_INVALID_HASH;
  cache->ioporder[k] = 0;
}

It sets the cacheline's hash to a sentinel that can never match any real pipeline
hash. The buffer allocation and the dsc struct (including its now-correct cst
field) are left intact, but the slot can no longer be found by hash lookup. It
becomes a zombie: occupying a slot in the cache's fixed-size table, counted in
cache->linvalid, and unavailable to any future lookup until LRU eviction eventually
recycles it.


Performance Implications of the (Removed) Invalidation

Module N-1 never benefits from caching. On every pipeline pass, module N
performs an in-place conversion on the buffer that contains module N-1's output, then
immediately invalidates it. The next pass finds DT_INVALID_HASH for what should be
a valid cached result, misses, and forces module N-1 to recompute from scratch. This
repeats indefinitely. Module N-1 gets zero reuse from the cache for the entire
lifetime of the session.

Zombie slot accumulation. Each pass produces a fresh zombie: the old cacheline
is invalidated, module N-1 runs again and is assigned a new cacheline (possibly the
same physical slot after LRU eviction, possibly a different one). The
cache->linvalid counter grows proportionally to the number of modules that require a
colorspace conversion — in a typical pipeline this is several modules. The effective
size of the reusable cache is reduced.

Interaction with the important flag. The pipeline marks certain input buffers
as important (higher LRU priority) — the focused module's input, the last-changed
module's input, and modules with IOP_FLAGS_WRITE_PIPECACHE. An important but
invalidated cacheline retains its priority weight but is permanently unreachable. It
both wastes a high-priority slot and fails to serve its purpose.

Cascading recompute cost. If module N-1 is expensive (a denoiser, a complex
tone-mapper), the forced recompute can be the dominant cost of a pipeline pass
triggered only by a downstream parameter change — a change that should have left
module N-1's output entirely untouched.


The Blend Conversion Edge Case

After module N processes its input a second in-place conversion may follow for blend
masking (pixelpipe_hb.c:1577):

dt_ioppr_transform_image_colorspace(module, input, input,
                                    roi_in->width, roi_in->height,
                                    input_format->cst, blend_cst, &input_format->cst,
                                    work_profile);

This converts the buffer from cst_to (the module's working space) to blend_cst
(the blend operator's working space), again writing back through &input_format->cst.
After this, cache->dsc[k].cst = blend_cst and the buffer content is in blend_cst.

On the next pass (without invalidation), a hash lookup for module N-1's cacheline
hits, returns blend_cst data, and module N then observes cst_from = blend_cst.
This is a different starting point than the first pass, so the analysis splits by
case.

Common case — blend_cst == cst_from: This is the overwhelmingly typical
configuration. Both are IOP_CS_RGB, the universal default. The blend conversion
is IOP_CS_RGB → IOP_CS_RGB, a no-op. The cached buffer is left in cst_to (from
the module's working-space conversion), not further mutated. On the next pass,
cst_from = cst_to, no conversion is needed, and the buffer is used directly. Fully
correct and maximally efficient.

Less common case — blend_cst ≠ cst_from: Consider cst_from = IOP_CS_RGB,
cst_to = IOP_CS_LAB, blend_cst = IOP_CS_RGB. Pass 1 leaves the cache in
IOP_CS_RGB (the blend conversion round-tripped back from LAB to RGB). Pass 2: the
cache returns IOP_CS_RGB data, module N converts RGB → LAB and then blends
LAB → RGB again. The buffer seen by module N on pass 2 is
LAB(RGB(LAB(original))). With lossless conversions this is identical to
LAB(original). With IEEE 754 float32 the error is in the 6th–7th significant digit
— sub-pixel, sub-quantisation. Critically, it does not accumulate: after pass 2 the
cache stabilises in IOP_CS_RGB (post-blend) and every subsequent pass follows the
same RGB → LAB → RGB cycle, producing numerically identical results.

The case blend_cst ≠ cst_from and blend_cst ≠ cst_to (three distinct
colorspaces) is theoretically possible but has no known practical instance in the
darktable module set: all blend modes operate in either IOP_CS_RGB or IOP_CS_LAB,
both of which coincide with at least one of the module's endpoint colorspaces in any
real pipeline.


Remaining Legitimate Invalidations

Two other dt_dev_pixelpipe_invalidate_cacheline(pipe, input) calls remain in the
file and are both correct:

  • Line 1296 (_module_pipe_stop): the pipe is aborting mid-pass; the buffer may
    be partially written or in an inconsistent state and must not be reused.
  • Line 2769 (valid_input_on_gpu_only): the input was computed entirely on the
    GPU and was never downloaded to host memory; the CPU cache slot holds stale data
    from a previous pass and must be poisoned.

These were not changed.


Summary

Condition With invalidation Without invalidation
Module N-1 cache reuse Never — forced recompute every pass Normal — reused when parameters unchanged
Cache descriptor accuracy Irrelevant (hash poisoned) Always correct (written by transform write-back)
Zombie cacheline growth Proportional to passes × converting modules None
Blend round-trip (blend_cst == cst_from, common) No effect (cache discarded) No extra conversion
Blend round-trip (blend_cst ≠ cst_from, uncommon) No effect (cache discarded) One extra float32 round-trip on first post-blend pass; stabilises immediately, negligible precision impact
OpenCL path Unaffected (no invalidation was present) Unaffected
Correctness Correct Correct

The invalidations were introduced under the belief that the cache descriptor was a
stale copy that needed explicit fixup. That belief was incorrect: input_format is a
pointer directly into cache->dsc[k] in both the hit and miss paths of
dt_dev_pixelpipe_cache_get, so the transform function's write-back via
&input_format->cst already keeps the cache self-consistent. Removing the
invalidations restores correct caching behaviour with no correctness cost.

@masterpiga masterpiga added bugfix pull request fixing a bug scope: performance doing everything the same but faster labels Jun 6, 2026
@TurboGit TurboGit added this to the 5.8 milestone Jun 6, 2026
@TurboGit TurboGit added feature: enhancement current features to improve and removed bugfix pull request fixing a bug labels Jun 6, 2026
@TurboGit
Copy link
Copy Markdown
Member

TurboGit commented Jun 6, 2026

To me this is not a "fix" so not for 5.6. I do prefer some speed penalty than some unwanted behavior on the pixelpipe.

@jenshannoschwalm
Copy link
Copy Markdown
Collaborator

The uncommon blend_cst ≠ cst_from "The Blend Conversion Edge Case" case is the one i was worried about when doing the invalidation commit. After rethinking and checking code - the invalidation is not required and indeed hurts when having no OpenCL on those modules that have to correct colorspace. It doesn't happen a lot though.

@masterpiga
Copy link
Copy Markdown
Collaborator Author

Let's wait until after the release, then. With this and #21243 I get some nice responsiveness improvements.

@jenshannoschwalm
Copy link
Copy Markdown
Collaborator

See #21274 ... also explaining why we had these invalidations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature: enhancement current features to improve scope: performance doing everything the same but faster

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants