stdenv: PURL fetcher introduction#454333
Conversation
|
we need to squash all commits later, just for transparency let's keep the commits separated in order to understand what has been changed. |
18f4eea to
016c6e6
Compare
016c6e6 to
c8afdf6
Compare
@wolfgangwalther that link is now dead. This PR was significantly simplified since - would it be possible to do another performance check? How would one do that? |
The performance data is available in every CI run. Just click your way to the details page of the CI jobs listed at the bottom of the PR and scroll your way through. |
Gotcha, for the 'Eval Summary' job, so https://github.com/NixOS/nixpkgs/actions/runs/22830597798?pr=454333 - so no noticeable change in speed, and a size increase of 0.05%-0.3%. |
raboof
left a comment
There was a problem hiding this comment.
Starting to look attractive! Some comments here and there.
|
|
||
| ### Package URL {#sec-meta-identifiers-purl} | ||
|
|
||
| [Package-URL](https://github.com/package-url/purl-spec) (PURL) is a specification to reliably identify and locate software packages. Through identification of software packages, additional (non-major) use cases are e.g. software license cross-verification via third party databases or initial vulnerability response management. Package URL's shall default to the mkDerivation.src, as the original consumed software package is the single point of truth. |
There was a problem hiding this comment.
"Package URL's shall default to the mkDerivation.src" is no longer true 'in nixpkgs', but consuming tools are expected to make this association (at least for now). That seems like a good start to me.
There was a problem hiding this comment.
src or srcs is implementation specific, the tool outside should not try to guess which one is the right approach. The interface is drv.meta.identifiers.purl and the implementation (the drv) should define the "where to search". I would like to keep this
There was a problem hiding this comment.
I am in agreement that we shouldn't push tools to trying to guess. Especially with tools that try to be efficient and can parallelize things, it makes the concurrency more difficult. Imo, it would be great if drv.meta.identifiers.purl defaults to drv.src.meta.identifiers.purl. I know people would be concerned with eval performance.
There was a problem hiding this comment.
@RossComputerGuy it would be helpful that the SBOM team meets and agrees on an approach. CRA is due in september this year and this should get included in nixos 26.05 - at least the reduced fetcher part
There was a problem hiding this comment.
I get that, though I don't think this is something we can or should rush. I want PURL's just as much as anyone else does. However, we should do this properly. 26.05 is close but I am not super confident that it's possible at this point to get PURL's in.
I think my concern about drv.src.meta vs drv.meta is something that's big enough (considering that it impacts how the identifier should be discovered & consumed). I don't think it's a bad idea for the SBOM team to sync up. However, it is difficult to get people to sync up for something like this with these sorts of projects. I do think that there should be more of a discussion and thorough review before we attempt to merge this PR.
There was a problem hiding this comment.
I agree 'deeper' integration would likely be too ambitious for 26.05, but 'just' being able to populate the drv.meta seems feasible and already a helpful step in the right direction, doesn't it?
There was a problem hiding this comment.
Yes, as long as whatever is scanning nixpkgs doesn't have to figure out where the identifiers coming from and can just assume drv.meta like with CPE's, I think that is good. With these sorts of things, usually the first NixOS release doesn't have much use of it. Usually it gets flushed out over the following 2 or 3 releases.
The only blocker I see is not needing to figure out if it's drv.src or drv.srcs for the identifiers. Once that's solved, I don't have any objections to merging this PR.
There was a problem hiding this comment.
populate the drv.meta seems feasible and already a helpful step in the right direction, doesn't it?
@raboof the current PR is feasible. Please keep in mind that despite 95% of purls coming out of the box, it is required to manually maintain the remaining ones. This is something which may land 26.11, but no one is investing into something which is an open PR. My ask would be to merge this as a first step into the right direction, i can then work on contributing missing purl informations - since this takes time.
The only blocker I see is not needing to figure out if it's drv.src or drv.srcs for the identifiers. Once that's solved, I don't have any objections to merging this PR.
@RossComputerGuy blocking the full PR does not make sense IMHO, we may go and land this as a first milestone? Maybe even mark it as experimental, which gives time and others confidence that it may not be "stable"?
0caddcd to
c6035ec
Compare
raboof
left a comment
There was a problem hiding this comment.
I don't think this PR is perfect (it still doesn't quite make sense to me to split the purlParts rather than using a simple string, and I'm not sure the field name spec makes sense to me).
Nonetheless, having a place to record upstream purls is valuable, so I'm reluctantly approving this PR.
I think we should feel free to possibly refactor the structure later.
i think we should outline a list of possible enhancements, which will come later once this PR will get merged:
|
raboof
left a comment
There was a problem hiding this comment.
Based on the existing approvals on this PR and following discussion in the Software Provenance Team I'm going to merge this.
While there's some implementation choices I'm not super thrilled about (see above), the current post-26.05-branchoff phase is the right time to make this available for wider use, and based on experience we can still change things as needed.
46cc4b5 to
3016365
Compare
| type = "generic"; | ||
| # https://github.com/package-url/purl-spec/blob/18fd3e395dda53c00bc8b11fe481666dc7b3807a/types-doc/generic-definition.md | ||
| spec = "${repo}?vcs_url=https://${githubBase}/${owner}/${repo}@${(lib.revOrTag rev tag)}"; | ||
| }; |
There was a problem hiding this comment.
Shouldn't this be type = "github" with repository_url?
There was a problem hiding this comment.
I think that would be a good improvement.
| identifiers.purlParts = { | ||
| type = "github"; | ||
| spec = "jqlang/jq@jq-${finalAttrs.version}"; | ||
| }; |
There was a problem hiding this comment.
Is this accurate?
nix-repl> :p jq.src.meta.identifiers.purls
[ ]
nix-repl> :p jq.meta.identifiers.purls
[ "pkg:github/jqlang/jq@jq-1.8.1" ]
Since the PURL spec seems to be missing an important detail: whether or not the identity is about the “logical” package, or the provenance of the source file, there's some interpretation that needs to be done...
My current head-canon, especially since there is pkg:deb/* and other misc. provenances, is that PURLs are supposed to represent the provenance of an artifact, rather than what "logical" package something is. And that would seem to align with your implementation on fetchers.
(See the comment for popt about src.meta vs. meta)
So, in this case I'm more interested in the pkg:github PURL usage.
This is, as far as I understand, not a GitHub type PURL.
This is a generic fetch of a tarball that was created via CI and coincidentally uploaded to GitHub.
See these steps:
This is creating an archive that differs from the commit's contents.
There was a problem hiding this comment.
while https://github.com/package-url/purl-spec/blob/main/types-doc/github-definition.md is sadly low on details, it is my understanding that you should understand GitHub purls not as literally "this is exactly the source found at this Git repository", but more abstractly "this is the software published by the owners of this repository" - so using this Purl to refer to a tarball that was uploaded to this repo seems OK to me here.
| identifiers.purlParts = { | ||
| type = "github"; | ||
| spec = "rpm-software-management/popt@popt-${finalAttrs.version}-release"; | ||
| }; |
There was a problem hiding this comment.
Compare:
nix-repl> :p popt.src.meta.identifiers.purls
[ ]
nix-repl> :p popt.meta.identifiers.purls
[ "pkg:github/rpm-software-management/popt@popt-1.19-release" ]
and
nix-repl> :p hub.src.meta.identifiers.purls
[ "pkg:github/github/hub@38bcd4ae469e5f53f01901340b715c7658ab417a" ]
nix-repl> :p hub.meta.identifiers.purls
[ ]
Why is the purl on the package here rather than on the src?
Since the PURL spec seems to be missing an important detail: whether or not the identity is about the “logical” package, or the provenance of the source file, there's some interpretation that needs to be done...
My current head-canon, especially since there is pkg:deb/* and other misc. provenances, is that PURLs are supposed to represent the provenance of an artifact, rather than what "logical" package something is. And that would seem to align with your implementation on fetchers.
Though in this particular instance, the src ends-up with no PURL, and the package itself is getting a PURL.
And, I'm not even sure it would be a "valid" PURL. A PURL for a derivation might really should probably be a pkg:nix/drv@... or something or other representing the provenance in a faithful-enough manner. No?
There was a problem hiding this comment.
Why is the
purlon the package here rather than on thesrc?
The purl is on the popt package because we manually encoded this knowledge. The purl is not on the hub package because we didn't specify it explicitly. In a lot of cases, unless otherwise specified, likely the purl of the package can be derived from the purl of the source, and the purl of the source can be derived from the fetcher. However, we didn't want to bake that assumption into nixpkgs yet: for various reasons it seemed better / more conservative / more precise to leave that up to whatever tool will be consuming these identifiers.
My current head-canon, especially since there is pkg:deb/* and other misc. provenances, is that PURLs are supposed to represent the provenance of an artifact, rather than what "logical" package something is
I'd say it's actually more about 'logical' 'identity' than 'provenance', though indeed they get mixed up somewhat.
A PURL for a derivation might really should probably be a pkg:nix/drv@... or something or other representing the provenance in a faithful-enough manner. No?
My thinking is we might indeed want pkg:nix/drv Purls at some point (package-url/purl-spec#314), but the 'meta' should contain the identification of the package as it's 'known' - similar to how the 'website' field is not the website of the derivation but the website of the software represented by the derivation.
| spec = "jqlang/jq@jq-${finalAttrs.version}"; | ||
| }; | ||
| }; | ||
| }) |
There was a problem hiding this comment.
Additional thread for jq.meta.purl. Let's ignore how this might not build, and focus on the implications for the PURL.
The proper tracking of provenance is, imo, another reason why the PURL should be on the src in this case:
nix-repl> :p (jq.overrideAttrs(_: { src = fetchurl { url = "https://github.com/jqlang/jq/releases/download/jq-1.8.0/jq-1.8.0.tar.gz"; hash = "sha256-kYEVd/kdmmGV/1DCv/7JtyyEKdwF7D6gIv2VwG0rMZw="; }; } )).meta.identifiers.purls
[ "pkg:github/jqlang/jq@jq-1.8.1" ]
Sure, this is bad form to override the source without overriding the version accordingly, but here it's important not to confidently produce wrong output. The overriding mechanisms will be used in surprising manners, and in the case of ${...}.meta.purl, it almost definitely will produce the wrong result at some point.
Compare this to if it had been set on jq.meta.src, the simple fact of overriding the src attribute would have made it harder to come to this result. Right?
There was a problem hiding this comment.
I think there's some case to be made that the purl to be encoded here should be the purl without the version component (version is optional in the spec and there's some precedence for this in CVE (https://github.com/CVEProject/cve-schema/blob/main/schema/docs/CVE_Record_Format_bundled.json#L449), as long as the version can be reliably determined by the version field. Of course, that also fails if you override just the source...
| patches = [ | ||
| ./musl.patch | ||
| ./CVE-2026-32316.patch | ||
| ./CVE-2026-33947.patch | ||
| ./CVE-2026-33948.patch | ||
| ./CVE-2026-39979.patch | ||
| ./CVE-2026-40164.patch | ||
| ] | ||
| ++ lib.optionals stdenv.hostPlatform.is32bit [ | ||
| # needed because epoch conversion test here is right at the end of 32 bit integer space | ||
| # See also: https://github.com/jqlang/jq/blob/859a8073ee8a21f2133154eea7c2bd5e0d60837f/tests/optional.test#L15-L18 | ||
| # "-D_TIME_BITS=64 -D_FILE_OFFSET_BITS=64" would be preferrable, but breaks with dynamic linking, | ||
| # unless done globally in stdenv for all of 32 bit. | ||
| ./disable-end-of-epoch-conversion-test.patch | ||
| ]; |
There was a problem hiding this comment.
Should patches be included in purls too? After all, these are things that have some form of provenance, and materially change the results for a given package.
(Adding more weight to my understanding that ${...}.meta.purl should be encoding the resulting derivation...)
There was a problem hiding this comment.
I think other documents (like an SBOM) can express "This is software with purl X with patch Y applied", and the patch information doesn't need to be in the purl itself.
There was a problem hiding this comment.
and the patch information doesn't need to be in the purl itself.
I tried to express this carefully, but apparently failed.
Should patches be included in
purlstoo?
I think I got confused with the list of purls, and thinking about the aggregation of the PURLs of a given component. In this case, I'm thinking about the fact we would be missing PURLs for the patches. Though they are vendored for jq, it would be more relevant for fetchpatch'd patches.
There was a problem hiding this comment.
and the patch information doesn't need to be in the purl itself.
I tried to express this carefully, but apparently failed.
My bad, it's late here :). I now see the relevance in your formatting :)
Should patches be included in
purlstoo?I think I got confused with the list of
purls, and thinking about the aggregation of the PURLs of a given component. In this case, I'm thinking about the fact we would be missing PURLs for the patches. Though they are vendored forjq, it would be more relevant forfetchpatch'd patches.
Right. While I now see your point (I think :) ), I don't see a super obvious use case for this information. For now my thinking is that whatever tool is extracting the provenance information from a nix structure could/should likely just get that information from the patches, and wouldn't benefit from encoding something in the purls. But if somehow a derivation included a patch that 'represents introducing a whole separate component', I could see adding a purl for that component in the purls, I think, yes.
There was a problem hiding this comment.
Yeah, sorry, I also was getting a bit tired.
I don't have a strong opinion, but the way I'm understanding the purpose of PURLs, I'm seeing a weird disconnect here:
For now my thinking is that whatever tool is extracting the provenance information from a nix structure could/should likely just get that information from the patches
I'm assuming you're implying that the patches themselves would have meta.purl each to consume.
Otherwise, and the less charitable interpretation, is that the tooling would have to extract the URLs and “get that information” (read: generate PURLs) from the patches. If that was the case, the same would apply to the package source's own PURLs, making this change unneeded. Which is why I'm thinking you meant patches.[...].meta.purl.
Forgive this abrupt tangent, but I've had this thought floating about in my mind since I thought about the problem space yesterday: for an hypothetical "new semantics mkDerivation", all "remote inputs" (i.e. that requires a fetch) like FODs should be described in a way that makes them actual “inputs” that can be overridden confidently, and more importantly, intrinsically introspectable, so they can be inspected.
... now back to the comment.
While I focused on patches, there is the broader concern that there is no way to list "all the 'remote' inputs for this derivation", other than heuristics after the fact from sifting through the .drv in the --requisites of the derivation. Unless I'm missing something.
So I guess from thinking a bit more and one night's sleep, I'm seeing that this comment chain really should have been: We are still missing information about provenance, and is that something that matters regarding the intent behind PURLs in the broader ecosystem? 🤔
There was a problem hiding this comment.
Otherwise, and the less charitable interpretation, is that the tooling would have to extract the URLs and “get that information” (read: generate PURLs) from the patches. If that was the case, the same would apply to the package source's own PURLs, making this change unneeded. Which is why I'm thinking you meant
patches.[...].meta.purl.
Taking a step back, I think in the vast majority of use cases there is no need for purls for individual patches, so this feels a bit hypothetical - but yes I could see tools looking at patches.[...].meta.purl.
there is the broader concern that there is no way to list "all the 'remote' inputs for this derivation", other than heuristics after the fact from sifting through the .drv in the --requisites of the derivation. Unless I'm missing something.
I don't think you're missing something, I indeed expect provenance analysis tools to need to sift through the --requisites of the derivation
We are still missing information about provenance, and is that something that matters regarding the intent behind PURLs in the broader ecosystem?
We're still missing a few big pieces (notably being able to find the purl metadata of a requisite drv and analyzing "aggregate FODs" that collect many npm/cargo/mvn/... dependencies), and that definitely matters - though I'm not sure that's what you're trying to say?
#421125 was merged and reverted later, because of regressions.
the background is described here: #421125 (comment)
@wolfgangwalther outlined the conditions and would like to enhance CI - over time. This is a continuous approach, which is in line with packages which have been found to be defunct and which need a fix. There may be more packages which have problems and we would like to prevent further fallout by a feature flag (prevents accessing + inheritance of drv.src / drv.srcs).
packages list: #453322 (comment)
list of real broken packages: #453322 (comment)
broken packages fix (deferrable PR: #457769)
With the old PR + the broken&platform check fix from #453291 + the feature flag, we enable maintainers to gather experience with PURL and set appropriate information (e.g. jq example, where fetchurl is used instead of fetchFromGithub)
Things done
passthru.tests.nixpkgs-reviewon this PR. See nixpkgs-review usage../result/bin/.Add a 👍 reaction to pull requests you find important.