Added Legolas with first node/edge schemes.#43
Conversation
Co-authored-by: Martijn Visser <mgvisser@gmail.com>
|
I've added a load of TODOs, not all of them need to be fixed in this PR. Creating a JSON scheme can be done, but getting all the little rules in there (connectors) will be a pain and there's no automation for it in the current ecosystem. I had to fix the tests to adhere to the Schema we defined, so that's great already. |
| sv = schema() | ||
| validate(Tables.schema(table), sv) | ||
| R = Legolas.record_type(sv) | ||
| foreach(R, Tables.rows(table)) # construct each row |
There was a problem hiding this comment.
Do you know roughly how long this takes for the national schematization? Especially the forcing can contain millions of rows, and sometimes only a fraction of them is used for short calculations.
Also, the result of this loop is not kept, that means that we only get errors, but any changes like clamp in the Legolas tour is never done, or not?
There was a problem hiding this comment.
That's correct, at the moment I only check the types and for empty strings. And I rather throw on error than clamp some values, but that's up for debate.
There was a problem hiding this comment.
Ok yeah right now we don't use something like clamp, and I can't directly think of a use case where it would be better than an error, so let's keep it like this.
Performance was answered here: #43 (comment)
| profile = DataFrame(read_table(config["profile"])) | ||
| forcing = DataFrame(read_table(config["forcing"])) | ||
| # Load data and validate schema + rows for required field tyopes and values | ||
| node = read_table(config["node"]; schema = NodeV1SchemaVersion) |
There was a problem hiding this comment.
Is there any reason to pass the type, and not its instance, NodeV1SchemaVersion()? The latter feels a bit more natural. And the schema keyword can then be set to ::Union{SchemaVersion, Nothing}.
There was a problem hiding this comment.
Because Legolas code requires the type, I passed the type. I can make it an instance and use typeof?
There was a problem hiding this comment.
Ok if Legolas works on the type we can keep it as is, but https://github.com/Deltares/Ribasim.jl/pull/43/files#diff-cd7159638f60b3d2b30ed4bbe7b622ffcdc834f0a2beafe12f46e219a4aa932aR69 (sv = schema()) suggests it doesn't.
| @assert issorted(forcing.time) | ||
| startrow = searchsortedfirst(forcing.time, starttime) | ||
| endrow = searchsortedlast(forcing.time, endtime) | ||
| forcing = @view forcing[startrow:endrow, :] | ||
| # then keep only IDs we use | ||
| forcing = filter(:id => in(ids), forcing; view = true) |
There was a problem hiding this comment.
Good catch that the assert must be moved up because of the searchsorted first.
In general in this function before I only validated the subset that we are using, and now we are validating the whole dataset. If it is cheap enough that is fine, but good to check the cost.
There was a problem hiding this comment.
The slowest is indeed validation the forcing on 3M rows, which takes 1.7s.
[ Info: Validating Legolas.SchemaVersion{Symbol("ribasim.node"), 1} for 29616 rows.
0.019118 seconds (206.28 k allocations: 4.905 MiB)
┌ Warning: unsupported ARROW:extension:name type: "geoarrow.linestring", arrow type = Vector{Tuple{Float64, Float64}}
└ @ Arrow ~/.julia/packages/Arrow/QsQ3U/src/eltypes.jl:53
[ Info: Validating Legolas.SchemaVersion{Symbol("ribasim.edge"), 1} for 39043 rows.
0.024101 seconds (426.03 k allocations: 10.671 MiB)
[ Info: Validating Legolas.SchemaVersion{Symbol("ribasim.state"), 1} for 8531 rows.
0.028598 seconds (75.18 k allocations: 1.541 MiB, 76.58% gc time)
[ Info: Validating Legolas.SchemaVersion{Symbol("ribasim.static"), 1} for 5464 rows.
0.010356 seconds (52.27 k allocations: 1.375 MiB)
[ Info: Validating Legolas.SchemaVersion{Symbol("ribasim.profile"), 1} for 46560 rows.
0.029184 seconds (601.92 k allocations: 13.447 MiB)
[ Info: Validating Legolas.SchemaVersion{Symbol("ribasim.forcing"), 1} for 3088222 rows.
1.741670 seconds (33.91 M allocations: 837.849 MiB, 4.59% gc time)There was a problem hiding this comment.
Thanks for the numbers, I'd say 1.7s is no reason to worry for now, we don't have to optimize further. But perhaps later on if needed we will validate only what we use.
Fixes #24. Legolas only really does validation on a schema level, but the defined struct allows for per row validation. Next up is to validate the graph itself.