feat(csv): make parseLine the synchronous primitive (refs #3765)#7118
feat(csv): make parseLine the synchronous primitive (refs #3765)#7118MukundaKatta wants to merge 2 commits into
Conversation
Refactor the CSV parser so a single synchronous parseLine handles all field-level rules, with parse() (sync) and CsvParseStream (async) becoming thin line-iteration shells on top of it. - _io.ts: introduce sync parseLine; rewrite the existing async parseRecord as a thin reader.readLine accumulator that delegates to parseLine. Error column tracking now resolves through embedded newlines so error messages stay correct for multi-line quoted records. - parse.ts: drop the duplicate field-parsing loop that lived inside Parser.#parseRecord; both Parser and the new public parseLine share the same primitive. Public parseLine has the simple (line, options) -> string[] signature requested in denoland#3765, including BOM strip and trailing CR/LF/CRLF normalization. - parse_test.ts: add 12 parseLine-specific tests covering happy path, custom separator, escapes, BOM, trailing newlines, multi-line quoted body, lazyQuotes, comment lines, and unclosed-field error. All 133 existing parse + parse_stream tests still pass; new tests bring the total to 145.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7118 +/- ##
=======================================
Coverage 94.61% 94.61%
=======================================
Files 634 634
Lines 51799 51769 -30
Branches 9329 9327 -2
=======================================
- Hits 49009 48982 -27
+ Misses 2216 2211 -5
- Partials 574 576 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
The deno_lint no-unused-vars check flagged the parameter on parseLine (and the matching one on parseRecord and Parser.#parseRecord) — it was threaded through but never read inside the function bodies because the locate() helper computes line offsets from embedded newlines in the joined fullLine instead. Removing the param simplifies the call sites without changing behavior: all 145 parse + parse_stream + parseLine tests still pass.
There was a problem hiding this comment.
@std/csv is v1.0.6 and csv/mod.ts:184 re-exports everything from ./parse.ts, so the new parseLine lands on the stable surface. Per .github/CONTRIBUTING.md, new public APIs in v1.0.0+ packages need to live in csv/unstable_parse_line.ts, must not be re-exported from mod.ts, and must carry @experimental **UNSTABLE**: New API, yet to be vetted.. Title would shift to feat(csv/unstable): add parseLine.
The underlying refactor — unifying Parser.#parseRecord and the streaming parseRecord onto one field-state machine in _io.ts — is fine; the locate() helper correctly recomputes (line, col) for embedded \n and the EOF branches match the old line.length === 0 / line.length > 0 split.
- nit:
parse.ts:14was previously a\uXXXX-escaped BOM constant; this PR replaces it with the raw U+FEFF character, which renders invisibly in most editors and breaks grep. The new BOM test inparse_test.tsdoes the same. Please keep the escape form —parse_test.tsalready defines aBYTE_ORDER_MARKconstant using it at the top of the file.
bartlomieju
left a comment
There was a problem hiding this comment.
Thanks for the rework — collapsing the three copies of the field state machine into one shared sync primitive is the right shape, and the (line, options) public surface matches what #3765 sketched. Error-location preservation across multi-line quoted records via locate() is a nice touch, and the new tests cover the cases that matter (BOM, custom separator, escaped quotes, embedded \n, lazyQuotes, comment, unclosed-field error).
Two things to fix before merge, plus a few smaller ones inline:
-
Quadratic re-parsing for multi-line quoted records. Both wrappers (
parseRecordin_io.tsandParser.#parseRecordinparse.ts) re-runparseLineover the entire accumulated buffer every time another line is appended. For an N-line quoted field that's O(N²) field-scan work plus O(N²) string-concat allocation, where the old code was linear. Multi-megabyte quoted fields (embedded JSON/HTML in a cell — a real pattern in data exports) would regress noticeably. Worth a benchmark; if confirmed,parseLineneeds a resume-position mode or the wrappers need to keep incremental state. -
BOM literal replaces
"\ufeff". The U+FEFF character is invisible in most editors and diff viewers — a future reader will see the constant as"". Keep the escape.
The public surface looks right; one missing piece is that parse() validates separator against ["\r", "\n", '"'] but the new public parseLine doesn't (inline). Also, issue #3765 had a second bullet about requiring TextLineStream upstream of CsvParseStream — this PR only does the first bullet, which is fine, but worth a follow-up note so it doesn't get lost.
Not blocking: the PR description still has an unchecked reviewer-confirmation item on the public shape — confirming that (line, options) → string[] matches #3765's spirit.
| export type { ParseResult, RecordWithColumn }; | ||
|
|
||
| const BYTE_ORDER_MARK = "\ufeff"; | ||
| const BYTE_ORDER_MARK = ""; |
There was a problem hiding this comment.
Please keep this as "\ufeff". The literal U+FEFF is invisible in most editors and diff viewers — a maintainer scanning this line will read it as an empty string. Same issue in the new test at csv/parse_test.ts:1071.
| export function parseLine( | ||
| line: string, | ||
| options: Omit<ParseOptions, "skipFirstRow" | "columns" | "fieldsPerRecord"> = | ||
| {}, | ||
| ): string[] { | ||
| const { separator = ",", trimLeadingSpace = false, comment, lazyQuotes } = | ||
| options; | ||
| const stripped = line.startsWith(BYTE_ORDER_MARK) ? line.slice(1) : line; | ||
| // Treat a single trailing CR/LF/CRLF as a record terminator (callers that | ||
| // forgot to trim should not see a phantom empty trailing field). | ||
| const normalized = stripped.endsWith("\r\n") | ||
| ? stripped.slice(0, -2) | ||
| : stripped.endsWith("\n") || stripped.endsWith("\r") | ||
| ? stripped.slice(0, -1) | ||
| : stripped; | ||
| const readOptions: ReadOptions = { | ||
| separator, | ||
| trimLeadingSpace, | ||
| ...(comment !== undefined ? { comment } : {}), | ||
| ...(lazyQuotes !== undefined ? { lazyQuotes } : {}), | ||
| }; | ||
| const result = parseLineInternal(normalized, readOptions, 0, true); | ||
| return result ?? []; | ||
| } |
There was a problem hiding this comment.
Two gaps on the public parseLine:
- Separator not validated.
parse()rejects separators inINVALID_RUNE = ["\r", "\n", '"'](see the existing check inParser.parse).parseLine("a\"b", { separator: '"' })would behave unpredictably here. Apply the same guard. commentbehavior is undocumented and silently lossy.parseLine("# x", { comment: "#" })returns[], indistinguishable from an empty line — see the new test atparse_test.ts:1113. Either dropcommentfrom the public single-line surface (it's a record-stream concept, not a per-line one) or document explicitly in the JSDoc that comment lines return[].
| let accumulated = first; | ||
| while (true) { | ||
| const result = parseLineInternal( | ||
| accumulated, | ||
| this.#options, | ||
| zeroBasedStartLine, | ||
| this.#isEOF(), | ||
| ); | ||
| if (result !== null) return result; | ||
| const next = this.#readLine(); | ||
| if (next === null) { | ||
| // Force the EOF decision (will throw unless lazyQuotes is set). | ||
| return parseLineInternal( | ||
| accumulated, | ||
| this.#options, | ||
| zeroBasedStartLine, | ||
| true, | ||
| ) ?? []; | ||
| } | ||
| accumulated += "\n" + next; | ||
| } |
There was a problem hiding this comment.
Quadratic re-parse. Every additional line, parseLineInternal re-scans accumulated from byte 0. For an N-line quoted field this is O(N²) parse work plus the O(N²) string concatenation on line 161. The old #parseRecord parsed each pulled line incrementally. Real CSV exports do put multi-MB quoted blobs (HTML, JSON, base64) in a single cell — this is a real regression risk. Worth benchmarking against a synthetic input like one record with 10k newlines inside a quoted field.
If the benchmark confirms it, options are: (a) have parseLineInternal accept a resume position so the wrapper only feeds it the new tail, (b) make it return progress state, or (c) keep a separate incremental path for the streaming case and only use the shared primitive for true single-line callers.
| // narrowing for the type system. | ||
| return eofResult ?? []; | ||
| } | ||
| accumulated += "\n" + next; |
There was a problem hiding this comment.
Same quadratic-reparse concern as the Parser.#parseRecord wrapper in parse.ts:142-161. Per call to parseLine here, the entire accumulated is re-scanned from start. For streaming, this matters more than for parse() because CsvParseStream is the documented path for memory-bounded ingest of large CSVs — a quoted field spanning many chunks would balloon to O(N²).
| const eofResult = parseLine( | ||
| accumulated, | ||
| options, | ||
| zeroBasedRecordStartLine, | ||
| true, | ||
| ); | ||
| // parseLine with atEof=true cannot return null; this is a defensive | ||
| // narrowing for the type system. | ||
| return eofResult ?? []; |
There was a problem hiding this comment.
The comment admits this branch is unreachable. Rather than carry a ?? [] fallback the type system can't prove away, express the invariant: overload parseLine so atEof: true returns string[] and atEof: false returns string[] | null. Then this block can just be return parseLine(accumulated, options, zeroBasedRecordStartLine, true) and the dead-code comment goes away.
| */ | ||
| export function parseLine( | ||
| fullLine: string, | ||
| reader: LineReader, | ||
| options: ReadOptions, | ||
| zeroBasedRecordStartLine: number, | ||
| zeroBasedLine: number = zeroBasedRecordStartLine, | ||
| ): Promise<Array<string>> { | ||
| zeroBasedRecordStartLine: number = 0, | ||
| atEof: boolean = true, | ||
| ): string[] | null { | ||
| // line starting with comment character is ignored |
There was a problem hiding this comment.
Nit on the docstring: it says parse builds on this function, but parse lives in parse.ts and goes through the Parser class — parseRecord (just below) and Parser.#parseRecord are the actual callers. Worth rewording to avoid implying a direct dependency from parse().
| const locate = (absPos: number): { line: number; col: number } => { | ||
| let line = zeroBasedRecordStartLine; | ||
| let lastNewline = -1; | ||
| for (let i = 0; i < absPos; i++) { | ||
| if (fullLine[i] === "\n") { | ||
| line++; | ||
| lastNewline = i; | ||
| } | ||
| } | ||
| const col = codePointLength(fullLine.slice(lastNewline + 1, absPos)); | ||
| return { line, col }; | ||
| }; | ||
|
|
There was a problem hiding this comment.
Nit: locate rescans fullLine[0..absPos] on each call. It's only hit on error paths so this isn't a real perf concern, but if you're touching the file anyway, a one-pass precomputed lineStarts: number[] indexed via findLastIndex would be cleaner and let you reuse it across both error sites.
|
Moved to draft while I rework — thanks for the careful review. Plan:
Will re-request once (2) is benchmarked and clean. |
Summary
parseLinethe actual internal CSV primitive that bothparse()andCsvParseStreambuild on, addressing the design feedback from feat(csv): add parseLine() convenience for single-line CSV records (refs #3765) #7114 (closed) and aligning with suggestion: investigate simpler CSV-parsing APIs #3765's intent.Parser.#parseRecordinparse.ts— bothparse()(sync) and the streaming path now share one set of field/quote rules.parseLine(line, options) -> string[]is the simple shape suggestion: investigate simpler CSV-parsing APIs #3765 asked for, with BOM strip and trailing CR/LF/CRLF normalization.What changed
csv/_io.ts: new syncparseLinecarries the whole field-parsing state machine (separator, quotes, escapes, lazyQuotes, comment, trim). The existing asyncparseRecordbecomes a small wrapper that pulls more lines from theLineReaderand re-callsparseLineuntil a record completes. Error column tracking maps absolute positions in the joined input back to (line, column) so multi-line quoted records still report the right line.csv/parse.ts: dropParser.#parseRecord's duplicate field loop;Parsernow defers toparseLinefrom_io.ts. Add the publicparseLineexport with a clean(line, options)signature.csv/parse_test.ts: 12 new tests pin parseLine behavior (happy path, custom separator, escaped quotes, BOM, trailing newline, multi-line quoted body, lazyQuotes, comment, unclosed-field error).Test plan
parseLine's public surface matches suggestion: investigate simpler CSV-parsing APIs #3765's spirit and that the(line, options)shape is what was wanted.cc @bartlomieju — this replaces #7114 with the design you sketched in the review there.