Skip to content

feat(csv): make parseLine the synchronous primitive (refs #3765)#7118

Draft
MukundaKatta wants to merge 2 commits into
denoland:mainfrom
MukundaKatta:feat/csv-parse-line-primitive
Draft

feat(csv): make parseLine the synchronous primitive (refs #3765)#7118
MukundaKatta wants to merge 2 commits into
denoland:mainfrom
MukundaKatta:feat/csv-parse-line-primitive

Conversation

@MukundaKatta
Copy link
Copy Markdown
Contributor

Summary

What changed

  • csv/_io.ts: new sync parseLine carries the whole field-parsing state machine (separator, quotes, escapes, lazyQuotes, comment, trim). The existing async parseRecord becomes a small wrapper that pulls more lines from the LineReader and re-calls parseLine until a record completes. Error column tracking maps absolute positions in the joined input back to (line, column) so multi-line quoted records still report the right line.
  • csv/parse.ts: drop Parser.#parseRecord's duplicate field loop; Parser now defers to parseLine from _io.ts. Add the public parseLine export with a clean (line, options) signature.
  • csv/parse_test.ts: 12 new tests pin parseLine behavior (happy path, custom separator, escaped quotes, BOM, trailing newline, multi-line quoted body, lazyQuotes, comment, unclosed-field error).

Test plan

  • All 133 existing parse + parse_stream steps still pass (145 total with the new parseLine tests).
  • Existing error-message assertions (StartLine1, StartLine2, ParseErrorLine, OddQuotes, etc.) preserved with no changes to test expectations.
  • Reviewer to confirm parseLine's public surface matches suggestion: investigate simpler CSV-parsing APIs #3765's spirit and that the (line, options) shape is what was wanted.

cc @bartlomieju — this replaces #7114 with the design you sketched in the review there.

Refactor the CSV parser so a single synchronous parseLine handles all
field-level rules, with parse() (sync) and CsvParseStream (async)
becoming thin line-iteration shells on top of it.

- _io.ts: introduce sync parseLine; rewrite the existing async
  parseRecord as a thin reader.readLine accumulator that delegates to
  parseLine. Error column tracking now resolves through embedded
  newlines so error messages stay correct for multi-line quoted records.
- parse.ts: drop the duplicate field-parsing loop that lived inside
  Parser.#parseRecord; both Parser and the new public parseLine share
  the same primitive. Public parseLine has the simple
  (line, options) -> string[] signature requested in denoland#3765, including
  BOM strip and trailing CR/LF/CRLF normalization.
- parse_test.ts: add 12 parseLine-specific tests covering happy path,
  custom separator, escapes, BOM, trailing newlines, multi-line quoted
  body, lazyQuotes, comment lines, and unclosed-field error.

All 133 existing parse + parse_stream tests still pass; new tests bring
the total to 145.
@github-actions github-actions Bot added the csv label Apr 28, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 28, 2026

Codecov Report

❌ Patch coverage is 91.96429% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.61%. Comparing base (cd03740) to head (ccb627d).

Files with missing lines Patch % Lines
csv/parse.ts 82.60% 6 Missing and 2 partials ⚠️
csv/_io.ts 98.48% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #7118   +/-   ##
=======================================
  Coverage   94.61%   94.61%           
=======================================
  Files         634      634           
  Lines       51799    51769   -30     
  Branches     9329     9327    -2     
=======================================
- Hits        49009    48982   -27     
+ Misses       2216     2211    -5     
- Partials      574      576    +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

The deno_lint no-unused-vars check flagged the parameter on parseLine
(and the matching one on parseRecord and Parser.#parseRecord) — it was
threaded through but never read inside the function bodies because the
locate() helper computes line offsets from embedded newlines in the
joined fullLine instead.

Removing the param simplifies the call sites without changing behavior:
all 145 parse + parse_stream + parseLine tests still pass.
Copy link
Copy Markdown

@fibibot fibibot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@std/csv is v1.0.6 and csv/mod.ts:184 re-exports everything from ./parse.ts, so the new parseLine lands on the stable surface. Per .github/CONTRIBUTING.md, new public APIs in v1.0.0+ packages need to live in csv/unstable_parse_line.ts, must not be re-exported from mod.ts, and must carry @experimental **UNSTABLE**: New API, yet to be vetted.. Title would shift to feat(csv/unstable): add parseLine.

The underlying refactor — unifying Parser.#parseRecord and the streaming parseRecord onto one field-state machine in _io.ts — is fine; the locate() helper correctly recomputes (line, col) for embedded \n and the EOF branches match the old line.length === 0 / line.length > 0 split.

  • nit: parse.ts:14 was previously a \uXXXX-escaped BOM constant; this PR replaces it with the raw U+FEFF character, which renders invisibly in most editors and breaks grep. The new BOM test in parse_test.ts does the same. Please keep the escape form — parse_test.ts already defines a BYTE_ORDER_MARK constant using it at the top of the file.

Copy link
Copy Markdown
Member

@bartlomieju bartlomieju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the rework — collapsing the three copies of the field state machine into one shared sync primitive is the right shape, and the (line, options) public surface matches what #3765 sketched. Error-location preservation across multi-line quoted records via locate() is a nice touch, and the new tests cover the cases that matter (BOM, custom separator, escaped quotes, embedded \n, lazyQuotes, comment, unclosed-field error).

Two things to fix before merge, plus a few smaller ones inline:

  1. Quadratic re-parsing for multi-line quoted records. Both wrappers (parseRecord in _io.ts and Parser.#parseRecord in parse.ts) re-run parseLine over the entire accumulated buffer every time another line is appended. For an N-line quoted field that's O(N²) field-scan work plus O(N²) string-concat allocation, where the old code was linear. Multi-megabyte quoted fields (embedded JSON/HTML in a cell — a real pattern in data exports) would regress noticeably. Worth a benchmark; if confirmed, parseLine needs a resume-position mode or the wrappers need to keep incremental state.

  2. BOM literal replaces "\ufeff". The U+FEFF character is invisible in most editors and diff viewers — a future reader will see the constant as "". Keep the escape.

The public surface looks right; one missing piece is that parse() validates separator against ["\r", "\n", '"'] but the new public parseLine doesn't (inline). Also, issue #3765 had a second bullet about requiring TextLineStream upstream of CsvParseStream — this PR only does the first bullet, which is fine, but worth a follow-up note so it doesn't get lost.

Not blocking: the PR description still has an unchecked reviewer-confirmation item on the public shape — confirming that (line, options) → string[] matches #3765's spirit.

Comment thread csv/parse.ts
export type { ParseResult, RecordWithColumn };

const BYTE_ORDER_MARK = "\ufeff";
const BYTE_ORDER_MARK = "";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep this as "\ufeff". The literal U+FEFF is invisible in most editors and diff viewers — a maintainer scanning this line will read it as an empty string. Same issue in the new test at csv/parse_test.ts:1071.

Comment thread csv/parse.ts
Comment on lines +51 to +74
export function parseLine(
line: string,
options: Omit<ParseOptions, "skipFirstRow" | "columns" | "fieldsPerRecord"> =
{},
): string[] {
const { separator = ",", trimLeadingSpace = false, comment, lazyQuotes } =
options;
const stripped = line.startsWith(BYTE_ORDER_MARK) ? line.slice(1) : line;
// Treat a single trailing CR/LF/CRLF as a record terminator (callers that
// forgot to trim should not see a phantom empty trailing field).
const normalized = stripped.endsWith("\r\n")
? stripped.slice(0, -2)
: stripped.endsWith("\n") || stripped.endsWith("\r")
? stripped.slice(0, -1)
: stripped;
const readOptions: ReadOptions = {
separator,
trimLeadingSpace,
...(comment !== undefined ? { comment } : {}),
...(lazyQuotes !== undefined ? { lazyQuotes } : {}),
};
const result = parseLineInternal(normalized, readOptions, 0, true);
return result ?? [];
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two gaps on the public parseLine:

  1. Separator not validated. parse() rejects separators in INVALID_RUNE = ["\r", "\n", '"'] (see the existing check in Parser.parse). parseLine("a\"b", { separator: '"' }) would behave unpredictably here. Apply the same guard.
  2. comment behavior is undocumented and silently lossy. parseLine("# x", { comment: "#" }) returns [], indistinguishable from an empty line — see the new test at parse_test.ts:1113. Either drop comment from the public single-line surface (it's a record-stream concept, not a per-line one) or document explicitly in the JSDoc that comment lines return [].

Comment thread csv/parse.ts
Comment on lines +142 to 162
let accumulated = first;
while (true) {
const result = parseLineInternal(
accumulated,
this.#options,
zeroBasedStartLine,
this.#isEOF(),
);
if (result !== null) return result;
const next = this.#readLine();
if (next === null) {
// Force the EOF decision (will throw unless lazyQuotes is set).
return parseLineInternal(
accumulated,
this.#options,
zeroBasedStartLine,
true,
) ?? [];
}
accumulated += "\n" + next;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quadratic re-parse. Every additional line, parseLineInternal re-scans accumulated from byte 0. For an N-line quoted field this is O(N²) parse work plus the O(N²) string concatenation on line 161. The old #parseRecord parsed each pulled line incrementally. Real CSV exports do put multi-MB quoted blobs (HTML, JSON, base64) in a single cell — this is a real regression risk. Worth benchmarking against a synthetic input like one record with 10k newlines inside a quoted field.

If the benchmark confirms it, options are: (a) have parseLineInternal accept a resume position so the wrapper only feeds it the new tail, (b) make it return progress state, or (c) keep a separate incremental path for the streaming case and only use the shared primitive for true single-line callers.

Comment thread csv/_io.ts
// narrowing for the type system.
return eofResult ?? [];
}
accumulated += "\n" + next;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same quadratic-reparse concern as the Parser.#parseRecord wrapper in parse.ts:142-161. Per call to parseLine here, the entire accumulated is re-scanned from start. For streaming, this matters more than for parse() because CsvParseStream is the documented path for memory-bounded ingest of large CSVs — a quoted field spanning many chunks would balloon to O(N²).

Comment thread csv/_io.ts
Comment on lines +272 to +280
const eofResult = parseLine(
accumulated,
options,
zeroBasedRecordStartLine,
true,
);
// parseLine with atEof=true cannot return null; this is a defensive
// narrowing for the type system.
return eofResult ?? [];
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment admits this branch is unreachable. Rather than carry a ?? [] fallback the type system can't prove away, express the invariant: overload parseLine so atEof: true returns string[] and atEof: false returns string[] | null. Then this block can just be return parseLine(accumulated, options, zeroBasedRecordStartLine, true) and the dead-code comment goes away.

Comment thread csv/_io.ts
Comment on lines +77 to 84
*/
export function parseLine(
fullLine: string,
reader: LineReader,
options: ReadOptions,
zeroBasedRecordStartLine: number,
zeroBasedLine: number = zeroBasedRecordStartLine,
): Promise<Array<string>> {
zeroBasedRecordStartLine: number = 0,
atEof: boolean = true,
): string[] | null {
// line starting with comment character is ignored
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit on the docstring: it says parse builds on this function, but parse lives in parse.ts and goes through the Parser class — parseRecord (just below) and Parser.#parseRecord are the actual callers. Worth rewording to avoid implying a direct dependency from parse().

Comment thread csv/_io.ts
Comment on lines +104 to +116
const locate = (absPos: number): { line: number; col: number } => {
let line = zeroBasedRecordStartLine;
let lastNewline = -1;
for (let i = 0; i < absPos; i++) {
if (fullLine[i] === "\n") {
line++;
lastNewline = i;
}
}
const col = codePointLength(fullLine.slice(lastNewline + 1, absPos));
return { line, col };
};

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: locate rescans fullLine[0..absPos] on each call. It's only hit on error paths so this isn't a real perf concern, but if you're touching the file anyway, a one-pass precomputed lineStarts: number[] indexed via findLastIndex would be cleaner and let you reuse it across both error sites.

@MukundaKatta MukundaKatta marked this pull request as draft May 27, 2026 05:47
@MukundaKatta
Copy link
Copy Markdown
Contributor Author

Moved to draft while I rework — thanks for the careful review. Plan:

  1. Unstable surface (per @fibibot): move parseLine to csv/unstable_parse_line.ts, drop the mod.ts re-export, add @experimental **UNSTABLE** tag, retitle to feat(csv/unstable): add parseLine.
  2. Quadratic re-parse (per @bartlomieju): convert both wrappers to a stateful incremental parser — feed each pulled line into a persistent field-state machine instead of re-scanning accumulated from offset 0. Will measure on a multi-MB single quoted field before re-requesting review.
  3. parseLine overloads: split into atEof: true → string[] and atEof: false → string[] | null so the unreachable ?? [] fallback in _io.ts:280 goes away cleanly.
  4. Doc + locate cleanups: reword the parseRecord docstring (parse goes through the Parser class, not this fn directly), and precompute lineStarts if I'm touching the error paths anyway.

Will re-request once (2) is benchmarked and clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants