Skip to content

Add Short and BigDecimal type mappings to SqlTypeUtils#763

Open
gknz wants to merge 3 commits into
apache:mainfrom
gknz:feature/sqltypeutils-short-bigdecimal-mapping
Open

Add Short and BigDecimal type mappings to SqlTypeUtils#763
gknz wants to merge 3 commits into
apache:mainfrom
gknz:feature/sqltypeutils-short-bigdecimal-mapping

Conversation

@gknz
Copy link
Copy Markdown

@gknz gknz commented May 30, 2026

Summary

Adds proper SQL type support for Short and java.math.BigDecimal across the
table-sink write path. Previously neither had a JDBC type mapping, so the sink
created such columns as VARCHAR(255) (numbers stored as text, with no numeric
ordering/aggregation in the DB). This PR maps them to real numeric SQL types and
binds their values correctly on both the Java and Spark execution paths.

Changes

1. Type mapping (SqlTypeUtils)

  • Short / shortSMALLINT
  • BigDecimalNUMERIC(38,18)
  • The PostgreSQL dialect inherits both from the default map.

2. Value binding (the DDL type alone is not enough)
Mapping the column type without binding the value would bind a string into a
numeric column. Each in-JVM sink is fixed:

  • JavaTableSink: bind via PreparedStatement.setShort(...) / setBigDecimal(...)
    instead of the setString fallback.
  • SparkTableSink: map Short → DataTypes.ShortType and
    BigDecimal → DataTypes.createDecimalType(38, 18), so the DataFrame schema
    (and the JDBC write derived from it) produces SMALLINT / NUMERIC(38,18).

Why NUMERIC(38,18) rather than a bare NUMERIC

A bare NUMERIC is not portable for decimals. Per the SQL standard, omittting
the scale means scale 0, so most engines create an integer column and silently
truncate the fractional part: H2, MySQL (DECIMAL(10,0)), SQL Server
(DECIMAL(18,0)), Derby, DB2 all do this (e.g. 12.345 → 12). PostgreSQL is the
lenient exception (bare NUMERIC is unbounded), which can hide the bug. An
explicit scale avoids the truncation on every database.

Why specifically 38 and 18

These are not arbitrary: (38, 18) is Spark's DecimalType.SYSTEM_DEFAULT
the precision/scale Spark assigns to a decimal when none is otherwise known
(38 is also the maximum precision Spark's Decimal supports, and a value widely
supported by Oracle/SQL Server). We deliberately reuse the same values on the
Java/DDL path (SqlTypeUtils → NUMERIC(38,18)):

  • 38 = max total significant digits (generous headroom),
  • 18 = digits after the decimal point (ample for monetary/rate data),
  • and, crucially, this makes the Java and Spark sinks emit identical column
    types
    for a BigDecimal. In Wayang the optimizer decides which platform runs
    the sink, so the output schema must not depend on that choice — matching Spark's
    default keeps both paths consistent.

Trade-off: a fixed scale pads short values with trailing zeros
(12.345 → 12.345000000000000000). This is numerically identical and is exactly
what the Spark path already produced;=, in return decimals aree portable and lossless
across all databases.

Tests

  • SqlTypeUtilsTest: the full default type map (incl. Short → SMALLINT,
    BigDecimal → NUMERIC(38,18)) and PostgreSQL inheritance.
  • SparkTableSinkTest:
    • direct mapping checks for getSparkDataType(...) — the whole Java-class →
      Spark DataType table, the String fallback, and explicitly
      BigDecimal → DecimalType(precision=38, scale=18);
    • an end-to-end write of all supported types to in-memory H2, asserting the
      round-tripped values and the created column types (SMALLINT, NUMERIC(38,18)).
  • JavaTableSinkTest: the analogous end-to-end test for the Java sink.

All tests run against in-memory H2 — no external database required.

Notes

  • The in-database JDBC sinks (PostgresTableSinkOperator, etc.) are unaffected :
    they compose CREATE TABLE ... AS SELECT and let the DB infer types, so they
    never bind a Java value or use SqlTypeUtils.
  • getSparkDataType was widened from private to package-private only to make
    the mapping unit-testable.

gepa and others added 2 commits May 30, 2026 16:39
Short and java.math.BigDecimal had no entry in the JDBC type map, so columns of those Java types were created as VARCHAR(255) by the JDBC table sink in overwrite mode. Map Short (and short) to SMALLINT and BigDecimal to NUMERIC; the PostgreSQL dialect map inherits both from the default map.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
SqlTypeUtils now maps Short->SMALLINT and BigDecimal->NUMERIC for the generated
DDL, but the sinks still bound those values via the String fallback. Complete the
write path:
- JavaTableSink: bind via setShort / setBigDecimal
- SparkTableSink: map Short->ShortType and BigDecimal->DecimalType(38,18) so the
  Spark DataFrame writes SMALLINT / NUMERIC columns instead of text
@mspruc
Copy link
Copy Markdown
Contributor

mspruc commented Jun 1, 2026

Hi thanks for the contribution, if possible it would also be nice to have a unit test for this

@zkaoudi
Copy link
Copy Markdown
Contributor

zkaoudi commented Jun 1, 2026

@mspruc what unit test do you have in mind? This PR seems to simply add two more data types

@mspruc
Copy link
Copy Markdown
Contributor

mspruc commented Jun 1, 2026

@zkaoudi well something that tests the sinks ideally, e.g. for https://github.com/gknz/wayang/blob/efb35979111a6bb84192acd2ade8424bcd1ee463/wayang-platforms/wayang-spark/src/main/java/org/apache/wayang/spark/operators/SparkTableSink.java#L156

we have two magic numbers, I'd like to make sure this conversion is stable.

Add unit and end-to-end tests pinning the Short/BigDecimal handling across the
table sinks, and fix a decimal-truncation issue the tests surfaced.

- SqlTypeUtils: map BigDecimal to NUMERIC(38,18) instead of a bare NUMERIC. A bare
  NUMERIC defaults to scale 0 on most engines (H2, MySQL, SQL Server, Derby, ...)
  and silently truncates the fractional part; an explicit scale fixes this on every
  database and matches Spark's DecimalType default (38, 18).
- SqlTypeUtilsTest: cover the full default type map (including Short and BigDecimal)
  and confirm the PostgreSQL dialect inherits them.
- SparkTableSink: make getSparkDataType package-private so it can be unit-tested.
- SparkTableSinkTest: assert the Java-class to Spark DataType mapping (including
  BigDecimal to DecimalType(38,18) and the String fallback), plus an end-to-end
  write of all supported types to H2 verifying values and SMALLINT/NUMERIC(38,18).
- JavaTableSinkTest: end-to-end write of all supported types to H2 verifying values
  and column types.
@gknz
Copy link
Copy Markdown
Author

gknz commented Jun 2, 2026

@mspruc hey! I've added unit tests for sqlTypeUtils, SparkTableSink, and JavaTableSink. Please feel free to read the updated descriptions for all the changes/additions I went with and why. If you have any comments, please feel free to let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants