[python] Support query auth (row filter & column masking) for REST catalog#8136
[python] Support query auth (row filter & column masking) for REST catalog#8136MgjLLL wants to merge 3 commits into
Conversation
…talog
Adds query-auth support to the Python client so it honors the row-level
filter and column masking rules returned by a REST catalog, matching the
existing JVM client behavior.
When the new option `query-auth.enabled` is set to true, the client
calls `POST /v1/.../databases/{db}/tables/{tb}/auth` before producing a
plan, receives `{ filter, columnMasking }`, and applies them on the
read path:
* `predicate_json_parser` parses Paimon predicate JSON into a
PyArrow compute filter (EQ/NEQ/LT/LTEQ/GT/GTEQ/IS_NULL/IS_NOT_NULL/
IN/NOT_IN/STARTS_WITH/ENDS_WITH/CONTAINS/AND/OR/NOT).
* `AuthFilterReader` / `AuthMaskingReader` / `ColumnProjectReader`
perform row filtering, column masking transforms (NULL, FIELD_REF,
CAST, UPPER, LOWER, CONCAT, CONCAT_WS) and final projection back to
the user's requested columns.
* `TableQueryAuth` / `TableQueryAuthResult` wrap the result and
convert each split to a `QueryAuthSplit`.
Behavior is gated by `CoreOptions.QUERY_AUTH_ENABLED` (default false),
so existing users see no change.
|
I found a few correctness issues in the query-auth paths introduced here:
|
- Ray: use table.new_read_builder() instead of direct ReadBuilder() - Streaming: pass query_auth to AsyncStreamingTableScan, apply to all plans - Merge reader: add RecordReaderToBatchAdapter for primary-key tables - Parallel: use _create_reader_for_split, add raw_convertible proxy
|
Fixes for issues raised by @JingsongLi, plus one additional issue found during analysis. Fix 1: Ray read path bypasses auth (
|
@JingsongLi All 3 issues fixed (+ 1 additional parallel path bypass found during analysis). See updated PR description. PTAL. |
Return None instead of a local lambda from table_query_auth() when auth is disabled, since pickle cannot serialize local lambdas. This fixes serializable_test and ray_sink_test failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| if not self.filter and not self.column_masking: | ||
| return plan | ||
| auth_splits = [QueryAuthSplit(split, self) for split in plan.splits()] | ||
| return Plan(auth_splits) |
There was a problem hiding this comment.
Java TableScan.Plan does not carry a snapshot id, but Python Plan does and the update / row-id update paths use it as check_from_snapshot. Wrapping the plan here drops plan.snapshot_id, so a query-auth table planned from a non-empty snapshot becomes snapshot_id=None; table_update then emits commit messages with -1, which disables the row-id conflict checks (and related global-index update checks). Please preserve the original plan metadata, e.g. Plan(auth_splits, snapshot_id=plan.snapshot_id).
|
|
||
| return reader | ||
|
|
||
| def _create_split_read_with_read_type(self, split, read_type): |
There was a problem hiding this comment.
This auth-specific construction bypasses the normal PK read path above. In _create_split_read, PK tables inject missing sequence.field columns into the inner read type and then project them back out, matching the Java withReadType + outer projection behavior. Here, if query auth is enabled and the user projects id,val from a PK table with sequence.field=ts, MergeFileSplitRead is built without ts; that can either fail with sequence.field ... not found or merge by file sequence instead of the configured user sequence. Please reuse the existing _create_split_read widening/project-back logic for effective_read_type, or factor it so the auth path cannot drift from the normal PK path.
| elif function == "LIKE": | ||
| raw = literals[0] | ||
| escaped = re.escape(raw) | ||
| pattern = escaped.replace("%", ".*").replace("_", ".") |
There was a problem hiding this comment.
This does not match the JVM LIKE semantics. Java treats backslash as the default escape character before expanding % / _, so a policy predicate like LIKE admin\\_% matches admin_foo and not adminXfoo. Escaping the whole string first and then replacing every % / _ makes escaped wildcards behave as wildcards (or requires a literal backslash), so Python can allow/deny different rows from the Java client for the same auth filter. Please port the Java Like.sqlToRegexLike behavior, including invalid escape handling.
Purpose
Adds query-auth support to the Python client so it honors the row-level filter and column masking rules returned by a REST catalog, matching the existing JVM client behavior.
When the new option
query-auth.enabledis set totrue, before producing aPlanthe client callsPOST /v1/.../databases/{db}/tables/{tb}/authwith the projected fields, receives{ filter, columnMasking }, and applies them on the read path:RESTApi.auth_table_queryissues the call (new request/response modelsAuthTableQueryRequest/AuthTableQueryResponse, new path inResourcePaths.auth_table).TableQueryAuth/TableQueryAuthResult(catalog/table_query_auth.py) wrap the result and convert each split to aQueryAuthSplit.predicate_json_parser(common/predicate_json_parser.py) parses Paimon predicate JSON into a PyArrow compute filter (EQ/NEQ/LT/LTEQ/GT/GTEQ/IS_NULL/IS_NOT_NULL/IN/NOT_IN/STARTS_WITH/ENDS_WITH/CONTAINS/AND/OR/NOT).AuthFilterReader/AuthMaskingReader/ColumnProjectReader(read/reader/auth_masking_reader.py) implement row filtering, column masking transforms (NULL,FIELD_REF,CAST,UPPER,LOWER,CONCAT,CONCAT_WS) and final projection back to the user's requested columns.read_builder/stream_read_builder/table_read/table_scan/file_store_table/catalog_environment/rest_catalogare wired to invoke the auth call and pull extra fields required only by the auth filter.Behavior is gated by the new
CoreOptions.QUERY_AUTH_ENABLED(query-auth.enabled, defaultfalse), so existing users see no change.Tests
Three new test files (994+ lines, all passing locally under
pytest):paimon-python/pypaimon/tests/predicate_json_parser_test.py— covers each predicate kind, nested AND/OR/NOT, type coercion, null handling, andextract_referenced_fields.paimon-python/pypaimon/tests/auth_masking_reader_test.py— covers each masking transform, missing-field validation, and projection back to the user-requested columns.paimon-python/pypaimon/tests/table_query_auth_test.py— end-to-end coverage: REST catalog callsauth_table_query, the result is plumbed into the plan, splits becomeQueryAuthSplit, and reads return filtered + masked rows.Local check:
API and Format
query-auth.enabled(boolean, defaultfalse).POST /v1/{prefix}/databases/{db}/tables/{tb}/auth. Request{ "select": [...] }, response{ "filter": [<predicate-json>...], "columnMasking": { <col>: <transform-json>, ... } }. The contract follows the existing Java client; no server-side change is required for catalogs that already implement query auth.AuthTableQueryRequest,AuthTableQueryResponse,TableQueryAuth,TableQueryAuthResult,QueryAuthSplit,AuthFilterReader,AuthMaskingReader,ColumnProjectReader) are additive and live under existing modules.Documentation
The new option
query-auth.enabledshould be reflected in the Python configuration reference. Happy to add the docs entry in this PR or in a follow-up — please advise.This closes #8135