Skip to content

[SYSTEMDS-3863] Add PowerTransformer built-in functions#2499

Open
WenliangCao wants to merge 2 commits into
apache:mainfrom
WenliangCao:systemds-3863-power-transformer
Open

[SYSTEMDS-3863] Add PowerTransformer built-in functions#2499
WenliangCao wants to merge 2 commits into
apache:mainfrom
WenliangCao:systemds-3863-power-transformer

Conversation

@WenliangCao

Copy link
Copy Markdown

Summary

This pull request introduces an initial implementation of the PowerTransformer built-in functions in Apache SystemDS.

The implementation follows a fit-and-apply structure, with separate functions for estimating transformation parameters and applying the transformation to new data.

Changes

  • Add powerTransform.dml for estimating column-wise transformation parameters.
  • Add powerTransformApply.dml for applying the transformation with previously estimated parameters.
  • Implement the Yeo-Johnson transformation for positive, zero, and negative input values.
  • Estimate the optimal lambda value independently for each column.
  • Use golden-section search as the current lambda optimization method.
  • Add DML scripts for integration testing.
  • Add an R reference implementation for result validation.
  • Add a Java integration test that compares the SystemDS output with the R reference output.

Testing

The implementation was tested with:

mvn -Dtest=BuiltinPowerTransformTest test

Test result:

Tests run: 1, Failures: 0, Errors: 0, Skipped: 0
BUILD SUCCESS

The test workflow performs the following steps:

  1. Generate the input matrix in the Java test.
  2. Execute the PowerTransformer implementation in SystemDS.
  3. Execute the equivalent reference implementation in R.
  4. Compare the SystemDS and R output matrices with a numerical tolerance.

Current Status

The current implementation establishes the main transformation pipeline and verifies the numerical correctness of the Yeo-Johnson transformation.

The following components are currently available:

  • Column-wise lambda estimation.
  • Yeo-Johnson transformation.
  • Separate fit and apply functions.
  • R-based reference validation.
  • Java integration testing through Maven.

Current Limitations

  • Lambda estimation currently uses golden-section search.
  • A Brent-based optimization method is planned for the final implementation.
  • Box-Cox transformation is not yet implemented.
  • Additional edge cases and numerical stability tests are still required.
  • More datasets and end-to-end experiments will be added.
  • Comparisons with existing scaling methods will be completed during the final project phase.

Future Work

  • Replace or extend golden-section search with Brent's optimization method.
  • Add support for the Box-Cox transformation.
  • Add standardization after the power transformation.
  • Extend tests for boundary lambda values, constant columns, zero values, and mixed-sign inputs.
  • Add numerical comparisons with scikit-learn.
  • Evaluate PowerTransformer against existing scaling methods for regression, classification, and clustering.
  • Add user documentation and usage examples.

Related Issue

SYSTEMDS-3863

Add two DML builtins for the midterm PowerTransformer prototype:
- powerTransform.dml estimates one Yeo-Johnson lambda per input column,
  uses lambda = 1 for constant columns, and applies the transform.
- powerTransformApply.dml applies the Yeo-Johnson transform with supplied
  per-column lambdas.

Register both builtins so they can be resolved by SystemDS.

Add powerTransformSmokeTest.dml to verify:
- powerTransformApply with lambda = 1 behaves as identity
- powerTransform returns output dimensions matching the input
- one lambda is returned per input column
- constant columns use lambda = 1
- transformed output and lambdas do not contain NaN or Inf

The smoke test passes with:
./bin/systemds src/test/scripts/functions/builtin/powerTransformSmokeTest.dml
Add a focused PowerTransformApply test that compares the DML builtin against
an independent R reference implementation.

The test adds:
- powerTransformApply.dml as a small DML wrapper around the registered builtin
- powerTransformApply.R as the reference Yeo-Johnson apply implementation
- BuiltinPowerTransformTest.java to run the DML and R scripts and compare Y

The test covers the key apply branches with fixed lambdas:
- lambda = 0 for the positive log branch
- lambda = 1 for the identity-style middle case
- lambda = 2 for the negative log branch

Also translate the remaining PowerTransformer comments from Chinese to English
in the prototype DML files.

Verified with:
Rscript -e 'parse(file="src/test/scripts/functions/builtin/powerTransformApply.R"); cat("R syntax OK\n")'
./bin/systemds src/test/scripts/functions/builtin/powerTransformSmokeTest.dml
mvn -Dtest=BuiltinPowerTransformTest test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

1 participant