Schema Evolution

Delta Lake offers online schema change capabilities that enable efficient DDL execution without the need for data rewriting. Key features, such as ColumnMapping, facilitate these changes. Additionally, Spark provides the writer option mergeSchema for dynamically adding columns and overwriteSchema for replacing the entire dataset. Relyt supports both scenarios when reading from Delta Lake.

Typical online schema change operations include:

Adding columns
Dropping and renaming columns
Changing column comments or ordering
Changing column types

The method of implementing online schema changes during writes affects how easily they are supported during reads. For instance, schema changes like modifying column types or names through overwriting are generally seamless, requiring no extra design and offering default compatibility.

This topic introduces Spark's implementations of online schema changes for Delta Lake tables, highlighting which changes are metadata-only and which require data rewriting.

How Spark implements online schema changes

Spark provides the following implementations of online schema changes:

The overwriteSchema option rewrites all existing data files to apply schema changes, such as adding, dropping, renaming columns, or modifying column comments and order.
Delta Table's column mapping feature allows you to explicitly drop and rename columns without rewriting the existing data.
The mergeSchema option in Spark enables replacing and adding columns, as well as modifying column comments and order, using a metadata-only approach.
Changing column types without rewriting data, known as type widening, is not currently supported in open-source Spark. This feature is available in public preview through the enableTypeWidening feature in Databricks runtime.

DDL statements for updating Delta Lake table schemas

This section outlines the DDL statements available for updating Delta Lake table schemas.

Important

To run these DDL statements, make sure the DPS cluster you are using is an Extreme DPS cluster.

Replace columns (metadata-only operation):

ALTER TABLE <table_name> REPLACE COLUMNS (<col_name1> <col_type1> [COMMENT <col_comment1>], ...)

Add columns, change column comments, or ordering (metadata-only operation):

ALTER TABLE <table_name> ADD COLUMNS (<col_name> <data_type> [COMMENT <col_comment>] [FIRST | AFTER <colA_name>], ...)

ALTER TABLE <table_name> ALTER [COLUMN] <col_name> (COMMENT <col_comment> | FIRST | AFTER <colA_name>)

Rename columns, or drop columns (metadata-only operation):

ALTER TABLE <table_name> RENAME COLUMN <old_col_name> TO <new_col_name>

ALTER TABLE <table_name> DROP COLUMN <col_name>

Change column type (rewriting data):

(spark.read.table(...)
  .withColumn("birthDate", col("birthDate").cast("date"))
  .write
  .mode("overwrite")
  .option("overwriteSchema", "true")
  .saveAsTable(...)
)

Turn on Delta Lake table column mapping:

ALTER TABLE {schemaName}.{tableName} SET TBLPROPERTIES (
     'delta.minReaderVersion' = '2',
     'delta.minWriterVersion' = '5'
     )

ALTER TABLE <table_name> SET TBLPROPERTIES (
     'delta.columnMapping.mode' = 'name'
)

For details about parameter description, see ALTER TABLE.

Write principals for merging schemas

During write operations like append or merge, any columns in the incremental data that do not exist in the current table (identified by both name and type) will be added to the table schema.

You can enable Spark's mergeSchema option using the Spark DataFrame API. This option applies only to a single write operation for one table, making it ideal for enabling schema evolution on a specific table. To activate this functionality for all table writes, use the Spark session configuration spark.databricks.delta.schema.autoMerge.enabled.

Databricks recommends enabling schema evolution for each write operation instead of relying on a global Spark configuration. For more details, visit Delta Lake Schema Evolution.

df.write.format("delta") \
.mode("append") \
.option("mergeSchema", "true") \
.saveAsTable(f"{schemaName}.{tableName}")

spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")

Automatic schema evolution for Delta Lake merges

For details, see the Apache Spark MERGE INTO SQL reference.

Schema Evolution

How Spark implements online schema changes​

DDL statements for updating Delta Lake table schemas​

Write principals for merging schemas​

Automatic schema evolution for Delta Lake merges​

How Spark implements online schema changes

DDL statements for updating Delta Lake table schemas

Write principals for merging schemas

Automatic schema evolution for Delta Lake merges