Schema Evolution
Delta Lake offers online schema change capabilities that enable efficient DDL execution without the need for data rewriting. Key features, such as ColumnMapping, facilitate these changes. Additionally, Spark provides the writer option mergeSchema
for dynamically adding columns and overwriteSchema
for replacing the entire dataset. Relyt supports both scenarios when reading from Delta Lake.
Typical online schema change operations include:
-
Adding columns
-
Dropping and renaming columns
-
Changing column comments or ordering
-
Changing column types
The method of implementing online schema changes during writes affects how easily they are supported during reads. For instance, schema changes like modifying column types or names through overwriting are generally seamless, requiring no extra design and offering default compatibility.
This topic introduces Spark's implementations of online schema changes for Delta Lake tables, highlighting which changes are metadata-only and which require data rewriting.
How Spark implements online schema changes
Spark provides the following implementations of online schema changes:
-
The
overwriteSchema
option rewrites all existing data files to apply schema changes, such as adding, dropping, renaming columns, or modifying column comments and order. -
Delta Table's column mapping feature allows you to explicitly drop and rename columns without rewriting the existing data.
-
The
mergeSchema
option in Spark enables replacing and adding columns, as well as modifying column comments and order, using a metadata-only approach. -
Changing column types without rewriting data, known as type widening, is not currently supported in open-source Spark. This feature is available in public preview through the
enableTypeWidening
feature in Databricks runtime.
DDL statements for updating Delta Lake table schemas
This section outlines the DDL statements available for updating Delta Lake table schemas.
To run these DDL statements, make sure the DPS cluster you are using is an Extreme DPS cluster.
Replace columns (metadata-only operation):
ALTER TABLE <table_name> REPLACE COLUMNS (<col_name1> <col_type1> [COMMENT <col_comment1>], ...)
Add columns, change column comments, or ordering (metadata-only operation):
ALTER TABLE <table_name> ADD COLUMNS (<col_name> <data_type> [COMMENT <col_comment>] [FIRST | AFTER <colA_name>], ...)
ALTER TABLE <table_name> ALTER [COLUMN] <col_name> (COMMENT <col_comment> | FIRST | AFTER <colA_name>)
Rename columns, or drop columns (metadata-only operation):
ALTER TABLE <table_name> RENAME COLUMN <old_col_name> TO <new_col_name>
ALTER TABLE <table_name> DROP COLUMN <col_name>
Change column type (rewriting data):
(spark.read.table(...)
.withColumn("birthDate", col("birthDate").cast("date"))
.write
.mode("overwrite")
.option("overwriteSchema", "true")
.saveAsTable(...)
)
Turn on Delta Lake table column mapping:
ALTER TABLE {schemaName}.{tableName} SET TBLPROPERTIES (
'delta.minReaderVersion' = '2',
'delta.minWriterVersion' = '5'
)
ALTER TABLE <table_name> SET TBLPROPERTIES (
'delta.columnMapping.mode' = 'name'
)
For details about parameter description, see ALTER TABLE.
Write principals for merging schemas
During write operations like append or merge, any columns in the incremental data that do not exist in the current table (identified by both name and type) will be added to the table schema.
You can enable Spark's mergeSchema
option using the Spark DataFrame API. This option applies only to a single write operation for one table, making it ideal for enabling schema evolution on a specific table. To activate this functionality for all table writes, use the Spark session configuration spark.databricks.delta.schema.autoMerge.enabled
.
Databricks recommends enabling schema evolution for each write operation instead of relying on a global Spark configuration. For more details, visit Delta Lake Schema Evolution.
df.write.format("delta") \
.mode("append") \
.option("mergeSchema", "true") \
.saveAsTable(f"{schemaName}.{tableName}")
spark.conf.set("spark.databricks.delta.schema.autoMerge.enabled", "true")
Automatic schema evolution for Delta Lake merges
For details, see the Apache Spark MERGE INTO SQL reference.