Databricks & Delta Lake

What is Databricks?

Databricks is a unified data & AI platform built on Apache Spark that merges the best of data lakes and data warehouses into a Lakehouse. You store data once in low-cost cloud object storage and query it like a warehouse, run pipelines, stream processing, and build ML models on the same platform.

Why teams use it

Scalable Spark compute
ACID Delta Tables
Streaming + batch
SQL analytics
MLflow & AI
Unity Catalog governance
Open formats (Parquet/Delta)
Runs on AWS/Azure/GCP

Lakehouse in one line

Object Storage + Delta (transactions) + Spark (compute) + Governance

Delta Tables: What & Where

A Delta Table is a set of Parquet files plus a _delta_log/ transaction log stored in your cloud object storage:

/path/to/table/
 ├─ part-0000.snappy.parquet
 ├─ part-0001.snappy.parquet
 └─ _delta_log/
     ├─ 00000000000000000000.json
     ├─ 00000000000000000001.json
     └─ ...

Managed table

CREATE TABLE sales (region STRING, amount DOUBLE);

Stored under the workspace-managed path (e.g., dbfs:/user/hive/warehouse/).

External table

CREATE TABLE sales
USING DELTA
LOCATION 'abfss://datalake@account.dfs.core.windows.net/bronze/sales';

Physically stored in your ADLS/S3/GCS path; catalog just points to it.

How a Delta Write & Commit Works

Executors write new Parquet data to temporary locations.
Driver prepares a JSON action file under _delta_log/.
Atomic commit: the new log version (e.g., 0002.json) becomes visible.
Readers see a consistent snapshot; older readers continue on old versions (snapshot isolation).

Delta Lake write and commit process diagram

Delta Lake write path: from temporary files to atomic commit in _delta_log/.

How a Delta Read Builds a Snapshot

Readers list _delta_log/, apply actions from JSON (and checkpoints) to determine the active data files for the requested version, then read only those Parquet files.

Delta Lake read process using _delta_log and checkpoints

Read path: _delta_log + checkpoints → snapshot → filtered file reads.

Where Temporary Storage Lives

Executor local disk /local_disk0/tmp for shuffles & spills (ephemeral).
Cloud staging path under the table location (e.g., _temporary/) before the commit.

Temporary storage layers before Delta commit

Two-stage temp storage: executor local disk → cloud temporary path.

What is DBFS?

DBFS is a Databricks filesystem abstraction that lets you access cloud object storage with simple paths like /dbfs/mnt/datalake/ or dbfs:/mnt/datalake/. Mounts map to S3/ADLS/GCS; managed tables live under dbfs:/user/hive/warehouse/.

DBFS architecture between cluster and cloud storage

DBFS connects your cluster to cloud object storage via mounts; /local_disk0 is ephemeral.

How “Folders” Work on Object Storage

Object stores are flat key–value systems. Delta uses naming conventions and prefix listings to simulate directories. Example keys:

s3://bucket/sales/_delta_log/00000000000000000000.json
s3://bucket/sales/part-0001.snappy.parquet

The slashes in keys are just delimiters used for grouping in listings and UIs.