Databricks & Delta Lake — Deep Dive

A concise, engineer-friendly walkthrough of Databricks Lakehouse, Delta Tables, commits, reads, temporary storage, and DBFS.

← Home

What is Databricks?

Databricks is a unified data & AI platform built on Apache Spark that merges the best of data lakes and data warehouses into a Lakehouse. You store data once in low-cost cloud object storage and query it like a warehouse, run pipelines, stream processing, and build ML models on the same platform.

Why teams use it

  • Scalable Spark compute
  • ACID Delta Tables
  • Streaming + batch
  • SQL analytics
  • MLflow & AI
  • Unity Catalog governance
  • Open formats (Parquet/Delta)
  • Runs on AWS/Azure/GCP

Lakehouse in one line

Object Storage + Delta (transactions) + Spark (compute) + Governance

Delta Tables: What & Where

A Delta Table is a set of Parquet files plus a _delta_log/ transaction log stored in your cloud object storage:

/path/to/table/
 ├─ part-0000.snappy.parquet
 ├─ part-0001.snappy.parquet
 └─ _delta_log/
     ├─ 00000000000000000000.json
     ├─ 00000000000000000001.json
     └─ ...

Managed table

CREATE TABLE sales (region STRING, amount DOUBLE);

Stored under the workspace-managed path (e.g., dbfs:/user/hive/warehouse/).

External table

CREATE TABLE sales
USING DELTA
LOCATION 'abfss://datalake@account.dfs.core.windows.net/bronze/sales';

Physically stored in your ADLS/S3/GCS path; catalog just points to it.

How a Delta Write & Commit Works

  1. Executors write new Parquet data to temporary locations.
  2. Driver prepares a JSON action file under _delta_log/.
  3. Atomic commit: the new log version (e.g., 0002.json) becomes visible.
  4. Readers see a consistent snapshot; older readers continue on old versions (snapshot isolation).
Delta Lake write and commit process diagram
Delta Lake write path: from temporary files to atomic commit in _delta_log/.

How a Delta Read Builds a Snapshot

Readers list _delta_log/, apply actions from JSON (and checkpoints) to determine the active data files for the requested version, then read only those Parquet files.

Delta Lake read process using _delta_log and checkpoints
Read path: _delta_log + checkpoints → snapshot → filtered file reads.

Where Temporary Storage Lives

Temporary storage layers before Delta commit
Two-stage temp storage: executor local disk → cloud temporary path.

What is DBFS?

DBFS is a Databricks filesystem abstraction that lets you access cloud object storage with simple paths like /dbfs/mnt/datalake/ or dbfs:/mnt/datalake/. Mounts map to S3/ADLS/GCS; managed tables live under dbfs:/user/hive/warehouse/.

DBFS architecture between cluster and cloud storage
DBFS connects your cluster to cloud object storage via mounts; /local_disk0 is ephemeral.

How “Folders” Work on Object Storage

Object stores are flat key–value systems. Delta uses naming conventions and prefix listings to simulate directories. Example keys:

s3://bucket/sales/_delta_log/00000000000000000000.json
s3://bucket/sales/part-0001.snappy.parquet

The slashes in keys are just delimiters used for grouping in listings and UIs.