Course Outline
Introduction, Objectives, and Migration Strategy
- Course goals, alignment with participant profiles, and success criteria.
- High-level migration approaches and risk considerations.
- Setting up workspaces, repositories, and lab datasets.
Day 1 — Migration Fundamentals and Architecture
- Lakehouse concepts, Delta Lake overview, and Databricks architecture.
- SMP vs MPP differences and their implications for migration.
- Medallion (Bronze→Silver→Gold) design and Unity Catalog overview.
Day 1 Lab — Translating a Stored Procedure
- Hands-on migration of a sample stored procedure to a notebook.
- Mapping temp tables and cursors to DataFrame transformations.
- Validation and comparison with original output.
Day 2 — Advanced Delta Lake & Incremental Loading
- ACID transactions, commit logs, versioning, and time travel.
- Auto Loader, MERGE INTO patterns, upserts, and schema evolution.
- OPTIMIZE, VACUUM, Z-ORDER, partitioning, and storage tuning.
Day 2 Lab — Incremental Ingestion & Optimization
- Implementing Auto Loader ingestion and MERGE workflows.
- Applying OPTIMIZE, Z-ORDER, and VACUUM; validating results.
- Measuring read/write performance improvements.
Day 3 — SQL in Databricks, Performance & Debugging
- Analytical SQL features: window functions, higher-order functions, JSON/array handling.
- Reading the Spark UI, DAGs, shuffles, stages, tasks, and bottleneck diagnosis.
- Query tuning patterns: broadcast joins, hints, caching, and spill reduction.
Day 3 Lab — SQL Refactoring & Performance Tuning
- Refactor a heavy SQL process into optimized Spark SQL.
- Use Spark UI traces to identify and fix skew and shuffle issues.
- Benchmark before/after and document tuning steps.
Day 4 — Tactical PySpark: Replacing Procedural Logic
- Spark execution model: driver, executors, lazy evaluation, and partitioning strategies.
- Transforming loops and cursors into vectorized DataFrame operations.
- Modularization, UDFs/pandas UDFs, widgets, and reusable libraries.
Day 4 Lab — Refactoring Procedural Scripts
- Refactor a procedural ETL script into modular PySpark notebooks.
- Introduce parametrization, unit-style tests, and reusable functions.
- Code review and best-practice checklist application.
Day 5 — Orchestration, End-to-End Pipeline & Best Practices
- Databricks Workflows: job design, task dependencies, triggers, and error handling.
- Designing incremental Medallion pipelines with quality rules and schema validation.
- Integration with Git (GitHub/Azure DevOps), CI, and testing strategies for PySpark logic.
Day 5 Lab — Build a Complete End-to-End Pipeline
- Assemble Bronze→Silver→Gold pipeline orchestrated with Workflows.
- Implement logging, auditing, retries, and automated validations.
- Run full pipeline, validate outputs, and prepare deployment notes.
Operationalization, Governance, and Production Readiness
- Unity Catalog governance, lineage, and access controls best practices.
- Cost, cluster sizing, autoscaling, and job concurrency patterns.
- Deployment checklists, rollback strategies, and runbook creation.
Final Review, Knowledge Transfer, and Next Steps
- Participant presentations of migration work and lessons learned.
- Gap analysis, recommended follow-up activities, and training materials handoff.
- References, further learning paths, and support options.
Requirements
- A foundational understanding of data engineering concepts.
- Experience with SQL and stored procedures (Synapse or SQL Server).
- Familiarity with ETL orchestration concepts (such as ADF or similar tools).
Audience
- Technology managers with a data engineering background.
- Data engineers migrating procedural OLAP logic to Lakehouse patterns.
- Platform engineers responsible for driving Databricks adoption.