A hands-on lab where you deploy a real distributed database cluster using Citus + PostgreSQL, implement fragmentation, run distributed queries, and simulate 2-phase commit transactions across 4 geographic sites.
Follow these 6 steps in order before writing any SQL. Steps 1–3 are required for everyone. Docker (Steps 4–6) is the core environment for this lab.
https://classroom.github.com/a/xxxxxxxx — opens GitHub Classroom.github.com/[org]/tp-distributed-[your-username] — only you and the professor can see it.Your pushes go into YOUR repo inside the Classroom organisation — never into the professor's template. 74 students = 74 separate repos, all isolated.
# Open PowerShell (Windows) or Terminal (Mac/Linux) # Go to Desktop cd Desktop # Clone — replace YOUR-USERNAME with your GitHub username git clone https://github.com/[org]/tp-distributed-YOUR-USERNAME.git # Enter the folder cd tp-distributed-YOUR-USERNAME # Verify the files are there dir # Windows ls # Mac / Linux # Expected output : README.md data/ docker-compose.yml exercises/ solutions/
Download at git-scm.com/downloads → install → reopen your terminal → retry the commands above.
Docker packages software into isolated "containers". Instead of manually installing PostgreSQL + Citus on every PC, one command docker-compose up -d launches the full 4-node cluster — identical on every machine in the class.
# Check version docker --version # Docker version 24.x.x ✓ # Quick test docker run hello-world # "Hello from Docker!" ✓
Docker Desktop is not open. Start it from Start Menu / Applications and wait for the green whale.
wsl --install
# Restart PC, then reopen Docker Desktop
# Make sure you are in the right folder cd Desktop/tp-distributed-YOUR-USERNAME # Start the 4-node cluster (first time = ~4 min download) docker-compose up -d # Verify all 4 containers are running docker ps # Expected — you must see 4 containers: CONTAINER ID IMAGE STATUS aaa111 citusdata/citus:12.1 Up (citus_master) bbb222 citusdata/citus:12.1 Up (citus_worker1) ccc333 citusdata/citus:12.1 Up (citus_worker2) ddd444 citusdata/citus:12.1 Up (citus_worker3)
University network may block Docker Hub. Solutions:
# Option A: use your phone hotspot, then: docker-compose up -d # Option B: pre-download at home before the lab: docker pull citusdata/citus:12.1
Docker launched 4 PostgreSQL+Citus containers that talk to each other on an internal network. citus_master is the coordinator — all your SQL goes there. The 3 workers store the actual shards.
# 1. Connect to the coordinator and create tables docker exec -it citus_master psql -U postgres -d mediAI \ -f /data/schema-mediAI.sql # 2. Register workers + distribute tables with Citus docker exec -it citus_master psql -U postgres -d mediAI \ -f /data/init-cluster.sql # 3. Insert 20 patients + medical records + transactions docker exec -it citus_master psql -U postgres -d mediAI \ -f /data/seed-mediAI.sql # 4. Verify everything loaded correctly docker exec -it citus_master psql -U postgres -d mediAI -c \ "SELECT 'Patients' AS t, COUNT(*) FROM Patients UNION ALL SELECT 'MedicalRecords', COUNT(*) FROM MedicalRecords UNION ALL SELECT 'TrainingData', COUNT(*) FROM TrainingData UNION ALL SELECT 'Transactions', COUNT(*) FROM Transactions;" # Expected: t | count ----------------+------- Patients | 20 MedicalRecords | 14 TrainingData | 13 Transactions | 18
All 4 tables are created and distributed across the 3 worker nodes by Citus. You can now start the lab exercises.
# Connect to the coordinator (master) docker exec -it citus_master psql -U postgres -d mediAI # You are now inside psql. Try: SELECT name, country, siteOrigin FROM Patients WHERE country = 'France' LIMIT 5; # Exit psql: \q
Download free at dbeaver.io. Then create a new connection with these settings:
Type : PostgreSQL Host : localhost Port : 5432 Database : mediAI Username : postgres Password : postgres
Citus cluster running, tables distributed, data loaded. Start with Part 1 — Fragmentation and work through each section in order.
MediAI runs on a Citus cluster: 1 coordinator (Paris HQ) + 3 worker nodes (Tunis, Montréal, Tokyo). The coordinator receives all queries and routes them to the right workers transparently.
| Table | Distribution Key | Type | 🇫🇷 Paris | 🇹🇳 Tunis | 🇨🇦 Montréal | 🇯🇵 Tokyo |
|---|---|---|---|---|---|---|
Patients | country | Horizontal | France | Tunisia | Canada | Japan |
MedicalRecords | country | Horizontal | co-loc | co-loc | co-loc | co-loc |
TrainingData | siteOrigin | Horizontal | Paris | Tunis | Montreal | Tokyo |
Transactions | country | Hybrid | France | Tunisia | Canada | Japan |
Fragmentation divides a table across multiple sites. You will implement all three types: horizontal (rows), vertical (columns), and hybrid (both combined).
Split a table into subsets of rows based on a predicate. Each site stores only the rows that belong to it. The union of all fragments reconstructs the full table.
Completeness: every row belongs to at least one fragment.
Disjointness: no row appears in two fragments (for H fragmentation).
Reconstructibility: UNION of all fragments = original table.
TrainingData by siteOrigin.-- Fragment Paris CREATE OR REPLACE VIEW TrainingData_Paris AS SELECT * FROM TrainingData WHERE siteOrigin = ___; ← complete -- Fragment Tunis CREATE OR REPLACE VIEW TrainingData_Tunis AS SELECT * FROM TrainingData WHERE ___ = 'Tunis'; ← complete -- TODO: write views for Montreal and Tokyo ___
Split a table into subsets of columns based on access patterns. Each fragment contains the primary key + a subset of attributes. Used when different user groups need different columns.
This is the reconstructibility condition. The JOIN on idRecord must rebuild the original table perfectly.
Apply horizontal fragmentation first, then vertical fragmentation on each horizontal fragment — or vice versa. The result is a matrix of fragments. Applied here to the Transactions table.
-- France Financial CREATE OR REPLACE VIEW Trans_FR_Financial AS SELECT idTrans, idPatient, date, amount, currency FROM Transactions WHERE country = 'France'; -- France Management CREATE OR REPLACE VIEW Trans_FR_Management AS SELECT idTrans, idPatient, type, status FROM Transactions WHERE country = 'France'; -- TODO: Tunisia, Canada, Japan (6 more views) ___ -- TODO: Reconstruction JOIN for France SELECT ___ FROM Trans_FR_Financial fin JOIN Trans_FR_Management mgt ON ___ = ___;
Citus implements horizontal fragmentation automatically. You declare the distribution key — Citus creates shards and places them on workers.
-- Step 1: Register the 3 workers in the coordinator SELECT citus_add_node('citus_worker1', 5432); SELECT citus_add_node('citus_worker2', 5432); SELECT citus_add_node('citus_worker3', 5432); -- Verify all 3 workers are active SELECT nodeid, nodename, nodeport, isactive FROM pg_dist_node ORDER BY nodeid; -- Step 2: Distribute tables SELECT create_distributed_table('Patients', 'country'); SELECT create_distributed_table('TrainingData', 'siteOrigin'); SELECT create_distributed_table('MedicalRecords','country'); SELECT create_distributed_table('Transactions', 'country'); -- Step 3: Inspect shards SELECT s.logicalrelid, s.shardid, p.nodename FROM pg_dist_shard s JOIN pg_dist_shard_placement p ON s.shardid = p.shardid ORDER BY s.logicalrelid, s.shardid;
pg_dist_shard to find: how many shards does each table have? Which worker stores TrainingData rows from Tokyo?Replication keeps copies of data on multiple nodes to improve availability and read performance. Allocation decides which fragments go on which sites.
Citus replicates each shard across multiple workers. A replication factor of 2 means each shard exists on 2 different workers — if one fails, the other still serves the data.
-- Set replication factor to 2 SET citus.shard_replication_factor = 2; -- Check current factor SHOW citus.shard_replication_factor; -- See all shard placements SELECT shardid, nodename, nodeport, shardstate FROM pg_dist_shard_placement ORDER BY shardid;
shardstate = 1 mean?| Strategy | Description | Used in MediAI |
|---|---|---|
| Partitioned | Each fragment on exactly 1 site | TrainingData |
| Replicated | Each fragment on all sites | Schema (DDL) |
| Mixed | Hot data replicated, cold partitioned | Possible |
Write SQL queries that Citus transparently distributes across worker nodes. Use EXPLAIN to understand where each query runs and how results are merged.
-- Citus routes this to a single worker (co-located JOIN on country) SELECT p.name, p.age, p.city, p.country, mr.date, mr.examType, mr.result, mr.aiModelUsed, mr.aiScore FROM Patients p JOIN MedicalRecords mr ON p.idPatient = mr.idPatient AND p.country = mr.country ← distribution key must match WHERE p.name = 'Mohamed Benali' ORDER BY mr.date DESC;
EXPLAIN (VERBOSE) on the query above. Answer: which worker(s) execute it? Why does co-locating Patients and MedicalRecords on the same country key avoid a cross-node JOIN?Two-Phase Commit (2PC) guarantees that a transaction spanning multiple nodes is either fully committed on ALL nodes or fully rolled back on ALL nodes — atomicity at scale.
Patient Yuki Tanaka (Tokyo) has an emergency consultation in Paris. Two operations must be atomic: create a medical record on the Tokyo worker AND charge a transaction on the France worker.
-- Phase 1: open transaction and prepare it BEGIN; INSERT INTO MedicalRecords (idPatient, country, date, examType, result, aiModelUsed, aiScore, aiVersion) VALUES (16, 'Japan', NOW()::DATE, 'Emergency consultation', 'General check — patient travelling', 'DiagNet-3', 0.8934, 'v3.2'); INSERT INTO Transactions (idPatient, country, date, type, amount, currency, status) VALUES (16, 'Japan', NOW(), 'consultation', 15000, 'JPY', 'pending'); PREPARE TRANSACTION 'mediAI_emergency_yuki_001'; -- Check the prepared transaction is waiting SELECT gid, prepared, owner, database FROM pg_prepared_xacts; -- Phase 2a: all workers ready → COMMIT COMMIT PREPARED 'mediAI_emergency_yuki_001'; -- Phase 2b: simulate failure → ROLLBACK -- ROLLBACK PREPARED 'mediAI_emergency_yuki_001';
pg_prepared_xacts query. What does the gid field contain? Then COMMIT the transaction and verify the data was inserted correctly.# In a SEPARATE terminal: docker stop citus_worker3 # Back in psql — try to commit: COMMIT PREPARED 'your_txn_id'; # Observe the error message # Restart the worker: docker start citus_worker3
No email. No PDF. Your GitHub repository is your submission. Every commit you push is timestamped and visible to the professor.
data/ folder.### Exercise 3.2 — AI model performance **My query:** ```sql SELECT p.siteOrigin, mr.aiModelUsed, COUNT(*) AS nb_exams, ROUND(AVG(mr.aiScore)::numeric,4) FROM MedicalRecords mr JOIN Patients p ON ... GROUP BY ... ORDER BY avg_score DESC; ``` **Result from psql:** ``` siteOrigin | aiModelUsed | nb_exams | avg_score ------------+--------------+----------+----------- Tunis | NephroAI-1 | 2 | 0.9812 Paris | DiagNet-3 | 1 | 0.9812 ... ```
# After Part 1 — Fragmentation git add . git commit -m "Part 1 done – horizontal, vertical, hybrid fragmentation" git push # After Part 2 — Replication git add . git commit -m "Part 2 done – replication and allocation analysis" git push # After Part 3 — Distributed Queries git add . git commit -m "Part 3 done – distributed SQL queries with EXPLAIN" git push # After Part 4 — 2PC git add . git commit -m "Part 4 done – 2PC simulation and failure scenarios" git push # FINAL submission git add . git commit -m "FINAL SUBMISSION – all parts completed" git push # ✓ Your last push before the deadline = your submission
What was the hardest part of this lab? Describe one specific error you encountered and how you solved it. (min 3 sentences)
In your own words — without the course slides — explain the difference between horizontal and vertical fragmentation. Give a concrete MediAI example for each.
Why does Citus distribute Patients and MedicalRecords using the same key (country)? What would happen to JOIN performance if they used different keys?
In a real hospital, a patient record creation and its billing must be atomic. Is 2PC the right choice? What are the trade-offs in terms of availability vs consistency?
✅ Your last git push before the deadline is your submission.
✅ Push after each part — progressive commits show real work.
✅ Paste psql results as plain text — no screenshots.
✅ Answer all 4 reflection questions in your own words.
❌ Do not modify the solutions/ folder — modifications are tracked.
❌ One commit at 23:58 with everything done = suspicious — commit progressively.