Python
Building a Sensor Data Pipeline with Python's pathlib
Introduction
Every Python program that touches the filesystem eventually accumulates a tangle of os.path.join calls, os.path.splitext acrobatics, and open(os.path.join(base, subdir, filename)) chains. The code works — until someone on Windows has a different path separator, until a developer drops a trailing slash in the wrong place, or until the same path gets constructed five different ways across the codebase and one of them is subtly wrong.
Python's pathlib module replaces all of that with a single Path object that knows what it is, where it lives, and how to reach its neighbors. Paths become objects with attributes — .stem, .suffix, .parent — not strings that must be sliced and rejoined. Directory traversal becomes a one-liner with glob(). Reading and writing a file becomes path.read_text() and path.write_text().
This tutorial builds a sensor data ingestion pipeline for a climate research network. Along the way it replaces every fragile os.path pattern with the equivalent pathlib operation, explaining exactly what breaks in the string-manipulation version and what the path-object version makes safe.
Background
pathlib models a filesystem path as an object rather than a string. The core class is Path, which automatically resolves to PosixPath on Linux and macOS or WindowsPath on Windows — the same code runs correctly on both without any conditional logic.
A Path object exposes every component of its address as a named attribute:
.name— the final component, including extension (2024-01-17.csv).stem— the final component without its extension (2024-01-17).suffix— the extension including the dot (.csv).parent— the containing directory, itself aPathobject.parts— all components as a tuple (('/', 'home', 'coder', 'learning', ...))
Paths are constructed by passing a string to Path(), or by joining components with the / operator — which replaces every os.path.join call. Path objects are comparable, sortable, and accepted anywhere a string path is expected. Most open(), os, and standard library functions accept them directly.
Practical Scenario
A climate research network operates weather stations across three geographic regions. Each station uploads daily CSV files containing temperature and humidity readings taken at three intervals throughout the day. The files accumulate in a shared archive, organized by station ID.
An ingestion pipeline runs at the end of each day. It scans the archive for the most recent readings per station, computes daily averages, writes a formatted summary report to a separate reports directory, and produces an audit log with file sizes and modification times for the compliance team.
The pipeline runs on the field servers running Linux and on analysts' laptops running macOS and Windows. Any path logic that hardcodes separators, relies on string slicing, or uses OS-specific calls will silently break on at least one platform. The compliance audit requires that every file processed be listed with its exact size in bytes — metadata that must be queried from the filesystem, not inferred from the data.
The Problem with String-Based Path Manipulation
The first version of the pipeline uses os.path throughout — the standard approach before pathlib existed.
Create a new file:
touch sensor_pipeline.py
Run it using:
python3 sensor_pipeline.py
import os
import csv
BASE = "/home/coder/learning/sensor_archive"
STATIONS = ["NST-01", "NST-02", "NST-03"]
DATES = ["2024-01-15", "2024-01-16", "2024-01-17"]
READINGS = {
"NST-01": [(21.4, 65), (24.1, 58), (19.8, 72)],
"NST-02": [(18.2, 70), (20.5, 63), (16.9, 78)],
"NST-03": [(25.3, 55), (27.8, 49), (23.1, 61)],
}
def setup_archive():
for station in STATIONS:
station_dir = os.path.join(BASE, station)
os.makedirs(station_dir, exist_ok=True)
for date in DATES:
file_path = os.path.join(station_dir, date + ".csv")
with open(file_path, "w") as f:
f.write("timestamp,temp_c,humidity\n")
for i, (temp, hum) in enumerate(READINGS[station]):
hour = 8 + i * 5
f.write(f"{date}T{hour:02d}:00:00,{temp},{hum}\n")
def process_station(station_id):
station_dir = os.path.join(BASE, station_id)
files = [f for f in os.listdir(station_dir) if f.endswith(".csv")]
files.sort()
latest = files[-1]
latest_path = os.path.join(station_dir, latest)
date_label = os.path.splitext(latest)[0]
with open(latest_path) as f:
reader = csv.DictReader(f)
temps = [float(r["temp_c"]) for r in reader]
avg = sum(temps) / len(temps)
print(f"Station {station_id} | Latest: {date_label} | Avg temp: {avg:.1f}°C")
setup_archive()
for station in STATIONS:
process_station(station)
Station NST-01 | Latest: 2024-01-17 | Avg temp: 21.8°C
Station NST-02 | Latest: 2024-01-17 | Avg temp: 18.5°C
Station NST-03 | Latest: 2024-01-17 | Avg temp: 25.4°C
Every path in this code is a raw string built by hand. os.path.join(station_dir, latest) works, but station_dir is already the result of a previous os.path.join — tracing where it came from requires reading the call above. Extracting the date from the filename requires os.path.splitext, a separate function that returns a two-element tuple just to strip four characters. os.listdir returns bare filenames with no location attached, so every result must be rejoined immediately to be usable. On Windows, the hardcoded / in BASE would silently produce a broken path.
Path Construction and the / Operator
pathlib.Path replaces os.path.join with the / operator. Each operand is either a Path object or a string, and the result is a new Path — not a string. The path carries its full location at every step without any manual joining. path.mkdir(parents=True, exist_ok=True) replaces os.makedirs with the same semantics, called directly on the directory path.
Replace setup_archive and process_station with the following, and update BASE:
from pathlib import Path
BASE = Path("/home/coder/learning/sensor_archive")
def setup_archive():
for station in STATIONS:
station_dir = BASE / station
station_dir.mkdir(parents=True, exist_ok=True)
for date in DATES:
file_path = station_dir / (date + ".csv")
with open(file_path, "w") as f:
f.write("timestamp,temp_c,humidity\n")
for i, (temp, hum) in enumerate(READINGS[station]):
hour = 8 + i * 5
f.write(f"{date}T{hour:02d}:00:00,{temp},{hum}\n")
def process_station(station_id):
station_dir = BASE / station_id
files = sorted(f for f in os.listdir(station_dir) if f.endswith(".csv"))
latest_path = station_dir / files[-1]
date_label = os.path.splitext(files[-1])[0]
with open(latest_path) as f:
reader = csv.DictReader(f)
temps = [float(r["temp_c"]) for r in reader]
avg = sum(temps) / len(temps)
print(f"Station {station_id} | Latest: {date_label} | Avg temp: {avg:.1f}°C")
setup_archive()
for station in STATIONS:
process_station(station)
The output is unchanged.
BASE / station reads as a path expression rather than a function call. The result is a Path object at every step, so it can be passed to open(), compared with ==, checked with .exists(), and extended further with another / — all without converting back to a string. The separator is resolved correctly for the current platform by Path itself, so the same code works on Linux, macOS, and Windows without any conditionals.
Path Properties: name, stem, suffix, parent
A Path exposes every component of its address as an attribute. Instead of calling os.path.splitext, os.path.basename, and os.path.dirname — three separate functions for three pieces of the same path — you read .stem, .name, and .parent from one object.
Replace process_station:
def process_station(station_id):
station_dir = BASE / station_id
paths = sorted(station_dir / f for f in os.listdir(station_dir) if f.endswith(".csv"))
latest_path = paths[-1]
print(f" .name : {latest_path.name}")
print(f" .stem : {latest_path.stem}")
print(f" .suffix : {latest_path.suffix}")
print(f" .parent : {latest_path.parent.name}")
with open(latest_path) as f:
reader = csv.DictReader(f)
temps = [float(r["temp_c"]) for r in reader]
avg = sum(temps) / len(temps)
print(f"Station {station_id} | Date: {latest_path.stem} | Avg temp: {avg:.1f}°C\n")
for station in STATIONS:
process_station(station)
.name : 2024-01-17.csv
.stem : 2024-01-17
.suffix : .csv
.parent : NST-01
Station NST-01 | Date: 2024-01-17 | Avg temp: 21.8°C
.name : 2024-01-17.csv
.stem : 2024-01-17
.suffix : .csv
.parent : NST-02
Station NST-02 | Date: 2024-01-17 | Avg temp: 18.5°C
.name : 2024-01-17.csv
.stem : 2024-01-17
.suffix : .csv
.parent : NST-03
Station NST-03 | Date: 2024-01-17 | Avg temp: 25.4°C
latest_path.stem gives the date string directly — no tuple unpacking, no slicing. latest_path.parent.name gives the station ID from the path itself rather than from the outer variable, which means the information comes from where the file actually lives, not from a string passed in separately.
Each attribute is a named property of the object. A reader sees .stem and understands "filename without extension" immediately — more precise than tracing what [0] means in os.path.splitext(files[-1])[0]. Because .parent is itself a Path, the chain extends naturally: path.parent.name, path.parent.parent, or any further navigation without a new os.path call at each step.
Existence Checks and Globbing
os.listdir returns bare filenames with no filtering — you get everything in the directory and must filter and rejoin each result manually. Path.glob() returns fully-qualified Path objects matching a pattern, pre-filtered, ready to use. Before scanning, path.is_dir() and path.exists() let you validate paths as objects rather than checking string-constructed guesses.
Replace process_station and add a scan_archive function:
def process_station(station_id):
station_dir = BASE / station_id
if not station_dir.is_dir():
print(f" {station_id}: no data directory found")
return
csv_files = sorted(station_dir.glob("*.csv"))
if not csv_files:
print(f" {station_id}: no CSV files")
return
latest_path = csv_files[-1]
with open(latest_path) as f:
reader = csv.DictReader(f)
temps = [float(r["temp_c"]) for r in reader]
avg = sum(temps) / len(temps)
print(f"Station {station_id} | Date: {latest_path.stem} | Avg temp: {avg:.1f}°C")
def scan_archive():
all_files = sorted(BASE.rglob("*.csv"))
print(f"\nArchive contains {len(all_files)} CSV files:")
for path in all_files:
print(f" {path.parent.name}/{path.name}")
for station in STATIONS:
process_station(station)
scan_archive()
Station NST-01 | Date: 2024-01-17 | Avg temp: 21.8°C
Station NST-02 | Date: 2024-01-17 | Avg temp: 18.5°C
Station NST-03 | Date: 2024-01-17 | Avg temp: 25.4°C
Archive contains 9 CSV files:
NST-01/2024-01-15.csv
NST-01/2024-01-16.csv
NST-01/2024-01-17.csv
NST-02/2024-01-15.csv
NST-02/2024-01-16.csv
NST-02/2024-01-17.csv
NST-03/2024-01-15.csv
NST-03/2024-01-16.csv
NST-03/2024-01-17.csv
station_dir.glob("*.csv") returns matching Path objects — no rejoining before each result can be used. BASE.rglob("*.csv") descends the full directory tree, the equivalent of os.walk plus filtering reduced to a single expression. station_dir.is_dir() checks the path object directly, which makes the guard condition read as a statement about the path rather than a string comparison.
Glob patterns are declarative: *.csv reads as "CSV files in this directory," **/*.csv reads as "CSV files anywhere below this point." The returned objects carry their full paths, so path.parent.name extracts the station ID from where the file actually lives — no separate tracking variable that could drift out of sync with the real path.
Note: glob() and rglob() return generators. Calling sorted() on them is the standard way to get a list in consistent order before indexing or iterating multiple times.
Reading and Writing with Path Methods
path.open() works like open(path) but is called as a method on the path object itself. For simple whole-file operations, path.read_text() and path.write_text() go further — they open, read or write, and close without requiring a context manager at the call site.
Replace process_station and add a write_report function:
def process_station(station_id):
station_dir = BASE / station_id
latest_path = sorted(station_dir.glob("*.csv"))[-1]
with latest_path.open() as f:
reader = csv.DictReader(f)
temps = [float(r["temp_c"]) for r in reader]
avg = sum(temps) / len(temps)
return station_id, latest_path.stem, avg
def write_report(results):
report_dir = BASE.parent / "reports"
report_dir.mkdir(exist_ok=True)
report_path = report_dir / "daily_summary.txt"
lines = ["Daily Station Summary", "=" * 30]
for station_id, date, avg in results:
lines.append(f"{station_id} {date} {avg:.1f}°C")
report_path.write_text("\n".join(lines) + "\n")
print(f"Report written to: {report_path}")
print(report_path.read_text())
results = [process_station(s) for s in STATIONS]
write_report(results)
Report written to: /home/coder/learning/reports/daily_summary.txt
Daily Station Summary
==============================
NST-01 2024-01-17 21.8°C
NST-02 2024-01-17 18.5°C
NST-03 2024-01-17 25.4°C
latest_path.open() gives a standard file handle, identical to open(latest_path), but tied to the object that already represents the file. report_path.write_text(...) creates or overwrites the file and handles the lifecycle internally. report_path.read_text() returns the full content as a string without a separate with block.
For structured formats like CSV, path.open() integrates file access into the path object rather than scattering open(os.path.join(...)) calls across the code. For simple text files, read_text() and write_text() remove the context manager overhead when the full content is what you need. Both methods accept an encoding parameter — path.read_text(encoding="utf-8") — which prevents OS-level encoding differences from causing silent failures on non-ASCII content.
File Metadata with stat()
path.stat() returns the same metadata as os.stat() — file size, modification time, permissions — called directly on the path object. The size in bytes comes from st_size, and the modification time comes from st_mtime as a Unix timestamp that datetime.fromtimestamp converts to a readable format.
Add the following function to sensor_pipeline.py and call it:
from datetime import datetime
def audit_report():
print("\nAudit log — latest readings per station\n")
for station in STATIONS:
station_dir = BASE / station
latest = sorted(station_dir.glob("*.csv"))[-1]
info = latest.stat()
size = info.st_size
mtime = datetime.fromtimestamp(info.st_mtime).strftime("%Y-%m-%d %H:%M:%S")
print(f"{station} {latest.name} {size} B {mtime}")
audit_report()
Audit log — latest readings per station
NST-01 2024-01-17.csv 113 B 2024-05-13 10:42:05
NST-02 2024-01-17.csv 113 B 2024-05-13 10:42:05
NST-03 2024-01-17.csv 113 B 2024-05-13 10:42:05
The timestamps reflect when the script runs, so they will differ from the values shown here. The file size is deterministic — each file has one header row and three data rows, totaling 113 bytes.
path.stat() is called on the object that already represents the file — no need to pass the path string into a separate os.stat() call or import os for this purpose alone. Combined with .name, .stem, and .parent, every piece of information about a file comes from the single object that describes it: its location, its components, its content interface, and its metadata.
Summary
Python's pathlib module replaces the scattered os.path API with a single Path object that carries its location, exposes every component as a named attribute, and provides methods for the most common filesystem operations. In this tutorial we built a sensor data pipeline that demonstrated the full range:
Path()converts a string into a path object; the/operator joins segments the same wayos.path.joindoes but produces aPathat every step, with the separator resolved correctly for the current platform.name,.stem,.suffix,.parent, and.partsprovide every component of a path as a named attribute, replacingos.path.splitext,os.path.basename, andos.path.dirnamewith readable attribute access on a single object.exists(),.is_file(), and.is_dir()validate a path as a check on the object rather than a string comparison, keeping path logic local and readablepath.glob("*.csv")returns fully-qualifiedPathobjects matching a pattern;path.rglob("*.csv")descends the entire subtree — both replaceos.listdirplus filtering plus rejoining with a single declarative expressionpath.open()is the object-oriented equivalent ofopen(path)for structured formats;path.read_text()andpath.write_text()handle whole-file reads and writes without an explicit context managerpath.mkdir(parents=True, exist_ok=True)creates the full directory chain in one call, replacingos.makedirswith the same two keyword arguments called directly on the directory pathpath.stat().st_sizeandpath.stat().st_mtimereturn file metadata directly from the path object, with no additional imports or string conversions required