Python Python

Building a Sensor Data Pipeline with Python's pathlib

Dima Iun 30, 2026

Introduction

Every Python program that touches the filesystem eventually accumulates a tangle of os.path.join calls, os.path.splitext acrobatics, and open(os.path.join(base, subdir, filename)) chains. The code works — until someone on Windows has a different path separator, until a developer drops a trailing slash in the wrong place, or until the same path gets constructed five different ways across the codebase and one of them is subtly wrong.

Python's pathlib module replaces all of that with a single Path object that knows what it is, where it lives, and how to reach its neighbors. Paths become objects with attributes — .stem, .suffix, .parent — not strings that must be sliced and rejoined. Directory traversal becomes a one-liner with glob(). Reading and writing a file becomes path.read_text() and path.write_text().

This tutorial builds a sensor data ingestion pipeline for a climate research network. Along the way it replaces every fragile os.path pattern with the equivalent pathlib operation, explaining exactly what breaks in the string-manipulation version and what the path-object version makes safe.


Background

pathlib models a filesystem path as an object rather than a string. The core class is Path, which automatically resolves to PosixPath on Linux and macOS or WindowsPath on Windows — the same code runs correctly on both without any conditional logic.

A Path object exposes every component of its address as a named attribute:

  • .name — the final component, including extension (2024-01-17.csv)
  • .stem — the final component without its extension (2024-01-17)
  • .suffix — the extension including the dot (.csv)
  • .parent — the containing directory, itself a Path object
  • .parts — all components as a tuple (('/', 'home', 'coder', 'learning', ...))

Paths are constructed by passing a string to Path(), or by joining components with the / operator — which replaces every os.path.join call. Path objects are comparable, sortable, and accepted anywhere a string path is expected. Most open(), os, and standard library functions accept them directly.


Practical Scenario

A climate research network operates weather stations across three geographic regions. Each station uploads daily CSV files containing temperature and humidity readings taken at three intervals throughout the day. The files accumulate in a shared archive, organized by station ID.

An ingestion pipeline runs at the end of each day. It scans the archive for the most recent readings per station, computes daily averages, writes a formatted summary report to a separate reports directory, and produces an audit log with file sizes and modification times for the compliance team.

The pipeline runs on the field servers running Linux and on analysts' laptops running macOS and Windows. Any path logic that hardcodes separators, relies on string slicing, or uses OS-specific calls will silently break on at least one platform. The compliance audit requires that every file processed be listed with its exact size in bytes — metadata that must be queried from the filesystem, not inferred from the data.


The Problem with String-Based Path Manipulation

The first version of the pipeline uses os.path throughout — the standard approach before pathlib existed.

Create a new file:

touch sensor_pipeline.py

Run it using:

python3 sensor_pipeline.py
import os
import csv

BASE = "/home/coder/learning/sensor_archive"
STATIONS = ["NST-01", "NST-02", "NST-03"]
DATES = ["2024-01-15", "2024-01-16", "2024-01-17"]
READINGS = {
    "NST-01": [(21.4, 65), (24.1, 58), (19.8, 72)],
    "NST-02": [(18.2, 70), (20.5, 63), (16.9, 78)],
    "NST-03": [(25.3, 55), (27.8, 49), (23.1, 61)],
}

def setup_archive():
    for station in STATIONS:
        station_dir = os.path.join(BASE, station)
        os.makedirs(station_dir, exist_ok=True)
        for date in DATES:
            file_path = os.path.join(station_dir, date + ".csv")
            with open(file_path, "w") as f:
                f.write("timestamp,temp_c,humidity\n")
                for i, (temp, hum) in enumerate(READINGS[station]):
                    hour = 8 + i * 5
                    f.write(f"{date}T{hour:02d}:00:00,{temp},{hum}\n")

def process_station(station_id):
    station_dir = os.path.join(BASE, station_id)
    files = [f for f in os.listdir(station_dir) if f.endswith(".csv")]
    files.sort()
    latest = files[-1]
    latest_path = os.path.join(station_dir, latest)
    date_label = os.path.splitext(latest)[0]

    with open(latest_path) as f:
        reader = csv.DictReader(f)
        temps = [float(r["temp_c"]) for r in reader]

    avg = sum(temps) / len(temps)
    print(f"Station {station_id} | Latest: {date_label} | Avg temp: {avg:.1f}°C")

setup_archive()
for station in STATIONS:
    process_station(station)


Station NST-01 | Latest: 2024-01-17 | Avg temp: 21.8°C
Station NST-02 | Latest: 2024-01-17 | Avg temp: 18.5°C
Station NST-03 | Latest: 2024-01-17 | Avg temp: 25.4°C


Every path in this code is a raw string built by hand. os.path.join(station_dir, latest) works, but station_dir is already the result of a previous os.path.join — tracing where it came from requires reading the call above. Extracting the date from the filename requires os.path.splitext, a separate function that returns a two-element tuple just to strip four characters. os.listdir returns bare filenames with no location attached, so every result must be rejoined immediately to be usable. On Windows, the hardcoded / in BASE would silently produce a broken path.


Path Construction and the / Operator

pathlib.Path replaces os.path.join with the / operator. Each operand is either a Path object or a string, and the result is a new Path — not a string. The path carries its full location at every step without any manual joining. path.mkdir(parents=True, exist_ok=True) replaces os.makedirs with the same semantics, called directly on the directory path.

Replace setup_archive and process_station with the following, and update BASE:

from pathlib import Path

BASE = Path("/home/coder/learning/sensor_archive")

def setup_archive():
    for station in STATIONS:
        station_dir = BASE / station
        station_dir.mkdir(parents=True, exist_ok=True)
        for date in DATES:
            file_path = station_dir / (date + ".csv")
            with open(file_path, "w") as f:
                f.write("timestamp,temp_c,humidity\n")
                for i, (temp, hum) in enumerate(READINGS[station]):
                    hour = 8 + i * 5
                    f.write(f"{date}T{hour:02d}:00:00,{temp},{hum}\n")

def process_station(station_id):
    station_dir = BASE / station_id
    files = sorted(f for f in os.listdir(station_dir) if f.endswith(".csv"))
    latest_path = station_dir / files[-1]
    date_label = os.path.splitext(files[-1])[0]

    with open(latest_path) as f:
        reader = csv.DictReader(f)
        temps = [float(r["temp_c"]) for r in reader]

    avg = sum(temps) / len(temps)
    print(f"Station {station_id} | Latest: {date_label} | Avg temp: {avg:.1f}°C")

setup_archive()
for station in STATIONS:
    process_station(station)


The output is unchanged.

BASE / station reads as a path expression rather than a function call. The result is a Path object at every step, so it can be passed to open(), compared with ==, checked with .exists(), and extended further with another / — all without converting back to a string. The separator is resolved correctly for the current platform by Path itself, so the same code works on Linux, macOS, and Windows without any conditionals.


Path Properties: name, stem, suffix, parent

A Path exposes every component of its address as an attribute. Instead of calling os.path.splitext, os.path.basename, and os.path.dirname — three separate functions for three pieces of the same path — you read .stem, .name, and .parent from one object.

Replace process_station:

def process_station(station_id):
    station_dir = BASE / station_id
    paths = sorted(station_dir / f for f in os.listdir(station_dir) if f.endswith(".csv"))
    latest_path = paths[-1]

    print(f"  .name   : {latest_path.name}")
    print(f"  .stem   : {latest_path.stem}")
    print(f"  .suffix : {latest_path.suffix}")
    print(f"  .parent : {latest_path.parent.name}")

    with open(latest_path) as f:
        reader = csv.DictReader(f)
        temps = [float(r["temp_c"]) for r in reader]

    avg = sum(temps) / len(temps)
    print(f"Station {station_id} | Date: {latest_path.stem} | Avg temp: {avg:.1f}°C\n")

for station in STATIONS:
    process_station(station)


  .name   : 2024-01-17.csv
  .stem   : 2024-01-17
  .suffix : .csv
  .parent : NST-01
Station NST-01 | Date: 2024-01-17 | Avg temp: 21.8°C

  .name   : 2024-01-17.csv
  .stem   : 2024-01-17
  .suffix : .csv
  .parent : NST-02
Station NST-02 | Date: 2024-01-17 | Avg temp: 18.5°C

  .name   : 2024-01-17.csv
  .stem   : 2024-01-17
  .suffix : .csv
  .parent : NST-03
Station NST-03 | Date: 2024-01-17 | Avg temp: 25.4°C


latest_path.stem gives the date string directly — no tuple unpacking, no slicing. latest_path.parent.name gives the station ID from the path itself rather than from the outer variable, which means the information comes from where the file actually lives, not from a string passed in separately.

Each attribute is a named property of the object. A reader sees .stem and understands "filename without extension" immediately — more precise than tracing what [0] means in os.path.splitext(files[-1])[0]. Because .parent is itself a Path, the chain extends naturally: path.parent.name, path.parent.parent, or any further navigation without a new os.path call at each step.


Existence Checks and Globbing

os.listdir returns bare filenames with no filtering — you get everything in the directory and must filter and rejoin each result manually. Path.glob() returns fully-qualified Path objects matching a pattern, pre-filtered, ready to use. Before scanning, path.is_dir() and path.exists() let you validate paths as objects rather than checking string-constructed guesses.

Replace process_station and add a scan_archive function:

def process_station(station_id):
    station_dir = BASE / station_id
    if not station_dir.is_dir():
        print(f"  {station_id}: no data directory found")
        return
    csv_files = sorted(station_dir.glob("*.csv"))
    if not csv_files:
        print(f"  {station_id}: no CSV files")
        return
    latest_path = csv_files[-1]

    with open(latest_path) as f:
        reader = csv.DictReader(f)
        temps = [float(r["temp_c"]) for r in reader]

    avg = sum(temps) / len(temps)
    print(f"Station {station_id} | Date: {latest_path.stem} | Avg temp: {avg:.1f}°C")

def scan_archive():
    all_files = sorted(BASE.rglob("*.csv"))
    print(f"\nArchive contains {len(all_files)} CSV files:")
    for path in all_files:
        print(f"  {path.parent.name}/{path.name}")

for station in STATIONS:
    process_station(station)
scan_archive()


Station NST-01 | Date: 2024-01-17 | Avg temp: 21.8°C
Station NST-02 | Date: 2024-01-17 | Avg temp: 18.5°C
Station NST-03 | Date: 2024-01-17 | Avg temp: 25.4°C

Archive contains 9 CSV files:
  NST-01/2024-01-15.csv
  NST-01/2024-01-16.csv
  NST-01/2024-01-17.csv
  NST-02/2024-01-15.csv
  NST-02/2024-01-16.csv
  NST-02/2024-01-17.csv
  NST-03/2024-01-15.csv
  NST-03/2024-01-16.csv
  NST-03/2024-01-17.csv


station_dir.glob("*.csv") returns matching Path objects — no rejoining before each result can be used. BASE.rglob("*.csv") descends the full directory tree, the equivalent of os.walk plus filtering reduced to a single expression. station_dir.is_dir() checks the path object directly, which makes the guard condition read as a statement about the path rather than a string comparison.

Glob patterns are declarative: *.csv reads as "CSV files in this directory," **/*.csv reads as "CSV files anywhere below this point." The returned objects carry their full paths, so path.parent.name extracts the station ID from where the file actually lives — no separate tracking variable that could drift out of sync with the real path.

Note: glob() and rglob() return generators. Calling sorted() on them is the standard way to get a list in consistent order before indexing or iterating multiple times.


Reading and Writing with Path Methods

path.open() works like open(path) but is called as a method on the path object itself. For simple whole-file operations, path.read_text() and path.write_text() go further — they open, read or write, and close without requiring a context manager at the call site.

Replace process_station and add a write_report function:

def process_station(station_id):
    station_dir = BASE / station_id
    latest_path = sorted(station_dir.glob("*.csv"))[-1]

    with latest_path.open() as f:
        reader = csv.DictReader(f)
        temps = [float(r["temp_c"]) for r in reader]

    avg = sum(temps) / len(temps)
    return station_id, latest_path.stem, avg

def write_report(results):
    report_dir = BASE.parent / "reports"
    report_dir.mkdir(exist_ok=True)
    report_path = report_dir / "daily_summary.txt"
    lines = ["Daily Station Summary", "=" * 30]
    for station_id, date, avg in results:
        lines.append(f"{station_id}  {date}  {avg:.1f}°C")
    report_path.write_text("\n".join(lines) + "\n")
    print(f"Report written to: {report_path}")
    print(report_path.read_text())

results = [process_station(s) for s in STATIONS]
write_report(results)


Report written to: /home/coder/learning/reports/daily_summary.txt
Daily Station Summary
==============================
NST-01  2024-01-17  21.8°C
NST-02  2024-01-17  18.5°C
NST-03  2024-01-17  25.4°C


latest_path.open() gives a standard file handle, identical to open(latest_path), but tied to the object that already represents the file. report_path.write_text(...) creates or overwrites the file and handles the lifecycle internally. report_path.read_text() returns the full content as a string without a separate with block.

For structured formats like CSV, path.open() integrates file access into the path object rather than scattering open(os.path.join(...)) calls across the code. For simple text files, read_text() and write_text() remove the context manager overhead when the full content is what you need. Both methods accept an encoding parameter — path.read_text(encoding="utf-8") — which prevents OS-level encoding differences from causing silent failures on non-ASCII content.


File Metadata with stat()

path.stat() returns the same metadata as os.stat() — file size, modification time, permissions — called directly on the path object. The size in bytes comes from st_size, and the modification time comes from st_mtime as a Unix timestamp that datetime.fromtimestamp converts to a readable format.

Add the following function to sensor_pipeline.py and call it:

from datetime import datetime

def audit_report():
    print("\nAudit log — latest readings per station\n")
    for station in STATIONS:
        station_dir = BASE / station
        latest = sorted(station_dir.glob("*.csv"))[-1]
        info = latest.stat()
        size = info.st_size
        mtime = datetime.fromtimestamp(info.st_mtime).strftime("%Y-%m-%d %H:%M:%S")
        print(f"{station}  {latest.name}  {size} B  {mtime}")

audit_report()


Audit log  latest readings per station

NST-01  2024-01-17.csv  113 B  2024-05-13 10:42:05
NST-02  2024-01-17.csv  113 B  2024-05-13 10:42:05
NST-03  2024-01-17.csv  113 B  2024-05-13 10:42:05


The timestamps reflect when the script runs, so they will differ from the values shown here. The file size is deterministic — each file has one header row and three data rows, totaling 113 bytes.

path.stat() is called on the object that already represents the file — no need to pass the path string into a separate os.stat() call or import os for this purpose alone. Combined with .name, .stem, and .parent, every piece of information about a file comes from the single object that describes it: its location, its components, its content interface, and its metadata.


Summary

Python's pathlib module replaces the scattered os.path API with a single Path object that carries its location, exposes every component as a named attribute, and provides methods for the most common filesystem operations. In this tutorial we built a sensor data pipeline that demonstrated the full range:

  • Path() converts a string into a path object; the / operator joins segments the same way os.path.join does but produces a Path at every step, with the separator resolved correctly for the current platform
  • .name, .stem, .suffix, .parent, and .parts provide every component of a path as a named attribute, replacing os.path.splitext, os.path.basename, and os.path.dirname with readable attribute access on a single object
  • .exists(), .is_file(), and .is_dir() validate a path as a check on the object rather than a string comparison, keeping path logic local and readable
  • path.glob("*.csv") returns fully-qualified Path objects matching a pattern; path.rglob("*.csv") descends the entire subtree — both replace os.listdir plus filtering plus rejoining with a single declarative expression
  • path.open() is the object-oriented equivalent of open(path) for structured formats; path.read_text() and path.write_text() handle whole-file reads and writes without an explicit context manager
  • path.mkdir(parents=True, exist_ok=True) creates the full directory chain in one call, replacing os.makedirs with the same two keyword arguments called directly on the directory path
  • path.stat().st_size and path.stat().st_mtime return file metadata directly from the path object, with no additional imports or string conversions required

Trebuie să fii autentificat pentru a accesa laboratorul cloud.

Autentifică-te