Java

How to process large datasets with Java streams and lambdas

grivel Июн 24, 2026

Introduction

An order analytics service that processes thousands of records per run using mutable loops and temporary lists is not just verbose — it is fragile. Every intermediate ArrayList is a place where a missed .add() call silently drops records. Every nested for loop that filters and accumulates creates invisible coupling between iteration and business logic. When the requirements change — add a new grouping dimension, switch from sum to average, filter by an extra condition — the loop body grows and the places where bugs can hide multiply.

Java's Stream API replaces mutable iteration with declarative pipelines. A stream pipeline describes what to compute, not how to iterate — filter, map, collect, groupingBy — and the runtime handles the mechanics. The result is code that reads like the specification, where each transformation is a named, composable operation rather than a block inside a loop.

This tutorial builds an e-commerce order analytics service. Each section replaces one pattern from the mutable version — loops, temporary lists, nested aggregations — with the Stream equivalent, showing the exact output at each step.

Background

Java streams are not data structures. A stream is a pipeline of operations over a source. Key properties:

Lazy evaluation: Intermediate operations (filter, map, sorted, flatMap) do not execute until a terminal operation (collect, reduce, forEach, findFirst) is called.
Single-use: A stream can only be consumed once. Calling a terminal operation closes it.
Sources: Collection.stream(), Arrays.stream(), Stream.of(...), IntStream.range(...).

Terminal operations and collectors: - collect(Collectors.toList()) — gathers elements into a list - collect(Collectors.groupingBy(fn)) — groups elements into a Map<Key, List<T>> - collect(Collectors.groupingBy(fn, Collectors.summarizingDouble(fn2))) — groups and produces a DoubleSummaryStatistics per group (count, sum, min, max, average) - collect(Collectors.counting()) — counts elements per group when nested inside groupingBy - reduce(identity, accumulator) — combines elements into a single value - Method references (ClassName::method, instance::method) are shorthand for single-method lambdas

Practical Scenario

An e-commerce platform generates tens of thousands of order records each day. The analytics service runs nightly to produce the numbers that feed the business dashboard: total revenue, revenue by category, the top revenue-generating products, average order value by customer segment, and a list of high-value orders that need manual review.

The team's first implementation uses a series of for loops, each maintaining its own accumulator variables and temporary lists. Adding a new metric requires finding the right loop, adding variables, and making sure the accumulator is initialized correctly and the aggregation is applied in the right place. The code is correct for the specific metrics it was written for, but it does not compose — adding a new dimension means touching existing loops rather than adding a new pipeline.

The platform processes the same OrderRecord objects for all metrics. What is needed is a processing model that treats each metric as an independent declaration over the same source data, where the business logic is visible and the iteration mechanics are invisible.

The Problem

A mutable-loop analytics service computes three metrics over a daily order batch.

Create a new file:

touch OrderAnalytics.java

Compile and run:

javac OrderAnalytics.java && java OrderAnalytics

import java.util.*;

class OrderRecord {
    String orderId;
    String customerId;
    String category;
    String productId;
    double revenue;
    boolean flaggedForReview;

    OrderRecord(String orderId, String customerId, String category,
                String productId, double revenue, boolean flaggedForReview) {
        this.orderId         = orderId;
        this.customerId      = customerId;
        this.category        = category;
        this.productId       = productId;
        this.revenue         = revenue;
        this.flaggedForReview = flaggedForReview;
    }
}

public class OrderAnalytics {

    static List<OrderRecord> sampleOrders() {
        return Arrays.asList(
            new OrderRecord("ORD-001", "CUST-A", "electronics", "PROD-101", 349.99, false),
            new OrderRecord("ORD-002", "CUST-B", "clothing",    "PROD-202", 89.50,  false),
            new OrderRecord("ORD-003", "CUST-A", "electronics", "PROD-101", 349.99, true),
            new OrderRecord("ORD-004", "CUST-C", "books",       "PROD-303", 24.99,  false),
            new OrderRecord("ORD-005", "CUST-B", "clothing",    "PROD-204", 129.00, false),
            new OrderRecord("ORD-006", "CUST-D", "electronics", "PROD-105", 899.00, true),
            new OrderRecord("ORD-007", "CUST-C", "books",       "PROD-303", 24.99,  false),
            new OrderRecord("ORD-008", "CUST-D", "electronics", "PROD-101", 349.99, false),
            new OrderRecord("ORD-009", "CUST-A", "clothing",    "PROD-202", 89.50,  false),
            new OrderRecord("ORD-010", "CUST-E", "electronics", "PROD-105", 899.00, true)
        );
    }

    public static void main(String[] args) {
        List<OrderRecord> orders = sampleOrders();

        // Total revenue
        double totalRevenue = 0;
        for (OrderRecord o : orders) {
            totalRevenue += o.revenue;
        }
        System.out.printf("Total revenue: $%.2f%n", totalRevenue);

        // Revenue by category
        Map<String, Double> revenueByCategory = new HashMap<>();
        for (OrderRecord o : orders) {
            revenueByCategory.merge(o.category, o.revenue, Double::sum);
        }
        for (Map.Entry<String, Double> entry : revenueByCategory.entrySet()) {
            System.out.printf("  %s: $%.2f%n", entry.getKey(), entry.getValue());
        }

        // Flagged orders
        List<String> flagged = new ArrayList<>();
        for (OrderRecord o : orders) {
            if (o.flaggedForReview) {
                flagged.add(o.orderId);
            }
        }
        System.out.println("Flagged for review: " + flagged);
    }
}

Total revenue: $3205.95
  electronics: $2848.97
  clothing: $308.00
  books: $49.98
Flagged for review: [ORD-003, ORD-006, ORD-010]

Each metric requires its own loop, its own accumulator, and its own insertion logic. The three loops share no structure — they are independent blocks of imperative code that happen to iterate over the same list. Adding a fourth metric means adding a fourth loop. Changing which orders are eligible for a metric means finding the right loop and editing its body rather than composing a filter onto an existing pipeline.

filter, map, and collect

The three foundational stream operations replace the most common loop patterns. filter keeps elements that match a predicate. map transforms each element. collect(Collectors.toList()) gathers results.

Replace the main method with the following:

    public static void main(String[] args) {
        List<OrderRecord> orders = sampleOrders();

        // Flagged order IDs using filter + map + collect
        List<String> flaggedIds = orders.stream()
            .filter(o -> o.flaggedForReview)
            .map(o -> o.orderId)
            .collect(java.util.stream.Collectors.toList());
        System.out.println("Flagged for review: " + flaggedIds);

        // High-value electronics orders (revenue > 300)
        List<String> highValueElectronics = orders.stream()
            .filter(o -> o.category.equals("electronics"))
            .filter(o -> o.revenue > 300.0)
            .map(o -> o.orderId + " ($" + o.revenue + ")")
            .collect(java.util.stream.Collectors.toList());
        System.out.println("High-value electronics: " + highValueElectronics);
    }

Flagged for review: [ORD-003, ORD-006, ORD-010]
High-value electronics: [ORD-001 ($349.99), ORD-003 ($349.99), ORD-006 ($899.0), ORD-008 ($349.99)]

Each pipeline is a declaration: "take orders, keep the flagged ones, extract the order ID, collect into a list." The iteration, the null-safety on the predicate evaluation, and the list construction are all handled by the runtime. Adding another filter requires one more .filter(...) call — no loop body to edit.

Lambdas passed to filter and map are stateless and isolated. There are no accumulators to forget to initialize, no variables shared between conditions. Each stage of the pipeline does exactly one thing, and the stages compose without touching each other.

Collectors.groupingBy

Collectors.groupingBy replaces the manual HashMap maintenance pattern. It groups stream elements by a key function and collects each group into a list — or, with a downstream collector, into a summary.

Add the following import at the top of the file:

import java.util.stream.*;

Replace main with:

    public static void main(String[] args) {
        List<OrderRecord> orders = sampleOrders();

        // Group orders by category
        Map<String, List<OrderRecord>> byCategory = orders.stream()
            .collect(Collectors.groupingBy(o -> o.category));

        byCategory.forEach((category, categoryOrders) -> {
            System.out.printf("%s: %d orders%n", category, categoryOrders.size());
        });

        // Group by category and count
        Map<String, Long> countByCategory = orders.stream()
            .collect(Collectors.groupingBy(o -> o.category, Collectors.counting()));

        System.out.println("\nOrder counts:");
        countByCategory.entrySet().stream()
            .sorted(Map.Entry.<String, Long>comparingByValue().reversed())
            .forEach(e -> System.out.printf("  %s: %d%n", e.getKey(), e.getValue()));
    }

electronics: 5 orders
clothing: 3 orders
books: 2 orders

Order counts:
  electronics: 5
  clothing: 3
  books: 2

groupingBy with a downstream collector of Collectors.counting() produces a Map<String, Long> directly. The merge loop from the original version maintained mutable state across iterations; groupingBy expresses the same aggregation as a single declarative statement.

groupingBy composes with any downstream collector. Switching from counting to summing revenue, or from summing to computing statistics, is one argument change — not a rewrite of the accumulation loop.

Collectors.summarizingDouble

Collectors.summarizingDouble computes count, sum, min, max, and average in a single pass, returned as a DoubleSummaryStatistics object. No separate loops, no running totals.

Replace main with:

    public static void main(String[] args) {
        List<OrderRecord> orders = sampleOrders();

        // Revenue statistics per category in one pass
        Map<String, DoubleSummaryStatistics> statsByCategory = orders.stream()
            .collect(Collectors.groupingBy(
                o -> o.category,
                Collectors.summarizingDouble(o -> o.revenue)
            ));

        System.out.println("Revenue statistics by category:");
        statsByCategory.entrySet().stream()
            .sorted(Map.Entry.comparingByKey())
            .forEach(e -> {
                DoubleSummaryStatistics s = e.getValue();
                System.out.printf("  %-12s  count=%-2d  total=$%-8.2f  avg=$%.2f%n",
                    e.getKey(), s.getCount(), s.getSum(), s.getAverage());
            });

        // Overall revenue summary
        DoubleSummaryStatistics overall = orders.stream()
            .collect(Collectors.summarizingDouble(o -> o.revenue));
        System.out.printf("%nOverall: %d orders, total=$%.2f, avg=$%.2f%n",
            overall.getCount(), overall.getSum(), overall.getAverage());
    }

Revenue statistics by category:
  books         count=2   total=$49.98    avg=$24.99
  clothing      count=3   total=$308.00   avg=$102.67
  electronics   count=5   total=$2848.97  avg=$569.79

Overall: 10 orders, total=$3205.95, avg=$320.60

summarizingDouble computes all five statistics in a single stream pass. The manual equivalent would require five separate accumulators — count, sum, min, max, and a running average — all updated in the correct order inside a single loop.

DoubleSummaryStatistics is standard library infrastructure. It handles floating-point edge cases and is available everywhere without custom aggregation classes. Paired with groupingBy, it gives per-group statistics with no iteration code at all.

sorted and Comparators

Stream pipelines can sort elements using sorted(Comparator). Comparator.comparingDouble, thenComparing, and .reversed() compose multi-level sort criteria in a single expression.

Replace main with:

    public static void main(String[] args) {
        List<OrderRecord> orders = sampleOrders();

        // Top 5 orders by revenue, descending
        System.out.println("Top 5 orders by revenue:");
        orders.stream()
            .sorted(Comparator.comparingDouble((OrderRecord o) -> o.revenue).reversed())
            .limit(5)
            .forEach(o -> System.out.printf("  %s  %-12s  $%.2f%n",
                o.orderId, o.category, o.revenue));

        // Orders sorted by category, then revenue descending within category
        System.out.println("\nOrders by category then revenue:");
        orders.stream()
            .sorted(Comparator.comparing((OrderRecord o) -> o.category)
                .thenComparingDouble((OrderRecord o) -> o.revenue).reversed())
            .forEach(o -> System.out.printf("  %-8s  %-12s  $%.2f%n",
                o.orderId, o.category, o.revenue));
    }

Top 5 orders by revenue:
  ORD-006  electronics   $899.00
  ORD-010  electronics   $899.00
  ORD-001  electronics   $349.99
  ORD-003  electronics   $349.99
  ORD-008  electronics   $349.99

Orders by category then revenue:
  ORD-010  electronics   $899.00
  ORD-006  electronics   $899.00
  ORD-008  electronics   $349.99
  ORD-003  electronics   $349.99
  ORD-001  electronics   $349.99
  ORD-005  clothing      $129.00
  ORD-002  clothing      $89.50
  ORD-009  clothing      $89.50
  ORD-004  books         $24.99
  ORD-007  books         $24.99

Comparator.comparingDouble(...).reversed() and thenComparingDouble(...) compose without any intermediate variables. limit(5) short-circuits the stream after five elements are emitted — the remaining sort work is skipped.

The sort logic reads as a specification: "compare by revenue, reversed, then by category." A Collections.sort with a manual Comparator implementing multi-level sort in an if/else chain expresses the same intent in four to eight lines where the logic is spread across a comparator body.

flatMap and reduce

flatMap transforms each element into a stream and concatenates the results — essential when each record contains a collection that needs to be processed at the element level. reduce collapses a stream to a single value using an accumulator function.

Add a List<String> tags field to OrderRecord and update the constructor and sample data, then add the following to main:

import java.util.*;
import java.util.stream.*;

class OrderRecord {
    String orderId;
    String customerId;
    String category;
    String productId;
    double revenue;
    boolean flaggedForReview;
    List<String> tags;

    OrderRecord(String orderId, String customerId, String category,
                String productId, double revenue, boolean flaggedForReview,
                List<String> tags) {
        this.orderId          = orderId;
        this.customerId       = customerId;
        this.category         = category;
        this.productId        = productId;
        this.revenue          = revenue;
        this.flaggedForReview = flaggedForReview;
        this.tags             = tags;
    }
}

public class OrderAnalytics {

    static List<OrderRecord> sampleOrders() {
        return Arrays.asList(
            new OrderRecord("ORD-001", "CUST-A", "electronics", "PROD-101", 349.99, false,
                Arrays.asList("express", "gift-wrap")),
            new OrderRecord("ORD-002", "CUST-B", "clothing",    "PROD-202",  89.50, false,
                Arrays.asList("sale")),
            new OrderRecord("ORD-003", "CUST-A", "electronics", "PROD-101", 349.99, true,
                Arrays.asList("express", "return-eligible")),
            new OrderRecord("ORD-004", "CUST-C", "books",       "PROD-303",  24.99, false,
                Arrays.asList("gift-wrap")),
            new OrderRecord("ORD-005", "CUST-B", "clothing",    "PROD-204", 129.00, false,
                Arrays.asList("sale", "express"))
        );
    }

    public static void main(String[] args) {
        List<OrderRecord> orders = sampleOrders();

        // All unique tags across all orders
        List<String> allTags = orders.stream()
            .flatMap(o -> o.tags.stream())
            .distinct()
            .sorted()
            .collect(Collectors.toList());
        System.out.println("All tags: " + allTags);

        // Total revenue using reduce
        double totalRevenue = orders.stream()
            .map(o -> o.revenue)
            .reduce(0.0, Double::sum);
        System.out.printf("Total revenue: $%.2f%n", totalRevenue);

        // Most expensive order using reduce
        Optional<OrderRecord> mostExpensive = orders.stream()
            .reduce((a, b) -> a.revenue >= b.revenue ? a : b);
        mostExpensive.ifPresent(o ->
            System.out.printf("Most expensive: %s ($%.2f)%n", o.orderId, o.revenue));
    }
}

All tags: [express, gift-wrap, return-eligible, sale]
Total revenue: $943.47
Most expensive: ORD-001 ($349.99)

flatMap eliminates the nested loop pattern. Without it, collecting all tags would require an outer loop over orders and an inner loop over order.tags, with manual deduplication using a Set. With flatMap, the inner collection becomes a continuation of the stream, and distinct() handles deduplication in one call.

reduce with an identity and a BinaryOperator is the standard pattern for any fold operation — sum, product, string concatenation, maximum. The two-argument reduce without an identity returns Optional because the stream might be empty, which forces the caller to handle the empty case rather than returning a wrong default silently.

Method References

Method references replace single-method lambdas with a more direct syntax. ClassName::staticMethod, instance::instanceMethod, and ClassName::instanceMethod are the three common forms.

Replace main with:

    public static void main(String[] args) {
        List<OrderRecord> orders = sampleOrders();

        // ClassName::instanceMethod — comparator
        orders.stream()
            .map(o -> o.orderId)
            .sorted(String::compareTo)
            .forEach(System.out::println);  // instance::method on System.out

        System.out.println();

        // Static method reference for revenue extraction
        orders.stream()
            .map(OrderAnalytics::formatOrder)   // static method reference
            .forEach(System.out::println);
    }

    static String formatOrder(OrderRecord o) {
        return String.format("[%s] %s  $%.2f  tags=%s",
            o.category, o.orderId, o.revenue, o.tags);
    }

ORD-001
ORD-002
ORD-003
ORD-004
ORD-005

[electronics] ORD-001  $349.99  tags=[express, gift-wrap]
[clothing] ORD-002  $89.50  tags=[sale]
[electronics] ORD-003  $349.99  tags=[express, return-eligible]
[books] ORD-004  $24.99  tags=[gift-wrap]
[clothing] ORD-005  $129.00  tags=[sale, express]

System.out::println is equivalent to x -> System.out.println(x). String::compareTo is equivalent to (a, b) -> a.compareTo(b). OrderAnalytics::formatOrder is equivalent to o -> OrderAnalytics.formatOrder(o). The method reference form is shorter and names the method explicitly, which aids searchability and readability.

Method references are not just syntactic sugar — they make the intent more explicit. map(OrderAnalytics::formatOrder) communicates that formatting is a named, reusable operation defined elsewhere, not an anonymous transformation defined inline. This makes the pipeline easier to test: formatOrder can be unit-tested independently of the stream pipeline that calls it.

Summary

Java Streams replace mutable loops and temporary accumulators with declarative pipelines that describe what to compute rather than how to iterate. This tutorial built an e-commerce order analytics service across six stream patterns:

filter(predicate) keeps elements that match a condition — chain multiple filter calls to compose conditions without nesting logic inside a loop body
map(function) transforms each element independently — the transformation is stateless and does not accumulate, making it safe to reason about in isolation
collect(Collectors.toList()) is the standard terminal operation for gathering stream elements; collect(Collectors.groupingBy(fn)) partitions the stream into a Map<Key, List<T>> keyed by the grouping function
Collectors.groupingBy with a downstream collector like Collectors.counting() or Collectors.summarizingDouble(fn) computes per-group aggregations in a single pass — no manual HashMap maintenance
sorted(Comparator) with Comparator.comparingDouble(...).reversed() and .thenComparing(...) expresses multi-level sort criteria as a single composable expression
flatMap(o -> o.collection.stream()) flattens nested collections into a single stream — use it wherever an element contains a list that needs to be processed at the element level
reduce(identity, BinaryOperator) folds a stream into a single value; the single-argument form returns Optional because the stream may be empty, forcing explicit handling of that case
Method references (ClassName::method, instance::method) replace single-method lambdas when the method already exists — they also make the operation testable as an independent unit