Java
How to process large datasets with Java streams and lambdas
Introduction
An order analytics service that processes thousands of records per run using mutable loops and temporary lists is not just verbose — it is fragile. Every intermediate ArrayList is a place where a missed .add() call silently drops records. Every nested for loop that filters and accumulates creates invisible coupling between iteration and business logic. When the requirements change — add a new grouping dimension, switch from sum to average, filter by an extra condition — the loop body grows and the places where bugs can hide multiply.
Java's Stream API replaces mutable iteration with declarative pipelines. A stream pipeline describes what to compute, not how to iterate — filter, map, collect, groupingBy — and the runtime handles the mechanics. The result is code that reads like the specification, where each transformation is a named, composable operation rather than a block inside a loop.
This tutorial builds an e-commerce order analytics service. Each section replaces one pattern from the mutable version — loops, temporary lists, nested aggregations — with the Stream equivalent, showing the exact output at each step.
Background
Java streams are not data structures. A stream is a pipeline of operations over a source. Key properties:
- Lazy evaluation: Intermediate operations (
filter,map,sorted,flatMap) do not execute until a terminal operation (collect,reduce,forEach,findFirst) is called. - Single-use: A stream can only be consumed once. Calling a terminal operation closes it.
- Sources:
Collection.stream(),Arrays.stream(),Stream.of(...),IntStream.range(...).
Terminal operations and collectors:
- collect(Collectors.toList()) — gathers elements into a list
- collect(Collectors.groupingBy(fn)) — groups elements into a Map<Key, List<T>>
- collect(Collectors.groupingBy(fn, Collectors.summarizingDouble(fn2))) — groups and produces a DoubleSummaryStatistics per group (count, sum, min, max, average)
- collect(Collectors.counting()) — counts elements per group when nested inside groupingBy
- reduce(identity, accumulator) — combines elements into a single value
- Method references (ClassName::method, instance::method) are shorthand for single-method lambdas
Practical Scenario
An e-commerce platform generates tens of thousands of order records each day. The analytics service runs nightly to produce the numbers that feed the business dashboard: total revenue, revenue by category, the top revenue-generating products, average order value by customer segment, and a list of high-value orders that need manual review.
The team's first implementation uses a series of for loops, each maintaining its own accumulator variables and temporary lists. Adding a new metric requires finding the right loop, adding variables, and making sure the accumulator is initialized correctly and the aggregation is applied in the right place. The code is correct for the specific metrics it was written for, but it does not compose — adding a new dimension means touching existing loops rather than adding a new pipeline.
The platform processes the same OrderRecord objects for all metrics. What is needed is a processing model that treats each metric as an independent declaration over the same source data, where the business logic is visible and the iteration mechanics are invisible.
The Problem
A mutable-loop analytics service computes three metrics over a daily order batch.
Create a new file:
touch OrderAnalytics.java
Compile and run:
javac OrderAnalytics.java && java OrderAnalytics
import java.util.*;
class OrderRecord {
String orderId;
String customerId;
String category;
String productId;
double revenue;
boolean flaggedForReview;
OrderRecord(String orderId, String customerId, String category,
String productId, double revenue, boolean flaggedForReview) {
this.orderId = orderId;
this.customerId = customerId;
this.category = category;
this.productId = productId;
this.revenue = revenue;
this.flaggedForReview = flaggedForReview;
}
}
public class OrderAnalytics {
static List<OrderRecord> sampleOrders() {
return Arrays.asList(
new OrderRecord("ORD-001", "CUST-A", "electronics", "PROD-101", 349.99, false),
new OrderRecord("ORD-002", "CUST-B", "clothing", "PROD-202", 89.50, false),
new OrderRecord("ORD-003", "CUST-A", "electronics", "PROD-101", 349.99, true),
new OrderRecord("ORD-004", "CUST-C", "books", "PROD-303", 24.99, false),
new OrderRecord("ORD-005", "CUST-B", "clothing", "PROD-204", 129.00, false),
new OrderRecord("ORD-006", "CUST-D", "electronics", "PROD-105", 899.00, true),
new OrderRecord("ORD-007", "CUST-C", "books", "PROD-303", 24.99, false),
new OrderRecord("ORD-008", "CUST-D", "electronics", "PROD-101", 349.99, false),
new OrderRecord("ORD-009", "CUST-A", "clothing", "PROD-202", 89.50, false),
new OrderRecord("ORD-010", "CUST-E", "electronics", "PROD-105", 899.00, true)
);
}
public static void main(String[] args) {
List<OrderRecord> orders = sampleOrders();
// Total revenue
double totalRevenue = 0;
for (OrderRecord o : orders) {
totalRevenue += o.revenue;
}
System.out.printf("Total revenue: $%.2f%n", totalRevenue);
// Revenue by category
Map<String, Double> revenueByCategory = new HashMap<>();
for (OrderRecord o : orders) {
revenueByCategory.merge(o.category, o.revenue, Double::sum);
}
for (Map.Entry<String, Double> entry : revenueByCategory.entrySet()) {
System.out.printf(" %s: $%.2f%n", entry.getKey(), entry.getValue());
}
// Flagged orders
List<String> flagged = new ArrayList<>();
for (OrderRecord o : orders) {
if (o.flaggedForReview) {
flagged.add(o.orderId);
}
}
System.out.println("Flagged for review: " + flagged);
}
}
Total revenue: $3205.95
electronics: $2848.97
clothing: $308.00
books: $49.98
Flagged for review: [ORD-003, ORD-006, ORD-010]
Each metric requires its own loop, its own accumulator, and its own insertion logic. The three loops share no structure — they are independent blocks of imperative code that happen to iterate over the same list. Adding a fourth metric means adding a fourth loop. Changing which orders are eligible for a metric means finding the right loop and editing its body rather than composing a filter onto an existing pipeline.
filter, map, and collect
The three foundational stream operations replace the most common loop patterns. filter keeps elements that match a predicate. map transforms each element. collect(Collectors.toList()) gathers results.
Replace the main method with the following:
public static void main(String[] args) {
List<OrderRecord> orders = sampleOrders();
// Flagged order IDs using filter + map + collect
List<String> flaggedIds = orders.stream()
.filter(o -> o.flaggedForReview)
.map(o -> o.orderId)
.collect(java.util.stream.Collectors.toList());
System.out.println("Flagged for review: " + flaggedIds);
// High-value electronics orders (revenue > 300)
List<String> highValueElectronics = orders.stream()
.filter(o -> o.category.equals("electronics"))
.filter(o -> o.revenue > 300.0)
.map(o -> o.orderId + " ($" + o.revenue + ")")
.collect(java.util.stream.Collectors.toList());
System.out.println("High-value electronics: " + highValueElectronics);
}
Flagged for review: [ORD-003, ORD-006, ORD-010]
High-value electronics: [ORD-001 ($349.99), ORD-003 ($349.99), ORD-006 ($899.0), ORD-008 ($349.99)]
Each pipeline is a declaration: "take orders, keep the flagged ones, extract the order ID, collect into a list." The iteration, the null-safety on the predicate evaluation, and the list construction are all handled by the runtime. Adding another filter requires one more .filter(...) call — no loop body to edit.
Lambdas passed to filter and map are stateless and isolated. There are no accumulators to forget to initialize, no variables shared between conditions. Each stage of the pipeline does exactly one thing, and the stages compose without touching each other.
Collectors.groupingBy
Collectors.groupingBy replaces the manual HashMap maintenance pattern. It groups stream elements by a key function and collects each group into a list — or, with a downstream collector, into a summary.
Add the following import at the top of the file:
import java.util.stream.*;
Replace main with:
public static void main(String[] args) {
List<OrderRecord> orders = sampleOrders();
// Group orders by category
Map<String, List<OrderRecord>> byCategory = orders.stream()
.collect(Collectors.groupingBy(o -> o.category));
byCategory.forEach((category, categoryOrders) -> {
System.out.printf("%s: %d orders%n", category, categoryOrders.size());
});
// Group by category and count
Map<String, Long> countByCategory = orders.stream()
.collect(Collectors.groupingBy(o -> o.category, Collectors.counting()));
System.out.println("\nOrder counts:");
countByCategory.entrySet().stream()
.sorted(Map.Entry.<String, Long>comparingByValue().reversed())
.forEach(e -> System.out.printf(" %s: %d%n", e.getKey(), e.getValue()));
}
electronics: 5 orders
clothing: 3 orders
books: 2 orders
Order counts:
electronics: 5
clothing: 3
books: 2
groupingBy with a downstream collector of Collectors.counting() produces a Map<String, Long> directly. The merge loop from the original version maintained mutable state across iterations; groupingBy expresses the same aggregation as a single declarative statement.
groupingBy composes with any downstream collector. Switching from counting to summing revenue, or from summing to computing statistics, is one argument change — not a rewrite of the accumulation loop.
Collectors.summarizingDouble
Collectors.summarizingDouble computes count, sum, min, max, and average in a single pass, returned as a DoubleSummaryStatistics object. No separate loops, no running totals.
Replace main with:
public static void main(String[] args) {
List<OrderRecord> orders = sampleOrders();
// Revenue statistics per category in one pass
Map<String, DoubleSummaryStatistics> statsByCategory = orders.stream()
.collect(Collectors.groupingBy(
o -> o.category,
Collectors.summarizingDouble(o -> o.revenue)
));
System.out.println("Revenue statistics by category:");
statsByCategory.entrySet().stream()
.sorted(Map.Entry.comparingByKey())
.forEach(e -> {
DoubleSummaryStatistics s = e.getValue();
System.out.printf(" %-12s count=%-2d total=$%-8.2f avg=$%.2f%n",
e.getKey(), s.getCount(), s.getSum(), s.getAverage());
});
// Overall revenue summary
DoubleSummaryStatistics overall = orders.stream()
.collect(Collectors.summarizingDouble(o -> o.revenue));
System.out.printf("%nOverall: %d orders, total=$%.2f, avg=$%.2f%n",
overall.getCount(), overall.getSum(), overall.getAverage());
}
Revenue statistics by category:
books count=2 total=$49.98 avg=$24.99
clothing count=3 total=$308.00 avg=$102.67
electronics count=5 total=$2848.97 avg=$569.79
Overall: 10 orders, total=$3205.95, avg=$320.60
summarizingDouble computes all five statistics in a single stream pass. The manual equivalent would require five separate accumulators — count, sum, min, max, and a running average — all updated in the correct order inside a single loop.
DoubleSummaryStatistics is standard library infrastructure. It handles floating-point edge cases and is available everywhere without custom aggregation classes. Paired with groupingBy, it gives per-group statistics with no iteration code at all.
sorted and Comparators
Stream pipelines can sort elements using sorted(Comparator). Comparator.comparingDouble, thenComparing, and .reversed() compose multi-level sort criteria in a single expression.
Replace main with:
public static void main(String[] args) {
List<OrderRecord> orders = sampleOrders();
// Top 5 orders by revenue, descending
System.out.println("Top 5 orders by revenue:");
orders.stream()
.sorted(Comparator.comparingDouble((OrderRecord o) -> o.revenue).reversed())
.limit(5)
.forEach(o -> System.out.printf(" %s %-12s $%.2f%n",
o.orderId, o.category, o.revenue));
// Orders sorted by category, then revenue descending within category
System.out.println("\nOrders by category then revenue:");
orders.stream()
.sorted(Comparator.comparing((OrderRecord o) -> o.category)
.thenComparingDouble((OrderRecord o) -> o.revenue).reversed())
.forEach(o -> System.out.printf(" %-8s %-12s $%.2f%n",
o.orderId, o.category, o.revenue));
}
Top 5 orders by revenue:
ORD-006 electronics $899.00
ORD-010 electronics $899.00
ORD-001 electronics $349.99
ORD-003 electronics $349.99
ORD-008 electronics $349.99
Orders by category then revenue:
ORD-010 electronics $899.00
ORD-006 electronics $899.00
ORD-008 electronics $349.99
ORD-003 electronics $349.99
ORD-001 electronics $349.99
ORD-005 clothing $129.00
ORD-002 clothing $89.50
ORD-009 clothing $89.50
ORD-004 books $24.99
ORD-007 books $24.99
Comparator.comparingDouble(...).reversed() and thenComparingDouble(...) compose without any intermediate variables. limit(5) short-circuits the stream after five elements are emitted — the remaining sort work is skipped.
The sort logic reads as a specification: "compare by revenue, reversed, then by category." A Collections.sort with a manual Comparator implementing multi-level sort in an if/else chain expresses the same intent in four to eight lines where the logic is spread across a comparator body.
flatMap and reduce
flatMap transforms each element into a stream and concatenates the results — essential when each record contains a collection that needs to be processed at the element level. reduce collapses a stream to a single value using an accumulator function.
Add a List<String> tags field to OrderRecord and update the constructor and sample data, then add the following to main:
import java.util.*;
import java.util.stream.*;
class OrderRecord {
String orderId;
String customerId;
String category;
String productId;
double revenue;
boolean flaggedForReview;
List<String> tags;
OrderRecord(String orderId, String customerId, String category,
String productId, double revenue, boolean flaggedForReview,
List<String> tags) {
this.orderId = orderId;
this.customerId = customerId;
this.category = category;
this.productId = productId;
this.revenue = revenue;
this.flaggedForReview = flaggedForReview;
this.tags = tags;
}
}
public class OrderAnalytics {
static List<OrderRecord> sampleOrders() {
return Arrays.asList(
new OrderRecord("ORD-001", "CUST-A", "electronics", "PROD-101", 349.99, false,
Arrays.asList("express", "gift-wrap")),
new OrderRecord("ORD-002", "CUST-B", "clothing", "PROD-202", 89.50, false,
Arrays.asList("sale")),
new OrderRecord("ORD-003", "CUST-A", "electronics", "PROD-101", 349.99, true,
Arrays.asList("express", "return-eligible")),
new OrderRecord("ORD-004", "CUST-C", "books", "PROD-303", 24.99, false,
Arrays.asList("gift-wrap")),
new OrderRecord("ORD-005", "CUST-B", "clothing", "PROD-204", 129.00, false,
Arrays.asList("sale", "express"))
);
}
public static void main(String[] args) {
List<OrderRecord> orders = sampleOrders();
// All unique tags across all orders
List<String> allTags = orders.stream()
.flatMap(o -> o.tags.stream())
.distinct()
.sorted()
.collect(Collectors.toList());
System.out.println("All tags: " + allTags);
// Total revenue using reduce
double totalRevenue = orders.stream()
.map(o -> o.revenue)
.reduce(0.0, Double::sum);
System.out.printf("Total revenue: $%.2f%n", totalRevenue);
// Most expensive order using reduce
Optional<OrderRecord> mostExpensive = orders.stream()
.reduce((a, b) -> a.revenue >= b.revenue ? a : b);
mostExpensive.ifPresent(o ->
System.out.printf("Most expensive: %s ($%.2f)%n", o.orderId, o.revenue));
}
}
All tags: [express, gift-wrap, return-eligible, sale]
Total revenue: $943.47
Most expensive: ORD-001 ($349.99)
flatMap eliminates the nested loop pattern. Without it, collecting all tags would require an outer loop over orders and an inner loop over order.tags, with manual deduplication using a Set. With flatMap, the inner collection becomes a continuation of the stream, and distinct() handles deduplication in one call.
reduce with an identity and a BinaryOperator is the standard pattern for any fold operation — sum, product, string concatenation, maximum. The two-argument reduce without an identity returns Optional because the stream might be empty, which forces the caller to handle the empty case rather than returning a wrong default silently.
Method References
Method references replace single-method lambdas with a more direct syntax. ClassName::staticMethod, instance::instanceMethod, and ClassName::instanceMethod are the three common forms.
Replace main with:
public static void main(String[] args) {
List<OrderRecord> orders = sampleOrders();
// ClassName::instanceMethod — comparator
orders.stream()
.map(o -> o.orderId)
.sorted(String::compareTo)
.forEach(System.out::println); // instance::method on System.out
System.out.println();
// Static method reference for revenue extraction
orders.stream()
.map(OrderAnalytics::formatOrder) // static method reference
.forEach(System.out::println);
}
static String formatOrder(OrderRecord o) {
return String.format("[%s] %s $%.2f tags=%s",
o.category, o.orderId, o.revenue, o.tags);
}
ORD-001
ORD-002
ORD-003
ORD-004
ORD-005
[electronics] ORD-001 $349.99 tags=[express, gift-wrap]
[clothing] ORD-002 $89.50 tags=[sale]
[electronics] ORD-003 $349.99 tags=[express, return-eligible]
[books] ORD-004 $24.99 tags=[gift-wrap]
[clothing] ORD-005 $129.00 tags=[sale, express]
System.out::println is equivalent to x -> System.out.println(x). String::compareTo is equivalent to (a, b) -> a.compareTo(b). OrderAnalytics::formatOrder is equivalent to o -> OrderAnalytics.formatOrder(o). The method reference form is shorter and names the method explicitly, which aids searchability and readability.
Method references are not just syntactic sugar — they make the intent more explicit. map(OrderAnalytics::formatOrder) communicates that formatting is a named, reusable operation defined elsewhere, not an anonymous transformation defined inline. This makes the pipeline easier to test: formatOrder can be unit-tested independently of the stream pipeline that calls it.
Summary
Java Streams replace mutable loops and temporary accumulators with declarative pipelines that describe what to compute rather than how to iterate. This tutorial built an e-commerce order analytics service across six stream patterns:
filter(predicate)keeps elements that match a condition — chain multiplefiltercalls to compose conditions without nesting logic inside a loop bodymap(function)transforms each element independently — the transformation is stateless and does not accumulate, making it safe to reason about in isolationcollect(Collectors.toList())is the standard terminal operation for gathering stream elements;collect(Collectors.groupingBy(fn))partitions the stream into aMap<Key, List<T>>keyed by the grouping functionCollectors.groupingBywith a downstream collector likeCollectors.counting()orCollectors.summarizingDouble(fn)computes per-group aggregations in a single pass — no manualHashMapmaintenancesorted(Comparator)withComparator.comparingDouble(...).reversed()and.thenComparing(...)expresses multi-level sort criteria as a single composable expressionflatMap(o -> o.collection.stream())flattens nested collections into a single stream — use it wherever an element contains a list that needs to be processed at the element levelreduce(identity, BinaryOperator)folds a stream into a single value; the single-argument form returnsOptionalbecause the stream may be empty, forcing explicit handling of that case- Method references (
ClassName::method,instance::method) replace single-method lambdas when the method already exists — they also make the operation testable as an independent unit