Account

Exploring Gold Price Trends with Python: A Pandas and Matplotlib Tutorial

Exploring Gold Price Trends with Python: A Pandas and Matplotlib Tutorial


Introduction

Python has become the go-to language for data analysis and visualization, and some of the most popular and powerful libraries are Pandas and Matplotlib.

Pandas provides robust data manipulation capabilities and Matplotlib allows you to create visualizations to better understand your data. Together, these libraries form a powerful toolkit for statistical analysis.

In this tutorial, we'll explore historical gold prices using real-world data to learn how to load, clean, filter, and visualize time-series data. Gold prices are particularly interesting because they reflect economic conditions, inflation trends, and global market sentiment. By analyzing this data, you'll learn practical techniques that you can apply to any time-series dataset, be it stock prices or weather patterns.

For this tutorial, we will make a couple assumptions so that we can provide exact commands and file paths:
- You are running Financial & Statistical Sandbox available on Bytestark.com
- You are using the provided gold price data file available at data_sources/gold_prices.csv
- Your starting folder when you open the Financial & Statistical Sandbox is /home/coder/learning. When paths are referenced, they will be relative to this root path.

Understanding the Dataset and Libraries

Before we dive into coding, let's understand the libraries we'll be using:

Pandas is a data manipulation library that provides data structures like DataFrames and powerful tools for reading, filtering, and transforming data.

Matplotlib is the foundational plotting library in Python. It gives you total control over your graphs, from axes to colors to labels.

Seaborn is built on top of Matplotlib and provides a higher-level interface for creating attractive statistical graphics with less code. We'll use it to enhance the styling of our plots.

Our dataset contains historical gold prices with dates and corresponding gold values. The dates are in DD/MM/YYYY format, and we'll need to convert them to proper datetime objects so Python can work with them effectively.
The data is presented in the following format:


Loading and Inspecting the Data

Let's start by creating a Python script to load and examine our gold price data. Create a new file called gold_analysis.py and add the following code:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
import matplotlib.dates as mdates

# Load the gold price data
result = pd.read_csv("data_sources/gold_prices.csv")

# Display the first few rows to understand the structure
print("First 5 rows of the dataset:")
print(result.head())

# Check the data types
print("\nData types:")
print(result.dtypes)

# Get basic statistics
print("\nBasic statistics:")
print(result.describe())


What this code does:

  • We import all the necessary libraries at the top
  • pd.read_csv() loads the CSV file into a Pandas DataFrame, which is the primary data structure we'll work with
  • head() shows us the first 5 rows of data
  • dtypes reveals the data type of each column
  • describe() provides summary statistics like mean, median and standard deviation

To run the script use the following command:

python3 gold_analysis.py


The result should be similar to the one below:

First 5 rows of the dataset:
         Date  Value
0  01/01/1915  19.25
1  02/01/1915  19.25
2  03/01/1915  19.25
3  04/01/1915  19.25
4  05/01/1915  19.25

Data types:
Date      object
Value    float64
dtype: object

Basic statistics:
             Value
count  1333.000000
mean    387.771207
std     618.579873
min      19.250000
25%      35.000000
50%      40.800000
75%     400.000000
max    5296.834900


You may notice that the "Date" column is an 'object' type rather than a datetime type. We need to convert it so that it's easier to work with.


Converting Date Strings to Datetime Objects

Working with dates as strings is problematic because Python won't understand chronological ordering or allow us to filter by year, month, or day. We need to convert the date strings into proper datetime objects. Change the code in gold_analysis.py by pasting the following instead:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
import matplotlib.dates as mdates

# Load the gold price data
result = pd.read_csv("data_sources/gold_prices.csv")

# Convert the Date column from string to datetime objects
result["Date"] = result["Date"].apply(lambda x: datetime.strptime(x, "%d/%m/%Y"))

# Verify the conversion
print("Data types after conversion:")
print(result.dtypes)
print("\nFirst 5 rows after conversion:")
print(result.head())

# Check the date range in our dataset
print(f"\nDate range: {result['Date'].min()} to {result['Date'].max()}")


Executing the script will result in the following output:

Data types after conversion:
Date     datetime64[ns]
Value           float64
dtype: object

First 5 rows after conversion:
        Date  Value
0 1915-01-01  19.25
1 1915-01-02  19.25
2 1915-01-03  19.25
3 1915-01-04  19.25
4 1915-01-05  19.25

Date range: 1915-01-01 00:00:00 to 2026-01-01 00:00:00


Let's understand the date conversion:

  • apply() applies a function to every value in the "Date" column
  • lambda x: creates an anonymous function that takes each date string as input
  • datetime.strptime(x, "%d/%m/%Y") parses the string according to the format pattern:
    • %d = day (01-31)
    • %m = month (01-12)
    • %Y = four-digit year (e.g., 2024)
  • After conversion, the "Date" column will show as datetime64[ns] type instead of object

Filtering Data by Date Range

Now that we have proper datetime objects, we can easily filter our data to focus on a specific time period. Often when analyzing trends, you want to focus on recent years or a specific period of interest. Let's filter our data to only include years from 2000 to 2026:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
import matplotlib.dates as mdates

# Load and convert the data
result = pd.read_csv("data_sources/gold_prices.csv")
result["Date"] = result["Date"].apply(lambda x: datetime.strptime(x, "%d/%m/%Y"))

# Filter data for years 2000-2026
data = result[(result["Date"].dt.year >= 2000) & (result["Date"].dt.year <= 2026)]

print(f"Original dataset size: {len(result)} rows")
print(f"Filtered dataset size: {len(data)} rows")
print(f"\nFiltered date range: {data['Date'].min()} to {data['Date'].max()}")
print(f"\nPrice range: ${data['Value'].min():.2f} to ${data['Value'].max():.2f}")


Running the code will result in:

Original dataset size: 1333 rows
Filtered dataset size: 313 rows

Filtered date range: 2000-01-01 00:00:00 to 2026-01-01 00:00:00

Price range: $259.05 to $5296.83


Let's go over the changes:

  • result["Date"].dt.year accesses the year component of each datetime object
  • >= 2000 and <= 2026 create boolean conditions for the year range
  • & combines both conditions (both must be True)
  • The square brackets [] apply this filter to create a new DataFrame called data
  • We use len() to compare the original and filtered dataset sizes

Note: The .dt attribute is crucial here as it provides access to datetime properties like .year, .month, .day, .dayofweek, etc. Without it, we'll get an error when trying to access these properties.


Creating First Visualization

Processing data is just half of the job, when we have processed it we need to see it to understand it better.
A line plot is perfect for time-series data like gold prices because it shows trends over time. Let's create a basic plot:

The instance of VSCode you are currently running does not have the ability to render graphs by default, so we will need to run the next code examples in Jupyter Notebook. To create a notebook press F1 in the VSCode window and type Create: New Jupyter Notebook. When the newly Notebook is created paste the following code in one of the cells and press the key combination Shift + Enter to execute the code within a cell. The first execution will prompt to select a Python kernel - select the default one Python 3.12.4.


import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
import matplotlib.dates as mdates

# Load, convert, and filter the data
result = pd.read_csv("data_sources/gold_prices.csv")
result["Date"] = result["Date"].apply(lambda x: datetime.strptime(x, "%d/%m/%Y"))
data = result[(result["Date"].dt.year >= 2000) & (result["Date"].dt.year <= 2026)]

# Create the plot
plt.figure(figsize=(12, 6))
plt.plot(data["Date"], data["Value"])
plt.xlabel("Year")
plt.ylabel("Price (USD)")
plt.title("Historical Gold Prices (2000-2026)")
plt.grid(True)
plt.tight_layout()
plt.show()


If the notebook was created and configured successfully the following graph should appear:

Here's what each plotting command does:

  • plt.figure(figsize=(12, 6)) creates a new figure with width=12 inches and height=6 inches
  • plt.plot(data["Date"], data["Value"]) creates a line plot with dates on x-axis and prices on y-axis
  • plt.xlabel() and plt.ylabel() add descriptive labels to the axes
  • plt.title() adds a title to help viewers understand what they're seeing
  • plt.grid(True) adds a grid to make it easier to read values
  • plt.tight_layout() automatically adjusts spacing to prevent labels from being cut off
  • plt.show() displays the plot (or saves it if you're not in an interactive environment)

Improving Date Formatting on the X-Axis

You might notice that the x-axis dates look crowded or poorly formatted. Let's improve this by formatting the dates to show only years and rotating the labels for better readability.
Note - In Jupyter you don't have to replace the existing code, you can just use a different cell and execute just that cell using the same Shift + Enter. This is one of the features that makes Jupyter so convenient for data analysis.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
import matplotlib.dates as mdates

# Load, convert, and filter the data
result = pd.read_csv("data_sources/gold_prices.csv")
result["Date"] = result["Date"].apply(lambda x: datetime.strptime(x, "%d/%m/%Y"))
data = result[(result["Date"].dt.year >= 2000) & (result["Date"].dt.year <= 2026)]

# Create the plot with improved date formatting
plt.figure(figsize=(12, 6))
plt.plot(data["Date"], data["Value"], linewidth=2, color='#FFD700')
plt.xlabel("Year", fontsize=12)
plt.ylabel("Price (USD)", fontsize=12)
plt.title("Historical Gold Prices (2000-2026)", fontsize=14, fontweight='bold')

# Format x-axis to show years
plt.gca().xaxis.set_major_locator(mdates.YearLocator())
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter("%Y"))
plt.xticks(rotation=45)

plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()


The updated graph should be similar to the one below:

Let's understand the date formatting improvements:

  • plt.gca() gets the current axes object, giving us access to more advanced formatting options
  • xaxis.set_major_locator(mdates.YearLocator()) places tick marks at yearly intervals
  • xaxis.set_major_formatter(mdates.DateFormatter("%Y")) formats those ticks to show only the year
  • plt.xticks(rotation=45) rotates the date labels 45 degrees to prevent overlap
  • We've also added linewidth=2 to make the line more visible and color='#FFD700' (gold color) for thematic consistency
  • alpha=0.3 makes the grid lines semi-transparent so they don't dominate the visualization

Adding Statistical Analysis

Combining graphical visualization with statistical analysis provides more powerful and deeper insights.
Let's calculate some key statistics and add them to our visualization:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
import matplotlib.dates as mdates

# Load, convert, and filter the data
result = pd.read_csv("data_sources/gold_prices.csv")
result["Date"] = result["Date"].apply(lambda x: datetime.strptime(x, "%d/%m/%Y"))
data = result[(result["Date"].dt.year >= 2000) & (result["Date"].dt.year <= 2026)]

# Calculate statistics
mean_price = data["Value"].mean()
median_price = data["Value"].median()
std_price = data["Value"].std()
min_price = data["Value"].min()
max_price = data["Value"].max()

# Find the dates of min and max prices
min_date = data.loc[data["Value"].idxmin(), "Date"]
max_date = data.loc[data["Value"].idxmax(), "Date"]

# Create the plot
plt.figure(figsize=(14, 7))
plt.plot(data["Date"], data["Value"], linewidth=2, color='#FFD700', label='Gold Price')

# Add horizontal lines for mean and median
plt.axhline(y=mean_price, color='red', linestyle='--', linewidth=1.5, label=f'Mean: ${mean_price:.2f}')
plt.axhline(y=median_price, color='blue', linestyle='--', linewidth=1.5, label=f'Median: ${median_price:.2f}')

plt.xlabel("Year", fontsize=12)
plt.ylabel("Price (USD)", fontsize=12)
plt.title("Historical Gold Prices with Statistical Analysis (2000-2026)", fontsize=14, fontweight='bold')

# Format x-axis
plt.gca().xaxis.set_major_locator(mdates.YearLocator())
plt.gca().xaxis.set_major_formatter(mdates.DateFormatter("%Y"))
plt.xticks(rotation=45)

plt.grid(True, alpha=0.3)
plt.legend(loc='upper left')
plt.tight_layout()
plt.show()

# Print statistics
print("=" * 50)
print("GOLD PRICE STATISTICS (2000-2026)")
print("=" * 50)
print(f"Mean Price:           ${mean_price:.2f}")
print(f"Median Price:         ${median_price:.2f}")
print(f"Standard Deviation:   ${std_price:.2f}")
print(f"Minimum Price:        ${min_price:.2f} on {min_date.strftime('%Y-%m-%d')}")
print(f"Maximum Price:        ${max_price:.2f} on {max_date.strftime('%Y-%m-%d')}")
print(f"Price Range:          ${max_price - min_price:.2f}")
print("=" * 50)


Let's take a look at the result:

And also break down the statistical calculations:

  • mean() calculates the average price across all data points
  • median() finds the middle value when all prices are sorted
  • std() measures the standard deviation, showing how spread out the prices are
  • idxmin() and idxmax() find the index positions of minimum and maximum values
  • loc[] retrieves the actual date values at those index positions
  • axhline() draws horizontal reference lines on the plot
  • legend() adds a legend box explaining what each line represents

The standard deviation is particularly useful metric. A high standard deviation means gold prices were volatile, while a low one means they were relatively stable.


Creating a Multi-Panel Analysis

For more comprehensive analysis, let's create a figure with multiple subplots showing different aspects of the data. This gives us a dashboard-like view of the gold prices:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
import matplotlib.dates as mdates

# Load, convert, and filter the data
result = pd.read_csv("data_sources/gold_prices.csv")
result["Date"] = result["Date"].apply(lambda x: datetime.strptime(x, "%d/%m/%Y"))
data = result[(result["Date"].dt.year >= 2000) & (result["Date"].dt.year <= 2026)]

# Create a figure with multiple subplots
fig, axes = plt.subplots(2, 2, figsize=(16, 10))
fig.suptitle('Comprehensive Gold Price Analysis (2000-2026)', fontsize=16, fontweight='bold')

# Subplot 1: Time series with trend
axes[0, 0].plot(data["Date"], data["Value"], linewidth=1.5, color='#FFD700', alpha=0.7)
axes[0, 0].plot(data["Date"], data["Value"].rolling(window=30).mean(), 
                linewidth=2, color='darkred', label='30-Day Moving Average')
axes[0, 0].set_xlabel("Year")
axes[0, 0].set_ylabel("Price (USD)")
axes[0, 0].set_title("Price Trend with Moving Average")
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Subplot 2: Distribution histogram
axes[0, 1].hist(data["Value"], bins=50, color='#FFD700', edgecolor='black', alpha=0.7)
axes[0, 1].axvline(data["Value"].mean(), color='red', linestyle='--', linewidth=2, label='Mean')
axes[0, 1].axvline(data["Value"].median(), color='blue', linestyle='--', linewidth=2, label='Median')
axes[0, 1].set_xlabel("Price (USD)")
axes[0, 1].set_ylabel("Frequency")
axes[0, 1].set_title("Price Distribution")
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# Subplot 3: Year-over-year change
data['Year'] = data['Date'].dt.year
yearly_avg = data.groupby('Year')['Value'].mean()
axes[1, 0].bar(yearly_avg.index, yearly_avg.values, color='#FFD700', edgecolor='black', alpha=0.7)
axes[1, 0].set_xlabel("Year")
axes[1, 0].set_ylabel("Average Price (USD)")
axes[1, 0].set_title("Average Annual Gold Price")
axes[1, 0].grid(True, alpha=0.3, axis='y')
plt.setp(axes[1, 0].xaxis.get_majorticklabels(), rotation=45)

# Subplot 4: Box plot by year (last 10 years for clarity)
recent_data = data[data['Year'] >= 2016]
years = sorted(recent_data['Year'].unique())
box_data = [recent_data[recent_data['Year'] == year]['Value'].values for year in years]
axes[1, 1].boxplot(box_data, labels=years)
axes[1, 1].set_xlabel("Year")
axes[1, 1].set_ylabel("Price (USD)")
axes[1, 1].set_title("Price Distribution by Year (2016-2026)")
axes[1, 1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()


Running this code will generate the following graph:

This multi-panel visualization provides four different perspectives:

  • Top-left: Time series with a 30-day moving average to smooth out daily fluctuations and reveal the underlying trend
  • Top-right: Histogram showing the distribution of prices, helping us see if prices cluster around certain values
  • Bottom-left: Bar chart of average annual prices, making year-to-year comparisons easy
  • Bottom-right: Box plots showing the spread and quartiles of prices for recent years

Key functions explained:

  • plt.subplots(2, 2) creates a 2×2 grid of subplots
  • rolling(window=30).mean() calculates a moving average over 30 data points
  • hist() creates a histogram with 50 bins
  • groupby('Year')['Value'].mean() groups data by year and calculates the mean for each group
  • boxplot() shows the median, quartiles, and outliers for each year

Calculating and Visualizing Returns

In financial analysis, we often want to see not just prices but returns—the percentage change from one period to another. Let's calculate daily returns and visualize them:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from datetime import datetime
import matplotlib.dates as mdates

# Load, convert, and filter the data
result = pd.read_csv("data_sources/gold_prices.csv")
result["Date"] = result["Date"].apply(lambda x: datetime.strptime(x, "%d/%m/%Y"))
data = result[(result["Date"].dt.year >= 2000) & (result["Date"].dt.year <= 2026)].copy()

# Calculate daily returns (percentage change)
data['Daily_Return'] = data['Value'].pct_change() * 100

# Calculate cumulative returns
data['Cumulative_Return'] = ((1 + data['Value'].pct_change()).cumprod() - 1) * 100

# Create visualization
fig, axes = plt.subplots(2, 1, figsize=(14, 10))
fig.suptitle('Gold Price Returns Analysis', fontsize=16, fontweight='bold')

# Daily returns
axes[0].plot(data["Date"], data["Daily_Return"], linewidth=0.5, color='#FFD700', alpha=0.6)
axes[0].axhline(y=0, color='black', linestyle='-', linewidth=1)
axes[0].fill_between(data["Date"], data["Daily_Return"], 0, 
                      where=(data["Daily_Return"] > 0), color='green', alpha=0.3, label='Positive Returns')
axes[0].fill_between(data["Date"], data["Daily_Return"], 0, 
                      where=(data["Daily_Return"] <= 0), color='red', alpha=0.3, label='Negative Returns')
axes[0].set_ylabel("Daily Return (%)")
axes[0].set_title("Daily Returns Over Time")
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Cumulative returns
axes[1].plot(data["Date"], data["Cumulative_Return"], linewidth=2, color='darkgreen')
axes[1].fill_between(data["Date"], data["Cumulative_Return"], 0, alpha=0.3, color='lightgreen')
axes[1].set_xlabel("Year")
axes[1].set_ylabel("Cumulative Return (%)")
axes[1].set_title("Cumulative Returns Since 2000")
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Print return statistics
print("\n" + "=" * 50)
print("RETURN STATISTICS")
print("=" * 50)
print(f"Average Daily Return:    {data['Daily_Return'].mean():.4f}%")
print(f"Daily Return Std Dev:    {data['Daily_Return'].std():.4f}%")
print(f"Best Daily Return:       {data['Daily_Return'].max():.4f}%")
print(f"Worst Daily Return:      {data['Daily_Return'].min():.4f}%")
print(f"Total Cumulative Return: {data['Cumulative_Return'].iloc[-1]:.2f}%")
print("=" * 50)


After running the code above we should see the following graph:

Understanding returns calculations:

  • pct_change() calculates the percentage change from the previous value
  • * 100 converts decimal returns to percentages (e.g., 0.05 becomes 5%)
  • cumprod() calculates the cumulative product, giving us cumulative returns
  • fill_between() colors the area between the line and zero, with different colors for positive and negative returns
  • where= parameter specifies a condition for when to apply each color

Note: The .copy() after filtering is important—it creates a copy of the DataFrame so we can safely add new columns without triggering warnings about modifying a slice of the original data.


Best Practices and Performance Considerations

When working with Pandas and Matplotlib for data analysis, following these best practices will help you create more efficient and maintainable code:

1. Load Data Once

Always load your data once at the beginning of your script and reuse that DataFrame. Reading CSV files repeatedly is slow and unnecessary. If you're working with very large datasets, consider using pd.read_csv() with the usecols parameter to load only the columns you need.

2. Use Vectorized Operations

Pandas operations like apply(), pct_change(), and rolling() are vectorized, meaning they operate on entire columns at once. This is much faster than using Python loops. Avoid writing code like for i in range(len(df)): when working with DataFrames.

3. Handle Missing Data

Real-world datasets often have missing values. Use data.isnull().sum() to check for missing data, and decide how to handle it with methods like fillna(), dropna(), or interpolate() depending on your analysis needs.

4. Save Your Plots

Instead of just displaying plots with plt.show(), you can save them to files using plt.savefig('filename.png', dpi=300, bbox_inches='tight'). The dpi=300 parameter ensures high resolution, and bbox_inches='tight' prevents labels from being cut off.

5. Use Appropriate Chart Types

Line plots are great for time-series data, but other chart types serve different purposes: histograms for distributions, scatter plots for relationships between variables, box plots for comparing groups, and bar charts for categorical comparisons. Choose the visualization that best tells your data's story.

6. Comment Your Code

Statistical analysis code can be complex. Add comments explaining what each calculation does and why you chose specific parameters (like the 30-day window for moving averages).



You need to be logged in to access the cloud lab and experiment with the code presented in this tutorial.

Log in