A step by step guide to exploring the Iris dataset

Posted on 2025-04-11

Do you want to learn how data scientists find patterns in numbers? Data analysis is fun and easier than you think! In this guide, we will use Python to explore the Iris dataset, a simple set of flower measurements. You will learn how to load data, clean it, check it out, make cool charts, and find interesting facts. By the end, you will feel like a data pro. Let's get started!

Step 1: Loading the Iris Dataset

The Iris dataset is about 150 iris flowers. It measures four things: sepal length, sepal width, petal length, and petal width for three types of flowers (setosa, versicolor, virginica). It's perfect for beginners!

We will use Python's Seaborn library to load it. Try this code:

Python

Copy

import seaborn as sns
import pandas as pd

# Load the data
iris = sns.load_dataset('iris')

# Look at the first few rows
print(iris.head())

Output:

   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

This shows a table with the measurements. Always check your data to make sure it's ready!

Step 2: Cleaning the Data

Dirty data can mess things up, so let's make sure ours is clean. We'll check for missing numbers and repeated rows.

Here's the code:

Python

Copy

# Check for missing values
print(iris.isnull().sum())

# Check for duplicates
print("Duplicates:", iris.duplicated().sum())

# Remove duplicates
iris = iris.drop_duplicates()

Output:

sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64
Duplicates: 1

Great news: the Iris dataset has no missing values! There might be 1-2 duplicates, which we remove. Now our data is ready to explore.

Step 3: Checking Out the data

Let's learn about our data with exploratory data analysis (EDA). It's like asking, "What's special about these flowers?"

First, get a quick summary:

Python

Copy

print(iris.describe())

# This shows averages, like sepal length is about 5.8 cm. Next, let’s compare the flower types:

print(iris.groupby('species').mean())

This tells us:

Setosa flowers have smaller petals and sepals.
Virginica flowers have the biggest petals.
Versicolor is in the middle.

These clues help us understand the data before making charts.

Step 4: Making Charts

Charts make data fun to look at! We'll use Seaborn to create a few. (You'll need Matplotlib too for showing them.)

Bar Charts (Histograms)

These show how measurements spread out:

Python

Copy

import matplotlib.pyplot as plt
sns.set(style="whitegrid")

plt.figure(figsize=(8, 6))
for i, column in enumerate(iris.columns[:-1], 1):
    plt.subplot(2, 2, i)
    sns.histplot(data=iris, x=column, hue='species')
    plt.title(f'{column}')
plt.tight_layout()
plt.show()

Output:

A step by step guide to exploring the Iris dataset Bar chart

You'll see setosa's petals are tiny, while virginica's are long.

Box Charts

Box charts show the range of measurements:

Python

Copy

plt.figure(figsize=(8, 6))
for i, column in enumerate(iris.columns[:-1], 1):
    plt.subplot(2, 2, i)
    sns.boxplot(x='species', y=column, data=iris)
    plt.title(f'{column}')
plt.tight_layout()
plt.show()

Output:

A step by step guide to exploring the Iris dataset Box chart

These show most measurements are steady, with few odd ones.

Dot Chart (Scatter Plot)

Let's compare sepal length and width:

Python

Copy

plt.figure(figsize=(6, 4))
sns.scatterplot(x='sepal_length', y='sepal_width', hue='species', data=iris)
plt.title('Sepal Length vs Width')
plt.show()

Output:

A step by step guide to exploring the Iris dataset Dot chart

Setosa dots are separate, but versicolor and virginica mix a bit.

Step 5: What We Learned

After exploring, here’s what we found:

No missing data—our dataset is clean!
Removed a couple of duplicates, so we have ~149 rows.
Setosa flowers are smaller all around.
Virginica has the biggest petals, easy to spot.
Sepal width is similar for all flowers, so it's less helpful.
Petal length and width go together (they're super related!).

These facts could help build a tool to guess flower types.