A step by step guide to exploring the Iris dataset

Posted on

Do you want to learn how data scientists find patterns in numbers? Data analysis is fun and easier than you think! In this guide, we will use Python to explore the Iris dataset, a simple set of flower measurements. You will learn how to load data, clean it, check it out, make cool charts, and find interesting facts. By the end, you will feel like a data pro. Let's get started!

Step 1: Loading the Iris Dataset

The Iris dataset is about 150 iris flowers. It measures four things: sepal length, sepal width, petal length, and petal width for three types of flowers (setosa, versicolor, virginica). It's perfect for beginners!

We will use Python's Seaborn library to load it. Try this code:

Python
Copy
import seaborn as sns
import pandas as pd

# Load the data
iris = sns.load_dataset('iris')

# Look at the first few rows
print(iris.head())

Output:

sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 setosa 1 4.9 3.0 1.4 0.2 setosa 2 4.7 3.2 1.3 0.2 setosa 3 4.6 3.1 1.5 0.2 setosa 4 5.0 3.6 1.4 0.2 setosa

This shows a table with the measurements. Always check your data to make sure it's ready!

Step 2: Cleaning the Data

Dirty data can mess things up, so let's make sure ours is clean. We'll check for missing numbers and repeated rows.

Here's the code:

Python
Copy
# Check for missing values
print(iris.isnull().sum())

# Check for duplicates
print("Duplicates:", iris.duplicated().sum())

# Remove duplicates
iris = iris.drop_duplicates()

Output:

sepal_length 0 sepal_width 0 petal_length 0 petal_width 0 species 0 dtype: int64 Duplicates: 1

Great news: the Iris dataset has no missing values! There might be 1-2 duplicates, which we remove. Now our data is ready to explore.

Step 3: Checking Out the data

Let's learn about our data with exploratory data analysis (EDA). It's like asking, "What's special about these flowers?"

First, get a quick summary:

Python
Copy
print(iris.describe())

# This shows averages, like sepal length is about 5.8 cm. Next, let’s compare the flower types:

print(iris.groupby('species').mean())

This tells us:

  • Setosa flowers have smaller petals and sepals.
  • Virginica flowers have the biggest petals.
  • Versicolor is in the middle.
  • These clues help us understand the data before making charts.
  • Step 4: Making Charts

    Charts make data fun to look at! We'll use Seaborn to create a few. (You'll need Matplotlib too for showing them.)

    Bar Charts (Histograms)

    These show how measurements spread out:

    Python
    Copy
    import matplotlib.pyplot as plt
    sns.set(style="whitegrid")
    
    plt.figure(figsize=(8, 6))
    for i, column in enumerate(iris.columns[:-1], 1):
        plt.subplot(2, 2, i)
        sns.histplot(data=iris, x=column, hue='species')
        plt.title(f'{column}')
    plt.tight_layout()
    plt.show()

    Output:

    A step by step guide to exploring the Iris dataset Bar chart

    You'll see setosa's petals are tiny, while virginica's are long.

    Box Charts

    Box charts show the range of measurements:

    Python
    Copy
    plt.figure(figsize=(8, 6))
    for i, column in enumerate(iris.columns[:-1], 1):
        plt.subplot(2, 2, i)
        sns.boxplot(x='species', y=column, data=iris)
        plt.title(f'{column}')
    plt.tight_layout()
    plt.show()

    Output:

    A step by step guide to exploring the Iris dataset Box chart

    These show most measurements are steady, with few odd ones.

    Dot Chart (Scatter Plot)

    Let's compare sepal length and width:

    Python
    Copy
    plt.figure(figsize=(6, 4))
    sns.scatterplot(x='sepal_length', y='sepal_width', hue='species', data=iris)
    plt.title('Sepal Length vs Width')
    plt.show()

    Output:

    A step by step guide to exploring the Iris dataset Dot chart

    Setosa dots are separate, but versicolor and virginica mix a bit.

    Step 5: What We Learned

    After exploring, here’s what we found:

    • No missing data—our dataset is clean!
    • Removed a couple of duplicates, so we have ~149 rows.
    • Setosa flowers are smaller all around.
    • Virginica has the biggest petals, easy to spot.
    • Sepal width is similar for all flowers, so it's less helpful.
    • Petal length and width go together (they're super related!).

    These facts could help build a tool to guess flower types.

    '