A step by step guide to exploring the Iris dataset
Posted on
Do you want to learn how data scientists find patterns in numbers? Data analysis is fun and easier than you think! In this guide, we will use Python to explore the Iris dataset, a simple set of flower measurements. You will learn how to load data, clean it, check it out, make cool charts, and find interesting facts. By the end, you will feel like a data pro. Let's get started!
Step 1: Loading the Iris Dataset
The Iris dataset is about 150 iris flowers. It measures four things: sepal length, sepal width, petal length, and petal width for three types of flowers (setosa, versicolor, virginica). It's perfect for beginners!
We will use Python's Seaborn library to load it. Try this code:
import seaborn as sns import pandas as pd # Load the data iris = sns.load_dataset('iris') # Look at the first few rows print(iris.head())
Output:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
This shows a table with the measurements. Always check your data to make sure it's ready!
Step 2: Cleaning the Data
Dirty data can mess things up, so let's make sure ours is clean. We'll check for missing numbers and repeated rows.
Here's the code:
# Check for missing values print(iris.isnull().sum()) # Check for duplicates print("Duplicates:", iris.duplicated().sum()) # Remove duplicates iris = iris.drop_duplicates()
Output:
sepal_length 0
sepal_width 0
petal_length 0
petal_width 0
species 0
dtype: int64
Duplicates: 1
Great news: the Iris dataset has no missing values! There might be 1-2 duplicates, which we remove. Now our data is ready to explore.
Step 3: Checking Out the data
Let's learn about our data with exploratory data analysis (EDA). It's like asking, "What's special about these flowers?"
First, get a quick summary:
print(iris.describe()) # This shows averages, like sepal length is about 5.8 cm. Next, let’s compare the flower types: print(iris.groupby('species').mean())
This tells us:
- Setosa flowers have smaller petals and sepals.
- Virginica flowers have the biggest petals.
- Versicolor is in the middle.
Step 4: Making Charts
Charts make data fun to look at! We'll use Seaborn to create a few. (You'll need Matplotlib too for showing them.)
Bar Charts (Histograms)
These show how measurements spread out:
import matplotlib.pyplot as plt sns.set(style="whitegrid") plt.figure(figsize=(8, 6)) for i, column in enumerate(iris.columns[:-1], 1): plt.subplot(2, 2, i) sns.histplot(data=iris, x=column, hue='species') plt.title(f'{column}') plt.tight_layout() plt.show()
Output:

You'll see setosa's petals are tiny, while virginica's are long.
Box Charts
Box charts show the range of measurements:
plt.figure(figsize=(8, 6)) for i, column in enumerate(iris.columns[:-1], 1): plt.subplot(2, 2, i) sns.boxplot(x='species', y=column, data=iris) plt.title(f'{column}') plt.tight_layout() plt.show()
Output:

These show most measurements are steady, with few odd ones.
Dot Chart (Scatter Plot)
Let's compare sepal length and width:
plt.figure(figsize=(6, 4)) sns.scatterplot(x='sepal_length', y='sepal_width', hue='species', data=iris) plt.title('Sepal Length vs Width') plt.show()
Output:

Setosa dots are separate, but versicolor and virginica mix a bit.
Step 5: What We Learned
After exploring, here’s what we found:
- No missing data—our dataset is clean!
- Removed a couple of duplicates, so we have ~149 rows.
- Setosa flowers are smaller all around.
- Virginica has the biggest petals, easy to spot.
- Sepal width is similar for all flowers, so it's less helpful.
- Petal length and width go together (they're super related!).
These facts could help build a tool to guess flower types.