← All projects
Personal · For fun

Pokémon Analysis
with Python

Exploratory data analysis of 890+ Pokémon across 8 generations — types, BMI, and battle stats. Built while learning Python and statistics, with a dataset worth actually caring about.

Python EDA Matplotlib Pandas 2021

Background

I was learning statistics with Python and needed a dataset I'd actually enjoy working with. The Pokémon dataset — 890 Pokémon through Generation VIII, compiled by Mario Tormo Romero on Kaggle — covered types, height, weight, and six battle stats for every Pokémon and its regional varieties.

What made it a good learning exercise was that I already had prior knowledge to test against. Findings I couldn't verify from game knowledge alone felt genuinely surprising.

Pokémon through the generations

Generation I (1990s) through Generation VIII (2019–20). The dataset has more than 890 entries because regional varieties like Meowth and Alolan Meowth share a Pokédex number but count as separate rows.

Pokémon count by generation and type distribution chart
Left: Pokémon added per generation. Right: distribution of primary types across all Pokémon.
Water is the most common type — 13% of all Pokémon. Normal and Grass follow. Flying ranks surprisingly low as a primary type despite how many Pokémon have it as a secondary.

Height, weight & BMI

The scatter plot shows a positive correlation between height and weight — expected, and consistent with what we'd see in real animals. But there's a clear outlier: near-zero height, 1000kg weight.

Scatter plot of Pokémon height vs weight
Height (m) vs weight (kg) for all Pokémon. The extreme outlier at bottom-right is Cosmoem.
The outlier is Cosmoem — a Psychic-type ProtoStar Pokémon. At 0.1m tall and 999.9kg, it's one of the densest objects in the Pokédex, modelled on a protostellar core. The data is technically correct.

With the Psychic type excluded (dominated by Cosmoem's extreme BMI), the BMI distribution by type shows something coherent: Steel and Rock types are the densest, Fairy types the lightest.

Box plot of BMI by Pokémon type, excluding Psychic
BMI distribution by primary type (Psychic excluded as an outlier due to Cosmoem).

Battle stats

The six battle stats — HP, Attack, Defense, Special Attack, Special Defense, Speed — all correlate positively with each other, but weakly to moderately. The strongest relationship is between Defense and Special Defense (r = 0.54). Speed and Defense show almost no correlation, which makes sense intuitively: fast Pokémon tend to be fragile.

Box plots and correlation heatmap of battle stats
Left: range of each battle stat. Right: pairwise correlation between all six stats.

Looking at the highest-rated Pokémon per stat:

Battle Stat Highest Rated Pokémon
HP Blissey
Attack Mega Mewtwo X
Defense Eternatus Eternamax
Sp. Attack Mega Mewtwo Y
Sp. Defense Eternatus Eternamax
Speed Deoxys Speed Forme
Blissey tops HP — exactly why it dominates gym defense. The data confirmed something any competitive player already knows: Blissey's absurd HP makes it the hardest gym defender to knock out. Finding it in the data felt satisfying.

What I took from it

The value of this project wasn't the Pokémon — it was learning to let domain knowledge guide analysis. I knew roughly what the data should say, which made anomalies stand out immediately. Cosmoem's BMI wasn't a data error; it was a signal worth investigating.

It also reinforced something about EDA: the most interesting findings often come from the outliers, not the averages.

View all projects