Pokémon Analysis with Python

Background

I was learning statistics with Python and needed a dataset I'd actually enjoy working with. The Pokémon dataset — 890+ Pokémon through Generation VIII, compiled by Mario Tormo Romero on Kaggle — covered types, height, weight, and six battle stats for every Pokémon and its regional varieties.

What made it a good learning exercise was that I already had prior knowledge to test against. Findings I couldn't verify from game knowledge alone felt genuinely surprising.

Pokémon through the generations

Generation I (1990s) through Generation VIII (2019–20). The dataset has more than 890 entries because regional varieties like Meowth and Alolan Meowth share a Pokédex number but count as separate rows.

Pokémon count by generation and type distribution chart — Left: Pokémon added per generation. Right: distribution of primary types across all Pokémon.

Water is the most common type — 13% of all Pokémon. Normal and Grass follow. Flying ranks surprisingly low as a primary type despite how many Pokémon have it as a secondary.

Height, weight & BMI

The scatter plot shows a positive correlation between height and weight — expected, and consistent with what we'd see in real animals. But there's a clear outlier: near-zero height, 1000kg weight.

Scatter plot of Pokémon height vs weight — Height (m) vs weight (kg) for all Pokémon. The extreme outlier at top-left is Cosmoem.

The outlier is Cosmoem — a Psychic-type Protostar Pokémon. At 0.1m tall and 999.9kg, it's one of the densest objects in the Pokédex, modelled on a protostellar core. The data is technically correct.

With the Psychic type excluded (dominated by Cosmoem's extreme BMI), the BMI distribution by type shows something coherent: Steel and Rock types are the densest, Fairy types the lightest.

Box plot of BMI by Pokémon type, excluding Psychic — BMI distribution by primary type (Psychic excluded as an outlier due to Cosmoem).

Battle stats

The six battle stats — HP, Attack, Defense, Special Attack, Special Defense, Speed — all correlate positively with each other, but weakly to moderately. The strongest relationship is between Defense and Special Defense (r = 0.54). Speed and Defense show almost no correlation, which makes sense intuitively: fast Pokémon tend to be fragile.

Looking at the highest-rated Pokémon per stat:

Battle Stat	Highest Rated Pokémon
HP	Blissey
Attack	Mega Mewtwo X
Defense	Eternatus Eternamax
Sp. Attack	Mega Mewtwo Y
Sp. Defense	Eternatus Eternamax
Speed	Deoxys Speed Forme

Blissey tops HP — exactly why it dominates gym defense. The data confirmed something any competitive player already knows: Blissey's absurd HP makes it the hardest gym defender to knock out. Finding it in the data felt satisfying.

What I took from it

The real takeaway was how much faster analysis goes when you already know the domain. I could spot anomalies immediately because I knew what the data should look like. Cosmoem's BMI wasn't a data error; it was a signal worth investigating.

Pokémon Analysiswith Python

Background

Pokémon through the generations

Height, weight & BMI

Battle stats

What I took from it

Pokémon Analysis
with Python