US National Baby Names · Michael Holtzscher

The data for this project is taken from Kaggle US Baby Names dataset. It consists of Social Security Administration baby naming data from 1880 to 2014. Note that only names with at least 5 babies born in the same year (/ state) are included in this dataset for privacy.

Preparing The Data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
font_size = 16

df = pd.read_csv('NationalNames.csv')
df.head()

	Id	Name	Year	Gender	Count
0	1	Mary	1880	F	7065
1	2	Anna	1880	F	2604
2	3	Emma	1880	F	2003
3	4	Elizabeth	1880	F	1939
4	5	Minnie	1880	F	1746

Births Per Year

Note: This plot is not a completely accurate representation of the number of births in the US since our data only accounts for names where at least 5 babies were given that name. This chart is to just get a basic understanding of the number of births.

total_per_year = df.groupby('Year').sum()
axis = total_per_year['Count'].plot.line(figsize=(20,10), fontsize=font_size)
axis.set_xlabel('Year', fontsize=30)
axis.set_ylabel('Births', fontsize=30)
axis.set_title('Births Per Year(Estimated)', fontsize=30)

png

Births Per Year By Gender

gender = df.groupby(['Year', 'Gender']).sum()
axis = gender['Count'].unstack().plot.line(figsize=(20,10), fontsize=font_size)
axis.set_xlabel('Year', fontsize=30)
axis.set_ylabel('Births', fontsize=30)
axis.set_title('Births Per Year By Gender', fontsize=30)
axis.legend(fontsize=font_size)

png

Proportion of Male to Female Births Per Year

I was curious to see the difference in numbers between male and female births. I expected to be near ⁵⁰⁄₅₀ but this was not the case. This sparked some interesting questions of “Why are there more males born in the last 70 years?” and “Why the flip around the 1940s?” Perhaps this will be a future exploration.

ratio = df.groupby(['Year', 'Gender']).sum()
ratio = ratio.unstack()
ratio['ratio'] = ratio['Count'].apply(lambda row: row['M'] / row['F'], axis=1)
axis = ratio['ratio'].plot.line(figsize=(20,10), grid=True, fontsize=font_size)
axis.set_xlabel('Year', fontsize=30)
axis.set_ylabel('Proportion(Male/Female)', fontsize=30)
axis.set_title('Proportion of Male to Female Births Per Year', fontsize=30)

png

Top 10 Names All Time

top_names = df.groupby('Name').sum()
top_names = top_names['Count'].sort_values(ascending=False)[:10]
top_10 = []
for i in top_names.iteritems():
    top_10.append(i[0])

top_names = df[df['Name'].isin(top_10)].groupby(['Year', 'Name']).sum()
axis = top_names['Count'].unstack().plot.line(figsize=(20,10), fontsize=font_size)
axis.set_xlabel('Year', fontsize=30)
axis.set_ylabel('Births', fontsize=30)
axis.set_title('Top 10 Names (1880-2014)', fontsize=30)
axis.legend(fontsize=font_size)

png

Top 10 Female Names All Time

top_names = df[df['Gender'] == 'F'].groupby('Name').sum()
top_names = top_names['Count'].sort_values(ascending=False)[:10]
top_10 = []
for i in top_names.iteritems():
    top_10.append(i[0])

top_names = df[df['Name'].isin(top_10)].groupby(['Year', 'Name']).sum()
axis = top_names['Count'].unstack().plot.line(figsize=(20,10), fontsize=font_size)
axis.set_xlabel('Year', fontsize=30)
axis.set_ylabel('Births', fontsize=30)
axis.set_title('Top 10 Female Names (1880-2014)', fontsize=30)
axis.legend(fontsize=font_size)

png

Top 10 Male Names All Time

top_names = df[df['Gender'] == 'M'].groupby('Name').sum()
top_names = top_names['Count'].sort_values(ascending=False)[:10]
top_10 = []
for i in top_names.iteritems():
    top_10.append(i[0])

top_names = df[df['Name'].isin(top_10)].groupby(['Year', 'Name']).sum()
axis = top_names['Count'].unstack().plot.line(figsize=(20,10), fontsize=font_size)
axis.set_xlabel('Year', fontsize=30)
axis.set_ylabel('Births', fontsize=30)
axis.set_title('Top 10 Male Names (1880-2014)', fontsize=30)
axis.legend(fontsize=font_size)

png

Diversity In Names

Upon looking at the top 10 names I noticed how they dropped off as they approached today. This was interesting considering that births per year have trended up consistently. This plot looks at the total number of names used each year.

import math
def f(x):
    if not math.isnan(x):
        return 1

diversity = df.groupby(['Name', 'Year']).sum()
diversity = diversity.unstack().applymap(f)
diversity = diversity.sum(axis=0)
axis = diversity['Count'].plot.line(figsize=(20,10), fontsize=font_size)
axis.set_xlabel('Year', fontsize=30)
axis.set_ylabel('Number of Names Used', fontsize=30)
axis.set_title('Diversity In Names', fontsize=30)

png

Most Consistent Names

After seeing the variety of names great increase I was curious if there were any names that were consistently used. This plot looks at the 10 names with the lowest standard deviation over the full time range.

consistent = df.groupby(['Name', 'Year']).sum()
consistent = consistent.unstack().std(axis=1, skipna=False).sort_values().dropna()[:10]

top_10 = list(consistent.index.values)

con_names = df[df['Name'].isin(top_10)].groupby(['Year', 'Name']).sum()
axis = con_names['Count'].unstack().plot.line(figsize=(20,10), fontsize=font_size)
axis.set_xlabel('Year', fontsize=30)
axis.set_ylabel('Births', fontsize=30)
axis.set_title('Most Consistent Names', fontsize=30)
axis.legend(fontsize=font_size)

png

Least Consistent Names

This plot looks at the 10 names with the greatest standard deviation over the full time range.

consistent = df.groupby(['Name', 'Year']).sum()
consistent = consistent.unstack().std(axis=1, skipna=False).sort_values().dropna()[-10:]

top_10 = list(consistent.index.values)

con_names = df[df['Name'].isin(top_10)].groupby(['Year', 'Name']).sum()
axis = con_names['Count'].unstack().plot.line(figsize=(20,10), fontsize=font_size)
axis.set_xlabel('Year', fontsize=30)
axis.set_ylabel('Births', fontsize=30)
axis.set_title('Least Consistent Names', fontsize=30)
axis.legend(fontsize=font_size)

png

All The Michaels

Just for fun what is the history of my name.

michael = df[df['Name'] == 'Michael'].groupby('Year').sum()
axis = michael['Count'].plot.line(figsize=(20,10), fontsize=font_size)
axis.set_xlabel('Year', fontsize=30)
axis.set_ylabel('Births', fontsize=30)
axis.set_title('Babies Named Michael By Year', fontsize=30)

png

The Jupyter Notebook for this work can be found on GitHub.