The data for this project is taken from Google’s public BigQuery datasets. The dataset can be easily accessed on Kaggle via the GitHub Repos dataset.

Getting The Data

import pandas as pd
from import bigquery
from bq_helper import BigQueryHelper
bq = BigQueryHelper('bigquery-public-data', 'github_repos')
QUERY = """
        SELECT license, COUNT(license) as license_count
        FROM `bigquery-public-data.github_repos.licenses`
        GROUP BY license

df = bq.query_to_pandas_safe(QUERY)
ax = df.sort_values('license_count').plot.barh(x='license', y='license_count', legend=False)
ax.set_title('Distribution of GitHub Licenses')