The data for this project is taken from Google’s public BigQuery datasets. The dataset can be easily accessed on Kaggle via the GitHub Repos dataset.

Getting The Data

import pandas as pd
from google.cloud import bigquery
from bq_helper import BigQueryHelper
bq = BigQueryHelper('bigquery-public-data', 'github_repos')
QUERY = """
        SELECT license, COUNT(license) as license_count
        FROM `bigquery-public-data.github_repos.licenses`
        GROUP BY license
        """

df = bq.query_to_pandas_safe(QUERY)
ax = df.sort_values('license_count').plot.barh(x='license', y='license_count', legend=False)
ax.set_xlabel('License')
ax.set_ylabel('Count')
ax.set_title('Distribution of GitHub Licenses')

png