{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "26Kc3Glqz45E" }, "source": [ "# Loading and Cleaning the Data\n", "\n", "First, we'll load in the raw Kaggle data. We're not working in Dataquest's code editor, so we have to load in the dataset ourselves from the root directory." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_fX7lwcM1_KO" }, "outputs": [], "source": [ "import csv\n", "\n", "with open('kaggle2021-short.csv') as f:\n", " reader = csv.reader(f, delimiter=\",\")\n", " kaggle_data = list(reader)\n", " \n", "column_names = kaggle_data[0]\n", "survey_responses = kaggle_data[1:]" ] }, { "cell_type": "markdown", "metadata": { "id": "xsaz4FF92Qdb" }, "source": [ "We've loaded in the raw dataset where all of the data is in terms of strings. Before we do any analysis, we'll make sure that each column is properly represented in the appropriate type. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "7eyhM2272psk" }, "outputs": [], "source": [ "# Iterate over the indices so that we can update all of the data\n", "num_rows = len(survey_responses)\n", "for i in range(num_rows):\n", "\n", " # experience_coding\n", " survey_responses[i][0] = float(survey_responses[i][0]) \n", " \n", " # python_user\n", " if survey_responses[i][1] == \"TRUE\":\n", " survey_responses[i][1] = True\n", " else:\n", " survey_responses[i][1] = False\n", " \n", " # r_user\n", " if survey_responses[i][2] == \"TRUE\":\n", " survey_responses[i][2] = True\n", " else:\n", " survey_responses[i][2] = False\n", "\n", " # sql_user\n", " if survey_responses[i][3] == \"TRUE\":\n", " survey_responses[i][3] = True\n", " else:\n", " survey_responses[i][3] = False\n", "\n", " # most_used\n", " if survey_responses[i][4] == \"None\":\n", " survey_responses[i][4] = None\n", " else:\n", " survey_responses[i][4] = survey_responses[i][4]\n", "\n", "\n", " # compensation\n", " survey_responses[i][5] = int(survey_responses[i][5]) " ] }, { "cell_type": "markdown", "metadata": { "id": "acMHmS-X7Dml" }, "source": [ "# Counting People\n", "\n", "As a first exercise, we'll count how many people report knowing Python, R, and SQL. We'll combine an `if-else` statement with a `for` loop. We only need to do something if we see a `True`, so we don't need an `else` branch here. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "h10O9uvv8RGp", "outputId": "de856e27-e350-49b4-dd17-756597111e7b" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of Python users: 21860\n", "Number of R users: 5335\n", "Number of SQL users: 10757\n", "Proportion of Python users: 0.8416432449081739\n", "Proportion of R users: 0.20540561352173412\n", "Proportion of SQL users: 0.4141608593539445\n" ] } ], "source": [ "python_user_count = 0\n", "r_user_count = 0\n", "sql_user_count = 0\n", "\n", "for i in range(num_rows):\n", "\n", " # Detect if python_user column is True\n", " if survey_responses[i][1]:\n", " python_user_count = python_user_count + 1\n", " \n", " # Detect if r_user column is True\n", " if survey_responses[i][2]:\n", " r_user_count = r_user_count + 1\n", "\n", " # Detect if sql_user column is True\n", " if survey_responses[i][3]:\n", " sql_user_count = sql_user_count + 1\n", "\n", "print(\"Number of Python users: \" + str(python_user_count))\n", "print(\"Number of R users: \" + str(r_user_count))\n", "print(\"Number of SQL users: \" + str(sql_user_count))\n", "\n", "print(\"Proportion of Python users: \" + str(python_user_count / num_rows))\n", "print(\"Proportion of R users: \" + str(r_user_count / num_rows))\n", "print(\"Proportion of SQL users: \" + str(sql_user_count / num_rows))" ] }, { "cell_type": "markdown", "metadata": { "id": "Mr9Ui7kj4VRW" }, "source": [ "# Aggregating Information\n", "\n", "Here, we'll summarize the `experience_coding` and `compensation` columns to learn more about the survey participants. More specifically, we'll check both the range and average of each column. The range will be useful for understanding how spread out the values are, while the average helps indicate what a \"typical\" value looks like." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "qJgm852k4-R9" }, "outputs": [], "source": [ "# Aggregating all years of experience and compensation together into a single list\n", "experience_coding_column = []\n", "compensation_column = []\n", "\n", "for i in range(num_rows):\n", " experience_coding_column.append(survey_responses[i][0])\n", " compensation_column.append(survey_responses[i][5])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "tqi1yYsm6H_e", "outputId": "42243651-3466-4a3a-b32c-4bb77541bbbb" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Minimum years of experience: 0.0\n", "Maximum years of experience: 30.0\n", "Average years of experience: 5.297231740653729\n" ] } ], "source": [ "# Summarizing the experience_coding column\n", "min_experience_coding = min(experience_coding_column)\n", "max_experience_coding = max(experience_coding_column)\n", "avg_experience_coding = sum(experience_coding_column) / num_rows\n", "\n", "print(\"Minimum years of experience: \" + str(min_experience_coding))\n", "print(\"Maximum years of experience: \" + str(max_experience_coding))\n", "print(\"Average years of experience: \" + str(avg_experience_coding))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "oAV0ssAs6nCf", "outputId": "5592ce62-e5f7-4425-ff0f-c211faf7e662" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Minimum compensation: 0\n", "Maximum compensation: 1492951\n", "Average compensation: 53252.81696377007\n" ] } ], "source": [ "# Summarizing the compensation column\n", "min_compensation = min(compensation_column)\n", "max_compensation = max(compensation_column)\n", "avg_compensation = sum(compensation_column) / num_rows\n", "\n", "print(\"Minimum compensation: \" + str(min_compensation))\n", "print(\"Maximum compensation: \" + str(max_compensation))\n", "print(\"Average compensation: \" + str(avg_compensation))" ] }, { "cell_type": "markdown", "metadata": { "id": "ClVcAE04zWj1" }, "source": [ "# Categorizing Years of Experience\n", "\n", "To do a more detailed analysis, we'll need to categorize everyone in terms of their years of experience. We'll add a new column to the dataset that contains this category. We'll bin years of experience in five-year increments." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "AMlTXz7xzja8" }, "outputs": [], "source": [ "for i in range(num_rows):\n", "\n", " if survey_responses[i][0] < 5:\n", " survey_responses[i].append(\"<5 Years\")\n", " \n", " elif survey_responses[i][0] >= 5 and survey_responses[i][0] < 10:\n", " survey_responses[i].append(\"5-10 Years\")\n", "\n", " elif survey_responses[i][0] >= 10 and survey_responses[i][0] < 15:\n", " survey_responses[i].append(\"10-15 Years\")\n", " \n", " elif survey_responses[i][0] >= 15 and survey_responses[i][0] < 20:\n", " survey_responses[i].append(\"15-20 Years\")\n", "\n", " elif survey_responses[i][0] >= 20 and survey_responses[i][0] < 25:\n", " survey_responses[i].append(\"20-25 Years\")\n", " \n", " else:\n", " survey_responses[i].append(\"25+ Years\")" ] }, { "cell_type": "markdown", "metadata": { "id": "Hf5YKC4oAjI8" }, "source": [ "# Distibution of Experience and Compensation\n", "\n", "Now that we have a new category for years of experience, we'll use these to create a set of lists that contain the `compensation` values for *each* category." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "BZk4JK63AkfD" }, "outputs": [], "source": [ "bin_0_to_5 = []\n", "bin_5_to_10 = []\n", "bin_10_to_15 = []\n", "bin_15_to_20 = []\n", "bin_20_to_25 = []\n", "bin_25_to_30 = []\n", "\n", "for i in range(num_rows):\n", " \n", " if survey_responses[i][6] == \"<5 Years\":\n", " bin_0_to_5.append(survey_responses[i][5])\n", " \n", " elif survey_responses[i][6] == \"5-10 Years\":\n", " bin_5_to_10.append(survey_responses[i][5])\n", " \n", " elif survey_responses[i][6] == \"10-15 Years\":\n", " bin_10_to_15.append(survey_responses[i][5])\n", " \n", " elif survey_responses[i][6] == \"15-20 Years\":\n", " bin_15_to_20.append(survey_responses[i][5])\n", " \n", " elif survey_responses[i][6] == \"20-25 Years\":\n", " bin_20_to_25.append(survey_responses[i][5])\n", "\n", " else:\n", " bin_25_to_30.append(survey_responses[i][5])" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "d5NTqv1oFAlN", "outputId": "dfc3b431-0cfa-43fa-d3eb-04447759d66c" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "People with < 5 years of experience: 18753\n", "People with 5 - 10 years of experience: 3167\n", "People with 10 - 15 years of experience: 1118\n", "People with 15 - 20 years of experience: 1069\n", "People with 20 - 25 years of experience: 925\n", "People with 25+ years of experience: 941\n" ] } ], "source": [ "# Checking the distribution of experience in the dataset\n", "print(\"People with < 5 years of experience: \" + str(len(bin_0_to_5)))\n", "print(\"People with 5 - 10 years of experience: \" + str(len(bin_5_to_10)))\n", "print(\"People with 10 - 15 years of experience: \" + str(len(bin_10_to_15)))\n", "print(\"People with 15 - 20 years of experience: \" + str(len(bin_15_to_20)))\n", "print(\"People with 20 - 25 years of experience: \" + str(len(bin_20_to_25)))\n", "print(\"People with 25+ years of experience: \" + str(len(bin_25_to_30)))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "1B33wsPUFqb6", "outputId": "3a2e7c5d-7662-4799-8680-cdbb56c409c6" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average salary of people with < 5 years of experience: 45047.87484669119\n", "Average salary of people with 5 - 10 years of experience: 59312.82033470161\n", "Average salary of people with 10 - 15 years of experience: 80226.75581395348\n", "Average salary of people with 15 - 20 years of experience: 75101.82694106642\n", "Average salary of people with 20 - 25 years of experience: 103159.80432432433\n", "Average salary of people with 25+ years of experience: 90444.98512221042\n" ] } ], "source": [ "# Checking the distribution of experience in the dataset\n", "print(\"Average salary of people with < 5 years of experience: \" + str(sum(bin_0_to_5) / len(bin_0_to_5)))\n", "print(\"Average salary of people with 5 - 10 years of experience: \" + str(sum(bin_5_to_10) / len(bin_5_to_10)))\n", "print(\"Average salary of people with 10 - 15 years of experience: \" + str(sum(bin_10_to_15) / len(bin_10_to_15)))\n", "print(\"Average salary of people with 15 - 20 years of experience: \" + str(sum(bin_15_to_20) / len(bin_15_to_20)))\n", "print(\"Average salary of people with 20 - 25 years of experience: \" + str(sum(bin_20_to_25) / len(bin_20_to_25)))\n", "print(\"Average salary of people with 25+ years of experience: \" + str(sum(bin_25_to_30) / len(bin_25_to_30)))" ] }, { "cell_type": "markdown", "metadata": { "id": "aGIl-aGEGcmw" }, "source": [ "# Summary of Findings\n", "\n", "Based on the number of people in each experience category, most of the people who took the survey have just started their career. Over 18,000 people have less than five years of experience coding. The next-highest category is the journeymen, with 5-10 years of experience. After that, there are several people in each of the long-term programmers who have more than 10 years of experience.\n", "\n", "Average salary seems to increase with experience, but this increase doesn't seem to be linear. There are times when the average salary dips when we move into a category of higher experience. There might be several reasons why this happens, but we don't have any data to help explain this. Overall, being a data professional provides a solid living, based on the reported data. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "m7ytNsu2H_ak" }, "outputs": [], "source": [] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 1 }