{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Popular Data Science Questions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our goal in this project is to use [Data Science Stack Exchange](https://datascience.stackexchange.com) to determine what content should a data science education company create, based on interest by subject." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stack Exchange" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### What kind of questions are welcome on this site?\n", "\n", "On DSSE's help center's [section on questions](https://datascience.stackexchange.com/help/asking) , we can read that we should:\n", "\n", "* Avoid subjective questions.\n", "* Ask practical questions about Data Science — there are adequate sites for theoretical questions.\n", "* Ask specific questions.\n", "* Make questions relevant to others.\n", "\n", "All of these characteristics, if employed, should be helpful attributes to our goal.\n", "\n", "In the help center we also learned that in addition to the sites mentioned in the _Learn_ section, there are other two sites that are relevant:\n", "\n", "* [Open Data](https://opendata.stackexchange.com/help/on-topic) (Dataset requests)\n", "* [Computational Science](https://scicomp.stackexchange.com/help/on-topic) (Software packages and algorithms in applied mathematics)\n", "\n", "#### What, other than questions, does DSSE's [home](https://datascience.stackexchange.com) subdivide into?\n", "\n", "On the [home page](https://datascience.stackexchange.com/) we can see that we have four sections:\n", "\n", "* [Questions](https://datascience.stackexchange.com/questions) — a list of all questions asked;\n", "* [Tags](https://datascience.stackexchange.com/tags) — a list of tags (keywords or labels that categorize questions);\n", "\n", " ![tags_ds](https://dq-content.s3.amazonaws.com/469/tags_ds.png)\n", "* [Users](https://datascience.stackexchange.com/users) — a list of users;\n", "* [Unanswered](https://datascience.stackexchange.com/unanswered) — a list of unanswered questions;\n", "\n", "The tagging system used by Stack Exchange looks just like what we need to solve this problem as it allow us to quantify how many questions are asked about each subject.\n", "\n", "Something else we can learn from exploring the help center, is that Stack Exchange's sites are heavily moderated by the community; this gives us some confidence in using the tagging system to derive conclusions.\n", "\n", "#### What information is available in each post?\n", "\n", "Looking, just as an example, at [this](https://datascience.stackexchange.com/questions/19141/linear-model-to-generate-probability-of-each-possible-output?rq=1) question, some of the information we see is:\n", "\n", "* For both questions and answers:\n", " * The posts's score;\n", " * The posts's title;\n", " * The posts's author;\n", " * The posts's body;\n", "* For questions only:\n", " * How many users have it on their \"\n", " * The last time the question as active;\n", " * How many times the question was viewed;\n", " * Related questions;\n", " * The question's tags;\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Stack Exchange Data Explorer\n", "\n", "Perusing the table names, a few stand out as relevant for our goal:\n", "\n", "* Posts\n", "* PostTags\n", "* Tags\n", "* TagSynonyms\n", "\n", "Running a few exploratory queries, leads us to focus our efforts on `Posts` table. For examples, the `Tags` table looked very promising as it tells us how many times each tag was used, but there's no way to tell just from this if the interest in these tags is recent or a thing from the past.\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdTagNameCountExcerptPostIdWikiPostId
2machine-learning691949094908
46python390755235522
81neural-network292388858884
194deep-learning278689568955
77classification189949114910
324keras173692519250
128scikit-learn130358965895
321tensorflow122491839182
47nlp1162147146
24r11144948
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Getting the Data\n", "\n", "To get the relevant data we run the following query.\n", "\n", "```\n", "SELECT Id, CreationDate,\n", " Score, ViewCount, Tags,\n", " AnswerCount, FavoriteCount\n", " FROM posts\n", " WHERE PostTypeId = 1 AND YEAR(CreationDate) = 2019;\n", "```\n", "\n", "Here's what the first few rows look like:\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdPostTypeIdCreationDateScoreViewCountTagsAnswerCountFavoriteCount
4441912019-01-23 09:21:13121<machine-learning><data-mining>0
4442012019-01-23 09:34:01025<machine-learning><regression><linear-regression><regularization>0
4442312019-01-23 09:58:4121651<python><time-series><forecast><forecasting>0
4442712019-01-23 10:57:09055<machine-learning><scikit-learn><pca>1
4442812019-01-23 11:02:15019<dataset><bigdata><data><speech-to-text>0
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exploring the Data\n", "\n", "We can read in the data while immediately making sure `CreationDate` will be stored as a datetime object:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# We import everything that we'll use\n", "\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "%matplotlib inline" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "questions = pd.read_csv(\"2019_questions.csv\", parse_dates=[\"CreationDate\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Running [`questions.info()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) should gives a lot of useful information." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "RangeIndex: 8839 entries, 0 to 8838\n", "Data columns (total 7 columns):\n", "Id 8839 non-null int64\n", "CreationDate 8839 non-null datetime64[ns]\n", "Score 8839 non-null int64\n", "ViewCount 8839 non-null int64\n", "Tags 8839 non-null object\n", "AnswerCount 8839 non-null int64\n", "FavoriteCount 1407 non-null float64\n", "dtypes: datetime64[ns](1), float64(1), int64(4), object(1)\n", "memory usage: 483.5+ KB\n" ] } ], "source": [ "questions.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that only `FavoriteCount` has missing values. A missing value on this column probably means that the question was is not present in any users' favorite list, so we can replace the missing values with zero.\n", "\n", "The types seem adequate for every column, however, after we fill in the missing values on `FavoriteCount`, there is no reason to store the values as floats.\n", "\n", "Since the `object` dtype is a catch-all type, let's see what types the objects in `questions[\"Tags\"]` are." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([], dtype=object)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "questions[\"Tags\"].apply(lambda value: type(value)).unique()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that every value in this column is a string. On Stack Exchange, each question can only have a maximum of five tags ([source](https://meta.stackexchange.com/a/18879)), so one way to deal with this column is to create five columns in `questions` called `Tag1`, `Tag2`, `Tag3`, `Tag4`, and `Tag5` and populate the columns with the tags in each row.\n", "\n", "However, since doesn't help is relating tags from one question to another, we'll just keep them as a list." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Cleaning the Data\n", "\n", "We'll begin by fixing `FavoriteCount`." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Id int64\n", "CreationDate datetime64[ns]\n", "Score int64\n", "ViewCount int64\n", "Tags object\n", "AnswerCount int64\n", "FavoriteCount int64\n", "dtype: object" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "questions.fillna(value={\"FavoriteCount\": 0}, inplace=True)\n", "questions[\"FavoriteCount\"] = questions[\"FavoriteCount\"].astype(int)\n", "questions.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now modify `Tags` to make it easier to work with." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdCreationDateScoreViewCountTagsAnswerCountFavoriteCount
511563822019-07-25 15:00:20034[machine-learning, python, pandas, natural-lan...00
2178583122019-08-28 09:44:00141[neural-network, pytorch]01
2536581512019-08-25 01:01:29037[dataset, audio-recognition]20
\n", "
" ], "text/plain": [ " Id CreationDate Score ViewCount \\\n", "511 56382 2019-07-25 15:00:20 0 34 \n", "2178 58312 2019-08-28 09:44:00 1 41 \n", "2536 58151 2019-08-25 01:01:29 0 37 \n", "\n", " Tags AnswerCount \\\n", "511 [machine-learning, python, pandas, natural-lan... 0 \n", "2178 [neural-network, pytorch] 0 \n", "2536 [dataset, audio-recognition] 2 \n", "\n", " FavoriteCount \n", "511 0 \n", "2178 1 \n", "2536 0 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "questions[\"Tags\"] = questions[\"Tags\"].str.replace(\"^<|>$\", \"\").str.split(\"><\")\n", "questions.sample(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Most Used and Most Viewed\n", "\n", "We'll begin by counting how many times each tag was used" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "tag_count = dict()\n", "\n", "for tags in questions[\"Tags\"]:\n", " for tag in tags:\n", " if tag in tag_count:\n", " tag_count[tag] += 1\n", " else:\n", " tag_count[tag] = 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For improved aesthetics, let's transform `tag_count` in a dataframe." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Count
machine-learning2693
data-mining217
regression347
linear-regression175
regularization50
python1814
time-series466
forecast34
forecasting85
scikit-learn540
\n", "
" ], "text/plain": [ " Count\n", "machine-learning 2693\n", "data-mining 217\n", "regression 347\n", "linear-regression 175\n", "regularization 50\n", "python 1814\n", "time-series 466\n", "forecast 34\n", "forecasting 85\n", "scikit-learn 540" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tag_count = pd.DataFrame.from_dict(tag_count, orient=\"index\")\n", "tag_count.rename(columns={0: \"Count\"}, inplace=True)\n", "tag_count.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now sort this dataframe by `Count` and visualize the top 20 results." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Count
machine-learning-model224
statistics234
clustering257
predictive-modeling265
r268
dataset340
regression347
pandas354
lstm402
time-series466
cnn489
nlp493
scikit-learn540
tensorflow584
classification685
keras935
neural-network1055
deep-learning1220
python1814
machine-learning2693
\n", "
" ], "text/plain": [ " Count\n", "machine-learning-model 224\n", "statistics 234\n", "clustering 257\n", "predictive-modeling 265\n", "r 268\n", "dataset 340\n", "regression 347\n", "pandas 354\n", "lstm 402\n", "time-series 466\n", "cnn 489\n", "nlp 493\n", "scikit-learn 540\n", "tensorflow 584\n", "classification 685\n", "keras 935\n", "neural-network 1055\n", "deep-learning 1220\n", "python 1814\n", "machine-learning 2693" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "most_used = tag_count.sort_values(by=\"Count\").tail(20)\n", "most_used" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The threshold of `20` is somewhat arbitrary and we can experiment with others, however, popularity of the tags rapidly declines, so looking at these tags should be enough to help us with our goal. Let's visualize these data." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "most_used.plot(kind=\"barh\", figsize=(16,8))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some tags are very, very broad and are unlikely to be useful; e.g.: `python`, `dataset`, `r`. Before we investigate the tags a little deeper, let's repeat the same process for views.\n", "\n", "We'll use Python's builtin [`enumerate()`](https://docs.python.org/3/library/functions.html#enumerate) function. Its utility is well understood by seeing it action." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 I\n", "1 t\n", "2 e\n", "3 r\n", "4 a\n", "5 t\n", "6 e\n", "7 \n", "8 t\n", "9 h\n", "10 i\n", "11 s\n", "12 !\n" ] } ], "source": [ "some_iterable = \"Iterate this!\"\n", "\n", "for i,c in enumerate(some_iterable):\n", " print(i,c)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition to the elements of `some_iterable`, `enumerate` gives us the index of each of them." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "tag_view_count = dict()\n", "\n", "for idx, tags in enumerate(questions[\"Tags\"]):\n", " for tag in tags:\n", " if tag in tag_view_count:\n", " tag_view_count[tag] += questions[\"ViewCount\"].iloc[idx]\n", " else:\n", " tag_view_count[tag] = 1\n", " \n", "tag_view_count = pd.DataFrame.from_dict(tag_view_count, orient=\"index\")\n", "tag_view_count.rename(columns={0: \"ViewCount\"}, inplace=True)\n", "\n", "most_viewed = tag_view_count.sort_values(by=\"ViewCount\").tail(20)\n", "\n", "most_viewed.plot(kind=\"barh\", figsize=(16,8))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see them side by side." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([],\n", " dtype=object)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "fig, axes = plt.subplots(nrows=1, ncols=2)\n", "fig.set_size_inches((24, 10))\n", "most_used.plot(kind=\"barh\", ax=axes[0], subplots=True)\n", "most_viewed.plot(kind=\"barh\", ax=axes[1], subplots=True)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "in_used = pd.merge(most_used, most_viewed, how=\"left\", left_index=True, right_index=True)\n", "in_viewed = pd.merge(most_used, most_viewed, how=\"right\", left_index=True, right_index=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Relations Between Tags" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One way of trying to gauge how pairs of tags are related to each other, is to count how many times each pair appears together. Let's do this.\n", "\n", "We'll begin by creating a list of all tags." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "all_tags = list(tag_count.index)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll now create a dataframe where each row will represent a tag, and each column as well. Something like this:\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
tag1tag2tag3
tag1
tag2
tag3
" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
machine-learningdata-miningregressionlinear-regression
machine-learningNaNNaNNaNNaN
data-miningNaNNaNNaNNaN
regressionNaNNaNNaNNaN
linear-regressionNaNNaNNaNNaN
\n", "
" ], "text/plain": [ " machine-learning data-mining regression linear-regression\n", "machine-learning NaN NaN NaN NaN\n", "data-mining NaN NaN NaN NaN\n", "regression NaN NaN NaN NaN\n", "linear-regression NaN NaN NaN NaN" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "associations = pd.DataFrame(index=all_tags, columns=all_tags)\n", "associations.iloc[0:4,0:4]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will now fill this dataframe with zeroes and then, for each lists of tags in `questions[\"Tags\"]`, we will increment the intervening tags by one. The end result will be a dataframe that for each pair of tags, it tells us how many times they were used together." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "associations.fillna(0, inplace=True)\n", "\n", "for tags in questions[\"Tags\"]:\n", " associations.loc[tags, tags] += 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This dataframe is quite large. Let's focus our attention on the most used tags. We'll add some colors to make it easier to talk about the dataframe. (At the time of this writing, GitHub's renderer does not display the colors, we suggest you use this solution notebook together with [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/)." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
machine-learning-model statistics clustering predictive-modeling r dataset regression pandas lstm time-series cnn nlp scikit-learn tensorflow classification keras neural-network deep-learning python machine-learning
machine-learning-model22433217128457441892117101937139
statistics32343161617163122136019311123589
clustering3325701652532009240120824561
predictive-modeling211602651372841331611262711133235123
r7161613268610232224111010952463
dataset121757634061476111199281320325399
regression81622810634761124623793431422159119
pandas4354214635471913373331124462
lstm513133711740287241924320133691036171
time-series72220312262419874668012925513344105131
cnn41062116124848970572011611816062124
nlp439141123190749312113523247271113
scikit-learn18624121937372120125401547342416235188
tensorflow9006199343957111558420256108136167106
classification21191227102834320252035472068558655998259
keras173011101331313351116233425658935235247280195
neural-network101181392042169331182424108652351055305137366
deep-learning1912232532211103441607216136592473051220160429
python37354535245359244611056271235167982801371601814499
machine-learning1398961123639911962711311241131881062591953664294992693
" ], "text/plain": [ "" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "relations_most_used = associations.loc[most_used.index, most_used.index]\n", "\n", "def style_cells(x):\n", " helper_df = pd.DataFrame('', index=x.index, columns=x.columns)\n", " helper_df.loc[\"time-series\", \"r\"] = \"background-color: yellow\"\n", " helper_df.loc[\"r\", \"time-series\"] = \"background-color: yellow\"\n", " for k in range(helper_df.shape[0]):\n", " helper_df.iloc[k,k] = \"color: blue\"\n", " \n", " return helper_df\n", "\n", "relations_most_used.style.apply(style_cells, axis=None)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The cells highlighted in yellow tell us that `time-series` was used together with `r` 22 times. The values in blue tell us how many times each of the tags was used. We saw earlier that `machine-learning` was used 2693 times and we confirm it in this dataframe.\n", "\n", "It's hard for a human to understand what is going on in this dataframe. Let's create a heatmap. But before we do it, let's get rid of the values in blue, otherwise the colors will be too skewed." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "for i in range(relations_most_used.shape[0]):\n", " relations_most_used.iloc[i,i] = pd.np.NaN" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(12,8))\n", "sns.heatmap(relations_most_used, cmap=\"Greens\", annot=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The most used tags also seem to have the strongest relationships, as given by the dark concentration in the bottom right corner. However, this could simply be because each of these tags is used a lot, and so end up being used together a lot without possibly even having any strong relation between them.\n", "\n", "A more intuitive manifestation of this phenomenon is the following. A lot of people buy bread, a lot of people buy toilet paper, so they end up being purchased together a lot, but purchasing one of them doesn't increase the chances of purchasing the other." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another shortcoming of this attempt is that it only looks at relations between pairs of tags and not between multiple groups of tags. For example, it could be the case that when used together, `dataset` and `scikit-learn` have a \"strong\" relation to `pandas`, but each by itself doesn't.\n", "\n", "So how do we attack both these problems? There is a powerful data mining technique that allows us to handle this: [association rules](https://en.wikipedia.org/wiki/Association_rule_learning). Association rules allow us to analytically spot relations like \"people who purchase milk, also purchase eggs\". Moreover, we can also measure how strong this relations are on several fronts: how common the relation is, how strong it is, and how independent the components of the relationship are (toilet paper and bread are probably more independent than eggs and milk — you'll learn more about [statistical independence](https://en.wikipedia.org/wiki/Independence_(probability_theory)) in the next step).\n", "\n", "\n", "We won't get into the details of it, as the technique is out of scope for this course, but it is a path worth investigating!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Enter Domain Knowledge\n", "\n", "[Keras](https://keras.io/), [scikit-learn](https://scikit-learn.org/), [TensorFlow](https://www.tensorflow.org/) are all Python libraries that allow their users to employ deep learning (a type of neural network).\n", "\n", "Most of the top tags are all intimately related with one central machine learning theme: deep learning. If we want to be very specific, we can suggest the creation of Python content that uses deep learning for classification problems (and other variations of this suggestion).\n", "\n", "At the glance of an eye, someone with sufficient domain knowledge can tell that the most popular topic at the moment, as shown by our analysis, is deep learning." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Just a Fad?\n", "\n", "Let's read in the file into a dataframe called `all_q`. We'll parse the dates at read-time." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "all_q = pd.read_csv(\"all_questions.csv\", parse_dates=[\"CreationDate\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use the same technique as before to clean the tags column." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "all_q[\"Tags\"] = all_q[\"Tags\"].str.replace(\"^<|>$\", \"\").str.split(\"><\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before deciding which questions should be classified as being deep learning questions, we should decide what tags are deep learning tags.\n", "\n", "The definition of what constitutes a deep learning tag we'll use is: a tag that belongs to the list `[\"lstm\", \"cnn\", \"scikit-learn\", \"tensorflow\", \"keras\", \"neural-network\", \"deep-learning\"]`.\n", "\n", "This list was obtained by looking at all the tags in `most_used` and seeing which ones had any relation to deep learning. You can use Google and read the tags descriptions to reach similar results.\n", "\n", "We'll now create a function that assigns `1` to deep learning questions and `0` otherwise; and we use it." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "def class_deep_learning(tags):\n", " for tag in tags:\n", " if tag in [\"lstm\", \"cnn\", \"scikit-learn\", \"tensorflow\",\n", " \"keras\", \"neural-network\", \"deep-learning\"]:\n", " return 1\n", " return 0" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "all_q[\"DeepLearning\"] = all_q[\"Tags\"].apply(class_deep_learning)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdCreationDateTagsDeepLearning
15231446752019-01-28 06:20:18[model-selection]0
440556392019-07-14 11:45:43[machine-learning, dataset, machine-learning-m...0
11720515232019-05-07 04:57:35[neural-network, gradient-descent, batch-norma...1
6262272322018-01-30 09:53:38[python, convergence]0
19292649302019-12-16 14:38:17[neural-network, deep-learning, keras, convolu...1
\n", "
" ], "text/plain": [ " Id CreationDate \\\n", "15231 44675 2019-01-28 06:20:18 \n", "440 55639 2019-07-14 11:45:43 \n", "11720 51523 2019-05-07 04:57:35 \n", "6262 27232 2018-01-30 09:53:38 \n", "19292 64930 2019-12-16 14:38:17 \n", "\n", " Tags DeepLearning \n", "15231 [model-selection] 0 \n", "440 [machine-learning, dataset, machine-learning-m... 0 \n", "11720 [neural-network, gradient-descent, batch-norma... 1 \n", "6262 [python, convergence] 0 \n", "19292 [neural-network, deep-learning, keras, convolu... 1 " ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_q.sample(5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks good!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data-science-techonology landscape isn't something as dynamic to merit daily, weekly, or even monthly tracking. Let's track it quarterly.\n", "\n", "Since we don't have all the data for the first quarter of 2020, we'll get rid of those dates:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "all_q = all_q[all_q[\"CreationDate\"].dt.year < 2020]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's create a column that identifies the quarter in which a question was asked." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "def fetch_quarter(datetime):\n", " year = str(datetime.year)[-2:]\n", " quarter = str(((datetime.month-1) // 3) + 1)\n", " return \"{y}Q{q}\".format(y=year, q=quarter)\n", "\n", "all_q[\"Quarter\"] = all_q[\"CreationDate\"].apply(fetch_quarter)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
IdCreationDateTagsDeepLearningQuarter
0454162019-02-12 00:36:29[python, keras, tensorflow, cnn, probability]119Q1
1454182019-02-12 00:50:39[neural-network]119Q1
2454222019-02-12 04:40:51[python, ibm-watson, chatbot]019Q1
3454262019-02-12 04:51:49[keras]119Q1
4454272019-02-12 05:08:24[r, predictive-modeling, machine-learning-mode...019Q1
\n", "
" ], "text/plain": [ " Id CreationDate \\\n", "0 45416 2019-02-12 00:36:29 \n", "1 45418 2019-02-12 00:50:39 \n", "2 45422 2019-02-12 04:40:51 \n", "3 45426 2019-02-12 04:51:49 \n", "4 45427 2019-02-12 05:08:24 \n", "\n", " Tags DeepLearning Quarter \n", "0 [python, keras, tensorflow, cnn, probability] 1 19Q1 \n", "1 [neural-network] 1 19Q1 \n", "2 [python, ibm-watson, chatbot] 0 19Q1 \n", "3 [keras] 1 19Q1 \n", "4 [r, predictive-modeling, machine-learning-mode... 0 19Q1 " ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "all_q.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For the final stretch of this screen, we'll group by quarter and:\n", "\n", "* Count the number of deep learning questions.\n", "* Count the total number of questions.\n", "* Compute the ratio between the two numbers above." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
QuarterDeepLearningQuestionsTotalQuestionsDeepLearningRate
1718Q368515120.453042
716Q11105160.213178
615Q4663820.172775
2219Q480920360.397348
916Q31615850.275214
\n", "
" ], "text/plain": [ " Quarter DeepLearningQuestions TotalQuestions DeepLearningRate\n", "17 18Q3 685 1512 0.453042\n", "7 16Q1 110 516 0.213178\n", "6 15Q4 66 382 0.172775\n", "22 19Q4 809 2036 0.397348\n", "9 16Q3 161 585 0.275214" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "quarterly = all_q.groupby('Quarter').agg({\"DeepLearning\": ['sum', 'size']})\n", "quarterly.columns = ['DeepLearningQuestions', 'TotalQuestions']\n", "quarterly[\"DeepLearningRate\"] = quarterly[\"DeepLearningQuestions\"]\\\n", " /quarterly[\"TotalQuestions\"]\n", "# The following is done to help with visualizations later.\n", "quarterly.reset_index(inplace=True)\n", "quarterly.sample(5)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "ax1 = quarterly.plot(x=\"Quarter\", y=\"DeepLearningRate\",\n", " kind=\"line\", linestyle=\"-\", marker=\"o\", color=\"orange\",\n", " figsize=(24,12)\n", " )\n", "\n", "ax2 = quarterly.plot(x=\"Quarter\", y=\"TotalQuestions\",\n", " kind=\"bar\", ax=ax1, secondary_y=True, alpha=0.7, rot=45)\n", "\n", "for idx, t in enumerate(quarterly[\"TotalQuestions\"]):\n", " ax2.text(idx, t, str(t), ha=\"center\", va=\"bottom\")\n", "xlims = ax1.get_xlim()\n", "\n", "ax1.get_legend().remove()\n", "\n", "handles1, labels1 = ax1.get_legend_handles_labels()\n", "handles2, labels2 = ax2.get_legend_handles_labels()\n", "ax1.legend(handles=handles1 + handles2,\n", " labels=labels1 + labels2,\n", " loc=\"upper left\", prop={\"size\": 12})\n", "\n", "\n", "for ax in (ax1, ax2):\n", " for where in (\"top\", \"right\"):\n", " ax.spines[where].set_visible(False)\n", " ax.tick_params(right=False, labelright=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It seems that deep learning questions was a high-growth trend since the start of DSSE and it looks like it is plateauing. There is no evidence to suggest that interest in deep learning is decreasing and so we maintain our previous idea of proposing that we create deep learning content." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.5" } }, "nbformat": 4, "nbformat_minor": 4 }