{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Popular Data Science Questions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our goal in this project is to use [Data Science Stack Exchange](https://datascience.stackexchange.com) to determine what content should a data science education company create, based on interest by subject."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Stack Exchange"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### What kind of questions are welcome on this site?\n",
"\n",
"On DSSE's help center's [section on questions](https://datascience.stackexchange.com/help/asking) , we can read that we should:\n",
"\n",
"* Avoid subjective questions.\n",
"* Ask practical questions about Data Science — there are adequate sites for theoretical questions.\n",
"* Ask specific questions.\n",
"* Make questions relevant to others.\n",
"\n",
"All of these characteristics, if employed, should be helpful attributes to our goal.\n",
"\n",
"In the help center we also learned that in addition to the sites mentioned in the _Learn_ section, there are other two sites that are relevant:\n",
"\n",
"* [Open Data](https://opendata.stackexchange.com/help/on-topic) (Dataset requests)\n",
"* [Computational Science](https://scicomp.stackexchange.com/help/on-topic) (Software packages and algorithms in applied mathematics)\n",
"\n",
"#### What, other than questions, does DSSE's [home](https://datascience.stackexchange.com) subdivide into?\n",
"\n",
"On the [home page](https://datascience.stackexchange.com/) we can see that we have four sections:\n",
"\n",
"* [Questions](https://datascience.stackexchange.com/questions) — a list of all questions asked;\n",
"* [Tags](https://datascience.stackexchange.com/tags) — a list of tags (keywords or labels that categorize questions);\n",
"\n",
" ![tags_ds](https://dq-content.s3.amazonaws.com/469/tags_ds.png)\n",
"* [Users](https://datascience.stackexchange.com/users) — a list of users;\n",
"* [Unanswered](https://datascience.stackexchange.com/unanswered) — a list of unanswered questions;\n",
"\n",
"The tagging system used by Stack Exchange looks just like what we need to solve this problem as it allow us to quantify how many questions are asked about each subject.\n",
"\n",
"Something else we can learn from exploring the help center, is that Stack Exchange's sites are heavily moderated by the community; this gives us some confidence in using the tagging system to derive conclusions.\n",
"\n",
"#### What information is available in each post?\n",
"\n",
"Looking, just as an example, at [this](https://datascience.stackexchange.com/questions/19141/linear-model-to-generate-probability-of-each-possible-output?rq=1) question, some of the information we see is:\n",
"\n",
"* For both questions and answers:\n",
" * The posts's score;\n",
" * The posts's title;\n",
" * The posts's author;\n",
" * The posts's body;\n",
"* For questions only:\n",
" * How many users have it on their \"\n",
" * The last time the question as active;\n",
" * How many times the question was viewed;\n",
" * Related questions;\n",
" * The question's tags;\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Stack Exchange Data Explorer\n",
"\n",
"Perusing the table names, a few stand out as relevant for our goal:\n",
"\n",
"* Posts\n",
"* PostTags\n",
"* Tags\n",
"* TagSynonyms\n",
"\n",
"Running a few exploratory queries, leads us to focus our efforts on `Posts` table. For examples, the `Tags` table looked very promising as it tells us how many times each tag was used, but there's no way to tell just from this if the interest in these tags is recent or a thing from the past.\n",
"\n",
"\n",
"
\n",
"
\n",
"
Id
\n",
"
TagName
\n",
"
Count
\n",
"
ExcerptPostId
\n",
"
WikiPostId
\n",
"
\n",
"
\n",
"
2
\n",
"
machine-learning
\n",
"
6919
\n",
"
4909
\n",
"
4908
\n",
"
\n",
"
\n",
"
46
\n",
"
python
\n",
"
3907
\n",
"
5523
\n",
"
5522
\n",
"
\n",
"
\n",
"
81
\n",
"
neural-network
\n",
"
2923
\n",
"
8885
\n",
"
8884
\n",
"
\n",
"
\n",
"
194
\n",
"
deep-learning
\n",
"
2786
\n",
"
8956
\n",
"
8955
\n",
"
\n",
"
\n",
"
77
\n",
"
classification
\n",
"
1899
\n",
"
4911
\n",
"
4910
\n",
"
\n",
"
\n",
"
324
\n",
"
keras
\n",
"
1736
\n",
"
9251
\n",
"
9250
\n",
"
\n",
"
\n",
"
128
\n",
"
scikit-learn
\n",
"
1303
\n",
"
5896
\n",
"
5895
\n",
"
\n",
"
\n",
"
321
\n",
"
tensorflow
\n",
"
1224
\n",
"
9183
\n",
"
9182
\n",
"
\n",
"
\n",
"
47
\n",
"
nlp
\n",
"
1162
\n",
"
147
\n",
"
146
\n",
"
\n",
"
\n",
"
24
\n",
"
r
\n",
"
1114
\n",
"
49
\n",
"
48
\n",
"
\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Getting the Data\n",
"\n",
"To get the relevant data we run the following query.\n",
"\n",
"```\n",
"SELECT Id, CreationDate,\n",
" Score, ViewCount, Tags,\n",
" AnswerCount, FavoriteCount\n",
" FROM posts\n",
" WHERE PostTypeId = 1 AND YEAR(CreationDate) = 2019;\n",
"```\n",
"\n",
"Here's what the first few rows look like:\n",
"\n",
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exploring the Data\n",
"\n",
"We can read in the data while immediately making sure `CreationDate` will be stored as a datetime object:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# We import everything that we'll use\n",
"\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"questions = pd.read_csv(\"2019_questions.csv\", parse_dates=[\"CreationDate\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Running [`questions.info()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) should gives a lot of useful information."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"RangeIndex: 8839 entries, 0 to 8838\n",
"Data columns (total 7 columns):\n",
"Id 8839 non-null int64\n",
"CreationDate 8839 non-null datetime64[ns]\n",
"Score 8839 non-null int64\n",
"ViewCount 8839 non-null int64\n",
"Tags 8839 non-null object\n",
"AnswerCount 8839 non-null int64\n",
"FavoriteCount 1407 non-null float64\n",
"dtypes: datetime64[ns](1), float64(1), int64(4), object(1)\n",
"memory usage: 483.5+ KB\n"
]
}
],
"source": [
"questions.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that only `FavoriteCount` has missing values. A missing value on this column probably means that the question was is not present in any users' favorite list, so we can replace the missing values with zero.\n",
"\n",
"The types seem adequate for every column, however, after we fill in the missing values on `FavoriteCount`, there is no reason to store the values as floats.\n",
"\n",
"Since the `object` dtype is a catch-all type, let's see what types the objects in `questions[\"Tags\"]` are."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([], dtype=object)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"questions[\"Tags\"].apply(lambda value: type(value)).unique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We see that every value in this column is a string. On Stack Exchange, each question can only have a maximum of five tags ([source](https://meta.stackexchange.com/a/18879)), so one way to deal with this column is to create five columns in `questions` called `Tag1`, `Tag2`, `Tag3`, `Tag4`, and `Tag5` and populate the columns with the tags in each row.\n",
"\n",
"However, since doesn't help is relating tags from one question to another, we'll just keep them as a list."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Cleaning the Data\n",
"\n",
"We'll begin by fixing `FavoriteCount`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Id int64\n",
"CreationDate datetime64[ns]\n",
"Score int64\n",
"ViewCount int64\n",
"Tags object\n",
"AnswerCount int64\n",
"FavoriteCount int64\n",
"dtype: object"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"questions.fillna(value={\"FavoriteCount\": 0}, inplace=True)\n",
"questions[\"FavoriteCount\"] = questions[\"FavoriteCount\"].astype(int)\n",
"questions.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's now modify `Tags` to make it easier to work with."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Id
\n",
"
CreationDate
\n",
"
Score
\n",
"
ViewCount
\n",
"
Tags
\n",
"
AnswerCount
\n",
"
FavoriteCount
\n",
"
\n",
" \n",
" \n",
"
\n",
"
511
\n",
"
56382
\n",
"
2019-07-25 15:00:20
\n",
"
0
\n",
"
34
\n",
"
[machine-learning, python, pandas, natural-lan...
\n",
"
0
\n",
"
0
\n",
"
\n",
"
\n",
"
2178
\n",
"
58312
\n",
"
2019-08-28 09:44:00
\n",
"
1
\n",
"
41
\n",
"
[neural-network, pytorch]
\n",
"
0
\n",
"
1
\n",
"
\n",
"
\n",
"
2536
\n",
"
58151
\n",
"
2019-08-25 01:01:29
\n",
"
0
\n",
"
37
\n",
"
[dataset, audio-recognition]
\n",
"
2
\n",
"
0
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Id CreationDate Score ViewCount \\\n",
"511 56382 2019-07-25 15:00:20 0 34 \n",
"2178 58312 2019-08-28 09:44:00 1 41 \n",
"2536 58151 2019-08-25 01:01:29 0 37 \n",
"\n",
" Tags AnswerCount \\\n",
"511 [machine-learning, python, pandas, natural-lan... 0 \n",
"2178 [neural-network, pytorch] 0 \n",
"2536 [dataset, audio-recognition] 2 \n",
"\n",
" FavoriteCount \n",
"511 0 \n",
"2178 1 \n",
"2536 0 "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"questions[\"Tags\"] = questions[\"Tags\"].str.replace(\"^<|>$\", \"\").str.split(\"><\")\n",
"questions.sample(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Most Used and Most Viewed\n",
"\n",
"We'll begin by counting how many times each tag was used"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"tag_count = dict()\n",
"\n",
"for tags in questions[\"Tags\"]:\n",
" for tag in tags:\n",
" if tag in tag_count:\n",
" tag_count[tag] += 1\n",
" else:\n",
" tag_count[tag] = 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For improved aesthetics, let's transform `tag_count` in a dataframe."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
"
],
"text/plain": [
" Count\n",
"machine-learning-model 224\n",
"statistics 234\n",
"clustering 257\n",
"predictive-modeling 265\n",
"r 268\n",
"dataset 340\n",
"regression 347\n",
"pandas 354\n",
"lstm 402\n",
"time-series 466\n",
"cnn 489\n",
"nlp 493\n",
"scikit-learn 540\n",
"tensorflow 584\n",
"classification 685\n",
"keras 935\n",
"neural-network 1055\n",
"deep-learning 1220\n",
"python 1814\n",
"machine-learning 2693"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"most_used = tag_count.sort_values(by=\"Count\").tail(20)\n",
"most_used"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The threshold of `20` is somewhat arbitrary and we can experiment with others, however, popularity of the tags rapidly declines, so looking at these tags should be enough to help us with our goal. Let's visualize these data."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"most_used.plot(kind=\"barh\", figsize=(16,8))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Some tags are very, very broad and are unlikely to be useful; e.g.: `python`, `dataset`, `r`. Before we investigate the tags a little deeper, let's repeat the same process for views.\n",
"\n",
"We'll use Python's builtin [`enumerate()`](https://docs.python.org/3/library/functions.html#enumerate) function. Its utility is well understood by seeing it action."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 I\n",
"1 t\n",
"2 e\n",
"3 r\n",
"4 a\n",
"5 t\n",
"6 e\n",
"7 \n",
"8 t\n",
"9 h\n",
"10 i\n",
"11 s\n",
"12 !\n"
]
}
],
"source": [
"some_iterable = \"Iterate this!\"\n",
"\n",
"for i,c in enumerate(some_iterable):\n",
" print(i,c)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In addition to the elements of `some_iterable`, `enumerate` gives us the index of each of them."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"tag_view_count = dict()\n",
"\n",
"for idx, tags in enumerate(questions[\"Tags\"]):\n",
" for tag in tags:\n",
" if tag in tag_view_count:\n",
" tag_view_count[tag] += questions[\"ViewCount\"].iloc[idx]\n",
" else:\n",
" tag_view_count[tag] = 1\n",
" \n",
"tag_view_count = pd.DataFrame.from_dict(tag_view_count, orient=\"index\")\n",
"tag_view_count.rename(columns={0: \"ViewCount\"}, inplace=True)\n",
"\n",
"most_viewed = tag_view_count.sort_values(by=\"ViewCount\").tail(20)\n",
"\n",
"most_viewed.plot(kind=\"barh\", figsize=(16,8))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see them side by side."
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([],\n",
" dtype=object)"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"fig, axes = plt.subplots(nrows=1, ncols=2)\n",
"fig.set_size_inches((24, 10))\n",
"most_used.plot(kind=\"barh\", ax=axes[0], subplots=True)\n",
"most_viewed.plot(kind=\"barh\", ax=axes[1], subplots=True)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"in_used = pd.merge(most_used, most_viewed, how=\"left\", left_index=True, right_index=True)\n",
"in_viewed = pd.merge(most_used, most_viewed, how=\"right\", left_index=True, right_index=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Relations Between Tags"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"One way of trying to gauge how pairs of tags are related to each other, is to count how many times each pair appears together. Let's do this.\n",
"\n",
"We'll begin by creating a list of all tags."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"all_tags = list(tag_count.index)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll now create a dataframe where each row will represent a tag, and each column as well. Something like this:\n",
"\n",
"
"
],
"text/plain": [
" machine-learning data-mining regression linear-regression\n",
"machine-learning NaN NaN NaN NaN\n",
"data-mining NaN NaN NaN NaN\n",
"regression NaN NaN NaN NaN\n",
"linear-regression NaN NaN NaN NaN"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"associations = pd.DataFrame(index=all_tags, columns=all_tags)\n",
"associations.iloc[0:4,0:4]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will now fill this dataframe with zeroes and then, for each lists of tags in `questions[\"Tags\"]`, we will increment the intervening tags by one. The end result will be a dataframe that for each pair of tags, it tells us how many times they were used together."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"associations.fillna(0, inplace=True)\n",
"\n",
"for tags in questions[\"Tags\"]:\n",
" associations.loc[tags, tags] += 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This dataframe is quite large. Let's focus our attention on the most used tags. We'll add some colors to make it easier to talk about the dataframe. (At the time of this writing, GitHub's renderer does not display the colors, we suggest you use this solution notebook together with [JupyterLab](https://jupyterlab.readthedocs.io/en/stable/)."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
machine-learning-model
statistics
clustering
predictive-modeling
r
dataset
regression
pandas
lstm
time-series
cnn
nlp
scikit-learn
tensorflow
classification
keras
neural-network
deep-learning
python
machine-learning
\n",
"
\n",
"
machine-learning-model
\n",
"
224
\n",
"
3
\n",
"
3
\n",
"
21
\n",
"
7
\n",
"
12
\n",
"
8
\n",
"
4
\n",
"
5
\n",
"
7
\n",
"
4
\n",
"
4
\n",
"
18
\n",
"
9
\n",
"
21
\n",
"
17
\n",
"
10
\n",
"
19
\n",
"
37
\n",
"
139
\n",
"
\n",
"
\n",
"
statistics
\n",
"
3
\n",
"
234
\n",
"
3
\n",
"
16
\n",
"
16
\n",
"
17
\n",
"
16
\n",
"
3
\n",
"
1
\n",
"
22
\n",
"
1
\n",
"
3
\n",
"
6
\n",
"
0
\n",
"
19
\n",
"
3
\n",
"
11
\n",
"
12
\n",
"
35
\n",
"
89
\n",
"
\n",
"
\n",
"
clustering
\n",
"
3
\n",
"
3
\n",
"
257
\n",
"
0
\n",
"
16
\n",
"
5
\n",
"
2
\n",
"
5
\n",
"
3
\n",
"
20
\n",
"
0
\n",
"
9
\n",
"
24
\n",
"
0
\n",
"
12
\n",
"
0
\n",
"
8
\n",
"
2
\n",
"
45
\n",
"
61
\n",
"
\n",
"
\n",
"
predictive-modeling
\n",
"
21
\n",
"
16
\n",
"
0
\n",
"
265
\n",
"
13
\n",
"
7
\n",
"
28
\n",
"
4
\n",
"
13
\n",
"
31
\n",
"
6
\n",
"
1
\n",
"
12
\n",
"
6
\n",
"
27
\n",
"
11
\n",
"
13
\n",
"
32
\n",
"
35
\n",
"
123
\n",
"
\n",
"
\n",
"
r
\n",
"
7
\n",
"
16
\n",
"
16
\n",
"
13
\n",
"
268
\n",
"
6
\n",
"
10
\n",
"
2
\n",
"
3
\n",
"
22
\n",
"
2
\n",
"
4
\n",
"
1
\n",
"
1
\n",
"
10
\n",
"
10
\n",
"
9
\n",
"
5
\n",
"
24
\n",
"
63
\n",
"
\n",
"
\n",
"
dataset
\n",
"
12
\n",
"
17
\n",
"
5
\n",
"
7
\n",
"
6
\n",
"
340
\n",
"
6
\n",
"
14
\n",
"
7
\n",
"
6
\n",
"
11
\n",
"
11
\n",
"
9
\n",
"
9
\n",
"
28
\n",
"
13
\n",
"
20
\n",
"
32
\n",
"
53
\n",
"
99
\n",
"
\n",
"
\n",
"
regression
\n",
"
8
\n",
"
16
\n",
"
2
\n",
"
28
\n",
"
10
\n",
"
6
\n",
"
347
\n",
"
6
\n",
"
11
\n",
"
24
\n",
"
6
\n",
"
2
\n",
"
37
\n",
"
9
\n",
"
34
\n",
"
31
\n",
"
42
\n",
"
21
\n",
"
59
\n",
"
119
\n",
"
\n",
"
\n",
"
pandas
\n",
"
4
\n",
"
3
\n",
"
5
\n",
"
4
\n",
"
2
\n",
"
14
\n",
"
6
\n",
"
354
\n",
"
7
\n",
"
19
\n",
"
1
\n",
"
3
\n",
"
37
\n",
"
3
\n",
"
3
\n",
"
3
\n",
"
1
\n",
"
1
\n",
"
244
\n",
"
62
\n",
"
\n",
"
\n",
"
lstm
\n",
"
5
\n",
"
1
\n",
"
3
\n",
"
13
\n",
"
3
\n",
"
7
\n",
"
11
\n",
"
7
\n",
"
402
\n",
"
87
\n",
"
24
\n",
"
19
\n",
"
2
\n",
"
43
\n",
"
20
\n",
"
133
\n",
"
69
\n",
"
103
\n",
"
61
\n",
"
71
\n",
"
\n",
"
\n",
"
time-series
\n",
"
7
\n",
"
22
\n",
"
20
\n",
"
31
\n",
"
22
\n",
"
6
\n",
"
24
\n",
"
19
\n",
"
87
\n",
"
466
\n",
"
8
\n",
"
0
\n",
"
12
\n",
"
9
\n",
"
25
\n",
"
51
\n",
"
33
\n",
"
44
\n",
"
105
\n",
"
131
\n",
"
\n",
"
\n",
"
cnn
\n",
"
4
\n",
"
1
\n",
"
0
\n",
"
6
\n",
"
2
\n",
"
11
\n",
"
6
\n",
"
1
\n",
"
24
\n",
"
8
\n",
"
489
\n",
"
7
\n",
"
0
\n",
"
57
\n",
"
20
\n",
"
116
\n",
"
118
\n",
"
160
\n",
"
62
\n",
"
124
\n",
"
\n",
"
\n",
"
nlp
\n",
"
4
\n",
"
3
\n",
"
9
\n",
"
1
\n",
"
4
\n",
"
11
\n",
"
2
\n",
"
3
\n",
"
19
\n",
"
0
\n",
"
7
\n",
"
493
\n",
"
12
\n",
"
11
\n",
"
35
\n",
"
23
\n",
"
24
\n",
"
72
\n",
"
71
\n",
"
113
\n",
"
\n",
"
\n",
"
scikit-learn
\n",
"
18
\n",
"
6
\n",
"
24
\n",
"
12
\n",
"
1
\n",
"
9
\n",
"
37
\n",
"
37
\n",
"
2
\n",
"
12
\n",
"
0
\n",
"
12
\n",
"
540
\n",
"
15
\n",
"
47
\n",
"
34
\n",
"
24
\n",
"
16
\n",
"
235
\n",
"
188
\n",
"
\n",
"
\n",
"
tensorflow
\n",
"
9
\n",
"
0
\n",
"
0
\n",
"
6
\n",
"
1
\n",
"
9
\n",
"
9
\n",
"
3
\n",
"
43
\n",
"
9
\n",
"
57
\n",
"
11
\n",
"
15
\n",
"
584
\n",
"
20
\n",
"
256
\n",
"
108
\n",
"
136
\n",
"
167
\n",
"
106
\n",
"
\n",
"
\n",
"
classification
\n",
"
21
\n",
"
19
\n",
"
12
\n",
"
27
\n",
"
10
\n",
"
28
\n",
"
34
\n",
"
3
\n",
"
20
\n",
"
25
\n",
"
20
\n",
"
35
\n",
"
47
\n",
"
20
\n",
"
685
\n",
"
58
\n",
"
65
\n",
"
59
\n",
"
98
\n",
"
259
\n",
"
\n",
"
\n",
"
keras
\n",
"
17
\n",
"
3
\n",
"
0
\n",
"
11
\n",
"
10
\n",
"
13
\n",
"
31
\n",
"
3
\n",
"
133
\n",
"
51
\n",
"
116
\n",
"
23
\n",
"
34
\n",
"
256
\n",
"
58
\n",
"
935
\n",
"
235
\n",
"
247
\n",
"
280
\n",
"
195
\n",
"
\n",
"
\n",
"
neural-network
\n",
"
10
\n",
"
11
\n",
"
8
\n",
"
13
\n",
"
9
\n",
"
20
\n",
"
42
\n",
"
1
\n",
"
69
\n",
"
33
\n",
"
118
\n",
"
24
\n",
"
24
\n",
"
108
\n",
"
65
\n",
"
235
\n",
"
1055
\n",
"
305
\n",
"
137
\n",
"
366
\n",
"
\n",
"
\n",
"
deep-learning
\n",
"
19
\n",
"
12
\n",
"
2
\n",
"
32
\n",
"
5
\n",
"
32
\n",
"
21
\n",
"
1
\n",
"
103
\n",
"
44
\n",
"
160
\n",
"
72
\n",
"
16
\n",
"
136
\n",
"
59
\n",
"
247
\n",
"
305
\n",
"
1220
\n",
"
160
\n",
"
429
\n",
"
\n",
"
\n",
"
python
\n",
"
37
\n",
"
35
\n",
"
45
\n",
"
35
\n",
"
24
\n",
"
53
\n",
"
59
\n",
"
244
\n",
"
61
\n",
"
105
\n",
"
62
\n",
"
71
\n",
"
235
\n",
"
167
\n",
"
98
\n",
"
280
\n",
"
137
\n",
"
160
\n",
"
1814
\n",
"
499
\n",
"
\n",
"
\n",
"
machine-learning
\n",
"
139
\n",
"
89
\n",
"
61
\n",
"
123
\n",
"
63
\n",
"
99
\n",
"
119
\n",
"
62
\n",
"
71
\n",
"
131
\n",
"
124
\n",
"
113
\n",
"
188
\n",
"
106
\n",
"
259
\n",
"
195
\n",
"
366
\n",
"
429
\n",
"
499
\n",
"
2693
\n",
"
\n",
"
"
],
"text/plain": [
""
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"relations_most_used = associations.loc[most_used.index, most_used.index]\n",
"\n",
"def style_cells(x):\n",
" helper_df = pd.DataFrame('', index=x.index, columns=x.columns)\n",
" helper_df.loc[\"time-series\", \"r\"] = \"background-color: yellow\"\n",
" helper_df.loc[\"r\", \"time-series\"] = \"background-color: yellow\"\n",
" for k in range(helper_df.shape[0]):\n",
" helper_df.iloc[k,k] = \"color: blue\"\n",
" \n",
" return helper_df\n",
"\n",
"relations_most_used.style.apply(style_cells, axis=None)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The cells highlighted in yellow tell us that `time-series` was used together with `r` 22 times. The values in blue tell us how many times each of the tags was used. We saw earlier that `machine-learning` was used 2693 times and we confirm it in this dataframe.\n",
"\n",
"It's hard for a human to understand what is going on in this dataframe. Let's create a heatmap. But before we do it, let's get rid of the values in blue, otherwise the colors will be too skewed."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"for i in range(relations_most_used.shape[0]):\n",
" relations_most_used.iloc[i,i] = pd.np.NaN"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "\n",
"text/plain": [
"
"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.figure(figsize=(12,8))\n",
"sns.heatmap(relations_most_used, cmap=\"Greens\", annot=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The most used tags also seem to have the strongest relationships, as given by the dark concentration in the bottom right corner. However, this could simply be because each of these tags is used a lot, and so end up being used together a lot without possibly even having any strong relation between them.\n",
"\n",
"A more intuitive manifestation of this phenomenon is the following. A lot of people buy bread, a lot of people buy toilet paper, so they end up being purchased together a lot, but purchasing one of them doesn't increase the chances of purchasing the other."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another shortcoming of this attempt is that it only looks at relations between pairs of tags and not between multiple groups of tags. For example, it could be the case that when used together, `dataset` and `scikit-learn` have a \"strong\" relation to `pandas`, but each by itself doesn't.\n",
"\n",
"So how do we attack both these problems? There is a powerful data mining technique that allows us to handle this: [association rules](https://en.wikipedia.org/wiki/Association_rule_learning). Association rules allow us to analytically spot relations like \"people who purchase milk, also purchase eggs\". Moreover, we can also measure how strong this relations are on several fronts: how common the relation is, how strong it is, and how independent the components of the relationship are (toilet paper and bread are probably more independent than eggs and milk — you'll learn more about [statistical independence](https://en.wikipedia.org/wiki/Independence_(probability_theory)) in the next step).\n",
"\n",
"\n",
"We won't get into the details of it, as the technique is out of scope for this course, but it is a path worth investigating!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Enter Domain Knowledge\n",
"\n",
"[Keras](https://keras.io/), [scikit-learn](https://scikit-learn.org/), [TensorFlow](https://www.tensorflow.org/) are all Python libraries that allow their users to employ deep learning (a type of neural network).\n",
"\n",
"Most of the top tags are all intimately related with one central machine learning theme: deep learning. If we want to be very specific, we can suggest the creation of Python content that uses deep learning for classification problems (and other variations of this suggestion).\n",
"\n",
"At the glance of an eye, someone with sufficient domain knowledge can tell that the most popular topic at the moment, as shown by our analysis, is deep learning."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Just a Fad?\n",
"\n",
"Let's read in the file into a dataframe called `all_q`. We'll parse the dates at read-time."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"all_q = pd.read_csv(\"all_questions.csv\", parse_dates=[\"CreationDate\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can use the same technique as before to clean the tags column."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"all_q[\"Tags\"] = all_q[\"Tags\"].str.replace(\"^<|>$\", \"\").str.split(\"><\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before deciding which questions should be classified as being deep learning questions, we should decide what tags are deep learning tags.\n",
"\n",
"The definition of what constitutes a deep learning tag we'll use is: a tag that belongs to the list `[\"lstm\", \"cnn\", \"scikit-learn\", \"tensorflow\", \"keras\", \"neural-network\", \"deep-learning\"]`.\n",
"\n",
"This list was obtained by looking at all the tags in `most_used` and seeing which ones had any relation to deep learning. You can use Google and read the tags descriptions to reach similar results.\n",
"\n",
"We'll now create a function that assigns `1` to deep learning questions and `0` otherwise; and we use it."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"def class_deep_learning(tags):\n",
" for tag in tags:\n",
" if tag in [\"lstm\", \"cnn\", \"scikit-learn\", \"tensorflow\",\n",
" \"keras\", \"neural-network\", \"deep-learning\"]:\n",
" return 1\n",
" return 0"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"all_q[\"DeepLearning\"] = all_q[\"Tags\"].apply(class_deep_learning)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Id
\n",
"
CreationDate
\n",
"
Tags
\n",
"
DeepLearning
\n",
"
\n",
" \n",
" \n",
"
\n",
"
15231
\n",
"
44675
\n",
"
2019-01-28 06:20:18
\n",
"
[model-selection]
\n",
"
0
\n",
"
\n",
"
\n",
"
440
\n",
"
55639
\n",
"
2019-07-14 11:45:43
\n",
"
[machine-learning, dataset, machine-learning-m...
\n",
"
0
\n",
"
\n",
"
\n",
"
11720
\n",
"
51523
\n",
"
2019-05-07 04:57:35
\n",
"
[neural-network, gradient-descent, batch-norma...
\n",
"
1
\n",
"
\n",
"
\n",
"
6262
\n",
"
27232
\n",
"
2018-01-30 09:53:38
\n",
"
[python, convergence]
\n",
"
0
\n",
"
\n",
"
\n",
"
19292
\n",
"
64930
\n",
"
2019-12-16 14:38:17
\n",
"
[neural-network, deep-learning, keras, convolu...
\n",
"
1
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Id CreationDate \\\n",
"15231 44675 2019-01-28 06:20:18 \n",
"440 55639 2019-07-14 11:45:43 \n",
"11720 51523 2019-05-07 04:57:35 \n",
"6262 27232 2018-01-30 09:53:38 \n",
"19292 64930 2019-12-16 14:38:17 \n",
"\n",
" Tags DeepLearning \n",
"15231 [model-selection] 0 \n",
"440 [machine-learning, dataset, machine-learning-m... 0 \n",
"11720 [neural-network, gradient-descent, batch-norma... 1 \n",
"6262 [python, convergence] 0 \n",
"19292 [neural-network, deep-learning, keras, convolu... 1 "
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_q.sample(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Looks good!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The data-science-techonology landscape isn't something as dynamic to merit daily, weekly, or even monthly tracking. Let's track it quarterly.\n",
"\n",
"Since we don't have all the data for the first quarter of 2020, we'll get rid of those dates:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"all_q = all_q[all_q[\"CreationDate\"].dt.year < 2020]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's create a column that identifies the quarter in which a question was asked."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"def fetch_quarter(datetime):\n",
" year = str(datetime.year)[-2:]\n",
" quarter = str(((datetime.month-1) // 3) + 1)\n",
" return \"{y}Q{q}\".format(y=year, q=quarter)\n",
"\n",
"all_q[\"Quarter\"] = all_q[\"CreationDate\"].apply(fetch_quarter)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Id
\n",
"
CreationDate
\n",
"
Tags
\n",
"
DeepLearning
\n",
"
Quarter
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
45416
\n",
"
2019-02-12 00:36:29
\n",
"
[python, keras, tensorflow, cnn, probability]
\n",
"
1
\n",
"
19Q1
\n",
"
\n",
"
\n",
"
1
\n",
"
45418
\n",
"
2019-02-12 00:50:39
\n",
"
[neural-network]
\n",
"
1
\n",
"
19Q1
\n",
"
\n",
"
\n",
"
2
\n",
"
45422
\n",
"
2019-02-12 04:40:51
\n",
"
[python, ibm-watson, chatbot]
\n",
"
0
\n",
"
19Q1
\n",
"
\n",
"
\n",
"
3
\n",
"
45426
\n",
"
2019-02-12 04:51:49
\n",
"
[keras]
\n",
"
1
\n",
"
19Q1
\n",
"
\n",
"
\n",
"
4
\n",
"
45427
\n",
"
2019-02-12 05:08:24
\n",
"
[r, predictive-modeling, machine-learning-mode...
\n",
"
0
\n",
"
19Q1
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Id CreationDate \\\n",
"0 45416 2019-02-12 00:36:29 \n",
"1 45418 2019-02-12 00:50:39 \n",
"2 45422 2019-02-12 04:40:51 \n",
"3 45426 2019-02-12 04:51:49 \n",
"4 45427 2019-02-12 05:08:24 \n",
"\n",
" Tags DeepLearning Quarter \n",
"0 [python, keras, tensorflow, cnn, probability] 1 19Q1 \n",
"1 [neural-network] 1 19Q1 \n",
"2 [python, ibm-watson, chatbot] 0 19Q1 \n",
"3 [keras] 1 19Q1 \n",
"4 [r, predictive-modeling, machine-learning-mode... 0 19Q1 "
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_q.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For the final stretch of this screen, we'll group by quarter and:\n",
"\n",
"* Count the number of deep learning questions.\n",
"* Count the total number of questions.\n",
"* Compute the ratio between the two numbers above."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"