Queer European MD passionate about IT

Mission213Solution.ipynb 21 KB

Popular Data Science Questions

Our goal in this project is to use Data Science Stack Exchange to determine what content should a data science education company create, based on interest by subject.

Stack Exchange

What kind of questions are welcome on this site?

On DSSE's help center's section on questions , we can read that we should:

  • Avoid subjective questions.
  • Ask practical questions about Data Science — there are adequate sites for theoretical questions.
  • Ask specific questions.
  • Make questions relevant to others.

All of these characteristics, if employed, should be helpful attributes to our goal.

In the help center we also learned that in addition to the sites mentioned in the Learn section, there are other two sites that are relevant:

What, other than questions, does DSSE's home subdivide into?

On the home page we can see that we have four sections:

  • Questions — a list of all questions asked;

  • Tags — a list of tags (keywords or labels that categorize questions);

  • Users — a list of users;

  • Unanswered — a list of unanswered questions;

The tagging system used by Stack Exchange looks just like what we need to solve this problem as it allow us to quantify how many questions are asked about each subject.

Something else we can learn from exploring the help center, is that Stack Exchange's sites are heavily moderated by the community; this gives us some confidence in using the tagging system to derive conclusions.

What information is available in each post?

Looking, just as an example, at this question, some of the information we see is:

  • For both questions and answers:
    • The posts's score;
    • The posts's title;
    • The posts's author;
    • The posts's body;
  • For questions only:
    • How many users have it on their "
    • The last time the question as active;
    • How many times the question was viewed;
    • Related questions;
    • The question's tags;

Stack Exchange Data Explorer

Perusing the table names, a few stand out as relevant for our goal:

  • Posts
  • PostTags
  • Tags
  • TagSynonyms

Running a few exploratory queries, leads us to focus our efforts on Posts table. For examples, the Tags table looked very promising as it tells us how many times each tag was used, but there's no way to tell just from this if the interest in these tags is recent or a thing from the past.

Id TagName Count ExcerptPostId WikiPostId
2 machine-learning 6919 4909 4908
46 python 3907 5523 5522
81 neural-network 2923 8885 8884
194 deep-learning 2786 8956 8955
77 classification 1899 4911 4910
324 keras 1736 9251 9250
128 scikit-learn 1303 5896 5895
321 tensorflow 1224 9183 9182
47 nlp 1162 147 146
24 r 1114 49 48

Getting the Data

To get the relevant data we run the following query.

SELECT Id, CreationDate,
       Score, ViewCount, Tags,
       AnswerCount, FavoriteCount
  FROM posts
 WHERE PostTypeId = 1 AND YEAR(CreationDate) = 2019;

Here's what the first few rows look like:

Id PostTypeId CreationDate Score ViewCount Tags AnswerCount FavoriteCount
44419 1 2019-01-23 09:21:13 1 21 <machine-learning><data-mining> 0
44420 1 2019-01-23 09:34:01 0 25 <machine-learning><regression><linear-regression><regularization> 0
44423 1 2019-01-23 09:58:41 2 1651 <python><time-series><forecast><forecasting> 0
44427 1 2019-01-23 10:57:09 0 55 <machine-learning><scikit-learn><pca> 1
44428 1 2019-01-23 11:02:15 0 19 <dataset><bigdata><data><speech-to-text> 0

Exploring the Data

We can read in the data while immediately making sure CreationDate will be stored as a datetime object:

# We import everything that we'll use

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
questions = pd.read_csv("2019_questions.csv", parse_dates=["CreationDate"])

Running questions.info() should gives a lot of useful information.

questions.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8839 entries, 0 to 8838
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Id             8839 non-null   int64         
 1   CreationDate   8839 non-null   datetime64[ns]
 2   Score          8839 non-null   int64         
 3   ViewCount      8839 non-null   int64         
 4   Tags           8839 non-null   object        
 5   AnswerCount    8839 non-null   int64         
 6   FavoriteCount  1407 non-null   float64       
dtypes: datetime64[ns](1), float64(1), int64(4), object(1)
memory usage: 483.5+ KB

We see that only FavoriteCount has missing values. A missing value on this column probably means that the question was is not present in any users' favorite list, so we can replace the missing values with zero.

The types seem adequate for every column, however, after we fill in the missing values on FavoriteCount, there is no reason to store the values as floats.

Since the object dtype is a catch-all type, let's see what types the objects in questions["Tags"] are.

questions["Tags"].apply(lambda value: type(value)).unique()
array([<class 'str'>], dtype=object)

We see that every value in this column is a string. On Stack Exchange, each question can only have a maximum of five tags (source), so one way to deal with this column is to create five columns in questions called Tag1, Tag2, Tag3, Tag4, and Tag5 and populate the columns with the tags in each row.

However, since doesn't help is relating tags from one question to another, we'll just keep them as a list.

Cleaning the Data

We'll begin by fixing FavoriteCount.

questions.fillna(value={"FavoriteCount": 0}, inplace=True)
questions["FavoriteCount"] = questions["FavoriteCount"].astype(int)
questions.dtypes
Id                        int64
CreationDate     datetime64[ns]
Score                     int64
ViewCount                 int64
Tags                     object
AnswerCount               int64
FavoriteCount             int64
dtype: object

Let's now modify Tags to make it easier to work with.

questions["Tags"] = questions["Tags"].str.replace("^<|>$", "").str.split("><")
questions.sample(3)
Id CreationDate Score ViewCount Tags AnswerCount FavoriteCount
511 56382 2019-07-25 15:00:20 0 34 [machine-learning, python, pandas, natural-lan... 0 0
2178 58312 2019-08-28 09:44:00 1 41 [neural-network, pytorch] 0 1
2536 58151 2019-08-25 01:01:29 0 37 [dataset, audio-recognition] 2 0

Most Used and Most Viewed

We'll begin by counting how many times each tag was used

tag_count = dict()

for tags in questions["Tags"]:
    for tag in tags:
        if tag in tag_count:
            tag_count[tag] += 1
        else:
            tag_count[tag] = 1

For improved aesthetics, let's transform tag_count in a dataframe.

tag_count = pd.DataFrame.from_dict(tag_count, orient="index")
tag_count.rename(columns={0: "Count"}, inplace=True)
tag_count.head(10)
Count
machine-learning 2693
data-mining 217
regression 347
linear-regression 175
regularization 50
python 1814
time-series 466
forecast 34
forecasting 85
scikit-learn 540

Let's now sort this dataframe by Count and visualize the top 20 results.

most_used = tag_count.sort_values(by="Count").tail(20)
most_used
Count
machine-learning-model 224
statistics 234
clustering 257
predictive-modeling 265
r 268
dataset 340
regression 347
pandas 354
lstm 402
time-series 466
cnn 489
nlp 493
scikit-learn 540
tensorflow 584
classification 685
keras 935
neural-network 1055
deep-learning 1220
python 1814
machine-learning 2693

The threshold of 20 is somewhat arbitrary and we can experiment with others, however, popularity of the tags rapidly declines, so looking at these tags should be enough to help us with our goal. Let's visualize these data.

most_used.plot(kind="barh", figsize=(16,8))
<matplotlib.axes._subplots.AxesSubplot at 0x7f2f948a2e50>

Some tags are very, very broad and are unlikely to be useful; e.g.: python, dataset, r. Before we investigate the tags a little deeper, let's repeat the same process for views.

We'll use pandas's pandas.DataFrame.iterrows().

tag_view_count = dict()

for index, row in questions.iterrows():
    for tag in row['Tags']:
        if tag in tag_view_count:
            tag_view_count[tag] += row['ViewCount']
        else:
            tag_view_count[tag] = row['ViewCount']
            
tag_view_count = pd.DataFrame.from_dict(tag_view_count, orient="index")
tag_view_count.rename(columns={0: "ViewCount"}, inplace=True)

most_viewed = tag_view_count.sort_values(by="ViewCount").tail(20)

most_viewed.plot(kind="barh", figsize=(16,8))
<matplotlib.axes._subplots.AxesSubplot at 0x7f2f938fb350>

Let's see them side by side.

fig, axes = plt.subplots(nrows=1, ncols=2)
fig.set_size_inches((24, 10))
most_used.plot(kind="barh", ax=axes[0], subplots=True)
most_viewed.plot(kind="barh", ax=axes[1], subplots=True)
array([<matplotlib.axes._subplots.AxesSubplot object at 0x7f2f93aa7210>],
      dtype=object)
in_used = pd.merge(most_used, most_viewed, how="left", left_index=True, right_index=True)
in_viewed = pd.merge(most_used, most_viewed, how="right", left_index=True, right_index=True)

Relations Between Tags

One way of trying to gauge how pairs of tags are related to each other, is to count how many times each pair appears together. Let's do this.

We'll begin by creating a list of all tags.

all_tags = list(tag_count.index)

We'll now create a dataframe where each row will represent a tag, and each column as well. Something like this:

tag1 tag2 tag3
tag1
tag2
tag3
associations = pd.DataFrame(index=all_tags, columns=all_tags)
associations.iloc[0:4,0:4]
machine-learning data-mining regression linear-regression
machine-learning NaN NaN NaN NaN
data-mining NaN NaN NaN NaN
regression NaN NaN NaN NaN
linear-regression NaN NaN NaN NaN

We will now fill this dataframe with zeroes and then, for each lists of tags in questions["Tags"], we will increment the intervening tags by one. The end result will be a dataframe that for each pair of tags, it tells us how many times they were used together.

associations.fillna(0, inplace=True)

for tags in questions["Tags"]:
    associations.loc[tags, tags] += 1

This dataframe is quite large. Let's focus our attention on the most used tags. We'll add some colors to make it easier to talk about the dataframe. (At the time of this writing, GitHub's renderer does not display the colors, we suggest you use this solution notebook together with JupyterLab).

relations_most_used = associations.loc[most_used.index, most_used.index]

def style_cells(x):
    helper_df = pd.DataFrame('', index=x.index, columns=x.columns)
    helper_df.loc["time-series", "r"] = "background-color: yellow"
    helper_df.loc["r", "time-series"] = "background-color: yellow"
    for k in range(helper_df.shape[0]):
        helper_df.iloc[k,k] = "color: blue"
    
    return helper_df

relations_most_used.style.apply(style_cells, axis=None)
machine-learning-model statistics clustering predictive-modeling r dataset regression pandas lstm time-series cnn nlp scikit-learn tensorflow classification keras neural-network deep-learning python machine-learning
machine-learning-model 224 3 3 21 7 12 8 4 5 7 4 4 18 9 21 17 10 19 37 139
statistics 3 234 3 16 16 17 16 3 1 22 1 3 6 0 19 3 11 12 35 89
clustering 3 3 257 0 16 5 2 5 3 20 0 9 24 0 12 0 8 2 45 61
predictive-modeling 21 16 0 265 13 7 28 4 13 31 6 1 12 6 27 11 13 32 35 123
r 7 16 16 13 268 6 10 2 3 22 2 4 1 1 10 10 9 5 24 63
dataset 12 17 5 7 6 340 6 14 7 6 11 11 9 9 28 13 20 32 53 99
regression 8 16 2 28 10 6 347 6 11 24 6 2 37 9 34 31 42 21 59 119
pandas 4 3 5 4 2 14 6 354 7 19 1 3 37 3 3 3 1 1 244 62
lstm 5 1 3 13 3 7 11 7 402 87 24 19 2 43 20 133 69 103 61 71
time-series 7 22 20 31 22 6 24 19 87 466 8 0 12 9 25 51 33 44 105 131
cnn 4 1 0 6 2 11 6 1 24 8 489 7 0 57 20 116 118 160 62 124
nlp 4 3 9 1 4 11 2 3 19 0 7 493 12 11 35 23 24 72 71 113
scikit-learn 18 6 24 12 1 9 37 37 2 12 0 12 540 15 47 34 24 16 235 188
tensorflow 9 0 0 6 1 9 9 3 43 9 57 11 15 584 20 256 108 136 167 106
classification 21 19 12 27 10 28 34 3 20 25 20 35 47 20 685 58 65 59 98 259
keras 17 3 0 11 10 13 31 3 133 51 116 23 34 256 58 935 235 247 280 195
neural-network 10 11 8 13 9 20 42 1 69 33 118 24 24 108 65 235 1055 305 137 366
deep-learning 19 12 2 32 5 32 21 1 103 44 160 72 16 136 59 247 305 1220 160 429
python 37 35 45 35 24 53 59 244 61 105 62 71 235 167 98 280 137 160 1814 499
machine-learning 139 89 61 123 63 99 119 62 71 131 124 113 188 106 259 195 366 429 499 2693

The cells highlighted in yellow tell us that time-series was used together with r 22 times. The values in blue tell us how many times each of the tags was used. We saw earlier that machine-learning was used 2693 times and we confirm it in this dataframe.

It's hard for a human to understand what is going on in this dataframe. Let's create a heatmap. But before we do it, let's get rid of the values in blue, otherwise the colors will be too skewed.

for i in range(relations_most_used.shape[0]):
    relations_most_used.iloc[i,i] = pd.np.NaN
plt.figure(figsize=(12,8))
sns.heatmap(relations_most_used, cmap="Greens", annot=False)
<matplotlib.axes._subplots.AxesSubplot at 0x7f2f936807d0>

The most used tags also seem to have the strongest relationships, as given by the dark concentration in the bottom right corner. However, this could simply be because each of these tags is used a lot, and so end up being used together a lot without possibly even having any strong relation between them.

A more intuitive manifestation of this phenomenon is the following. A lot of people buy bread, a lot of people buy toilet paper, so they end up being purchased together a lot, but purchasing one of them doesn't increase the chances of purchasing the other.

Another shortcoming of this attempt is that it only looks at relations between pairs of tags and not between multiple groups of tags. For example, it could be the case that when used together, dataset and scikit-learn have a "strong" relation to pandas, but each by itself doesn't.

So how do we attack both these problems? There is a powerful data mining technique that allows us to handle this: association rules. Association rules allow us to analytically spot relations like "people who purchase milk, also purchase eggs". Moreover, we can also measure how strong this relations are on several fronts: how common the relation is, how strong it is, and how independent the components of the relationship are (toilet paper and bread are probably more independent than eggs and milk — you'll learn more about statistical independence in the next step).

We won't get into the details of it, as the technique is out of scope for this course, but it is a path worth investigating!

Enter Domain Knowledge

Keras, scikit-learn, TensorFlow are all Python libraries that allow their users to employ deep learning (a type of neural network).

Most of the top tags are all intimately related with one central machine learning theme: deep learning. If we want to be very specific, we can suggest the creation of Python content that uses deep learning for classification problems (and other variations of this suggestion).

At the glance of an eye, someone with sufficient domain knowledge can tell that the most popular topic at the moment, as shown by our analysis, is deep learning.

Just a Fad?

Let's read in the file into a dataframe called all_q. We'll parse the dates at read-time.

all_q = pd.read_csv("all_questions.csv", parse_dates=["CreationDate"])

We can use the same technique as before to clean the tags column.

all_q["Tags"] = all_q["Tags"].str.replace("^<|>$", "").str.split("><")

Before deciding which questions should be classified as being deep learning questions, we should decide what tags are deep learning tags.

The definition of what constitutes a deep learning tag we'll use is: a tag that belongs to the list ["lstm", "cnn", "scikit-learn", "tensorflow", "keras", "neural-network", "deep-learning"].

This list was obtained by looking at all the tags in most_used and seeing which ones had any relation to deep learning. You can use Google and read the tags descriptions to reach similar results.

We'll now create a function that assigns 1 to deep learning questions and 0 otherwise; and we use it.

def class_deep_learning(tags):
    for tag in tags:
        if tag in ["lstm", "cnn", "scikit-learn", "tensorflow",
                   "keras", "neural-network", "deep-learning"]:
            return 1
    return 0
all_q["DeepLearning"] = all_q["Tags"].apply(class_deep_learning)
all_q.sample(5)
Id CreationDate Tags DeepLearning
15231 44675 2019-01-28 06:20:18 [model-selection] 0
440 55639 2019-07-14 11:45:43 [machine-learning, dataset, machine-learning-m... 0
11720 51523 2019-05-07 04:57:35 [neural-network, gradient-descent, batch-norma... 1
6262 27232 2018-01-30 09:53:38 [python, convergence] 0
19292 64930 2019-12-16 14:38:17 [neural-network, deep-learning, keras, convolu... 1

Looks good!

The data-science-techonology landscape isn't something as dynamic to merit daily, weekly, or even monthly tracking. Let's track it quarterly.

Since we don't have all the data for the first quarter of 2020, we'll get rid of those dates:

all_q = all_q[all_q["CreationDate"].dt.year < 2020]

Let's create a column that identifies the quarter in which a question was asked.

def fetch_quarter(datetime):
    year = str(datetime.year)[-2:]
    quarter = str(((datetime.month-1) // 3) + 1)
    return "{y}Q{q}".format(y=year, q=quarter)

all_q["Quarter"] = all_q["CreationDate"].apply(fetch_quarter)
all_q.head()
Id CreationDate Tags DeepLearning Quarter
0 45416 2019-02-12 00:36:29 [python, keras, tensorflow, cnn, probability] 1 19Q1
1 45418 2019-02-12 00:50:39 [neural-network] 1 19Q1
2 45422 2019-02-12 04:40:51 [python, ibm-watson, chatbot] 0 19Q1
3 45426 2019-02-12 04:51:49 [keras] 1 19Q1
4 45427 2019-02-12 05:08:24 [r, predictive-modeling, machine-learning-mode... 0 19Q1

For the final stretch of this screen, we'll group by quarter and:

  • Count the number of deep learning questions.
  • Count the total number of questions.
  • Compute the ratio between the two numbers above.
quarterly = all_q.groupby('Quarter').agg({"DeepLearning": ['sum', 'size']})
quarterly.columns = ['DeepLearningQuestions', 'TotalQuestions']
quarterly["DeepLearningRate"] = quarterly["DeepLearningQuestions"]\
                                /quarterly["TotalQuestions"]
# The following is done to help with visualizations later.
quarterly.reset_index(inplace=True)
quarterly.sample(5)
Quarter DeepLearningQuestions TotalQuestions DeepLearningRate
17 18Q3 685 1512 0.453042
7 16Q1 110 516 0.213178
6 15Q4 66 382 0.172775
22 19Q4 809 2036 0.397348
9 16Q3 161 585 0.275214
ax1 = quarterly.plot(x="Quarter", y="DeepLearningRate",
                    kind="line", linestyle="-", marker="o", color="orange",
                    figsize=(24,12)
                    )

ax2 = quarterly.plot(x="Quarter", y="TotalQuestions",
                     kind="bar", ax=ax1, secondary_y=True, alpha=0.7, rot=45)

for idx, t in quarterly["TotalQuestions"].iteritems():
    ax2.text(idx, t, str(t), ha="center", va="bottom")
xlims = ax1.get_xlim()

ax1.get_legend().remove()

handles1, labels1 = ax1.get_legend_handles_labels()
handles2, labels2 = ax2.get_legend_handles_labels()
ax1.legend(handles=handles1 + handles2,
           labels=labels1 + labels2,
           loc="upper left", prop={"size": 12})


for ax in (ax1, ax2):
    for where in ("top", "right"):
        ax.spines[where].set_visible(False)
        ax.tick_params(right=False, labelright=False)

It seems that deep learning questions was a high-growth trend since the start of DSSE and it looks like it is plateauing. There is no evidence to suggest that interest in deep learning is decreasing and so we maintain our previous idea of proposing that we create deep learning content.