|
@@ -147,9 +147,7 @@
|
|
|
"We have 7197 iOS apps in this data set, and the columns that seem interesting are: `'track_name'`, `'currency'`, `'price'`, `'rating_count_tot'`, `'rating_count_ver'`, and `'prime_genre'`. Not all column names are self-explanatory in this case, but details about each column can be found in the data set [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home).\n",
|
|
|
"\n",
|
|
|
"\n",
|
|
|
- "## Data Cleaning\n",
|
|
|
- "\n",
|
|
|
- "### Deleting Wrong Data\n",
|
|
|
+ "## Deleting Wrong Data\n",
|
|
|
"\n",
|
|
|
"The Google Play data set has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), and we can see that [one of the discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015) outlines an error for row 10472. Let's print this row and compare it against the header and another row that is correct."
|
|
|
]
|
|
@@ -212,8 +210,9 @@
|
|
|
"cell_type": "markdown",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "### Removing Duplicate App Entries\n",
|
|
|
+ "## Removing Duplicate Entries\n",
|
|
|
"\n",
|
|
|
+ "### Part One\n",
|
|
|
"If we explore the Google Play data set long enough, we'll find that some apps have more than one entry. For instance, the application Instagram has four entries:"
|
|
|
]
|
|
|
},
|
|
@@ -288,6 +287,8 @@
|
|
|
"- Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app\n",
|
|
|
"- Use the dictionary to create a new data set, which will have only one entry per app (and we only select the apps with the highest number of reviews)\n",
|
|
|
"\n",
|
|
|
+ "### Part Two\n",
|
|
|
+ "\n",
|
|
|
"Let's start by building the dictionary."
|
|
|
]
|
|
|
},
|
|
@@ -410,6 +411,8 @@
|
|
|
"\n",
|
|
|
"## Removing Non-English Apps\n",
|
|
|
"\n",
|
|
|
+ "### Part One\n",
|
|
|
+ "\n",
|
|
|
"If you explore the data sets enough, you'll notice that the names of some of the apps suggest that they are not directed toward an English-speaking audience. Below, we see a couple of examples from both data sets:"
|
|
|
]
|
|
|
},
|
|
@@ -510,6 +513,8 @@
|
|
|
"cell_type": "markdown",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
+ "### Part Two\n",
|
|
|
+ "\n",
|
|
|
"To minimize the impact of data loss, we'll only remove an app if its name has more than three non-ASCII characters:"
|
|
|
]
|
|
|
},
|
|
@@ -614,7 +619,7 @@
|
|
|
"source": [
|
|
|
"We can see that we're left with 9614 Android apps and 6183 iOS apps.\n",
|
|
|
"\n",
|
|
|
- "### Isolating the Free Apps\n",
|
|
|
+ "## Isolating the Free Apps\n",
|
|
|
"\n",
|
|
|
"As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps, and we'll need to isolate only the free apps for our analysis. Below, we isolate the free apps for both our data sets."
|
|
|
]
|
|
@@ -659,6 +664,8 @@
|
|
|
"\n",
|
|
|
"## Most Common Apps by Genre\n",
|
|
|
"\n",
|
|
|
+ "### Part One\n",
|
|
|
+ "\n",
|
|
|
"As we mentioned in the introduction, our aim is to find out the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.\n",
|
|
|
"\n",
|
|
|
"To minimize risks and overhead, our validation strategy for an app idea comprises three steps:\n",
|
|
@@ -671,6 +678,8 @@
|
|
|
"\n",
|
|
|
"Let's begin the analysis by getting a sense of the most common genres for each market. For this, we'll build a frequency table for the `prime_genre` column of the App Store data set, and the `Genres` and `Category` columns of the Google Play data set.\n",
|
|
|
"\n",
|
|
|
+ "### Part Two\n",
|
|
|
+ "\n",
|
|
|
"We'll build two functions that we can use to analyze the frequency tables:\n",
|
|
|
"\n",
|
|
|
"- One function to generate frequency tables that show percentages\n",
|
|
@@ -719,6 +728,8 @@
|
|
|
"cell_type": "markdown",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
+ "### Part Three\n",
|
|
|
+ "\n",
|
|
|
"We start by examining the frequency table for the `prime_genre` column of the App Store data set."
|
|
|
]
|
|
|
},
|
|
@@ -1118,11 +1129,11 @@
|
|
|
"cell_type": "markdown",
|
|
|
"metadata": {},
|
|
|
"source": [
|
|
|
- "However, this niche seems to show some potential. One thing we could do is take another very popular book and turn it into an app where we could add different features besides the raw version of the book: daily quotes from the book, an audio version of the book, quizes about the book, etc. On top of that, we could also embed a dictionary within the app, so that users don't need to exit our app to look up words in an external app.\n",
|
|
|
+ "However, this niche seems to show some potential. One thing we could do is take another very popular book and turn it into an app where we could add different features besides the raw version of the book: daily quotes from the book, an audio version of the book, quizzes about the book, etc. On top of that, we could also embed a dictionary within the app, so that users don't need to exit our app to look up words in an external app.\n",
|
|
|
"\n",
|
|
|
"This idea seems to fit well with the fact that App Store is dominated by for-fun apps. This suggests that the market might be a bit saturated with for fun-apps, which in turn means that a practical-purposed app might have more chances to stand out among the huge number of apps on App Store.\n",
|
|
|
"\n",
|
|
|
- "Other genres that seem popular include weather, book, food and drink, or finance. The book genre seem to overlap a bit with the app idea we exposed above, but the other genres don't seem too interesting to us:\n",
|
|
|
+ "Other genres that seem popular include weather, book, food and drink, or finance. The book genre seem to overlap a bit with the app idea we described above, but the other genres don't seem too interesting to us:\n",
|
|
|
"\n",
|
|
|
"- Weather apps — people generally don't spend too much time in app, and the chances of making profit from in-app adds are low. Also, getting reliable live weather data may require us to connect our apps to non-free APIs.\n",
|
|
|
"\n",
|
|
@@ -1296,7 +1307,9 @@
|
|
|
],
|
|
|
"source": [
|
|
|
"for app in android_final:\n",
|
|
|
- " if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+' or app[5] == '500,000,000+' or app[5] == '100,000,000+'):\n",
|
|
|
+ " if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'\n",
|
|
|
+ " or app[5] == '500,000,000+'\n",
|
|
|
+ " or app[5] == '100,000,000+'):\n",
|
|
|
" print(app[0], ':', app[5])"
|
|
|
]
|
|
|
},
|