{ "cells": [ { "cell_type": "markdown", "id": "ba71a55c", "metadata": { "id": "hi2CmTEDvGij" }, "source": [ "# Purpose of Notebook\n", "\n", "The purpose of this notebook is to offer an example solution to the guided project for the Sequential Models for Deep Learning course. Since the choice of model predictors is up to the learner, results can differ. Use this solution as a guide for how to structure your own answer." ] }, { "cell_type": "markdown", "id": "da963ff3", "metadata": { "id": "e65739c8" }, "source": [ "# Time-Series Forecasting on the S&P 500" ] }, { "cell_type": "markdown", "id": "33c20aa2", "metadata": { "id": "18150cb2" }, "source": [ "**Context**: We are working as traders on the S&P 500 futures desk. We have been tasked with building a model to better forecast how this index will move based on its behavior over the past several years. The better our forecast performs, the more effectively and lucratively our desk will be able to trade these futures." ] }, { "cell_type": "markdown", "id": "da06b221", "metadata": { "id": "2e64de5f" }, "source": [ "## 1. Introduction" ] }, { "cell_type": "markdown", "id": "599774a6", "metadata": { "id": "9091a3ff" }, "source": [ "The dataset we will be working with is from [Yahoo Finance via Kaggle](https://www.kaggle.com/datasets/arashnic/time-series-forecasting-with-yahoo-stock-price), and it contains S&P 500 Index prices from 2015 through 2020.\n", "\n", "Before we get into the data, let's set some random seed values to improve the reproducibility of the models we will build later on." ] }, { "cell_type": "code", "execution_count": 1, "id": "830f8d64", "metadata": { "id": "e101a4be" }, "outputs": [], "source": [ "# Imports\n", "import tensorflow as tf\n", "import numpy as np\n", "import random\n", "\n", "# Seed code\n", "np.random.seed(1)\n", "random.seed(1)\n", "tf.random.set_seed(1)" ] }, { "cell_type": "markdown", "id": "961d325b", "metadata": { "id": "240f8d2d" }, "source": [ "## 2. Data Wrangling and Exploration" ] }, { "cell_type": "markdown", "id": "9f69912d", "metadata": { "id": "395dc3c2" }, "source": [ "First, we will load in the data and inspect it to determine what steps will be required for cleaning and preprocessing." ] }, { "cell_type": "code", "execution_count": 2, "id": "1fae836e", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 206 }, "id": "f55e074c", "outputId": "f2091de4-28f5-4d49-c8c8-8420aee93d51" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
DateHighLowOpenCloseVolumeAdj Close
02015-11-232095.6101072081.3898932089.4099122086.5900883.587980e+092086.590088
12015-11-242094.1201172070.2900392084.4199222089.1398933.884930e+092089.139893
22015-11-252093.0000002086.3000492089.3000492088.8701172.852940e+092088.870117
32015-11-262093.0000002086.3000492089.3000492088.8701172.852940e+092088.870117
42015-11-272093.2900392084.1298832088.8200682090.1101071.466840e+092090.110107
\n", "
" ], "text/plain": [ " Date High Low Open Close \\\n", "0 2015-11-23 2095.610107 2081.389893 2089.409912 2086.590088 \n", "1 2015-11-24 2094.120117 2070.290039 2084.419922 2089.139893 \n", "2 2015-11-25 2093.000000 2086.300049 2089.300049 2088.870117 \n", "3 2015-11-26 2093.000000 2086.300049 2089.300049 2088.870117 \n", "4 2015-11-27 2093.290039 2084.129883 2088.820068 2090.110107 \n", "\n", " Volume Adj Close \n", "0 3.587980e+09 2086.590088 \n", "1 3.884930e+09 2089.139893 \n", "2 2.852940e+09 2088.870117 \n", "3 2.852940e+09 2088.870117 \n", "4 1.466840e+09 2090.110107 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Import\n", "import pandas as pd\n", "\n", "# Load and inspect the data\n", "stock_data = pd.read_csv(\"yahoo_stock.csv\")\n", "stock_data.head()" ] }, { "cell_type": "markdown", "id": "a606ccb5", "metadata": { "id": "192e4514" }, "source": [ "We can see that the data contains seven columns: `Date`, `High`, `Low`, `Open`, `Close`, `Volume`, and `Adj Close`.\n", "\n", "We will want to set the index of the DataFrame to the `Date` column to prepare for time series forecasting, and decide what other column(s) to use for the forecast itself. For now, we are going to use only the `Adj Close` column, which is the closing price of the S&P 500 index, [adjusted for dividends](https://www.investopedia.com/articles/investing/091015/how-dividends-affect-stock-prices.asp). Based on this decision, we modify the DataFrame to drop the other columns.\n", "\n", "We should also ensure that the data is sorted by its `Date` column." ] }, { "cell_type": "code", "execution_count": 3, "id": "2993d21e", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 238 }, "id": "d4e04bd1", "outputId": "d9068a8e-5a59-4934-ba1b-f036db02772e" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Adj Close
Date
2015-11-232086.590088
2015-11-242089.139893
2015-11-252088.870117
2015-11-262088.870117
2015-11-272090.110107
\n", "
" ], "text/plain": [ " Adj Close\n", "Date \n", "2015-11-23 2086.590088\n", "2015-11-24 2089.139893\n", "2015-11-25 2088.870117\n", "2015-11-26 2088.870117\n", "2015-11-27 2090.110107" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Select relevant columns, sort data, and set index\n", "stock_data = stock_data[[\"Date\", \"Adj Close\"]]\n", "stock_data = stock_data.sort_values(\"Date\")\n", "stock_data = stock_data.set_index(\"Date\")\n", "\n", "# Inspect the data\n", "stock_data.head()" ] }, { "cell_type": "markdown", "id": "aff03554", "metadata": { "id": "c0d02adc" }, "source": [ "We should also double-check that we don't have any missing or erroneous values in our dataset, and consider forward-filling or interpolating if necessary." ] }, { "cell_type": "code", "execution_count": 4, "id": "7502bff3", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "ff57c079", "outputId": "199f8938-e72b-4d2a-f8fc-ddeab6256faa" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Info: \n", "\n", "Index: 1825 entries, 2015-11-23 to 2020-11-20\n", "Data columns (total 1 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 Adj Close 1825 non-null float64\n", "dtypes: float64(1)\n", "memory usage: 28.5+ KB\n", "\n", "Describe: \n", " Adj Close\n", "count 1825.000000\n", "mean 2647.856284\n", "std 407.301177\n", "min 1829.079956\n", "25% 2328.949951\n", "50% 2683.340088\n", "75% 2917.520020\n", "max 3626.909912\n", "\n", "Skew: \n", " Adj Close 0.081869\n", "dtype: float64\n" ] } ], "source": [ "# Check for missing or erroneous values\n", "print(\"Info: \")\n", "stock_data.info()\n", "print(\"\\nDescribe: \\n\", stock_data.describe())\n", "print(\"\\nSkew: \\n\", stock_data.skew())" ] }, { "cell_type": "markdown", "id": "a9418c22", "metadata": { "id": "0f794226" }, "source": [ "Great! No missing values, and everything seems to be within a reasonable range. The low skew value for `Adj Close` indicates we don't have any outliers to be concerned about.\n", "\n", "Before we begin preparing the data for modeling by scaling the variable to be forecasted (`Adj Close`) and splitting the dataset for training, validation, and testing, let's quickly visualize the data." ] }, { "cell_type": "code", "execution_count": 5, "id": "214130bc", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 274 }, "id": "062d8991", "outputId": "e5244674-9c69-49b3-9064-e6687ec13ba7" }, "outputs": [ { "data": { "text/plain": [ "Text(0, 0.5, 'Adjusted Close')" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Imports\n", "import matplotlib.pyplot as plt\n", "import matplotlib.dates as mdates\n", "\n", "# Plot the data\n", "plt.plot(stock_data)\n", "\n", "# Add title and axis labels\n", "plt.title('S&P 500 Index')\n", "plt.xlabel('Date')\n", "plt.xticks(rotation=45)\n", "plt.gca().xaxis.set_major_locator(mdates.YearLocator())\n", "plt.gcf().autofmt_xdate()\n", "plt.ylabel('Adjusted Close')" ] }, { "cell_type": "markdown", "id": "731bb411", "metadata": { "id": "bdbb420b" }, "source": [ "The plot is looking great! We can see that we have data from about five years to work with. There are no gaps, and there are some visible dips and spikes.\n", "\n", "Clearly there is a pattern here (it's not just random noise), and we want to build a model that can predict that pattern. Before we can do that, we need to preprocess the data so we can build and train an RNN model. Let's move on to that data preprocessing." ] }, { "cell_type": "markdown", "id": "455fda72", "metadata": { "id": "743cc550" }, "source": [ "## 3. Data Preprocessing" ] }, { "cell_type": "markdown", "id": "17a6d799", "metadata": { "id": "8454185d" }, "source": [ "Before we can build and train an RNN model to make forecasts based on this data, we need to complete two preprocessing steps:\n", "- Split the data into `train`, `validation`, and `test` sets, using 50% of the data for training and 25% each for validation and testing.\n", "- Scale the data to between `0` and `1`, fitting the scaler to the training data and using the fitted scaler to scale all three datasets." ] }, { "cell_type": "code", "execution_count": 6, "id": "776a16da", "metadata": { "id": "007b1366" }, "outputs": [], "source": [ "# Import\n", "from sklearn.preprocessing import MinMaxScaler\n", "\n", "# Split into train, validation, and test sets\n", "train_size = int(len(stock_data) * 0.5)\n", "validation_size = int(len(stock_data) * 0.25)\n", "train_df = stock_data.iloc[0:train_size, :]\n", "validation_df = stock_data.iloc[train_size:train_size + validation_size, :]\n", "test_df = stock_data.iloc[train_size + validation_size:len(stock_data), :]\n", "\n", "# Fit scaler\n", "scaler = MinMaxScaler()\n", "scaler.fit(train_df)\n", "\n", "# Scale data\n", "train = pd.DataFrame(scaler.transform(train_df), columns=['Adj Close'], index=train_df.index)\n", "validation = pd.DataFrame(scaler.transform(validation_df), columns=['Adj Close'], index=validation_df.index)\n", "test = pd.DataFrame(scaler.transform(test_df), columns=['Adj Close'], index=test_df.index)" ] }, { "cell_type": "markdown", "id": "e4886142", "metadata": { "id": "0dbb236a" }, "source": [ "We will also want to shape our data into fixed-length time windows and reshape it into NumPy arrays to prepare it for TensorFlow models. We will do this in a way that is repeatable and does not overwrite our `train`, `validation`, and `test` variables so that we have the freedom to modify this window size later." ] }, { "cell_type": "code", "execution_count": 7, "id": "308a8e29", "metadata": { "id": "1959002c" }, "outputs": [], "source": [ "# Define a helper function to construct windowed datasets\n", "def create_dataset(dataset, window_size=1):\n", " data_x, data_y = [], []\n", " for i in range(len(dataset) - window_size - 1):\n", " window = dataset.iloc[i:(i + window_size), 0]\n", " target = dataset.iloc[i + window_size, 0]\n", " data_x.append(window)\n", " data_y.append(target)\n", " return np.array(data_x), np.array(data_y)\n", "\n", "# Set the desired window size\n", "window_size = 10\n", "\n", "# Construct train, validation, and test datasets\n", "X_train, y_train = create_dataset(train, window_size)\n", "X_validation, y_validation = create_dataset(validation, window_size)\n", "X_test, y_test = create_dataset(test, window_size)\n", "\n", "# Reshape into NumPy arrays\n", "X_train = np.reshape(X_train, (X_train.shape[0], 1, X_train.shape[1]))\n", "X_validation = np.reshape(X_validation, (X_validation.shape[0], 1, X_validation.shape[1]))\n", "X_test = np.reshape(X_test, (X_test.shape[0], 1, X_test.shape[1]))" ] }, { "cell_type": "markdown", "id": "0c065df2", "metadata": { "id": "14a5a3de" }, "source": [ "## 4. Build and Train a Basic RNN Model" ] }, { "cell_type": "markdown", "id": "72947bde", "metadata": { "id": "e230e893" }, "source": [ "Now comes the fun part! We've thoroughly prepared our data for modeling and now we need to build a TensorFlow model to make forecasts. Let's start with a `SimpleRNN` model." ] }, { "cell_type": "code", "execution_count": 8, "id": "4273d727-07c3-4782-b1dc-f0b85985f871", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model: \"sequential\"\n", "_________________________________________________________________\n", " Layer (type) Output Shape Param # \n", "=================================================================\n", " simple_rnn (SimpleRNN) (None, 10) 210 \n", " \n", " dense (Dense) (None, 10) 110 \n", " \n", " dense_1 (Dense) (None, 1) 11 \n", " \n", "=================================================================\n", "Total params: 331\n", "Trainable params: 331\n", "Non-trainable params: 0\n", "_________________________________________________________________\n" ] } ], "source": [ "# Imports\n", "from tensorflow import keras\n", "from tensorflow.keras import layers\n", "\n", "# Build the model\n", "model = tf.keras.Sequential()\n", "model.add(tf.keras.layers.SimpleRNN(10, input_shape=(1, window_size), activation='relu'))\n", "model.add(tf.keras.layers.Dense(10, activation='relu'))\n", "model.add(tf.keras.layers.Dense(1))\n", "model.compile(optimizer='adam', loss='mean_squared_error')\n", "model.summary()" ] }, { "cell_type": "markdown", "id": "a709fce4", "metadata": { "id": "ca885bb7" }, "source": [ "Great! And now that we've built the model, let's train it on our training dataset and evaluate its performance using the validation dataset. Note that we are _not_ using the `test` dataset yet because we want to have a clean, untouched testing dataset for our final model evaluation. We may have skipped this step in previous lessons for simplicity, but it is a best practice to set aside an untouched testing set during the model optimization process." ] }, { "cell_type": "code", "execution_count": 9, "id": "5de4ac69", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "03c24e26", "outputId": "7e5c6c1f-e9e5-4f7a-ddb3-a1e9c6018295" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "29/29 [==============================] - 1s 2ms/step - loss: 0.5889\n", "14/14 [==============================] - 0s 1ms/step\n", "-92.97549395321612\n" ] } ], "source": [ "# Import\n", "from sklearn.metrics import r2_score\n", "\n", "# Train the model\n", "model.fit(X_train, y_train)\n", "\n", "# Make predictions and evaluate\n", "y_pred = model.predict(X_validation)\n", "print(r2_score(y_validation, y_pred))" ] }, { "cell_type": "markdown", "id": "75199ecf", "metadata": { "id": "475190db" }, "source": [ "Yikes! That R-Squared score is not looking very good. Let's see if we can improve upon this model." ] }, { "cell_type": "markdown", "id": "2e42a94c", "metadata": { "id": "0d2f80d5" }, "source": [ "## 5. Build and Train an LSTM Model" ] }, { "cell_type": "markdown", "id": "1b247614", "metadata": { "id": "c4211a13" }, "source": [ "Let's repeat the above steps for an LSTM model and see if we can improve performance." ] }, { "cell_type": "code", "execution_count": 10, "id": "d51660b4", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "6dc34b29", "outputId": "cf5e5f9d-cb9f-4c33-c163-73bebc393ce8" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model: \"sequential_1\"\n", "_________________________________________________________________\n", " Layer (type) Output Shape Param # \n", "=================================================================\n", " lstm (LSTM) (None, 10) 840 \n", " \n", " dense_2 (Dense) (None, 10) 110 \n", " \n", " dense_3 (Dense) (None, 1) 11 \n", " \n", "=================================================================\n", "Total params: 961\n", "Trainable params: 961\n", "Non-trainable params: 0\n", "_________________________________________________________________\n", "29/29 [==============================] - 2s 2ms/step - loss: 0.3164\n", "14/14 [==============================] - 0s 2ms/step\n", "-60.02999289943477\n" ] } ], "source": [ "# Build the model\n", "model = tf.keras.Sequential()\n", "model.add(tf.keras.layers.LSTM(10, input_shape=(1, window_size), activation='relu'))\n", "model.add(tf.keras.layers.Dense(10, activation='relu'))\n", "model.add(tf.keras.layers.Dense(1))\n", "model.compile(optimizer='adam', loss='mean_squared_error')\n", "model.summary()\n", "\n", "# Train the model\n", "model.fit(X_train, y_train)\n", "\n", "# Make predictions and evaluate\n", "y_pred = model.predict(X_validation)\n", "print(r2_score(y_validation, y_pred))" ] }, { "cell_type": "markdown", "id": "47c76a36", "metadata": { "id": "e82aa818" }, "source": [ "That didn't improve things like we hoped. For now, let's keep the LSTM in place until we're ready to fully optimize the model, and we can decide then whether it's worth keeping. For now, let's move onto the next section to see what other techniques we can try to improve this model." ] }, { "cell_type": "markdown", "id": "d77306da", "metadata": { "id": "d265d334" }, "source": [ "## 6. Add a Convolutional Layer" ] }, { "cell_type": "markdown", "id": "4ba0e666", "metadata": { "id": "c46ad12c" }, "source": [ "Next, we are going to branch out a bit from the basic RNN or LSTM model and try adding a convolutional layer to see if this improves the model performance." ] }, { "cell_type": "code", "execution_count": 11, "id": "5f417499", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "9afe33e8", "outputId": "480766cd-3378-4dde-bcff-e93662070621" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model: \"sequential_2\"\n", "_________________________________________________________________\n", " Layer (type) Output Shape Param # \n", "=================================================================\n", " conv1d (Conv1D) (None, 1, 64) 704 \n", " \n", " max_pooling1d (MaxPooling1D (None, 1, 64) 0 \n", " ) \n", " \n", " lstm_1 (LSTM) (None, 10) 3000 \n", " \n", " dense_4 (Dense) (None, 10) 110 \n", " \n", " dense_5 (Dense) (None, 1) 11 \n", " \n", "=================================================================\n", "Total params: 3,825\n", "Trainable params: 3,825\n", "Non-trainable params: 0\n", "_________________________________________________________________\n", "29/29 [==============================] - 2s 2ms/step - loss: 0.2094\n", "14/14 [==============================] - 0s 2ms/step\n", "-23.86714366881604\n" ] } ], "source": [ "# Build the model\n", "model = tf.keras.Sequential()\n", "model.add(tf.keras.layers.Conv1D(64, 1, activation=\"relu\", input_shape=(1, window_size)))\n", "model.add(tf.keras.layers.MaxPooling1D(1))\n", "model.add(tf.keras.layers.LSTM(10, activation='relu'))\n", "model.add(tf.keras.layers.Dense(10, activation='relu'))\n", "model.add(tf.keras.layers.Dense(1))\n", "model.compile(optimizer='adam', loss='mean_squared_error')\n", "model.summary()\n", "\n", "# Train the model\n", "model.fit(X_train, y_train)\n", "\n", "# Make predictions and evaluate\n", "y_pred = model.predict(X_validation)\n", "print(r2_score(y_validation, y_pred))" ] }, { "cell_type": "markdown", "id": "a384b2aa", "metadata": { "id": "7c1aed13" }, "source": [ "These are still pretty terrible results. Let's go further and try modifying other model parameters to fully optimize this model, at which point we may or may not keep the convolutional layer." ] }, { "cell_type": "markdown", "id": "58aa96e7", "metadata": { "id": "38621317" }, "source": [ "## 7. Optimize the Model" ] }, { "cell_type": "markdown", "id": "e00a41a0", "metadata": { "id": "6475171f" }, "source": [ "Now we can go further to optimize this model by adding layers, changing the number of nodes in each layer, increasing the number of training epochs, and modifying the window size." ] }, { "cell_type": "code", "execution_count": 12, "id": "0d2da317", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "c43d8153", "outputId": "de7cb21f-2cef-4e2e-d0db-490251875dbc" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Model: \"sequential_3\"\n", "_________________________________________________________________\n", " Layer (type) Output Shape Param # \n", "=================================================================\n", " conv1d_1 (Conv1D) (None, 1, 128) 3328 \n", " \n", " max_pooling1d_1 (MaxPooling (None, 1, 128) 0 \n", " 1D) \n", " \n", " lstm_2 (LSTM) (None, 64) 49408 \n", " \n", " dense_6 (Dense) (None, 32) 2080 \n", " \n", " dense_7 (Dense) (None, 16) 528 \n", " \n", " dense_8 (Dense) (None, 1) 17 \n", " \n", "=================================================================\n", "Total params: 55,361\n", "Trainable params: 55,361\n", "Non-trainable params: 0\n", "_________________________________________________________________\n", "Epoch 1/35\n", "28/28 [==============================] - 2s 3ms/step - loss: 0.0612\n", "Epoch 2/35\n", "28/28 [==============================] - 0s 4ms/step - loss: 0.0034\n", "Epoch 3/35\n", "28/28 [==============================] - 0s 4ms/step - loss: 0.0017\n", "Epoch 4/35\n", "28/28 [==============================] - 0s 4ms/step - loss: 0.0015\n", "Epoch 5/35\n", "28/28 [==============================] - 0s 3ms/step - loss: 0.0014\n", "Epoch 6/35\n", "28/28 [==============================] - 0s 4ms/step - loss: 0.0012\n", "Epoch 7/35\n", "28/28 [==============================] - 0s 3ms/step - loss: 9.5635e-04\n", "Epoch 8/35\n", "28/28 [==============================] - 0s 3ms/step - loss: 7.9142e-04\n", "Epoch 9/35\n", "28/28 [==============================] - 0s 4ms/step - loss: 6.6951e-04\n", "Epoch 10/35\n", "28/28 [==============================] - 0s 3ms/step - loss: 6.4106e-04\n", "Epoch 11/35\n", "28/28 [==============================] - 0s 4ms/step - loss: 5.3615e-04\n", "Epoch 12/35\n", "28/28 [==============================] - 0s 4ms/step - loss: 4.8200e-04\n", "Epoch 13/35\n", "28/28 [==============================] - 0s 3ms/step - loss: 4.1122e-04\n", "Epoch 14/35\n", "28/28 [==============================] - 0s 3ms/step - loss: 3.8065e-04\n", "Epoch 15/35\n", "28/28 [==============================] - 0s 3ms/step - loss: 3.7033e-04\n", "Epoch 16/35\n", "28/28 [==============================] - 0s 3ms/step - loss: 3.8822e-04\n", "Epoch 17/35\n", "28/28 [==============================] - 0s 3ms/step - loss: 3.6025e-04\n", "Epoch 18/35\n", "28/28 [==============================] - 0s 3ms/step - loss: 3.2716e-04\n", "Epoch 19/35\n", "28/28 [==============================] - 0s 3ms/step - loss: 3.4214e-04\n", "Epoch 20/35\n", "28/28 [==============================] - 0s 3ms/step - loss: 4.9653e-04\n", "Epoch 21/35\n", "28/28 [==============================] - 0s 3ms/step - loss: 3.8275e-04\n", "Epoch 22/35\n", "28/28 [==============================] - 0s 3ms/step - loss: 4.0230e-04\n", "Epoch 23/35\n", "28/28 [==============================] - 0s 3ms/step - loss: 2.9043e-04\n", "Epoch 24/35\n", "28/28 [==============================] - 0s 3ms/step - loss: 4.2905e-04\n", "Epoch 25/35\n", "28/28 [==============================] - 0s 3ms/step - loss: 3.6350e-04\n", "Epoch 26/35\n", "28/28 [==============================] - 0s 3ms/step - loss: 3.2698e-04\n", "Epoch 27/35\n", "28/28 [==============================] - 0s 3ms/step - loss: 2.6212e-04\n", "Epoch 28/35\n", "28/28 [==============================] - 0s 3ms/step - loss: 2.8240e-04\n", "Epoch 29/35\n", "28/28 [==============================] - 0s 3ms/step - loss: 3.0767e-04\n", "Epoch 30/35\n", "28/28 [==============================] - 0s 3ms/step - loss: 3.0365e-04\n", "Epoch 31/35\n", "28/28 [==============================] - 0s 3ms/step - loss: 2.6077e-04\n", "Epoch 32/35\n", "28/28 [==============================] - 0s 3ms/step - loss: 2.4851e-04\n", "Epoch 33/35\n", "28/28 [==============================] - 0s 4ms/step - loss: 3.0939e-04\n", "Epoch 34/35\n", "28/28 [==============================] - 0s 3ms/step - loss: 2.7302e-04\n", "Epoch 35/35\n", "28/28 [==============================] - 0s 4ms/step - loss: 3.2615e-04\n", "14/14 [==============================] - 0s 2ms/step\n", "\n", "R-Squared value on validation set: 0.9448038342648208\n" ] } ], "source": [ "# Set the desired window size\n", "window_size = 25\n", "\n", "# Construct train, validation, and test datasets\n", "X_train, y_train = create_dataset(train, window_size)\n", "X_validation, y_validation = create_dataset(validation, window_size)\n", "X_test, y_test = create_dataset(test, window_size)\n", "\n", "# Reshape into NumPy arrays\n", "X_train = np.reshape(X_train, (X_train.shape[0], 1, X_train.shape[1]))\n", "X_validation = np.reshape(X_validation, (X_validation.shape[0], 1, X_validation.shape[1]))\n", "X_test = np.reshape(X_test, (X_test.shape[0], 1, X_test.shape[1]))\n", "\n", "# Build the model\n", "model = tf.keras.Sequential()\n", "model.add(tf.keras.layers.Conv1D(128, 1, activation=\"relu\", input_shape=(1, window_size)))\n", "model.add(tf.keras.layers.MaxPooling1D(1))\n", "model.add(tf.keras.layers.LSTM(64, activation='relu'))\n", "model.add(tf.keras.layers.Dense(32, activation='relu'))\n", "model.add(tf.keras.layers.Dense(16, activation='relu'))\n", "model.add(tf.keras.layers.Dense(1))\n", "model.compile(optimizer='adam', loss='mean_squared_error')\n", "model.summary()\n", "\n", "# Train the model\n", "model.fit(X_train, y_train, epochs=35)\n", "\n", "# Make predictions and evaluate\n", "y_pred = model.predict(X_validation)\n", "print(f\"\\nR-Squared value on validation set: {r2_score(y_validation, y_pred)}\")" ] }, { "cell_type": "markdown", "id": "9d5e554f", "metadata": { "id": "3e6c545a" }, "source": [ "Overall, this is starting to look really good! We are achieving an R-Squared value between `0.90` and `0.95` on the validation set, which is impressive. As a final performance check, we should now compute and visualize the performance on the testing set." ] }, { "cell_type": "markdown", "id": "657d94b7", "metadata": { "id": "9802018a" }, "source": [ "## 8. Evaluate Model Performance" ] }, { "cell_type": "markdown", "id": "300e297d", "metadata": { "id": "a83ac44f" }, "source": [ "Finally! We've settled on a model that we can be satisfied with, so let's use this model to make predictions on the testing set and compute our final R-Squared. While we're at it, since we'll need them for plotting, let's make predictions on all three sets: training, validation, and testing." ] }, { "cell_type": "code", "execution_count": 13, "id": "90848e5b", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "dcdc7f91", "outputId": "59646323-152d-4165-b8d7-114aedfaa0db" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "28/28 [==============================] - 0s 2ms/step\n", "14/14 [==============================] - 0s 2ms/step\n", "14/14 [==============================] - 0s 2ms/step\n", "0.9959827878740425 --> Training Set\n", "0.9448038342648208 --> Validation Set\n", "0.9372843429943125 --> Test Set\n" ] } ], "source": [ "# Make predictions on all three sets\n", "train_pred = model.predict(X_train)\n", "validation_pred = model.predict(X_validation)\n", "test_pred = model.predict(X_test)\n", "\n", "print(r2_score(y_train, train_pred), \" --> Training Set\")\n", "print(r2_score(y_validation, validation_pred), \" --> Validation Set\")\n", "print(r2_score(y_test, test_pred), \" --> Test Set\")" ] }, { "cell_type": "markdown", "id": "11c5db85", "metadata": { "id": "c1562f53" }, "source": [ "Excellent! The R-Squared value from validation seems to have held up in testing, which is a great sign. Now it's time to visualize this performance. First, we'll need to undo the scaling and windowing preprocessing we've done." ] }, { "cell_type": "code", "execution_count": 14, "id": "e3210c8c", "metadata": { "id": "b97cd8f1" }, "outputs": [], "source": [ "# Un-scale the predictions\n", "train_pred = scaler.inverse_transform(train_pred)\n", "validation_pred = scaler.inverse_transform(validation_pred)\n", "test_pred = scaler.inverse_transform(test_pred)\n", "\n", "# Un-window the training predictions\n", "plot_train_pred = np.empty((len(stock_data), 1))\n", "plot_train_pred[:] = np.nan\n", "plot_train_pred[window_size:len(train_pred) + window_size, :] = train_pred\n", "\n", "# Un-window the validation predictions\n", "plot_validation_pred = np.empty((len(stock_data), 1))\n", "plot_validation_pred[:] = np.nan\n", "plot_validation_pred[len(train_pred) + (window_size * 2) + 1:len(train_pred) + len(validation_pred) + (window_size * 2) + 1, :] = validation_pred\n", "\n", "# Un-window the test predictions\n", "plot_test_pred = np.empty((len(stock_data), 1))\n", "plot_test_pred[:] = np.nan\n", "plot_test_pred[len(train_pred) + len(validation_pred) + (window_size * 3) + 2:len(stock_data) - 1, :] = test_pred" ] }, { "cell_type": "markdown", "id": "4778425a", "metadata": { "id": "0eaebce1" }, "source": [ "Finally, let's plot the un-scaled and un-windowed data on top of the original `stock_data` dataset." ] }, { "cell_type": "code", "execution_count": 15, "id": "0ed569bf", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 257 }, "id": "b3667f57", "outputId": "f98e4bca-5a8b-4bb9-85c3-4fea7a1235b3" }, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# Plot the original data\n", "plt.plot(stock_data, label=\"Original Data\")\n", "\n", "# Plot the predictions\n", "plt.plot(plot_train_pred, label=\"Training Set Predictions\")\n", "plt.plot(plot_validation_pred, label=\"Validation Set Predictions\")\n", "plt.plot(plot_test_pred, label=\"Test Set Predictions\")\n", "\n", "# Add title, axis labels, and a legend\n", "plt.title('S&P 500 Index Forecast')\n", "plt.xlabel('Date')\n", "plt.xticks(rotation=45)\n", "plt.gca().xaxis.set_major_locator(mdates.YearLocator())\n", "plt.gcf().autofmt_xdate()\n", "plt.ylabel('Adjusted Close')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "ca33d209", "metadata": { "id": "32ce7009" }, "source": [ "The visualization looks as great as the R-Squared suggested it might. The real test, of course, would be to make some predictions for the future and do some trades (or pretend to do some trades — \"paper trading\" as we call it), then see if we can make any money. Investment advice is way outside the scope of this course, and the stock market has a way of being a \"harsh teacher,\" so please be cautious and remember that this project was only meant to be fun and educational! If successfully predicting the market trends was \"this easy,\" everyone would do it." ] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.6" } }, "nbformat": 4, "nbformat_minor": 5 }