The Algorithms logo
算法
关于我们捐赠

使用 Pandas 进行线性回归

H
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# `RELIANCE - NSE Stock Data`\n",
    "\n",
    "The file contains RELIANCE - NSE Stock Data from 1-Jan-16 to 6-May-21\n",
    "\n",
    "The data can be used to forecast the stock prices of the future\n",
    "\n",
    "Its a timeseries data from the national stock exchange of India\n",
    "\n",
    "|| `Variable` | `Significance` |\n",
    "| ------------- |:-------------:|:-------------:|\n",
    "|1.|Symbol|Symbol of the listed stock on NSE|\n",
    "|2.|Series|To which series does the stock belong (Equity, Options Future)|\n",
    "|3.|Date|Date of the trade|\n",
    "|4.|Prev Close|Previous day closing value of the stock|\n",
    "|5.|Open Price|Current Day opening price of the stock|\n",
    "|6.|High Price|Highest price touched by the stock in current day `(Target Variable)`|\n",
    "|7.|Low Price|lowest price touched by the stock in current day|\n",
    "|8.|Last Price|The price at which last trade occured in current day|\n",
    "|9.|Close Price|Current day closing price of the stock|\n",
    "|10.|Average Price|Average price of the day|\n",
    "|11.|Total Traded Quantity|Total number of stocks traded in current day|\n",
    "|12.|Turnover||\n",
    "|13.|No. of Trades|Current day's total number of trades|\n",
    "|14.|Deliverabel Quantity|Current day deliveable quantity to the traders|\n",
    "|15.|% Dly Qt to Traded Qty|`(Deliverable Quantity/Total Traded Quantity)*100`|"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "data_path=\"./data/RILO - Copy.csv\"\n",
    "data=pd.read_csv(data_path)\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Renaming the columns to have snake_case naming style. (Just as a convention and for convenience)\n",
    "data.columns=[\"_\".join(column.lower().split()) for column in data.columns]\n",
    "data.columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Using `.describe()` on an entire DataFrame we can get a summary of the distribution of continuous variables:\n",
    "data.describe()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Checking for null values\n",
    "data.isnull().sum()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### As shown above, we do not have any null values in our dataset. Now we can focus on feature selection and model building."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# By using the correlation method `.corr()` we can get the relationship between each continuous variable:\n",
    "correlation=data.corr()\n",
    "correlation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### `Matplotlib`\n",
    "\n",
    "Matplotlib is a visualization library in Python for 2D plots of arrays. Matplotlib is a multi-platform data visualization library built on NumPy arrays and designed to work with the broader SciPy stack.\n",
    "\n",
    "One of the greatest benefits of visualization is that it allows us visual access to huge amounts of data in easily digestible visuals. Matplotlib consists of several plots like line, bar, scatter, histogram etc.\n",
    "\n",
    "Matplotlib comes with a wide variety of plots. Plots helps to understand trends, patterns, and to make correlations. They’re typically instruments for reasoning about quantitative information\n",
    "\n",
    "### `Seaborn`\n",
    "\n",
    "Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Using seaborn and matplotlib to have a better visualization of correlation\n",
    "import seaborn as sn\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "plt.figure(figsize=(10,8))\n",
    "sn.heatmap(correlation,annot=True,linewidth=1,cmap='PuOr')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the above correlation matrix, we get a general idea of which variables can be treated as features to build our model. Lets list them out\n",
    "Considering all the variables having `|corr|>=0.5`\n",
    "\n",
    "- prev_close\n",
    "- no._of_trades\n",
    "- open_price\n",
    "- low_price\n",
    "- last_price\n",
    "- turnover\n",
    "- close_price\n",
    "- %_dly_qt_to_traded_qty\n",
    "- average_price\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "Now that we have a rough idea about our features, lets confirm their behaviour aginst target variable using scatter plots."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.figure(figsize=(18,18))\n",
    "\n",
    "plt.subplot(3,3,1)\n",
    "plt.scatter(data.prev_close,data.high_price)\n",
    "plt.title('Relation with Previous Closing Price')\n",
    "\n",
    "plt.subplot(3,3,2)\n",
    "plt.scatter(data['no._of_trades'],data.high_price)\n",
    "plt.title('Relation with No. of trades')\n",
    "\n",
    "plt.subplot(3,3,3)\n",
    "plt.scatter(data.open_price,data.high_price)\n",
    "plt.title('Relation with Opening Price')\n",
    "\n",
    "plt.subplot(3,3,4)\n",
    "plt.scatter(data.low_price,data.high_price)\n",
    "plt.title('Relation with Low Price')\n",
    "\n",
    "plt.subplot(3,3,5)\n",
    "plt.scatter(data.last_price,data.high_price)\n",
    "plt.title('Relation with Last Price')\n",
    "\n",
    "plt.subplot(3,3,6)\n",
    "plt.scatter(data.turnover,data.high_price)\n",
    "plt.title('Relation with Turnover')\n",
    "\n",
    "plt.subplot(3,3,7)\n",
    "plt.scatter(data.close_price,data.high_price)\n",
    "plt.title('Relation with Closing Price')\n",
    "\n",
    "plt.subplot(3,3,8)\n",
    "plt.scatter(data['%_dly_qt_to_traded_qty'],data.high_price)\n",
    "plt.title('Relation with Deliverable quantity')\n",
    "\n",
    "plt.subplot(3,3,9)\n",
    "plt.scatter(data.average_price,data.high_price)\n",
    "plt.title('Relation with Average Price')\n",
    "\n",
    "plt.show"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From above visualization, we are clear to choose features for the linear-regression model. Those are:\n",
    "\n",
    "- prev_close\n",
    "- ~~no._of_trades~~\n",
    "- open_price\n",
    "- low_price\n",
    "- last_price\n",
    "- ~~turnover~~\n",
    "- close_price\n",
    "- ~~%_dly_qt_to_traded_qty~~\n",
    "- average_price\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "features=['prev_close','open_price','low_price','last_price','close_price','average_price']\n",
    "X=data[features]\n",
    "X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Target variable\n",
    "y=data.high_price\n",
    "y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# split data into training and validation data, for both features and target\n",
    "# The split is based on a random number generator. Supplying a numeric value to\n",
    "# the random_state argument guarantees we get the same split every time we\n",
    "# run this script.\n",
    "\n",
    "from sklearn.model_selection import train_test_split\n",
    "train_X,val_X,train_y,val_y=train_test_split(X,y,test_size=0.2,random_state=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "# Define model\n",
    "model=LinearRegression()\n",
    "# Fit model\n",
    "model.fit(train_X,train_y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We use .score method to get an idea of quality of our model\n",
    "model.score(val_X,val_y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Model Validation\n",
    "There are many metrics for summarizing model quality, but we'll start with one called `Mean Absolute Error (also called MAE)`. Let's break down this metric starting with the last word, error.\n",
    "\n",
    "\n",
    "`error=actual-predicted`\n",
    "\n",
    "So, if a stock cost Rs.4000 at a timeframe, and we predicted it would cost Rs.3980 the error is Rs.20.\n",
    "\n",
    "With the MAE metric, we take the absolute value of each error. This converts each error to a positive number. We then take the average of those absolute errors. This is our measure of model quality. In plain English, it can be said as\n",
    "\n",
    "> On average, our predictions are off by about X.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.metrics import mean_absolute_error\n",
    "# Get predicted prices of stock on validation data\n",
    "pred_y=model.predict(val_X)\n",
    "mean_absolute_error(val_y,pred_y)"
   ]
  }
 ],
 "metadata": {
  "interpreter": {
   "hash": "e7370f93d1d0cde622a1f8e1c04877d8463912d04d973331ad4851f04de6915a"
  },
  "kernelspec": {
   "display_name": "Python 3.9.7 64-bit",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.7"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
关于此算法

RELIANCE - NSE 股票数据

该文件包含 2016 年 1 月 1 日至 2021 年 5 月 6 日的 RELIANCE - NSE 股票数据

该数据可用于预测未来股票价格

它来自印度国家证券交易所的时间序列数据

变量 意义
1. 符号 NSE 上上市股票的符号
2. 系列 股票属于哪个系列(股票、期权期货)
3. 日期 交易日期
4. 前收盘 前一天收盘价
5. 开盘价 当日开盘价
6. 最高价 当日股票触及的最高价 (目标变量)
7. 最低价 当日股票触及的最低价
8. 收盘价 当日最后交易价格
9. 收盘价 当日收盘价
10. 平均价格 当日平均价格
11. 总交易量 当日交易的总股票数量
12. 营业额
13. 交易次数 当日的总交易次数
14. 可交割数量 当日可交割给交易者的数量
15. % Dly Qt to Traded Qty (可交割数量/总交易量)*100
import pandas as pd

data_path="./data/RILO - Copy.csv"
data=pd.read_csv(data_path)
data
# Renaming the columns to have snake_case naming style. (Just as a convention and for convenience)
data.columns=["_".join(column.lower().split()) for column in data.columns]
data.columns
# Using `.describe()` on an entire DataFrame we can get a summary of the distribution of continuous variables:
data.describe()
# Checking for null values
data.isnull().sum()

如上所示,我们的数据集中没有任何空值。现在我们可以专注于特征选择和模型构建。

# By using the correlation method `.corr()` we can get the relationship between each continuous variable:
correlation=data.corr()
correlation

Matplotlib

Matplotlib 是 Python 中一个用于数组二维绘图的可视化库。Matplotlib 是一个跨平台的数据可视化库,建立在 NumPy 数组之上,旨在与更广泛的 SciPy 堆栈协同工作。

可视化的最大优势之一是它允许我们以易于理解的可视化方式访问大量数据。Matplotlib 包含多种绘图,例如折线图、条形图、散点图、直方图等。

Matplotlib 附带了各种各样的绘图。绘图有助于理解趋势、模式并进行关联。它们通常是用于推理定量信息的工具。

Seaborn

Seaborn 是一个基于 matplotlib 的 Python 数据可视化库。它为绘制有吸引力且信息丰富的统计图形提供了高级接口。

# Using seaborn and matplotlib to have a better visualization of correlation
import seaborn as sn
import matplotlib.pyplot as plt

plt.figure(figsize=(10,8))
sn.heatmap(correlation,annot=True,linewidth=1,cmap='PuOr')
plt.show()

从上面的相关矩阵中,我们可以大致了解哪些变量可以作为特征来构建我们的模型。让我们列出它们,考虑到所有具有 |corr|>=0.5 的变量。

  • prev_close
  • no._of_trades
  • open_price
  • low_price
  • last_price
  • turnover
  • close_price
  • %_dly_qt_to_traded_qty
  • average_price

现在我们对特征有了大致的了解,让我们使用散点图再次确认它们相对于目标变量的行为。

plt.figure(figsize=(18,18))

plt.subplot(3,3,1)
plt.scatter(data.prev_close,data.high_price)
plt.title('Relation with Previous Closing Price')

plt.subplot(3,3,2)
plt.scatter(data['no._of_trades'],data.high_price)
plt.title('Relation with No. of trades')

plt.subplot(3,3,3)
plt.scatter(data.open_price,data.high_price)
plt.title('Relation with Opening Price')

plt.subplot(3,3,4)
plt.scatter(data.low_price,data.high_price)
plt.title('Relation with Low Price')

plt.subplot(3,3,5)
plt.scatter(data.last_price,data.high_price)
plt.title('Relation with Last Price')

plt.subplot(3,3,6)
plt.scatter(data.turnover,data.high_price)
plt.title('Relation with Turnover')

plt.subplot(3,3,7)
plt.scatter(data.close_price,data.high_price)
plt.title('Relation with Closing Price')

plt.subplot(3,3,8)
plt.scatter(data['%_dly_qt_to_traded_qty'],data.high_price)
plt.title('Relation with Deliverable quantity')

plt.subplot(3,3,9)
plt.scatter(data.average_price,data.high_price)
plt.title('Relation with Average Price')

plt.show

从上面的可视化中,我们清楚地选择用于线性回归模型的特征。它们是

  • prev_close
  • no._of_trades
  • open_price
  • low_price
  • last_price
  • turnover
  • close_price
  • %_dly_qt_to_traded_qty
  • average_price
features=['prev_close','open_price','low_price','last_price','close_price','average_price']
X=data[features]
X
# Target variable
y=data.high_price
y
# split data into training and validation data, for both features and target
# The split is based on a random number generator. Supplying a numeric value to
# the random_state argument guarantees we get the same split every time we
# run this script.

from sklearn.model_selection import train_test_split
train_X,val_X,train_y,val_y=train_test_split(X,y,test_size=0.2,random_state=0)
from sklearn.linear_model import LinearRegression
# Define model
model=LinearRegression()
# Fit model
model.fit(train_X,train_y)
# We use .score method to get an idea of quality of our model
model.score(val_X,val_y)

模型验证

有很多指标可以总结模型质量,但我们将从一个称为 平均绝对误差 (也称为 MAE) 的指标开始。让我们从最后一个词,误差,开始分解这个指标。

误差 = 实际值 - 预测值

因此,如果某支股票在一个时间段内的价格为 4000 卢比,而我们预测它的价格为 3980 卢比,那么误差为 20 卢比。

对于 MAE 指标,我们对每个误差取绝对值。这将每个误差转换为正数。然后我们对这些绝对误差求平均值。这是我们的模型质量度量。简单地说,可以这样说:

平均而言,我们的预测误差约为 X。

from sklearn.metrics import mean_absolute_error
# Get predicted prices of stock on validation data
pred_y=model.predict(val_X)
mean_absolute_error(val_y,pred_y)