The Algorithms logo
算法
关于我们捐赠

电影推荐系统

H
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Recommendation Engine\n",
    "## Building a Movie Recommendation Engine using MovieLens dataset \n",
    "We will be using a MovieLens dataset. This dataset contains 100004 ratings across 9125 movies for 671 users. All selected users had at least rated 20 movies. \n",
    "We are going to build a recommendation engine which will suggest movies for a user which he hasn't watched yet based on the movies which he has already rated. We will be using k-nearest neighbour algorithm which we will implement from scratch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Movie file contains information like movie id, title, genre of movies and ratings file contains data like user id, movie id, rating and timestamp in which each line after header row represents one rating of one movie by one user."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>title</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Jumanji (1995)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>Grumpier Old Men (1995)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>Waiting to Exhale (1995)</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>Father of the Bride Part II (1995)</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   movieId                               title\n",
       "0        1                    Toy Story (1995)\n",
       "1        2                      Jumanji (1995)\n",
       "2        3             Grumpier Old Men (1995)\n",
       "3        4            Waiting to Exhale (1995)\n",
       "4        5  Father of the Bride Part II (1995)"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movie_file = \"data\\movie_dataset\\movies.csv\"\n",
    "movie_data = pd.read_csv(movie_file, usecols = [0, 1])\n",
    "movie_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>userId</th>\n",
       "      <th>movieId</th>\n",
       "      <th>rating</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>31</td>\n",
       "      <td>2.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1029</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>1061</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>1129</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>1172</td>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   userId  movieId  rating\n",
       "0       1       31     2.5\n",
       "1       1     1029     3.0\n",
       "2       1     1061     3.0\n",
       "3       1     1129     2.0\n",
       "4       1     1172     4.0"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ratings_file = \"data\\\\movie_dataset\\\\ratings.csv\"\n",
    "ratings_info = pd.read_csv(ratings_file, usecols = [0, 1, 2])\n",
    "ratings_info.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>title</th>\n",
       "      <th>userId</th>\n",
       "      <th>rating</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>7</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>9</td>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>13</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>15</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>19</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   movieId             title  userId  rating\n",
       "0        1  Toy Story (1995)       7     3.0\n",
       "1        1  Toy Story (1995)       9     4.0\n",
       "2        1  Toy Story (1995)      13     5.0\n",
       "3        1  Toy Story (1995)      15     2.0\n",
       "4        1  Toy Story (1995)      19     3.0"
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movie_info = pd.merge(movie_data, ratings_info, left_on = 'movieId', right_on = 'movieId')\n",
    "movie_info.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>title</th>\n",
       "      <th>userId</th>\n",
       "      <th>rating</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>7</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>9</td>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>13</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>15</td>\n",
       "      <td>2.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>1</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>19</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   movieId             title  userId  rating\n",
       "0        1  Toy Story (1995)       7     3.0\n",
       "1        1  Toy Story (1995)       9     4.0\n",
       "2        1  Toy Story (1995)      13     5.0\n",
       "3        1  Toy Story (1995)      15     2.0\n",
       "4        1  Toy Story (1995)      19     3.0"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movie_info.loc[0:10, ['userId']]\n",
    "movie_info[movie_info.title == \"Toy Story (1995)\"].head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movieId</th>\n",
       "      <th>title</th>\n",
       "      <th>userId</th>\n",
       "      <th>rating</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>246</th>\n",
       "      <td>1</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>671</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2111</th>\n",
       "      <td>36</td>\n",
       "      <td>Dead Man Walking (1995)</td>\n",
       "      <td>671</td>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2843</th>\n",
       "      <td>50</td>\n",
       "      <td>Usual Suspects, The (1995)</td>\n",
       "      <td>671</td>\n",
       "      <td>4.5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6715</th>\n",
       "      <td>230</td>\n",
       "      <td>Dolores Claiborne (1995)</td>\n",
       "      <td>671</td>\n",
       "      <td>4.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7809</th>\n",
       "      <td>260</td>\n",
       "      <td>Star Wars: Episode IV - A New Hope (1977)</td>\n",
       "      <td>671</td>\n",
       "      <td>5.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      movieId                                      title  userId  rating\n",
       "246         1                           Toy Story (1995)     671     5.0\n",
       "2111       36                    Dead Man Walking (1995)     671     4.0\n",
       "2843       50                 Usual Suspects, The (1995)     671     4.5\n",
       "6715      230                   Dolores Claiborne (1995)     671     4.0\n",
       "7809      260  Star Wars: Episode IV - A New Hope (1977)     671     5.0"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movie_info = pd.DataFrame.sort_values(movie_info, ['userId', 'movieId'], ascending = [0, 1])\n",
    "movie_info.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let us see the number of users and number of movies in our dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "671\n",
      "163949\n"
     ]
    }
   ],
   "source": [
    "num_users = max(movie_info.userId)\n",
    "num_movies = max(movie_info.movieId)\n",
    "print(num_users)\n",
    "print(num_movies)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "how many movies were rated by each user and the number of users rated each movie"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 52,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "547    2391\n",
       "564    1868\n",
       "624    1735\n",
       "15     1700\n",
       "73     1610\n",
       "Name: userId, dtype: int64"
      ]
     },
     "execution_count": 52,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movie_per_user = movie_info.userId.value_counts()\n",
    "movie_per_user.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 53,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Forrest Gump (1994)                          341\n",
       "Pulp Fiction (1994)                          324\n",
       "Shawshank Redemption, The (1994)             311\n",
       "Silence of the Lambs, The (1991)             304\n",
       "Star Wars: Episode IV - A New Hope (1977)    291\n",
       "Name: title, dtype: int64"
      ]
     },
     "execution_count": 53,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "users_per_movie = movie_info.title.value_counts()\n",
    "users_per_movie.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Function to find top N favourite movies of a user"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Grease (1978)', 'Little Mermaid, The (1989)', 'Sound of Music, The (1965)']\n"
     ]
    }
   ],
   "source": [
    "def fav_movies(current_user, N):\n",
    "    # get rows corresponding to current user and then sort by rating in descending order\n",
    "    # pick top N rows of the dataframe\n",
    "    fav_movies = pd.DataFrame.sort_values(movie_info[movie_info.userId == current_user], \n",
    "                                          ['rating'], ascending = [0]) [:N]\n",
    "    # return list of titles\n",
    "    return list(fav_movies.title)\n",
    "\n",
    "print(fav_movies(5, 3))\n",
    "    \n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Lets build recommendation engine now\n",
    "\n",
    "- We will use a neighbour based collaborative filtering model. \n",
    "- The idea is to use k-nearest neighbour algorithm to find neighbours of a user\n",
    "-  We will use their ratings to predict ratings of a movie not already rated by a current user.\n",
    "\n",
    "We will represent movies watched by a user in a vector - the vector will have values for all the movies in our dataset.\n",
    "If a user hasn't rated a movie, it would be represented as NaN."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>movieId</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "      <th>4</th>\n",
       "      <th>5</th>\n",
       "      <th>6</th>\n",
       "      <th>7</th>\n",
       "      <th>8</th>\n",
       "      <th>9</th>\n",
       "      <th>10</th>\n",
       "      <th>...</th>\n",
       "      <th>161084</th>\n",
       "      <th>161155</th>\n",
       "      <th>161594</th>\n",
       "      <th>161830</th>\n",
       "      <th>161918</th>\n",
       "      <th>161944</th>\n",
       "      <th>162376</th>\n",
       "      <th>162542</th>\n",
       "      <th>162672</th>\n",
       "      <th>163949</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>userId</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4.0</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4.0</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>4.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 9066 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "movieId  1       2       3       4       5       6       7       8       \\\n",
       "userId                                                                    \n",
       "1           NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN   \n",
       "2           NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN   \n",
       "3           NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN   \n",
       "4           NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN   \n",
       "5           NaN     NaN     4.0     NaN     NaN     NaN     NaN     NaN   \n",
       "\n",
       "movieId  9       10       ...    161084  161155  161594  161830  161918  \\\n",
       "userId                    ...                                             \n",
       "1           NaN     NaN   ...       NaN     NaN     NaN     NaN     NaN   \n",
       "2           NaN     4.0   ...       NaN     NaN     NaN     NaN     NaN   \n",
       "3           NaN     NaN   ...       NaN     NaN     NaN     NaN     NaN   \n",
       "4           NaN     4.0   ...       NaN     NaN     NaN     NaN     NaN   \n",
       "5           NaN     NaN   ...       NaN     NaN     NaN     NaN     NaN   \n",
       "\n",
       "movieId  161944  162376  162542  162672  163949  \n",
       "userId                                           \n",
       "1           NaN     NaN     NaN     NaN     NaN  \n",
       "2           NaN     NaN     NaN     NaN     NaN  \n",
       "3           NaN     NaN     NaN     NaN     NaN  \n",
       "4           NaN     NaN     NaN     NaN     NaN  \n",
       "5           NaN     NaN     NaN     NaN     NaN  \n",
       "\n",
       "[5 rows x 9066 columns]"
      ]
     },
     "execution_count": 55,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "user_movie_rating_matrix = pd.pivot_table(movie_info, values = 'rating', index=['userId'], columns=['movieId'])\n",
    "user_movie_rating_matrix.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we will find the similarity between 2 users by using correlation "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.spatial.distance import correlation\n",
    "import numpy as np\n",
    "def similarity(user1, user2):\n",
    "    # normalizing user1 rating i.e mean rating of user1 for any movie\n",
    "    # nanmean will return mean of an array after ignore NaN values \n",
    "    user1 = np.array(user1) - np.nanmean(user1) \n",
    "    user2 = np.array(user2) - np.nanmean(user2)\n",
    "    \n",
    "    # finding the similarity between 2 users\n",
    "    # finding subset of movies rated by both the users\n",
    "    common_movie_ids = [i for i in range(len(user1)) if user1[i] > 0 and user2[i] > 0]\n",
    "    if(len(common_movie_ids) == 0):\n",
    "        return 0\n",
    "    else:\n",
    "        user1 = np.array([user1[i] for i in common_movie_ids])\n",
    "        user2 = np.array([user2[i] for i in common_movie_ids])\n",
    "        return correlation(user1, user2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " We will now use the similarity function to find the nearest neighbour of a current user"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [],
   "source": [
    "# nearest_neighbour_ratings function will find the k nearest neighbours of the current user and\n",
    "# then use their ratings to predict the current users ratings for other unrated movies \n",
    "\n",
    "def nearest_neighbour_ratings(current_user, K):\n",
    "    # Creating an empty matrix whose row index is userId and the value\n",
    "    # will be the similarity of that user to the current user\n",
    "    similarity_matrix = pd.DataFrame(index = user_movie_rating_matrix.index, \n",
    "                                    columns = ['similarity'])\n",
    "    for i in user_movie_rating_matrix.index:\n",
    "        # finding the similarity between user i and the current user and add it to the similarity matrix\n",
    "        similarity_matrix.loc[i] = similarity(user_movie_rating_matrix.loc[current_user],\n",
    "                                             user_movie_rating_matrix.loc[i])\n",
    "    # Sorting the similarity matrix in descending order\n",
    "    similarity_matrix = pd.DataFrame.sort_values(similarity_matrix,\n",
    "                                                ['similarity'], ascending= [0])\n",
    "    # now we will pick the top k nearest neighbou\n",
    "    nearest_neighbours = similarity_matrix[:K]\n",
    "\n",
    "    neighbour_movie_ratings = user_movie_rating_matrix.loc[nearest_neighbours.index]\n",
    "\n",
    "    # This is empty dataframe placeholder for predicting the rating of current user using neighbour movie ratings\n",
    "    predicted_movie_rating = pd.DataFrame(index = user_movie_rating_matrix.columns, columns = ['rating'])\n",
    "\n",
    "    # Iterating all movies for a current user\n",
    "    for i in user_movie_rating_matrix.columns:\n",
    "        # by default, make predicted rating as the average rating of the current user\n",
    "        predicted_rating = np.nanmean(user_movie_rating_matrix.loc[current_user])\n",
    "\n",
    "        for j in neighbour_movie_ratings.index:\n",
    "            # if user j has rated the ith movie\n",
    "            if(user_movie_rating_matrix.loc[j,i] > 0):\n",
    "                predicted_rating += ((user_movie_rating_matrix.loc[j,i] -np.nanmean(user_movie_rating_matrix.loc[j])) *\n",
    "                                                    nearest_neighbours.loc[j, 'similarity']) / nearest_neighbours['similarity'].sum()\n",
    "\n",
    "        predicted_movie_rating.loc[i, 'rating'] = predicted_rating\n",
    "\n",
    "    return predicted_movie_rating"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Predicting top N recommendations for a current user"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 79,
   "metadata": {},
   "outputs": [],
   "source": [
    "def top_n_recommendations(current_user, N):\n",
    "    predicted_movie_rating = nearest_neighbour_ratings(current_user, 10)\n",
    "    movies_already_watched = list(user_movie_rating_matrix.loc[current_user]\n",
    "                                  .loc[user_movie_rating_matrix.loc[current_user] > 0].index)\n",
    "    \n",
    "    predicted_movie_rating = predicted_movie_rating.drop(movies_already_watched)\n",
    "    \n",
    "    top_n_recommendations = pd.DataFrame.sort_values(predicted_movie_rating, ['rating'], ascending=[0])[:N]\n",
    "    \n",
    "    top_n_recommendation_titles = movie_data.loc[movie_data.movieId.isin(top_n_recommendations.index)]\n",
    "\n",
    "    return list(top_n_recommendation_titles.title)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "finding out the recommendations for a user"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\erchh\\Anaconda3\\envs\\tensorflow\\lib\\site-packages\\scipy\\spatial\\distance.py:644: RuntimeWarning: invalid value encountered in double_scalars\n",
      "  dist = 1.0 - uv / np.sqrt(uu * vv)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "User's favorite movies are :  ['Shawshank Redemption, The (1994)', 'Father of the Bride Part II (1995)', 'Cast Away (2000)', 'Parent Trap, The (1998)', \"Ocean's Eleven (2001)\"] \n",
      "User's top recommendations are:  ['Godfather, The (1972)', 'Star Wars: Episode V - The Empire Strikes Back (1980)', 'Godfather: Part II, The (1974)']\n"
     ]
    }
   ],
   "source": [
    "current_user = 140\n",
    "print(\"User's favorite movies are : \", fav_movies(current_user, 5),\n",
    "      \"\\nUser's top recommendations are: \", top_n_recommendations(current_user, 3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "We have built a movie recommendation engine using k-nearest neighbour algorithm implemented from scratch. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
关于此算法

推荐引擎

使用 MovieLens 数据集构建电影推荐引擎

我们将使用 MovieLens 数据集。该数据集包含 671 个用户的 9125 部电影的 100004 条评分。所有选定的用户至少评价了 20 部电影。我们将构建一个推荐引擎,根据用户已经评价过的电影,为他推荐他尚未观看过的电影。我们将使用 K 近邻算法,并从头开始实现它。

import pandas as pd

电影文件包含电影 ID、标题、电影类型等信息,评分文件包含用户 ID、电影 ID、评分和时间戳等数据,其中标题行之后的每一行代表一个用户对一部电影的评分。

movie_file = "data\movie_dataset\movies.csv"
movie_data = pd.read_csv(movie_file, usecols = [0, 1])
movie_data.head()
movieId title
0 1 玩具总动员 (1995)
1 2 勇敢者的游戏 (1995)
2 3 风流老爸 (1995)
3 4 等待呼唤 (1995)
4 5 父女情深 2 (1995)
ratings_file = "data\\movie_dataset\\ratings.csv"
ratings_info = pd.read_csv(ratings_file, usecols = [0, 1, 2])
ratings_info.head()
userId movieId rating
0 1 31 2.5
1 1 1029 3.0
2 1 1061 3.0
3 1 1129 2.0
4 1 1172 4.0
movie_info = pd.merge(movie_data, ratings_info, left_on = 'movieId', right_on = 'movieId')
movie_info.head()
movieId title userId rating
0 1 玩具总动员 (1995) 7 3.0
1 1 玩具总动员 (1995) 9 4.0
2 1 玩具总动员 (1995) 13 5.0
3 1 玩具总动员 (1995) 15 2.0
4 1 玩具总动员 (1995) 19 3.0
movie_info.loc[0:10, ['userId']]
movie_info[movie_info.title == "Toy Story (1995)"].head()
movieId title userId rating
0 1 玩具总动员 (1995) 7 3.0
1 1 玩具总动员 (1995) 9 4.0
2 1 玩具总动员 (1995) 13 5.0
3 1 玩具总动员 (1995) 15 2.0
4 1 玩具总动员 (1995) 19 3.0
movie_info = pd.DataFrame.sort_values(movie_info, ['userId', 'movieId'], ascending = [0, 1])
movie_info.head()
movieId title userId rating
246 1 玩具总动员 (1995) 671 5.0
2111 36 死囚漫步 (1995) 671 4.0
2843 50 非常嫌疑犯 (1995) 671 4.5
6715 230 廊桥遗梦 (1995) 671 4.0
7809 260 星球大战:新的希望 (1977) 671 5.0

让我们看看数据集中有多少用户和多少电影

num_users = max(movie_info.userId)
num_movies = max(movie_info.movieId)
print(num_users)
print(num_movies)
671
163949

每个用户评价了多少部电影,以及有多少用户评价了每部电影

movie_per_user = movie_info.userId.value_counts()
movie_per_user.head()
547    2391
564    1868
624    1735
15     1700
73     1610
Name: userId, dtype: int64
users_per_movie = movie_info.title.value_counts()
users_per_movie.head()
Forrest Gump (1994)                          341
Pulp Fiction (1994)                          324
Shawshank Redemption, The (1994)             311
Silence of the Lambs, The (1991)             304
Star Wars: Episode IV - A New Hope (1977)    291
Name: title, dtype: int64

查找用户前 N 部最爱电影的函数

def fav_movies(current_user, N):
    # get rows corresponding to current user and then sort by rating in descending order
    # pick top N rows of the dataframe
    fav_movies = pd.DataFrame.sort_values(movie_info[movie_info.userId == current_user], 
                                          ['rating'], ascending = [0]) [:N]
    # return list of titles
    return list(fav_movies.title)

print(fav_movies(5, 3))
    
    
[&#x27;Grease (1978)&#x27;, &#x27;Little Mermaid, The (1989)&#x27;, &#x27;Sound of Music, The (1965)&#x27;]

现在让我们构建推荐引擎

  • 我们将使用基于邻居的协同过滤模型。
  • 其思想是使用 K 近邻算法来查找用户的邻居
  • 我们将使用他们的评分来预测当前用户尚未评价的电影的评分。

我们将用一个向量来表示用户观看的电影 - 该向量将包含我们数据集中所有电影的值。如果用户没有评价一部电影,它将用 NaN 表示。

user_movie_rating_matrix = pd.pivot_table(movie_info, values = 'rating', index=['userId'], columns=['movieId'])
user_movie_rating_matrix.head()
movieId 1 2 3 4 5 6 7 8 9 10 ... 161084 161155 161594 161830 161918 161944 162376 162542 162672 163949
userId
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN 4.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN 4.0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN 4.0 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 行 × 9066 列

现在,我们将使用相关性来查找两个用户之间的相似度

from scipy.spatial.distance import correlation
import numpy as np
def similarity(user1, user2):
    # normalizing user1 rating i.e mean rating of user1 for any movie
    # nanmean will return mean of an array after ignore NaN values 
    user1 = np.array(user1) - np.nanmean(user1) 
    user2 = np.array(user2) - np.nanmean(user2)
    
    # finding the similarity between 2 users
    # finding subset of movies rated by both the users
    common_movie_ids = [i for i in range(len(user1)) if user1[i] &gt; 0 and user2[i] &gt; 0]
    if(len(common_movie_ids) == 0):
        return 0
    else:
        user1 = np.array([user1[i] for i in common_movie_ids])
        user2 = np.array([user2[i] for i in common_movie_ids])
        return correlation(user1, user2)

我们现在将使用相似度函数来查找当前用户的最近邻居

# nearest_neighbour_ratings function will find the k nearest neighbours of the current user and
# then use their ratings to predict the current users ratings for other unrated movies 

def nearest_neighbour_ratings(current_user, K):
    # Creating an empty matrix whose row index is userId and the value
    # will be the similarity of that user to the current user
    similarity_matrix = pd.DataFrame(index = user_movie_rating_matrix.index, 
                                    columns = ['similarity'])
    for i in user_movie_rating_matrix.index:
        # finding the similarity between user i and the current user and add it to the similarity matrix
        similarity_matrix.loc[i] = similarity(user_movie_rating_matrix.loc[current_user],
                                             user_movie_rating_matrix.loc[i])
    # Sorting the similarity matrix in descending order
    similarity_matrix = pd.DataFrame.sort_values(similarity_matrix,
                                                ['similarity'], ascending= [0])
    # now we will pick the top k nearest neighbou
    nearest_neighbours = similarity_matrix[:K]

    neighbour_movie_ratings = user_movie_rating_matrix.loc[nearest_neighbours.index]

    # This is empty dataframe placeholder for predicting the rating of current user using neighbour movie ratings
    predicted_movie_rating = pd.DataFrame(index = user_movie_rating_matrix.columns, columns = ['rating'])

    # Iterating all movies for a current user
    for i in user_movie_rating_matrix.columns:
        # by default, make predicted rating as the average rating of the current user
        predicted_rating = np.nanmean(user_movie_rating_matrix.loc[current_user])

        for j in neighbour_movie_ratings.index:
            # if user j has rated the ith movie
            if(user_movie_rating_matrix.loc[j,i] &gt; 0):
                predicted_rating += ((user_movie_rating_matrix.loc[j,i] -np.nanmean(user_movie_rating_matrix.loc[j])) *
                                                    nearest_neighbours.loc[j, 'similarity']) / nearest_neighbours['similarity'].sum()

        predicted_movie_rating.loc[i, 'rating'] = predicted_rating

    return predicted_movie_rating

预测当前用户的推荐前 N 名

def top_n_recommendations(current_user, N):
    predicted_movie_rating = nearest_neighbour_ratings(current_user, 10)
    movies_already_watched = list(user_movie_rating_matrix.loc[current_user]
                                  .loc[user_movie_rating_matrix.loc[current_user] &gt; 0].index)
    
    predicted_movie_rating = predicted_movie_rating.drop(movies_already_watched)
    
    top_n_recommendations = pd.DataFrame.sort_values(predicted_movie_rating, ['rating'], ascending=[0])[:N]
    
    top_n_recommendation_titles = movie_data.loc[movie_data.movieId.isin(top_n_recommendations.index)]

    return list(top_n_recommendation_titles.title)

查找用户的推荐

current_user = 140
print("User's favorite movies are : ", fav_movies(current_user, 5),
      "\nUser's top recommendations are: ", top_n_recommendations(current_user, 3))
C:\Users\erchh\Anaconda3\envs\tensorflow\lib\site-packages\scipy\spatial\distance.py:644: RuntimeWarning: invalid value encountered in double_scalars
  dist = 1.0 - uv / np.sqrt(uu * vv)
User&#x27;s favorite movies are :  [&#x27;Shawshank Redemption, The (1994)&#x27;, &#x27;Father of the Bride Part II (1995)&#x27;, &#x27;Cast Away (2000)&#x27;, &#x27;Parent Trap, The (1998)&#x27;, &quot;Ocean&#x27;s Eleven (2001)&quot;] 
User&#x27;s top recommendations are:  [&#x27;Godfather, The (1972)&#x27;, &#x27;Star Wars: Episode V - The Empire Strikes Back (1980)&#x27;, &#x27;Godfather: Part II, The (1974)&#x27;]

结论

我们使用从头开始实现的 K 近邻算法构建了一个电影推荐引擎。