Recommendation systems are a collection of algorithms used to recommend items to users based on information taken from the user. These systems have become ubiquitous, and can commonly be seen in online stores, movie databases, and job finders. In this blog post, we will explore content-based and colaborative filtering recommendation systems.

The dataset we’ll be working on has been acquired from GroupLens. It consists of 27 million ratings and 1.1 million tag applications applied to 58,000 movies by 280,000 users.

# import libraries
import pandas as pd
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# store the movie information into a pandas dataframe
movies_df = pd.read_csv('movies1.csv')

# store the ratings information into a pandas dataframe
ratings_df = pd.read_csv('ratings.csv')
movies_df.head()
movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy

Each movie has a unique ID, a title with its release year along with it (which may contain unicode characters) and several different genres in the same field.

# dimensions of the dataframes
print(movies_df.shape)
print(ratings_df.shape)
(58097, 3)
(27753444, 4)

Preprocessing the data

Let’s remove the year from the ‘title’ column and store it in a new ‘year’ column.

# use regular expressions to find a year stored between parantheses
# we specify the parantheses so we don't conflict with movies that have years in their titles
movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))', expand=False)

# remove the parentheses
movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)', expand=False)

# remove the years from the 'title' column
movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')

# apply the strip finction to get rid of any ending whitespace characters that may have appeared
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())
movies_df.head()
movieId title genres year
0 1 Toy Story Adventure|Animation|Children|Comedy|Fantasy 1995
1 2 Jumanji Adventure|Children|Fantasy 1995
2 3 Grumpier Old Men Comedy|Romance 1995
3 4 Waiting to Exhale Comedy|Drama|Romance 1995
4 5 Father of the Bride Part II Comedy 1995

Let’s also split the values in the ‘genres’ column into a ‘list of genres’ to simplify future use. Apply Python’s split string function on the genres column.

# every genre is separated by a |. So call the split function on |.
movies_df['genres'] = movies_df.genres.str.split('|')
movies_df.head()
movieId title genres year
0 1 Toy Story [Adventure, Animation, Children, Comedy, Fantasy] 1995
1 2 Jumanji [Adventure, Children, Fantasy] 1995
2 3 Grumpier Old Men [Comedy, Romance] 1995
3 4 Waiting to Exhale [Comedy, Drama, Romance] 1995
4 5 Father of the Bride Part II [Comedy] 1995

Since keeping genres in a list format isn’t optimal for the content-based recommendation system technique, we will use the One Hot Encoding technique to convert the list of genres to a vector where each column corresponds to one possible value of the feature. This encoding is needed for feeding categorical data.
In this case, we store every differrent genre in columns that contain either 1 or 0. 1 shows that a movie has that genre and 0 shows that it doesn’t. Let’s also store this dataframe in another variable since genres won’t be important for our first recommendation system.

# copy the movie dataframe into a new one
moviesWithGenres_df = movies_df.copy()

# for every row in the dataframe, iterate through the list of genres and place a 1 in the corresponding column
for index, row in movies_df.iterrows():
    for genre in row['genres']:
        moviesWithGenres_df.at[index, genre] = 1
        
# fill in the NaN values with 0 to show that a movie doesn't have that column's genre
moviesWithGenres_df = moviesWithGenres_df.fillna(0)
moviesWithGenres_df.head()
movieId title genres year Adventure Animation Children Comedy Fantasy Romance ... Horror Mystery Sci-Fi IMAX Documentary War Musical Western Film-Noir (no genres listed)
0 1 Toy Story [Adventure, Animation, Children, Comedy, Fantasy] 1995 1.0 1.0 1.0 1.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 2 Jumanji [Adventure, Children, Fantasy] 1995 1.0 0.0 1.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 3 Grumpier Old Men [Comedy, Romance] 1995 0.0 0.0 0.0 1.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 4 Waiting to Exhale [Comedy, Drama, Romance] 1995 0.0 0.0 0.0 1.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 5 Father of the Bride Part II [Comedy] 1995 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 24 columns

Now, let’s focus on the ratings dataframe.

ratings_df.head()
userId movieId rating timestamp
0 1 307 3.5 1256677221
1 1 481 3.5 1256677456
2 1 1091 1.5 1256677471
3 1 1257 4.5 1256677460
4 1 1449 4.5 1256677264

Every row in the ratings dataframe has a userId associated with at least one movie, a rating and a timestamp showing when they reviewed it. We won’t be needing the timestamp column, so let’s drop it.

ratings_df = ratings_df.drop('timestamp',1)
ratings_df.head()
userId movieId rating
0 1 307 3.5
1 1 481 3.5
2 1 1091 1.5
3 1 1257 4.5
4 1 1449 4.5

Content-based recommendation system

This technique attempts to figure out what a user’s favorite aspects of an item are, and then recommends items that present those aspects. In our case, we’re going to try to figure out the input’s favorite genres from the movies and ratings given.

Advantages of content-based filtering:

  • it learns the user’s preferences.
  • it’s highly personalized for the user.

Disadvantages of content-based filtering:

  • it doesn’t take into account what others think of the item, so low quality item recommendations might happen.
  • Extracting data is not always intuitive.
  • Determining what characteristics of the item the user dislikes or likes is not always obvious.

Create an input to recommend movies to.

userInput = [
    {'title':'Mission: Impossible - Fallout', 'rating':5},
    {'title':'Top Gun', 'rating':4.5},
    {'title':'Jerry Maguire', 'rating':3},
    {'title':'Vanilla Sky', 'rating':2.5},
    {'title':'Minority Report', 'rating':4},
]
inputMovies = pd.DataFrame(userInput)
inputMovies
title rating
0 Mission: Impossible - Fallout 5.0
1 Top Gun 4.5
2 Jerry Maguire 3.0
3 Vanilla Sky 2.5
4 Minority Report 4.0

Add movieId to input user.
Extract the input movie’s ID from the movies dataframe and add it to the input.

# filter the movies by title
inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]

# merge it to get the movieId
inputMovies = pd.merge(inputId, inputMovies)

# drop information we won't use from the input dataframe
inputMovies = inputMovies.drop('genres', 1).drop('year', 1)

# final input dataframe
inputMovies
movieId title rating
0 1101 Top Gun 4.5
1 1393 Jerry Maguire 3.0
2 4975 Vanilla Sky 2.5
3 5445 Minority Report 4.0
4 189333 Mission: Impossible - Fallout 5.0

We will learn the input’s preferences. So let’s get the subset of movies that the input has watched from the dataframe containing genres defined with binary values.

# filter out the movies from the input
userMovies = moviesWithGenres_df[moviesWithGenres_df['movieId'].isin(inputMovies['movieId'].tolist())]
userMovies
movieId title genres year Adventure Animation Children Comedy Fantasy Romance ... Horror Mystery Sci-Fi IMAX Documentary War Musical Western Film-Noir (no genres listed)
1079 1101 Top Gun [Action, Romance] 1986 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1361 1393 Jerry Maguire [Drama, Romance] 1996 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4879 4975 Vanilla Sky [Mystery, Romance, Sci-Fi, Thriller] 2001 0.0 0.0 0.0 0.0 0.0 1.0 ... 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5348 5445 Minority Report [Action, Crime, Mystery, Sci-Fi, Thriller] 2002 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
56349 189333 Mission: Impossible - Fallout [Action, Adventure, Thriller] 2018 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 24 columns

We only need the actual genre table. Reset the index and drop the unnecessary columns.

# reset the index
userMovies = userMovies.reset_index(drop=True)

# drop unnecessary columns
userGenreTable = userMovies.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
userGenreTable
Adventure Animation Children Comedy Fantasy Romance Drama Action Crime Thriller Horror Mystery Sci-Fi IMAX Documentary War Musical Western Film-Noir (no genres listed)
0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Now we learn the input preferences.
We turn each genre into weights using the input’s reviews and multiplying them into the input’s genre table, and then summing up the resulting table by column.

inputMovies['rating']
0    4.5
1    3.0
2    2.5
3    4.0
4    5.0
Name: rating, dtype: float64
# dot product to get weights
userProfile = userGenreTable.transpose().dot(inputMovies['rating'])

# the user profile
userProfile
Adventure              5.0
Animation              0.0
Children               0.0
Comedy                 0.0
Fantasy                0.0
Romance               10.0
Drama                  3.0
Action                13.5
Crime                  4.0
Thriller              11.5
Horror                 0.0
Mystery                6.5
Sci-Fi                 6.5
IMAX                   0.0
Documentary            0.0
War                    0.0
Musical                0.0
Western                0.0
Film-Noir              0.0
(no genres listed)     0.0
dtype: float64

Now we have the weights for each of the user’s preferences. This is the User Profile. Using this, we can recommend movies that satisfy the user’s preferences.
Let’s start by extracting the genre table from the original dataframe.

# get the genre of every movie in our original dataframe
genreTable = moviesWithGenres_df.set_index(moviesWithGenres_df['movieId'])

# drop unnecessary columns
genreTable = genreTable.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
genreTable.head()
Adventure Animation Children Comedy Fantasy Romance Drama Action Crime Thriller Horror Mystery Sci-Fi IMAX Documentary War Musical Western Film-Noir (no genres listed)
movieId
1 1.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
genreTable.shape
(58097, 20)

With the input’s profile and the complete list of movies and their genres in hand, we’re going to take the weighted average of every movie based on the input profile and recommend the top twenty movies that most satisfy it.

# multiply the genre by the weights and then take the weighted average
recommendationTable_df = ((genreTable*userProfile).sum(axis=1)) / (userProfile.sum())
recommendationTable_df.head()
movieId
1    0.083333
2    0.083333
3    0.166667
4    0.216667
5    0.000000
dtype: float64

Here is the recommendation table.

movies_df.loc[movies_df['movieId'].isin(recommendationTable_df.head(20).keys())]
movieId title genres year
0 1 Toy Story [Adventure, Animation, Children, Comedy, Fantasy] 1995
1 2 Jumanji [Adventure, Children, Fantasy] 1995
2 3 Grumpier Old Men [Comedy, Romance] 1995
3 4 Waiting to Exhale [Comedy, Drama, Romance] 1995
4 5 Father of the Bride Part II [Comedy] 1995
5 6 Heat [Action, Crime, Thriller] 1995
6 7 Sabrina [Comedy, Romance] 1995
7 8 Tom and Huck [Adventure, Children] 1995
8 9 Sudden Death [Action] 1995
9 10 GoldenEye [Action, Adventure, Thriller] 1995
10 11 American President, The [Comedy, Drama, Romance] 1995
11 12 Dracula: Dead and Loving It [Comedy, Horror] 1995
12 13 Balto [Adventure, Animation, Children] 1995
13 14 Nixon [Drama] 1995
14 15 Cutthroat Island [Action, Adventure, Romance] 1995
15 16 Casino [Crime, Drama] 1995
16 17 Sense and Sensibility [Drama, Romance] 1995
17 18 Four Rooms [Comedy] 1995
18 19 Ace Ventura: When Nature Calls [Comedy] 1995
19 20 Money Train [Action, Comedy, Crime, Drama, Thriller] 1995

These are the top 20 movies to recommend to the user based on a content-based recommendation system.

Collaborative Filtering

This technique uses other users to recommend items to the input user. It attempts to find users that have similar preferences and opinions as the input and then recommends items that they have liked to the input. there are several methods of finding similar users, and the one we will be using here is going to be based on the Pearson Correlation Function.

The process for creating a user-based recommendation system is as follows:

  • Select a user with the movies the user has watched.
  • Based on his ratings of movies, find the top X neighbours.
  • Get the watched movie record of the user for each neighbour.
  • Calculate a similarity score using some formula.
  • Recommend the items with the highest score.

Advantages of collaborative filtering:

  • It takes other user’s ratings into consideration
  • It doesn’t need to study or extract information from the recommended item
  • It adapts to the user’s interestes which might change over time

Disadvantages of collaborative filtering:

  • The approximation function can be slow.
  • There might be a low amount of users to approximate
  • There might be privacy issues when trying to learn the user’s experiences.

Let’s create an input user to recommend movies to.

userInput = [
    {'title':'Mission: Impossible - Fallout', 'rating':5},
    {'title':'Top Gun', 'rating':4.5},
    {'title':'Jerry Maguire', 'rating':3},
    {'title':'Vanilla Sky', 'rating':2.5},
    {'title':'Minority Report', 'rating':4},
]
inputMovies = pd.DataFrame(userInput)
inputMovies
title rating
0 Mission: Impossible - Fallout 5.0
1 Top Gun 4.5
2 Jerry Maguire 3.0
3 Vanilla Sky 2.5
4 Minority Report 4.0
# filter the movies by title
inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]

# merge it to get the movieId
inputMovies = pd.merge(inputId, inputMovies)

# drop information we won't use from the input dataframe
inputMovies = inputMovies.drop('genres', 1).drop('year', 1)

# final input dataframe
inputMovies
movieId title rating
0 1101 Top Gun 4.5
1 1393 Jerry Maguire 3.0
2 4975 Vanilla Sky 2.5
3 5445 Minority Report 4.0
4 189333 Mission: Impossible - Fallout 5.0

The users who have seen the same movies

Now, with the movie IDs in our input, we can get the subset of users that have watched and reviewd the movies in our input.

# filter out users that have watched movies that the input has watched and storing it
userSubset = ratings_df[ratings_df['movieId'].isin(inputMovies['movieId'].tolist())]
userSubset.head()
userId movieId rating
214 4 1101 4.0
248 4 1393 2.5
586 4 4975 4.0
610 4 5445 4.5
935 8 1393 4.0

Group the rows by userId.

# groupby creates several sub dataframes where they all have the same value in the column specified as the parameter
userSubsetGroup = userSubset.groupby(['userId'])

Let’s look at one of these users - userId = 4

userSubsetGroup.get_group(4)
userId movieId rating
214 4 1101 4.0
248 4 1393 2.5
586 4 4975 4.0
610 4 5445 4.5

Let’s sort these groups so the users that share the most movies in common with the input have higher priority. This provides a richer recommendation since we won’t go through every single user.

userSubsetGroup = sorted(userSubsetGroup, key=lambda x: len(x[1]), reverse=True)

Now let’s look at the first user.

userSubsetGroup[0:3]
[(214,
         userId  movieId  rating
  20548     214     1101     2.0
  20638     214     1393     3.0
  21122     214     4975     2.0
  21160     214     5445     4.0
  21933     214   189333     3.0),
 (6264,
          userId  movieId  rating
  616485    6264     1101     5.0
  616574    6264     1393     4.0
  617440    6264     4975     3.0
  617480    6264     5445     3.0
  618666    6264   189333     4.0),
 (19924,
           userId  movieId  rating
  1945179   19924     1101     3.5
  1945273   19924     1393     4.0
  1946065   19924     4975     2.0
  1946152   19924     5445     4.0
  1948193   19924   189333     3.5)]

Next, we are going to compare users to our specified user and find the one that is most similar.
We’re going to find out how similar each user is to the input through the Pearson Correlation Coefficient. It is used to measure the strength of a linear association between two variables.

We will select a subset of users to iterate through. The limit is imposed because we don’t want to waste too much time going through every single user.

userSubsetGroup = userSubsetGroup[0:100]

Calculate the Pearson Correlation between the input user and the subset group, and store it in a dictionary, where the key is the userId and the value is the coefficient.

pearsonCorrelationDict = {}

#For every user group in our subset
for name, group in userSubsetGroup:
    #Let's start by sorting the input and current user group so the values aren't mixed up later on
    group = group.sort_values(by='movieId')
    inputMovies = inputMovies.sort_values(by='movieId')
    #Get the N for the formula
    nRatings = len(group)
    #Get the review scores for the movies that they both have in common
    temp_df = inputMovies[inputMovies['movieId'].isin(group['movieId'].tolist())]
    #And then store them in a temporary buffer variable in a list format to facilitate future calculations
    tempRatingList = temp_df['rating'].tolist()
    #Let's also put the current user group reviews in a list format
    tempGroupList = group['rating'].tolist()
    #Now let's calculate the pearson correlation between two users, so called, x and y
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)
    
    #If the denominator is different than zero, then divide, else, 0 correlation.
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[name] = 0
pearsonCorrelationDict.items()
dict_items([(214, 0.23055616708169335), (6264, 0.518751375933811), (19924, 0.48424799847909467), (21962, 0.7190233885442843), (22361, 0.6163156344279349), (24518, -0.48424799847909017), (28244, -0.22258705026211378), (30387, 0.8339502495593619), (31727, -0.6163156344279349), (32728, -0.26413527189768593), (33550, 0.3774147062120368), (36202, 0.9510441892119876), (38778, 0.5906244232186185), (43227, -0.1968748077395395), (43264, -0.9021937088963177), (48109, 0.0), (50016, 0.04402254531627891), (59611, 0.24946109012559378), (62705, 0.5187513759338097), (63353, -0.4799585206127619), (64733, -0.8524929243380921), (69860, 0.43133109281375515), (70271, -0.08524929243380922), (71857, 0.7781270639007126), (72194, 0.24112141108520613), (75629, 0.6016946526766817), (77609, 0.48224282217041226), (80398, 0.7776587696250218), (81924, -0.32283199898606263), (93997, 0.7771889263740438), (94749, 0.0), (98561, -0.5619806572616304), (99014, -0.23055616708169688), (102101, 0.2516098041413576), (104322, -0.43133109281375515), (105397, 0.8859366348279278), (112491, -0.6469966392206334), (116632, 0.20173664619648324), (117053, 0.5917813771642448), (124357, 0.7635511351031528), (125365, 0.7009130258223497), (128610, 0.45109685444815883), (131687, -0.22874785549890708), (133546, 0.6163156344279386), (148144, 0.10783277320344019), (153921, 0.12056070554260306), (161582, -0.3616821166278092), (167427, 0.10783277320343922), (167835, 0.10783277320343156), (171745, -0.6995593008237843), (173280, -0.2516098041413576), (175811, 0.616315634427937), (184822, 0.07421560439929334), (186859, -0.6163156344279386), (187056, 0.8439249387982215), (189464, 0.2017366461964786), (194365, -0.05547950410915026), (195892, 0.17049858486761843), (199011, 0.6340294594746541), (205765, 0.6163156344279422), (209798, 0.836059669922064), (210651, -0.057639041770424365), (220709, 0.8364283610093444), (221882, -0.18485618263446638), (233580, 0.7009130258223497), (240712, 0.700913025822351), (242708, 0.04876920665717847), (247867, -0.4528033232531783), (248019, 0.393749615479079), (261170, 0.518751375933811), (261224, 0.5114957546028552), (263973, -0.12888481555661682), (267699, 0.17049858486761843), (271364, 0.7043607250605002), (275841, -0.8364283610093444), (280868, 0.09843740386976975), (4, 0.4216370213557839), (56, 0.12909944487358055), (81, 0.2581988897471611), (147, 0.5502760564641688), (235, 0.0), (239, -0.7302967433402214), (313, -0.7302967433402214), (332, 0.848528137423857), (458, 0.0), (601, 0.6708203932499369), (605, 0.32071349029490925), (864, -0.31622776601683794), (930, -0.5163977794943222), (1073, -0.4242640687119285), (1153, -0.5262348115842176), (1191, 0.0), (1263, 0.0), (1312, 0.9621404708847278), (1367, -0.1414213562373095), (1419, -0.3651483716701107), (1440, 0.6708203932499369), (1513, 0.3651483716701107), (1519, -0.38138503569823695), (1523, -0.38138503569823695)])
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['userId'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
pearsonDF.head()
similarityIndex userId
0 0.230556 214
1 0.518751 6264
2 0.484248 19924
3 0.719023 21962
4 0.616316 22361

The top x similar users to the input user

Let’s get the top 50 users that are most similar to the input.

topUsers = pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]
topUsers.head()
similarityIndex userId
93 0.962140 1312
11 0.951044 36202
35 0.885937 105397
83 0.848528 332
54 0.843925 187056

Now let’s start recommending movies to the input user.

Rating of selected users to all movies

We’re going to do this by taking the weighted average of the ratings of the movies using the Pearson Correlation as the weight. But to do this, we first need to get the movies watched by the users in our pearsonDF from the ratings dataframe, and then store their correlation in a new column called ‘similarityIndex’.

# merge two tables
topUsersRating=topUsers.merge(ratings_df, left_on='userId', right_on='userId', how='inner')
topUsersRating.head()
similarityIndex userId movieId rating
0 0.96214 1312 6 3.0
1 0.96214 1312 19 3.5
2 0.96214 1312 32 2.5
3 0.96214 1312 110 2.5
4 0.96214 1312 150 3.0

Now we multiply the movie rating by its weight (the similarity index), then sum up the new ratings and divide it by the sum of the weights.
We can easily do this by simply multiplying two columns, then grouping up the dataframe by movieId and then dividing two columns.
It shows the idea of all similar users to candidate movies for the input user.

# multiply the similarity by the user's ratings
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']
topUsersRating.head()
similarityIndex userId movieId rating weightedRating
0 0.96214 1312 6 3.0 2.886421
1 0.96214 1312 19 3.5 3.367492
2 0.96214 1312 32 2.5 2.405351
3 0.96214 1312 110 2.5 2.405351
4 0.96214 1312 150 3.0 2.886421
# apply a sum to the topUsers after grouping it by userId
tempTopUsersRating = topUsersRating.groupby('movieId').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head()
sum_similarityIndex sum_weightedRating
movieId
1 24.947499 100.721637
2 22.262128 70.826453
3 8.242517 25.223362
4 2.427828 6.840441
5 12.595882 33.904291
# create an empty dataframe
recommendation_df = pd.DataFrame()

# take the weighted average
recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
recommendation_df['movieId'] = tempTopUsersRating.index
recommendation_df.head()
weighted average recommendation score movieId
movieId
1 4.037344 1
2 3.181477 2
3 3.060153 3
4 2.817515 4
5 2.691696 5

Let’s sort this and see the top 20 movies that the algorithm recommended.

recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
recommendation_df.head()
weighted average recommendation score movieId
movieId
4863 5.0 4863
5641 5.0 5641
3777 5.0 3777
3205 5.0 3205
3847 5.0 3847
movies_df.loc[movies_df['movieId'].isin(recommendation_df.head(20)['movieId'].tolist())]
movieId title genres year
3118 3205 Black Sunday (La maschera del demonio) [Horror] 1960
3686 3777 Nekromantik [Comedy, Horror] 1987
3754 3847 Ilsa, She Wolf of the SS [Horror] 1974
3876 3970 Beyond, The (E tu vivrai nel terrore - L'aldilà) [Horror] 1981
4767 4863 Female Trouble [Comedy, Crime] 1975
5542 5641 Moderns, The [Drama] 1988
5681 5780 Polyester [Comedy] 1981
5810 5909 Visitor Q (Bizita Q) [Comedy, Drama, Horror] 2001
8549 26007 Unknown Soldier, The (Tuntematon sotilas) [Drama, War] 1955
12542 58425 Heima [Documentary] 2007
12713 59684 Lake of Fire [Documentary] 2006
16931 85181 Pooh's Grand Adventure: The Search for Christo... [Adventure, Animation, Children, Musical] 1997
20049 98198 OMG Oh My God! [Comedy, Drama] 2012
21237 102666 Ivan Vasilievich: Back to the Future (Ivan Vas... [Adventure, Comedy] 1973
22406 106561 Krrish 3 [Action, Adventure, Fantasy, Sci-Fi] 2013
46195 167248 Kedi [(no genres listed)] 2016
50636 176753 Bingo - The King of the Mornings [Comedy, Drama] 2017
51187 177951 Happy! [Fantasy] 2017
53314 182723 Cosmos: A Spacetime Odissey [(no genres listed)] NaN
54462 185227 Brief History of Disbelief [Documentary] 2004

These are the top 20 movies to recommend to the user based on a collaborative filtering recommendation system.