Investigate TMDB Movie Dataset

12 min readMay 25, 2020

What kind of movie should I invest in?

Business Understanding

Imagine an investor who would like to know more about the features that can affect the profit. We have the TMDB data to solve his problem. The following questions are the ones I am interested in tackling in the context of a movie investor:

What release season is the peak season that can bring the most profit?
Which movie production company is the most profitable company?
Which movie genre is the most profitable genre?
Which actor is the most profitable actor that can bring the most profit?
How well can we predict a movie’s average vote? What aspects correlate well to its average vote?
How well can we predict a movie’s adjusted profit? What aspects correlate well to its adjusted profit?

Data Understanding

I will use the TMDB data to get some insight into each of these questions. In the rest of this project, I will work along to answer the six questions above.

The dataset I use is the TMDB data. The link for this data is https://d17h27t6h515a5.cloudfront.net/topher/2017/October/59dd1c4c_tmdb-movies/tmdb-movies.csv. As the data will not be clean at first, I will wrangle and clean the data first.

Data wrangling

# Load the movie data 
df = pd.read_csv('tmdb-movies.csv')
pd.pandas.set_option('display.max_columns', None)print(df.head(3))print("The tmdb dataset have {} rows and {} columns".format(df.shape[0], df.shape[1]))

The TMDB dataset has 10866 rows and 21 columns and the top three rows are as follows:

After checking the datatype, null values, and 0 values. I have four data cleaning steps:

Data Cleaning

Drop the columns I do not need in the further data analysis;
Change the data type of the column “realease_date” into DateTime;
Drop the rows with 0 value;
Drop the duplicates.

# === 1.Drop the unrelated columns ===
to_drop =  ['id', 'imdb_id','homepage','tagline','keywords','overview']
df.drop(to_drop, inplace = True, axis = 1)df.isnull().sum()

As we are going to analyze how popularity, revenue, genres, and runtime affect the vote average and profit. These columns do not have any null values except the genres column have 23 null values. I do not think the 23 null values will affect my data analysis result. Therefore, I will do nothing with the null values.

# === 2.Change the data type ===df.release_date = pd.to_datetime(df.release_date)# === 3. Deal with the null values  ===
df.isnull().sum()

There are 1030 rows without the information of production_companies. I do not want to drop them because these rows also contain important information for other questions.
Therefore, I will fill null values in the cast, director, genres, and production_companies with “Unknown” to solve the null value problem.

df.fillna("unknow", inplace = True)

We still have 0 values in some continuous columns, I locate these rows and delete them.

# find the rows with 0 buget or 0 revenue or 0 runtime
zero = (df.budget == 0) | (df.revenue == 0)|(df.runtime == 0)
zero_rows = df.loc[zero]df = df.loc[~zero]

Right now the data frame only has 3854 rows and 15 columns. Our data is ready, let us start to answer the six questions:

1. What release season is the peak season that can bring the most profit?

I create a function peak_season() (Appendix 1)to plot the profit by weekday and month. Therefore, we can check the line charts of the profit in the last 10 years.

# input the calculator column and the time range
peak_season(‘profit’, “weekday”)

I assume that the customers would like to watch movies with the release date at weekends. Because they have enough time to spend on watching movies at weekends. However, the line chart shows that customers would like to watch movies with the release date on Wednesdays. I think that is because Wednesday is in the middle of a week, the customers get to know the movie releases on Wednesday. Then they spend two to three days to decide to watch the movie at weekends or not. That is why the Wednesday released movie can bring a lot of profits.

peak_season('profit', "month")

The line chart shows that June always the quarter with the highest adjusted revenue from 2006 to 2015. In December, there is another peak season. It makes sense because June is in the summertime, people would like to attend more entertainment activities. Then December is the holiday season and there are lots of Christmas Movies in December, so people would like to watch movies during this month.

2. Which movie production company is the most profitable company?

The list of production company column is as follows. We can see that each row the value is a combination of the production companies names separated by “|”. Therefore, I will use the pandas.Series.str.get_dummies to return a binary indicator matrix of the company, then append these dummy variables to my original data frame using pd.concat. I have created a function indicator_by_feature() (Appendix 2) to create dummy columns base on the feature column I choose(production companies, genres, cast), calculate the indicator(mean, sum) of the calculation field(profit, adjusted profit, average vote).

dummies_columns = basic_dataframe[feature_col].str.get_dummies(sep=’|’)
 df_final = pd.concat([basic_dataframe, dummies_columns], axis=1)

I wonder which company has the most total profit. Therefore I use the function indicator_by_feature().

df_com = df.copy()df_com_total_profit = indicator_by_feature(df_com,’production_companies’,’profit’,’sum’)# The top 10 production companies ranked by total profit
df_com_total_profit[:10]

df_com_total_adj_profit = indicator_by_feature(df_com,’production_companies’,’profit_adj’,’sum’)
# The top 10 production companies ranked by total adjusted profit
df_com_total_adj_profit[:10]

The top 10 production companies lists ranked by total profit and total adjusted profit have lots of overlapped companies, for example Warner Bros., Universal Pictures, Paramount Pictures, Twentieth Century Fox Film Corporation, Walt Disney Pictures, Columbia Pictures, New Line Cinema, and Amblin Entertainment.

3. Which movie genre is the most profitable genre?

As the genre column is also a column with a combination of genres in one cell. I will still use the function indicator_by_feature() to get the rank of the movie genre.

df_genre = df.copy()# average profit by each genredf_genre_profit = indicator_by_feature(df_genre,’genres’,’profit’,’avg’)

# average adjusted profit for each genre
df_genre_adj_profit = indicator_by_feature(df_genre,’genres’,’profit_adj’, ‘avg’)
df_genre_adj_profit

The movie genre that has the largest profit and adjusted profit is Animation. We can check the comparison of calculation fields by genre.

4. Which actor is the most profitable actor that can bring the most profit?

Let us check who is the most bankable star that brings the highest profit and adjusted profit. We still use the function indicator_by_feature() to get the answer.

df_actor = df.copy()# check the total profit by each actordf_actor_total_profit = indicator_by_feature(df_actor ,’cast’,’profit’,’sum’)
df_actor_total_profit

df_actor = df.copy()# check the total adjusted profit by each actordf_actor_total_profit = indicator_by_feature(df_actor ,’cast’,’profit_adj’,’sum’)df_actor_total_profit

The star that can bring the highest total profit and the highest total adjusted profit is Harrison Ford. The actor with the second-highest adjusted profit is Tom Cruise and the third one is Tom Hanks.

From the four questions above, we know that the movies that released in June and December or released on Wednesdays,
always earn a lot of money. Big production companies, animation genre, and big stars will always collect the larget proportion of benefits in this industry.
As a movie investor, I want to know how well can we predict a movie’s average vote, profit, and adjusted profit.

5. How well can we predict a movie’s average vote? What aspects correlate well to its average vote?

I want to apply the linear regression model to predict the average vote. Therefore, I need to do more cleaning of the data frame, for example, remove some columns I do not need, and turn some categorical columns into dummy variables.

df_pred = df.copy()# check the datatype 
df_pred.dtypes

The release year is also a DateTime, so I will drop this column at first. I choose two categorical columns to convert to dummy variables. The “director” and “cast” are also categorical columns but I did not choose them because I do not want so many dummy variables.

for col in [‘genres’,’production_companies’]:
 dummies_columns = df_[col].str.get_dummies(sep=’|’)
 df_ = pd.concat([df_, dummies_columns], axis=1)

I only want to use numeric columns:

# Only use the numerical columns 
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
df_model = df_.select_dtypes(include=numerics)

After all these data cleaning steps, let us check how many columns in df_model.

df_model.shapeOut[69]:(3854, 3296)

There are 3296 columns and only 3854 rows in this dataset. The number of rows is a little bigger than that of columns. Therefore, I will do the feature selection with the feature_selection package in sklearn.

Define the X and y at first:

X = df_model.drop('vote_average',1)
y = df_model['vote_average']# pearson’s correlation feature selection for numeric input and numeric output
 # Create and fit selector
 selector = SelectKBest(score_func=f_regression, k=10)
 selector.fit(X, y)

Then we can use the SelectKBest() selection method to find the 10 best features that we will use.

# pearson's correlation feature selection for numeric input and numeric output
# Create and fit selector
selector = SelectKBest(score_func=f_regression, k=10)
selector.fit(X, y)
# Get columns to keep and create new dataframe with those only
cols = selector.get_support(indices=True)
X_selected = X.iloc[:,cols]
print("The columns left are:")
print(X_selected.columns)Out[78]:The columns left are:
Index(['popularity', 'revenue', 'runtime', 'vote_count', 'revenue_adj',
       'profit', 'profit_adj', 'Comedy', 'Drama', 'Horror'],
      dtype='object')

Then let us use split the data into train and test set. Then apply the linear regression model and get back the evaluation indicators of the model.

X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=0)regressor = LinearRegression()  
    regressor.fit(X_train, y_train) #training the algorithmy_pred = regressor.predict(X_test)print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
    print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
    print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
    print('R square:',metrics.r2_score(y_test, y_pred))Out[79]:Mean Absolute Error: 0.49745945025019167
Mean Squared Error: 0.4019416764711173
Root Mean Squared Error: 0.63398870374094
R square: 0.3499607819931152

R-Squared(0.349): only 35% of the variability in the vote average can be explained by the model.

6. How well can we predict a movie’s adj profit? What aspects correlate well to its adj profit?

To make the prediction process more productive, I create a function linear_regression()(Appendix 3) to get the evaluation indicators by inputting the target variable’s column name.

The evaluation indicators I get is :

Mean Absolute Error: 78720314.91636488
Mean Squared Error: 1.6824098974243844e+16
Root Mean Squared Error: 129707744.46517773
R square: 0.386260119091398

Evaluation

From the R-squared we know that only 35% of the variability in the vote average can be explained by the model. And only 39% of the variability in the adjusted profit can be explained by the model. In general, the higher the R-squared, the better the model fits the data.

But R-squared cannot determine whether the coefficient estimates and predictions are biased, which is why I assess the residual plots. Also, R-squared does not indicate whether a regression model is adequate. we may have a low R-squared value for a good model, or a high R-squared value for a model that does not fit the data.

Deployment

As a movie investor, from the questions above I know that:

The movie release on Wednesday will bring high profits. Also, July and December are the peak months for the movie industry. so I will choose the right season to release my movie.
The big production companies such as Warner Bros., Universal Pictures, Paramount Pictures, Twentieth Century Fox Film Corporation, Walt Disney Pictures, Columbia Pictures, New Line Cinema, and Amblin Entertainment are in the top 10 production companies lists ranked by total profit and total adjusted profit. To cooperate with the big companies, I will have a high probability to get a great profit.
The genre of Animation can bring the highest average profit. The second genre is Adventure and the third one is Fantasy. The animation is the genre that is popular so that I will think about investing in an animation movie.
The most bankable star that can generate the most average profit is Raquel Welch. Jason Miller has the most average adjusted profit. But the star that can bring the highest total profit and total adjusted profit is Harrison Ford. The bankable stars will attract people to go to see movies.
The vote average and profit can not be predicted well from the linear regression model. I think the reasons why the model does not fit very well are:
The TMDB dataset has 10866 rows and 21 columns. But 7011 rows have 0 values in the budget, revenue, or runtime columns. For further data analysis, I drop the 7011 rows. Removing the 70% of the rows of the dataset, there must be a lot of information missing.
The latest release year of the movies in this dataset is 2015, the results of the analysis can not track the latest changes. There are lots of movies releasing from 2016 to 2020. The movie industry has changed a lot in the past five years. For example, Netflix produces movies, and consumers may choose to watch movies on TV instead of going to the theater. Therefore, I will ask the data engineer more updated data.

More details in the https://github.com/cssamanda0104/Data-Scientist-Nanodegree-Project-1-Write-A-Data-Science-Blog-Post

Appendix:

peak_season()

def peak_season(calculation_col, time_range):
 
 “””
 a multiple line plot shows the latest 10 years’ sum of the calculation column by different time range
 
 input: 
 calculation_col: profit, profit_adj, vote_average
 time_range: weekday, month, quarter
 output:
 None
 
 “””
 
 if time_range == ‘weekday’:
 df_season[‘release_’+ time_range] = df_season[‘release_date’].dt.weekday
 if time_range == ‘month’:
 df_season[‘release_’+ time_range] = df_season[‘release_date’].dt.month
 if time_range == ‘quarter’:
 df_season[‘release’+ time_range] = df_season[‘release_date’].dt.quarter
 
 
 # find out the adjuested revenue for each quarter in each year
 df_timerange_revenue = pd.pivot_table(df_season, values= calculation_col, index=’release_’+ time_range,
 columns=[‘release_year’], aggfunc=np.sum).reset_index()
 # There are some time ranges do not have profit data in some years, I drop the year column when there are NA value
 df_timerange_revenue.dropna(axis = 1, inplace = True)
 # I only want to check the most recent 10 years of quartely profit.
 df_10_year_mr = pd.concat([df_timerange_revenue.iloc[:, 0], df_timerange_revenue.iloc[:, -10:]], axis=1)# Draw a line plot to check the seasonality of the profit curve in differnt time range.
 # style
 plt.style.use(‘seaborn-darkgrid’)
 # create a color palette
 palette = plt.get_cmap(‘Set1’)
 # multiple line plot
 num=0
 plt.figure(figsize=(20,12))
 for column in df_10_year_mr.drop(‘release_’+time_range , axis=1):
 num+=1
 plt.plot(df_10_year_mr[‘release_’+ time_range], df_10_year_mr[column], marker=’’, color=palette(num), linewidth=1, alpha=0.9, label=column)
 # Add legend
 plt.legend(loc=2, ncol=2, fontsize = 14)
 # Add titles
 plt.title(calculation_col + “ by “ + time_range + “ plot”, loc=’center’, fontsize=18, fontweight=0, color=’orange’)
 plt.xlabel(time_range, fontsize=14)
 plt.ylabel(calculation_col, fontsize=14)
 plt.xticks(fontsize=14)
 plt.yticks(fontsize=14);

2. indicator_by_feature()

def indicator_by_feature(basic_dataframe, feature_col, calculation_col, cal_method):
 
 “””
 create dummy columns base on the feature column, calculate the indicator, 
 create a dataframe to store the results.
 
 input:
 basic_dataframe — the basic dataframe which is a copy of df
 feature_col — “geners”, “production_companies”, or “cast”
 calculation_col — “profit” or “profit_adj”
 cal_method — “avg” or “sum”
 
 output:
 a dataframe contains the categories in the feature col and the average indicator
 
 “””
 
 # Use pandas.Series.str.get_dummies to return a binary indicator matrix of genre, 
 # Then append to your original dataframe using pd.concat.
 dummies_columns = basic_dataframe[feature_col].str.get_dummies(sep=’|’)
 df_final = pd.concat([basic_dataframe, dummies_columns], axis=1)
 
 # create a feature list to store all the types in the feature.
 feature_list = df_final.columns[17:].tolist()# create a blank list to store the average indicator
 indicator_list = []for feature in feature_list:
 cnt = df_final[feature].sum()
 total = df_final[df_final[feature] == 1][calculation_col].sum()
 if cal_method == “avg”:
 indicator = total/cnt
 else: 
 indicator = total
 indicator_list.append(indicator)# Create a dataframe to store the production company name and its average vote
 df_feature_indicator = pd.DataFrame(list(zip(feature_list, indicator_list)), 
 columns =[feature_col, cal_method + ‘_’ + calculation_col]) 
 
 # sort the dataframe by the calculation field
 df_feature_indicator.sort_values(by = cal_method + ‘_’+ calculation_col, \
 ascending = False, inplace = True)
 return df_feature_indicator

3. linear_regression()

def linear_regression(col):
 “””
 args: the column name we want to predict
 output : the parameters to test the model
 
 “””
 # get the X and y
 X = df_model.drop(col,1)
 y = df_model[col] 
 
 if col == ‘profit_adj’:
 X = df_model.drop([‘budget_adj’, ‘revenue_adj’,’profit’,’budget’, ‘revenue’],1)# pearson’s correlation feature selection for numeric input and numeric output
 # Create and fit selector
 selector = SelectKBest(score_func=f_regression, k=10)
 selector.fit(X, y)
 # Get columns to keep and create new dataframe with those only
 cols = selector.get_support(indices=True)
 X_selected = X.iloc[:,cols]
 print(“The columns left are:”)
 print(X_selected.columns)
 X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=0)regressor = LinearRegression() 
 regressor.fit(X_train, y_train) #training the algorithmy_pred = regressor.predict(X_test)print(‘Mean Absolute Error:’, metrics.mean_absolute_error(y_test, y_pred)) 
 print(‘Mean Squared Error:’, metrics.mean_squared_error(y_test, y_pred)) 
 print(‘Root Mean Squared Error:’, np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
 print(‘R square:’,metrics.r2_score(y_test, y_pred))