Case Study: Data Mining Goes to Hollywood

Altaf Safi

Data Modelling Analyst
Data Scientist
Data Analyst
Predicting the financial success of movies is crucial for Hollywood professionals because the movie industry is a highly competitive business, and investing in movies requires a significant amount of capital and investment. Without accurate predictions of a movie's potential box-office receipts, investors can be at risk of losing their investment if the movie ends up performing poorly. The challenge of forecasting box-office earnings for a specific movie has been considered difficult and daunting by many scholars and industry leaders due to the uncertainty of product demand, which makes it impossible to accurately gauge its performance until it is screened in a darkened theater and elicits a response from the audience. Predicting the financial success of a movie before it enters production can help investors and studios make better decisions about which movies to invest in and how much to invest, thereby minimizing the risk and maximizing the financial return. 
Employing data mining methods like neural networks, decision trees, and support vector machines can aid Hollywood professionals in scrutinizing and refining decision variables to meet their financial objectives. Additionally, having a reliable prediction model like the use of a variety of data mining techniques can help producers and studios make informed decisions on marketing and distribution strategies, casting choices, and production budgets, leading to more efficient and effective use of resources. In sum, when producers or other investors use data mining techniques to make accurate predictions, it will eventually help studios to optimize marketing strategies, release dates, and other decision variables to maximize the movie's success in the marketplace. 
Before the production process of a movie starts, data mining techniques can be employed to forecast its financial success by transforming the regression problem into a classification problem. Rather than predicting the exact point estimate of box-office earnings, the aim is to categorize the potential box-office receipts of the movie into one of nine classes, which range from “flop” to “blockbuster.” This results in a multinomial classification problem that aids in determining the classification of the movie. To achieve this, data is collected from various movie-related databases like ShowBiz and IMDB and consolidated into a single dataset that contains independent variables and their specifications, all of which are movies released from a certain time periods like 1998 to 2006 so it serves as an accurate prediction of recent era customer demand and responses. The dataset that is used for prediction models should contain a considerably high amount of movies as a sample size like 2,632 movies to give reliable results. Also, the dataset contains information on independent variables such as the MPAA Rating, Competition, Star value, Genre, Special Effects, Sequel, and Number of Screens.
Once a dataset is available, a variety of data mining methods, such as neural networks, decision trees, support vector machines, and ensembles, are applied to develop the prediction model. Researchers in this case used data from 1988 to 2005 as training data to build the prediction models and the data from 2006 as test data to evaluate and compare the models' prediction accuracy. The prediction model provides accurate results and can also be used to analyze and optimize decision variables to maximize financial returns. Lastly, to measure the performance of prediction models, there are a variety of performance measures like Count(Bingo), Count(1-Away), Accuracy(%Bingo), Accuracy (% 1-Away), and Standard Deviation to measure percent success and correct classification rate. In sum, The researchers claim that their prediction results are better than any reported in the published literature for this problem domain. 
I believe that Hollywood has historically relied on market research, intuition, and guesswork to predict the box-office receipts of a particular motion picture. Industry leaders and experts have acknowledged that predicting product demand in the movie business is difficult and uncertain, and that it is impossible to accurately forecast how a movie will perform until it is released in theaters. 
Without the help of data mining tools and techniques, it is likely that Hollywood would continue to rely on traditional methods for predicting box-office receipts, such as market research, tracking buzz and social media chatter, and relying on the fame power of actors and directors. These methods are not always reliable and may not provide accurate predictions of a movie's financial performance. 
The use of data mining tools and techniques, as demonstrated by Sharda and Delen, can provide more accurate predictions of box-office receipts by analyzing large amounts of data from various movie-related databases. Given the challenges associated with predicting box-office receipts, it is likely that Hollywood has relied on historical trends, industry expertise, and intuition to make decisions about which movies to produce and how much to invest in them. However, the use of data mining tools and techniques has provided a new approach for predicting the financial performance of movies and identifying patterns before they enter production. 
CART, Neural Net, and SVM are three types of predictive data mining models that can be used to forecast the box office receipts of a motion picture. 
CART stands for Classification and Regression Tree, which is a decision tree-based data mining algorithm used for both classification and regression problems. In predictive data mining, it is used to classify and predict the target variable based on a set of independent variables. In the context of predicting box-office receipts of a motion picture, CART can be used to build a decision tree to identify the most important independent variables that influence box-office performance. By recursively splitting the dataset based on the 
independent variables, CART can create a tree structure that maps the independent variables to the target variable (i.e., box-office receipts). CART can handle both
categorical and continuous independent variables and is relatively easy to interpret and visualize. In the case study, CART’s method had the highest standard deviation of 1.05. 
Additionally, Neural Net is a machine learning algorithm inspired by the structure and function of the human brain. In predictive data mining, it is used to build a model that can learn from the patterns in the training data and predict the target variable based on new input data. The model works by training the network using backpropagation, adjusting the weights between neurons to minimize the error between the predicted and actual values. In the context of predicting box-office receipts of a motion picture, Neural Net can be used to identify the nonlinear relationships between the independent variables and the target variable. It consists of multiple layers of interconnected neurons that process the input data and produce an output. Neural Net can handle both categorical and continuous independent variables and can learn from complex patterns in the data. It can be used for both classification and regression problems, usually in the case with complex relationships between different variables. 
Moreover, SVM stands for Support Vector Machine, which is a supervised learning algorithm used for classification and regression problems. In predictive data mining, it is used to build a model that can predict the target variable based on a set of independent variables. In the context of predicting box-office receipts of a motion picture, SVM can be used to identify the decision boundary that separates the different categories of box-office performance based on the independent variables. SVM aims to find the hyperplane that maximally separates the data points with the largest margin. It can handle both categorical and continuous independent variables and can deal with high-dimensional data. The model works by finding a hyperplane in the input space that maximizes the margin between the classes, for instance, the distance between the closest data points of each class. SVM is useful for problems with a small number of features and a large number of observations. 
In sum, in the context of predictive data mining for box office financial performance prediction, these algorithms were likely used to build models that could accurately classify movies based on their expected financial performance. The researchers used these models to predict whether a movie would be a flop or a blockbuster. In the context of predictive data mining for box-office receipts, these models can be used to classify movies based on their potential financial performance (For example, flop, blockbuster). By training the models on historical data and testing them on new data, they were able to evaluate the accuracy of the models and compare their performance to other published results in the field. These models could be used by movie studios and investors to make informed decisions about which movies to produce or invest in.
Source: 
https://www.analyticsvidhya.com/blog/2017/09/common-machine-learning-algorithms/ 
HSX.com, Hollywood Stock Exchange, is a website that allows users to trade digital stocks of movies, actors, and directors and make money. The website uses a market-based approach to predict the success of movies. The success of a movie is determined by how much money it earns at the box office during its first four weeks of release. Before the movie is released, users can buy and sell digital shares of the movie, which reflects their belief in its potential success. If the movie performs well, the value of the digital shares can go up, and if it performs poorly, the value goes down. In other words, users can buy and sell digital shares of movies before they are released, and the price of the shares is based on predictions of the movie's box office performance. HSX.com also collects data on movie-related news and events to inform their predictions. Moreover, the website also has a prediction algorithm that takes into account factors such as the movie's genre, cast, release date, and marketing budget to make predictions about its box office performance. 
Compared to the approach described in the case above, HSX.com's method is more of a collective intelligence approach, where the collective knowledge and opinions of the users and other factors such as social media, trailer views, and industry expert opinions are used to predict the success of the movie. Also, HSX.com appears to be more based on market speculation and user behavior rather than statistical analysis of historical data. This means it can be unclear whether this method has a high level of accuracy in predicting box office performance. On the other hand, Sharda and Delen's approach is more of a data-driven approach that uses various data mining techniques to predict a movie's success. 
Moreover, Sharda and Delen's method uses a range of data mining techniques to analyze historical data on a large number of movies, and the accuracy of their predictions was reported to be better than any reported in the published literature for this problem domain. Therefore, based on the information provided, Sharda and Delen's (the case given) use of data mining as a form of prediction method provides better accuracy in predicting the financial performance of a movie at the box office.
Partner With Altaf
View Services

More Projects by Altaf