Data Collection

We gathered data of basic features of movies released the US from 2005 to 2017 on TMDb.
Besides, we scraped number of trailer views to represent social media popularity of movies before they were released.

TMDb

We collect movie information from TMDb (i.e. the Movie Database), a community built movie database opens for use with API key. TMDb is a good movie database but due to its non-profit and community-contributed character, the information on it contains errors and missing values. The information we need from this database mainly contain two part: basic information of movies, and credits of movies, which are stored separately. By using an API module, we can easily find out all movies in the US released between a certain time period. In the data collection part, we use python script to collect the dataset, and stored our results in a .json file. There are 52973 items in the original set.

The scatter matrix of each two features in the data set

Youtube Views

We try to get data from social media in order to find relationship between social media and movies' assessments. Youtube is a good social media. By finding the number of views of each movies' trailers can assess the movie. The following steps shows how we getting data from youtube. Step one is to get the list of movie's name. It is easily to find by traversing the json file from TMDb. The second step is to clean the name of each movies. Many movie names are not in English and it cannot be read by python. I change the unicode of those name who is not in English so that the DNS can understand the url. Step three. After combined the URL, a request is sent to the youtube server. Using regular expression to find the first 20 videos related to the certain movies' trailer. Comparing all of them and find the trailer with the largest views. The return number is set as the number of views. The last step is to loop the previous function, find all trailers' views and save the data into csv file. The following picture shows the distribution of views for all movies.

  • youtube_bar

    In this graph, it describe the amount of views distribution for different movies

  • This scatter plot shows relationship between budget and revenue for every movies. ClickThe graph to see the interaction graph written in D3

  • This scatter plot shows relationship between budget and runtime for every movies . ClickThe graph to see the interaction graph written in D3