Movie records

Movie records#

Cine

\(^{1}\)Photo of the local cinema taken from Zelai Arizti.

Feb, 2024

Data mining with API

Background#

For a little over a decade, I’ve been keeping a record of the movies I watch, both in theaters and at home. I jot down the title, director, release year, and rate them with stars based on how I felt about them. In reality, my rating system consists of just two numbers: I give 4 stars if I liked the movie and 5 if I loved it. If a movie didn’t evoke any special feelings, I leave the rating box blank. This dataset now has almost a thousand entries, and I’ve thought about analyzing it to answer some questions I have.

For instance, lately I’ve had the feeling that movies are getting longer. Is this true, or is it just my impression? To get the duration in minutes for each film and integrate it into my table, I thought about using the API of the online movie database TMDB (The Movie Database). Once connected to it, besides the runtime, I could also complete the information, for example, with the ratings received by each movie and compare my tastes with those of the community.

The data#

First I read my records from a CSV file:

	title	director	year	stars
0	Inland Empire	David Lynch	2006	4.0
1	La vida es un milagro	Emir Kusturica	2004	NaN
2	El arca rusa	Aleksandr Sokurov	2002	5.0
3	Biutiful	Alejandro Glez. Iñárritu	2010	NaN
4	Carretera perdida	David Lynch	1997	NaN
...	...	...	...	...
928	Anatomía de una caída	Justine Triet	2023	4.0
929	Fallen Leaves	Aki Kaurismäki	2023	4.0
930	Chinas	Arantxa Echevarría	2023	NaN
931	Una habitación con vistas	James Ivory	1985	NaN
932	Els encantats	Elena Trapé	2023	4.0

933 rows × 4 columns

Next, I am going to collect the information to complete the table. To simplify the code and facilitate access, I found the following library for the TMDB API: tmdbsimple. It has a function that utilizes the movie search utility based on its title. In my table, the movie titles are often in Spanish, and this function finds the closest match in the TMBD database. To fine-tune the results, I include the movie’s release year in the search.

Next, once the movie identifier is found, I access the movie record so I can get the following information:

tmdb_id: film TMDB identifier.
tmdb_title: title given in TMDB (normaly the English one).
tmdb_genre: here I filter the response to differentiate between documentaries and just films (fiction films).
tmdb_lang: original language.
tmdb_runtime: duration of the film in minutes.
tmdb_vote_cnt: number of ratings.
tmdb_vote_avg: average rating.

Show code cell source Hide code cell source

# Init TMDB API parameters
tmdb.API_KEY = "api-key"
tmdb.REQUESTS_TIMEOUT = 5  # seconds, for both connect and read


# ****SEARCH MOVIE****

# Init Search method
search = tmdb.Search()

# Iterate through my records' title and year to find TMDB id
tmdb_ids = []
tmdb_titles = []

for row in range(len(films)):
    
    response = search.movie(query=films.loc[row, 'title'],
                            year=films.loc[row, 'year'])

    # Iterate through the possible movies
    ids = []
    titles = []
    for i in range(len(response['results'])):
        ids.append(response['results'][i]['id'])
        titles.append(response['results'][i]['title'])

    # I will take the most probable film found, first in the list
    try:
        tmdb_ids.append(ids[0])
        tmdb_titles.append(titles[0])
    except:
        tmdb_ids.append("")
        tmdb_titles.append("")


# Assign new columns to the dataframe    
films['tmdb_title'] = tmdb_titles
films['tmdb_id'] = tmdb_ids



# ****GET MOVIE INFO****

# Iterate through the films to fetch the required info
tmdb_genres = []
tmdb_langs = []
tmdb_runtimes = []
tmdb_vote_cnts = []
tmdb_vote_avgs = []

for row in range(len(films)):

    movie = tmdb.Movies(films.loc[row, 'tmdb_id'])

    # Get genres distinguishing Documentaries from Films
    try:
        tmdb_genre = 'Film'
        for genre in movie.info()['genres']:
            if genre['name'] == 'Documentary':
                tmdb_genre = 'Documentary'
        tmdb_genres.append(tmdb_genre)
    except:
        tmdb_genres.append("")

    # Get the language
    try:
        tmdb_langs.append(movie.info()['original_language'])
    except:
        tmdb_langs.append("")

    # Get the runtime
    try:
        tmdb_runtimes.append(movie.info()['runtime'])
    except:
        tmdb_runtimes.append("")

    # Get the rating counts
    try:
        tmdb_vote_cnts.append(movie.info()['vote_count'])
    except:
        tmdb_vote_cnts.append("")

    # Get the rating
    try:
        tmdb_vote_avgs.append(movie.info()['vote_average'])
    except:
        tmdb_vote_avgs.append("")
    

# Assign new columns to the dataframe  
films['tmdb_genre'] = tmdb_genres
films['tmdb_lang'] = tmdb_langs
films['tmdb_runtime'] = tmdb_runtimes
films['tmdb_vote_cnt'] = tmdb_vote_cnts
films['tmdb_vote_avg'] = tmdb_vote_avgs

# Print the resulting dataframe
films

	title	director	year	stars	tmdb_title	tmdb_id	tmdb_genre	tmdb_lang	tmdb_runtime	tmdb_vote_cnt	tmdb_vote_avg
0	Inland Empire	David Lynch	2006	4.0	Inland Empire	1730	Film	en	180	1007	7.023
1	La vida es un milagro	Emir Kusturica	2004	NaN	Life Is a Miracle	20128	Film	sr	155	163	7.365
2	El arca rusa	Aleksandr Sokurov	2002	5.0	Russian Ark	16646	Film	ru	99	425	7.3
3	Biutiful	Alejandro Glez. Iñárritu	2010	NaN	Biutiful	45958	Film	es	148	1040	7.255
4	Carretera perdida	David Lynch	1997	NaN	Lost Highway	638	Film	en	134	2422	7.6
...	...	...	...	...	...	...	...	...	...	...	...
928	Anatomía de una caída	Justine Triet	2023	4.0	Anatomy of a Fall	915935	Film	fr	152	897	7.7
929	Fallen Leaves	Aki Kaurismäki	2023	4.0	Fallen Leaves	986280	Film	fi	81	269	7.258
930	Chinas	Arantxa Echevarría	2023	NaN	Chinas	1034387	Film	es	119	8	7.938
931	Una habitación con vistas	James Ivory	1985	NaN	A Room with a View	11257	Film	en	117	690	6.988
932	Els encantats	Elena Trapé	2023	4.0	The Enchanted	1022725	Film	ca	108	8	6.8

933 rows × 11 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 933 entries, 0 to 932
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Unnamed: 0     933 non-null    int64  
 1   title          933 non-null    object 
 2   director       933 non-null    object 
 3   year           933 non-null    int64  
 4   stars          176 non-null    float64
 5   tmdb_title     927 non-null    object 
 6   tmdb_id        927 non-null    float64
 7   tmdb_genre     927 non-null    object 
 8   tmdb_lang      927 non-null    object 
 9   tmdb_runtime   927 non-null    float64
 10  tmdb_vote_cnt  927 non-null    float64
 11  tmdb_vote_avg  927 non-null    float64
dtypes: float64(5), int64(2), object(5)
memory usage: 87.6+ KB

Data validation#

Missing values in the tmdb_ columns indicate that the film was not found in TMDB, so I discard those rows altogether since there is no information available for them. Furthermore, I adjust the data types for integers (id, runtime, number of ratings) and categorize the genre, language, and the number of stars that I assign. In the latter case, 4 stars become “liked,” 5 stars become “loved,” and missing values are classified as “untagged.”

<class 'pandas.core.frame.DataFrame'>
Index: 927 entries, 0 to 932
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   Unnamed: 0     927 non-null    int64   
 1   title          927 non-null    object  
 2   director       927 non-null    object  
 3   year           927 non-null    int64   
 4   stars          927 non-null    category
 5   tmdb_title     927 non-null    object  
 6   tmdb_id        927 non-null    int64   
 7   tmdb_genre     927 non-null    category
 8   tmdb_lang      927 non-null    category
 9   tmdb_runtime   927 non-null    int64   
 10  tmdb_vote_cnt  927 non-null    int64   
 11  tmdb_vote_avg  927 non-null    float64 
dtypes: category(3), float64(1), int64(5), object(3)
memory usage: 76.7+ KB

Duration of the films#

Let’s first analyze the duration of the movies. A histogram will provide us with their distribution.

../_images/561dcc644ffa77395abd923d5da00d5f8e5e0c69d09215049fe8f80bb58ab661.png

We see that there are about 8 movies with a duration of 0 minutes. This is because the value of this field was not filled in the TMDB record, so we discard them before calculating the average duration value of the films.

Mean duration is --> 106 minutes

Therefore, the average duration of a feature film is approximately one hour and three quarters, which matches the idea I had.

Out of curiosity, I extract the ten longest movies from those I have seen.

	director	year	tmdb_title	tmdb_runtime
801	Martin Scorsese	2019	The Irishman	209
655	Martin Scorsese	2005	No Direction Home: Bob Dylan	208
139	Akira Kurosawa	1954	Seven Samurai	207
511	Nuri Bilge Ceylan	2014	Winter Sleep	196
299	Robert B. Weide	2012	Woody Allen: A Documentary	192
607	Richard Attenborough	1982	Gandhi	191
180	Marcel Carné	1945	Children of Paradise	191
30	Luchino Visconti	1963	The Leopard	186
259	Andrei Tarkovsky	1966	Andrei Rublev	183
0	David Lynch	2006	Inland Empire	180

Next, I plot the movies based on their duration, distinguishing between documentaries and fiction films on the graph. I include the yearly mean value on the graph as well.

../_images/4e22315bb7617d9ebc1517a92a66b6c7d9cfcd0dd3ce4be5f0a067b015026ef5.png

In the graph, it can be observed that the majority of the movies I have seen are from recent years. Regarding their duration, it does not seem to have increased. However, I am going to focus the graph on the last 20 years, from 2004 to 2024, and I am going to exclude the documentaries to see if there is an upward trend in duration or if such a trend does not exist.

../_images/1487beee9371b9c2708d9ef5991d0837862f5560d2759f66370c60c2c9e1725c.png

I would say there hasn’t been a significant change in the duration of the movies, at least in the ones I’ve seen, so the notion that movies are getting longer must be just my impression that doesn’t align with reality.

Rating analysis#

To ensure consistency in the movie ratings, I will only consider movies that have received a minimum of, for example, 100 ratings.

648 out of the 927 movies have more than 100 votes in TMDb

These are the 10 movies that have received the most number of ratings:

	director	year	tmdb_title	tmdb_vote_cnt	tmdb_vote_avg
468	Christopher Nolan	2014	Interstellar	33563	8.4
792	Todd Phillips	2019	Joker	24056	8.2
618	Pete Docter	2015	Inside Out	19984	7.9
818	Joon-ho Bong	2019	Parasite	17057	8.5
627	Denis Villeneuve	2016	Arrival	16927	7.6
919	Olivier Nakache	2011	The Intouchables	16499	8.3
539	Morten Tyldum	2014	The Imitation Game	16309	8.0
580	Damien Chazelle	2016	La La Land	15977	7.9
650	Christopher Nolan	2017	Dunkirk	15792	7.5
577	Hayao Miyazaki	2001	Spirited Away	15490	8.5

And these are the ones that have received the highest ratings:

	director	year	tmdb_title	tmdb_vote_cnt	tmdb_vote_avg
34	Sergio Leone	1966	The Good, the Bad and the Ugly	8062	8.5
818	Joon-ho Bong	2019	Parasite	17057	8.5
442	Giuseppe Tornatore	1988	Cinema Paradiso	4108	8.5
816	Isao Takahata	2003	Grave of the Fireflies	5087	8.5
139	Akira Kurosawa	1954	Seven Samurai	3391	8.5
577	Hayao Miyazaki	2001	Spirited Away	15490	8.5
489	Damien Chazelle	2014	Whiplash	14324	8.4
426	Milos Forman	1975	One Flew Over the Cuckoo's Nest	9921	8.4
409	Carlos Sorin	2008	Rear Window	6118	8.4
230	Fernando Meirelles	2002	City of God	6920	8.4

For a clearer visual reference, I place these movies into three different groups corresponding to my personal ratings of them.

../_images/c15588cad14a5898623bbb6093f58c332b6c5ccc192064568ec9cf408cdd601a.png

In the graph, I see that among the ones I loved, there are a couple of movies that also enjoyed much public favor. Which ones are they?

	director	year	tmdb_title	tmdb_vote_cnt	tmdb_vote_avg	stars
426	Milos Forman	1975	One Flew Over the Cuckoo's Nest	9921	8.4	loved
193	Billy Wilder	1960	The Apartment	2103	8.2	loved

Oh, definitely!

Which is that film that I loved, but got a lower rating in TMDB?

	director	year	tmdb_title	tmdb_vote_cnt	tmdb_vote_avg	stars
658	Manuel Martín Cuenca	2017	The Motive	257	6.2	loved

And what about the movies that I didn’t label because they didn’t particularly strike me (or I forgot to label them), yet they have high ratings? I list the six that appear at the top of the graph in the untagged group.

	director	year	tmdb_title	tmdb_vote_cnt	tmdb_vote_avg	stars
818	Joon-ho Bong	2019	Parasite	17057	8.5	untagged
139	Akira Kurosawa	1954	Seven Samurai	3391	8.5	untagged
816	Isao Takahata	2003	Grave of the Fireflies	5087	8.5	untagged
577	Hayao Miyazaki	2001	Spirited Away	15490	8.5	untagged
34	Sergio Leone	1966	The Good, the Bad and the Ugly	8062	8.5	untagged
442	Giuseppe Tornatore	1988	Cinema Paradiso	4108	8.5	untagged

I definitely should have uprated Parasite! (But by no means Cinema Paradiso).

And what’s that movie that’s so low in the ratings?

	director	year	tmdb_title	tmdb_vote_cnt	tmdb_vote_avg	stars
142	Ed. Wood	1959	Plan 9 from Outer Space	509	4.2	untagged

It’s the famous Ed Wood movie, considered to be the worst in the history of cinema!

Original languages#

Once we’re in the thick of it, I found it interesting to consider the original language of the movies I’ve watched over these years. Perhaps this information would tell me something about the type of cinema I prefer.

../_images/a8d117526af64ed5364657358ef44e86d205fdb7af532288c62817b348ef0e90.png

The top three main languages were as expected, and so were the following ones. As Korean and Danish appear on this list, it confirms the special interest I have in the cinema of these countries.

Conclusion#

This little project allowed me to analyze the personal movie record I’ve been updating for some years. I found it interesting to have the option to automatically add further data of each movie from TMDB, as it saves me the trouble of doing it manually, and my table has indeed become more comprehensive.

Regarding the questions I had in mind, it seems that the duration of the films does not tend to increase, at least not significantly, after an analysis that has been purely visual.

As for the community ratings and whether they are in line with my tastes, well, it’s a bit of everything. However, one thing that is clear is that TMDB ratings don’t help much in identifying good movies since the vast majority fall between 6 and 8 points. However, they do help in recognizing big hits based on the number of ratings.