Home SysAdmin Commands & Shells IMDb Movie Title Parser in Bash

IMDb Movie Title Parser in Bash

June 20, 2022

144

Originally published June 20, 2019 @ 7:13 pm

This is an update to the IMDb parser I wrote years back. From time to time IMDb makes small changes to their setup that break my script. This time they decided to start blocking curl, or so it would seem. Even using a fake user-agent string doesn’t help. But wget still works fine (shhh!).

The search syntax is simple: "Title (Year)". Here’s an example:

# imdb "Machete Maidens Unleashed (2010)"

Title:  Machete Maidens Unleashed!
Year:   2010
Rating: 7.4/10
Dir:    Mark Hartley
Cast:   Roger Corman, John Landis, Pete Tombs, Mark Holcomb
Plot:   A fast moving odyssey into the subterranean world of the rarely explored province of Filipino genre filmmaking.

As a fallback strategy, the script will try using Google search for the imdb.com domain, i.e. something like this: imdb.com Machete Maidens Unleashed. It will then grab the first result and will attempt to parse it. Sometimes Google search algorithm is better than IMDb’s. No great surprise here.

The script is below and you can get an up-to-date copy from my GitHub repo here.

#!/bin/bash
export WWW_HOME="www.google.com/"
if [ $# -eq 0 ]
then
	echo 'Usage: imdb "Movie Title (Year)"'
	exit 1
else
	y=$(echo "${@}" | sed -E 's/[()]//g' | awk '{print $NF}' | grep -oE "[0-9]{4}")
	t=$(echo "${@}" | sed -E 's/[()]//g' | sed -E 's/ [0-9]{4}$//g' | sed -r 's/  */\+/g;s/\&/%26/g;s/\++$//g' | sed 's/ /\%20/g')
fi

configure() {
	tmpfile="/tmp/imdb-mf_${RANDOM}.tmp"
	#LYNX="lynx -connect_timeout=10 --source"
	#LYNX="curl -m10 -k -s0"
	LYNX="wget --max-redirect=30 --no-check-certificate --timeout=1 --tries=5 --retry-connrefused -U \"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)\" -qO-"
	base_url_imdb="https://www.imdb.com/search/title"
	base_url_google="https://www.google.com/search"
}

cleanup() {
	if [ -f "${tmpfile}" ]
	then
		/bin/rm -f "${tmpfile}"
	fi
}

get_imdb() {
	if [ ! -z "${y}" ]
	then
		l=$(${LYNX} "https://www.imdb.com/search/title?release_date=${y},${y}&title=${t}&title_type=feature" | grep -m1 -oP "(?<=id=\")[a-z]{2}[0-9]{4,}(?=\|imdb)")
		if [ -z "${l}" ]
		then
			l=$(${LYNX} "https://www.imdb.com/search/title?release_date=${y},${y}&title=${t}&title_type=tv" | grep -m1 -oP "(?<=id=\")[a-z]{2}[0-9]{4,}(?=\|imdb)")
		fi
		${LYNX} "https://www.imdb.com/title/${l}/" > ${tmpfile} 2> /dev/null
	else
		${LYNX} "https://www.google.com/search?q=site:imdb.com+%22${t}%22&btnI" > ${tmpfile} 2> /dev/null
	fi
}

parse_imdb() {
	year="$(grep -m 1 "\/year\/" "${tmpfile}" | grep -Eo "[0-9]{4}")"
	title="$(grep -m 1 "og:title" "${tmpfile}" | grep -Eo '\".*\"' | sed -e 's/"//g' | sed 's/ - IMDb//g' | sed -r 's/ \([0-9]{4}\)//g' | sed 's@/@ @g')"
	temp="$(grep "og:description" "${tmpfile}" | sed -e 's/content="/@/g' -e 's/" \/>/@/g' -e 's/\&quot;/\"/g' | awk -F'@' '{print $(NF-1)}')"
	director="$(echo ${temp} | grep -oP "(?<=Directed by ).*?(?=\. With)")"
	cast="$(echo ${temp} | grep -oP "(?<=\. With ).* ?(?=\. [A-Z0-9])" | sed -r 's/([A-Z]{1})\./@/g' | awk -F'.' '{print $1}' | sed -r 's/@/\./g')"
	plot="$(echo ${temp} | sed -r "s/${cast}\. /@/g" | awk -F'@' '{print $NF}')"
	rating="$(grep -m 1 -oP "[0-9]\.?[0-9]?\<span class=\"ofTen\"\>/10" "${tmpfile}" | sed -r 's/<.*>//g')"
}

get_imdb2() {
	if [ -z "${year}" ]
	then
		m=$(echo "${l}" | sed 's/ [Aa]nd / \& /g')
		${LYNX} "https://www.imdb.com/title/${m}/" > ${tmpfile} 2> /dev/null
		parse_imdb
	fi
}

get_imdb3() {
	if [ -z "${year}" ]
	then
		#${LYNX} "https://www.google.com/search?q=site:imdb.com+%22${t}%22&btnI" > ${tmpfile} 2> /dev/null
		${LYNX} "$(${LYNX} "https://www.google.com/search?q=site:imdb.com+${t} (${y})&btnI" 2>/dev/null | grep -oE "(https?|ftps?)://[^\<\>\"\' ]+" | grep imdb | tail -1)" > ${tmpfile} 2> /dev/null
		parse_imdb
	fi
}

print_imdb() {
	if [ -z "${year}" ]
	then
		echo "Scraped the bottom of the pickle barrel but came up dry. Check the title and provide release year."
	else
		echo -e "Title:\t${title}"
		echo -e "Year:\t${year}"
		echo -e "Rating:\t${rating}"
		echo -e "Dir:\t${director}"
		echo -e "Cast:\t${cast}"
		echo -e "Plot:\t${plot}"
	fi
}

# RUNTIME
# ---------------------------

configure
cleanup
#get_imdb
#parse_imdb
#get_imdb2
get_imdb3
print_imdb
cleanup

Igor

Experienced Unix/Linux System Administrator with 20-year background in Systems Analysis, Problem Resolution and Engineering Application Support in a large distributed Unix and Windows server environment. Strong problem determination skills. Good knowledge of networking, remote diagnostic techniques, firewalls and network security. Extensive experience with engineering application and database servers, high-availability systems, high-performance computing clusters, and process automation.

Symbol	USD	% 1h	% 24h	% 7d
BTC	37,157	0.55	2.50	7.72
ETH	1,716.5	0.31	3.66	4.71
USDT	0.9994	0.00	0.03	0.04
XRP	0.3813	0.14	0.63	2.13
BNB	564.13	0.11	7.19	7.74
SOL	147.93	0.13	1.23	6.13
USDC	0.9998	0.04	0.01	0.00
ADA	0.8180	0.01	21.67	19.16
	?	---	0.00	0.00
	?	---	0.00	0.00

Bitcoin $ 37,157	Bitcoin 2.50 %
Ethereum $ 1,716.5	Ethereum 3.66 %
Litecoin $ 53.16	Litecoin 0.18 %
XRP $ 0.3813	XRP 0.63 %

Gnuplot with Bash

Multi-Dimensional Arrays in Bash

Asciinema Notes

Killing Process Network Access

Generating and Running Multiple Scripts

Removing Chef Server Installation

Curated Downloads

Exporting WordPress to Markdown

Synology NAS Hacks

GPG Encryption QSG

Encrypting Log Data During Log Rotation

The Unix Oriental

File Compression Testing

Measure DNS Server Performance

Resizing Photos for Instagram

QNAP NAS Performance Analysis

NFS I/O Stats with Logging

Measure DNS Server Performance

Inventory Network Services with Nmap

NFS I/O Stats with Logging

Inventorying NFS Mounts and Mount Options

Finding Duplicate Photos

Maryland Renaissance Festival

Focus Stacking with Lightroom and Photoshop

Longwood Gardens, April 2018

IMDb Movie Title Parser in Bash

Randomizing Filenames

Auto-Update /etc/hosts

Copying X11 Magic Cookies

Joining Text Files and Data Columns

Dealing with GitHub Desktop

Generating IP Whitelists