Dataspace 8: Example: Movie data

Sidebar: Here’s a quick comparison of what I’m hoping to achieve in terms of syntax and readability, and an example of why I think it’s important to spend a fair bit of time thinking about syntax. Particulary, about what’s not in the syntax, so it’s not there to get in the way.

SWI Prolog’s SWISH has some wonderful example programs on the web; here, for example is a simple movie database  with nearly 3000 separate facts (probably taken from IMDB, I guess).

Here’s what the first movie in the list, American Beauty looks like in Prolog notation:

movie(american_beauty, 1999).
director(american_beauty, sam_mendes).
actor(american_beauty, kevin_spacey, lester_burnham).
actress(american_beauty, annette_bening, carolyn_burnham).
actress(american_beauty, thora_birch, jane_burnham).
actor(american_beauty, wes_bentley, ricky_fitts).
actress(american_beauty, mena_suvari, angela_hayes).
actor(american_beauty, chris_cooper, col_frank_fitts_usmc).
actor(american_beauty, peter_gallagher, buddy_kane).
actress(american_beauty, allison_janney, barbara_fitts).
actor(american_beauty, scott_bakula, jim_olmeyer).
actor(american_beauty, sam_robards, jim_berkley).
actor(american_beauty, barry_del_sherman, brad_dupree).
actress(american_beauty, ara_celi, sale_house_woman_1).
actor(american_beauty, john_cho, sale_house_man_1).
actor(american_beauty, fort_atkinson, sale_house_man_2).
actress(american_beauty, sue_casey, sale_house_woman_2).
actor(american_beauty, kent_faulcon, sale_house_man_3).
actress(american_beauty, brenda_wehle, sale_house_woman_4).
actress(american_beauty, lisa_cloud, sale_house_woman_5).
actress(american_beauty, alison_faulk, spartanette_1).
actress(american_beauty, krista_goodsitt, spartanette_2).
actress(american_beauty, lily_houtkin, spartanette_3).
actress(american_beauty, carolina_lancaster, spartanette_4).
actress(american_beauty, romana_leah, spartanette_5).
actress(american_beauty, chekeshka_van_putten, spartanette_6).
actress(american_beauty, emily_zachary, spartanette_7).
actress(american_beauty, nancy_anderson, spartanette_8).
actress(american_beauty, reshma_gajjar, spartanette_9).
actress(american_beauty, stephanie_rizzo, spartanette_10).
actress(american_beauty, heather_joy_sher, playground_girl_1).
actress(american_beauty, chelsea_hertford, playground_girl_2).
actress(american_beauty, amber_smith, christy_kane).
actor(american_beauty, joel_mccrary, catering_boss).
actress(american_beauty, marissa_jaret_winokur, mr_smiley_s_counter_girl).
actor(american_beauty, dennis_anderson, mr_smiley_s_manager).
actor(american_beauty, matthew_kimbrough, firing_range_attendant).
actress(american_beauty, erin_cathryn_strubbe, young_jane_burnham).
actress(american_beauty, elaine_corral_kendall, newscaster).

Here’s what it would look like refactored to its simplest normal form in term-expressions. (We could just have a big /all block with every fact on its own line, as with Prolog, but that’s explicitly the representation semantics that I want the freedom to get away from.) I’ve reformatted the names as lists, because I can, and recased them, because I don’t want Prolog’s limitation of disallowing upper case.

(/movie (American Beauty) 1999 /all
	(director 	(Sam Mendes))
	(actor	/all 	((Kevin Spacey)		(Lester Burnham))
			((Wes Bentley)		(Ricky Fitts))
			((Chris Cooper) 	(Col Frank Fitts USMC))
			((Peter Gallagher)	(Buddy Kane))
			((Scott Bakula)		(Jim Olmeyer))
			((Sam Robards)		(Jim Berkley))
			((Barry del Sherman)	(Brad Dupree))
			((John Cho)		(Sale House Man 1))
			((Fort Atkinson)	(Sale House Man 2))
			((Kent Faulcon)		(Sale House Man 3))
			((Joel McCrary)		(Catering Boss))
			((Dennis Anderson)	(Mr Smiley's Manager))
			((Matthew Kimbrough)	(Firing Range Attendant))		
	)
	(actress /all	((Annete Bening)	(Carolyn Burnham))
			((Thora Birch)		(Jane Burnham))
			((Mena Suvari)		(Angela Hayes))
			((Allison Janney)	(Barbara Fitts))
			((Ara Celi)		(Sale House Woman 1))
			((Sue Casey)		(Sale House Woman 2))
			((Brenda Wehle)		(Sale House Woman 4))
			((Lisa Cloud)		(Sale House Woman 5))
			((Alison Faulk)		(Spartanette 1))
			(Krista Goodsitt)	(Spartanette 2))
			((Lily Houtkin)		(Spartanette 3))
			((Carolina Lancaster)	(Spartanette 4))
			((Romana Leah)		(Spartanette 5))
			((Chekeshka van Putten)	(Spartanette 6))
			((Emily Zachary)	(Spartanette 7))
			((Nancy Anderson)	(Spartanette 8))
			((Reshma Gajjar)	(Spartanette 9))
			((Stephanie Rizzo)	(Spartanette 10))
			((Heather Joy Sher)	(Playground Girl 1))
			((Chelsea Hertford)	(Playground Girl 2))
			((Amber Smith)		(Christy Kane))
			((Marissa Jaret Winokur)(Mr Smiley's Counter Girl))
			((Erin Cathryn Strubbe)	(Young Jane Burnham))
			((Elaine Corral Kendall)(Newscaster))			
	)
)

This is now one single fact. It takes just as many lines, but much fewer characters. All the duplicate data has been removed, making it much clearer – particularly if you can preserve tabs to allow columnar data to keep its shape. This is the sort of freedom I would like to see in low-level data serialisation syntaxes.

The structure is essentially a dictionary, but note that if an actor was playing multiple roles in a movie, we could record that; that wouldn’t fit dictionary semantics. In the case of a multiple-role actor, we would have a hard job naturally expressing this data in a JSON object. As it is, because I’ve chosen to represent names as lists (to make, eg, asking queries like ‘Who played a Spartanette in American Beauty?’ simple), this data couldn’t quite fit in JSON even in this naturally JSON-object-shaped form, because JSON can only use strings as keys.

Also, because I’ve put the movie name and the year both into the ‘header line’ of the /movie term, we’ve actually strictly captured a slightly more specific set of facts than in the Prolog version: if (as often happens) there were two movies with the same name, we can now disambiguate them by year. These ‘actor’ and ‘actress’ facts are recorded against ‘(American Beauty) 1999’ and not just ‘(American Beauty)’. So just in this step of normalising the data, we’ve actually found and fixed a subtle bug in the Prolog representation.

We could, sort of (except for commas, which add syntactic noise, and the disallowed upper case characters, and the single quotes) get something like this in Prolog by using lists. Like the term-expression version, it would be one long fact, and every line that might have multiple options would have to be wrapped in list of lists. It wouldn’t be idiomatic Prolog, but you could maybe get there. Or, you could add a symbol like /all manually.

The next question is if I can prove that we can build a Prolog inference engine that operates on data structured like this. I’ve been putting this off for a long time because writing a Prolog resolver from scratch is a bit nasty. If Prolog had a slightly cleaner list syntax I would already just be running Prolog over lists.