Is history hidden in the words we read and write about the world around us?

A massive study of 150 years of digitized British newspapers says events can be determined, and historical patterns better understood, through crunching big-data numbers of old journalism newspaper stories, in a study published today.

“We have demonstrated that computational approaches can establish meaningful relationships between a given signal in large-scale textual corpora and verifiable historical moments,” said Tom Landsall-Welfare, a computer scientist, and one of the authors.

The University of Bristol team crunched the numbers on some 36 million articles, totaling 28.6 billion words, comprising about 14 percent of all the regional U.K. newspaper content from 1800 to 1950, as described in the Proceedings of the National Academy of Sciences.

Trends emerged, from the words and their usage. “Britishness” emerged as a popular idea in the first part of the 20th century, and absolutely spiked during the world wars. The turning point from steam to electrical power was in 1898, according to the analysis. Trains overtook horses in 1902, foreshadowing the future of transportation. The term “panic” boldly spiked to highlight banking crises in 1826, 1847, 1857 and 1866. “Suffragette” and “suffrage” rocket up in 1906 and remain intense until 1918, when British women got the vote. Culturally, actors, singers and dancers begin to gain prominence over politicians in the 1890s and right into the turn of the 20th century. “Football” overtakes “cricket” in mentions in the first decade of the 20th century. 

“The research team showed that changes and continuities detected in newspaper content can reflect culture, biases in representation or actual real-world events,” said Nello Cristianini, professor of Artificial Intelligence at the University of Bristol, and leader of the team. “More detailed studies on the same data will be performed.”

Cristianini pointed Laboratory Equipment to a study the same team, including the FindMyPast Newspaper archiving team from Scotland, published in PLOS One in November. Cristianini said the initial looks were "proof of concept" - and there were many more forays into the data to come.

"We will keep looking at U.S. historical news, plus were are no looking at cime in the Victorian U.K. and I just got ahold of some Italian newspapers," the AI professor told Laboratory Equipment.

The Bristol studies are not the first of their kind. Some 5 million books in English published over the course of 200 years were scanned similarly for cultural clues, with results published in the journal Science in 2011. Critics of that study focused their ire on the methodology of counting words alone, without adding context or accounting for other variables.

The newspapers were much better at keeping focused on events in a timely, trackable way, the authors of the new study contend.

“We found that the impact of key events, such as coronations, conclaves, wars and epidemics, was much more obvious in our corpus, with peaks allowing us to identify specific years in which events occurred,” they write, adding that books are more “reflective in nature and less time-bound.”

But the human element is not included in the computerized analysis, they add.

“However, what cannot be automated is the understanding of the implications of these findings for people,” said Lansdall-Welfare, the computer scientist. “That will always be the realm of the humanities and social sciences, and never that of machines.”