Standard keyword searching compares word frequencies in one document with the frequencies in a standard corpus of text from many sources. If a word in the document occurs more frequently than average, it is considered important.
The new method gauges the importance of words in a document based on where they appear, rather than simply on how often they occur. "You should be able to detect an intrinsic property of a book without the need to compare it with different books," says Pedro Carpena, a physicist at the University of Malaga in Spain.
The importance of words in a document can be based on where they appear, rather than on how often
Carpena previously used mathematics from a field called random matrix theory to analyse quantum systems. He now says the same technique can be used to identify salient words in documents (Physical Review E, vol 79, p 035102).
Important words tend to be clustered together, Carpena says, while less important words appear more randomly distributed. This makes intuitive sense, he adds: as authors develop important ideas, they are likely to use relevant words many times in the same paragraph or page before moving on to other ideas. Less important words such as "and" and "but" tend to occur more evenly through the text.
The technique has worked well in Carpena's tests. Using random matrix theory to extract keywords from a book by Albert Einstein called Relativity: The special and general theory, he found "universe", "field", "gravitational", and "energy" among the top 10 results.
The method could even generate useful keywords when Carpena removed the spaces from a text document and asked the computer to identify significant letter combinations of between 2 and 35 characters long. This suggests it might also work on more abstract data sets. Carpena and his colleagues are currently testing the idea on the human genome to see whether it can extract useful information about genes.
It's not clear whether the search method is actually superior to existing ones, says Oren Etzioni, a computer scientist at the University of Washington in Seattle. He points out that Carpena has yet to compare his results with existing methods.
"Often great discoveries are made when techniques from one discipline are tried in another. This is potentially very promising, but they're wading into a crowded field," Etzioni says.