Normalized Google distance

The Normalized Google Distance is a semantic similarity measure derived from the number of hits returned by the Google search engine for a given set of keywords. Keywords with the same or similar meanings in a natural language sense tend to be "close" in units of Normalized Google Distance, while words with dissimilar meanings tend to be farther apart.
Specifically, the Normalized Google Distance between two search terms x and y is
where N is the total number of web pages searched by Google multiplied by the average number of singleton search terms occurring on pages; f and f are the number of hits for search terms x and y, respectively; and f is the number of web pages on which both x and y occur.
If the then x and y are viewed as alike as possible, but if then x and y are very different.
If the two search terms x and y never occur together on the same web page, but do occur separately, the NGD between them is infinite. If both terms always occur together, their NGD is zero.
Example: On 9 April 2013, googling for "Shakespeare" gave 130,000,000 hits;
googling for "Macbeth" gave 26,000,000 hits; and googling
for "Shakespeare Macbeth" gave 20,800,000 hits.
The number of pages indexed by Google was estimated by the number
of hits of the search term "the" which was 25,270,000,000 hits. Assuming
there are about 1,000 search terms on the average page this gives.
Hence
"Shakespeare" and "Macbeth" are
very much alike according to the relative semantics supplied by Google.

Introduction

The Normalized Google Distance is derived from the earlier Normalized Compression Distance.
Namely, objects can be given literally, like the literal four-letter genome of a mouse,
or the literal text of Macbeth by Shakespeare. The similarity of these objects is given by the NCD. For
simplicity we take it that all meaning of the object
is represented by the literal object itself. Objects can also be
given by name, like 'the four-letter genome of a mouse,'
or 'the text of Macbeth by Shakespeare.' There are
also objects that cannot be given literally, but only by name,
and that acquire their meaning from their contexts in background common
knowledge in humankind, like 'home" or "red." The similarity between names for objects is
given by the NGD.

Google Distribution and Google Code

The probabilities of Google search terms, conceived as
the frequencies of page counts returned by Google divided by
the number of pages indexed by Google,
approximate the actual relative frequencies of those search terms
as actually used in society. Based on this premise,
the relations represented by the
normalized Google distance
approximately capture
the assumed true semantic
relations governing the search terms. In the NGD the World Wide Web
and Google is used. Other text corpora
can be Wikipedia, the King James version of the
Bible or the Oxford English Dictionary together with appropriate search engines.

Properties

The following properties are proved in:

The NGD is roughly in between 0 and. It can be slightly negative. For example, "red red" gives about 20% more hits of Google on the World Wide Web than "red." If the then we view x and y as very dissimilar.
The NGD is not a metric. In the beginning we have seen that the NGD is zero for x and y that are not equal provided x and y do always occur together on the same web page. From the NGD formula we see that it is symmetric. The triangle property is not satisfied by the NGD. However, these results are theoretic. It is hard to come up with practical examples of the World Wide Web using Google that violate the triangle property.
Applications

Applications to colors versus numbers, primes versus non-primes and so are given in,
as well as a randomized massive experiment using WordNet categories. In the primes versus non-primes case
and the WordNet experiment the NGD method is augmented with a Support Vector Machine classifier.
The experiments consist of 25 positive examples and 25 negative ones. The WordNet experiment consisted of 100 random WordNet categories. The NGD method had a success rate of 87.25%. That is the mean is
0.8725 while the standard deviation was 0.1169. These rates are about agreement with the WordNet categories which represent the knowledge of researchers with PhD's which entered them. It is rare to see agreement less than 75%.

Related Literature

R. Allen and Y. Wu, , JASIST,, 55, 1243-1249
M. Li and P.M.B. Vitanyi, An Introduction to Kolmogorov Complexity and Its Applications, Springer, 2019, Fourth Edition
at Newscientist.com.
J. Poland and Th. Zeugmann,
A. Gupta and T. Oates,
Wong, W., Liu, W. & Bennamoun, M. Tree-Traversing Ant Algorithm for Term Clustering based on Featureless Similarities. In: Data Mining and Knowledge Discovery, Volume 15, Issue 3, Pages 349–381.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...