Recently, I made a few discoveries while digging around on Google. Shall we draw conclusions from the following?
So here we have some root tokens at work for the search “CNG veh,” as evidenced by the truncated string matching that has taken place. Here, we see a phrase used in a partial capacity “Natural Gas” which is derived from an acronym “CNG” – related to a longer phrase “Compressed Natural Gas”. Google clearly groups discreet strings from the general English lexicon, but in exactly what capacity, it would be difficult to determine. In situations where there are no exact string matches or those on pages that Google deems low-quality, Google has delved into not only the root pile but a matrix of co-occurring strings which cross-match within the root pile to locate combination, root-derived strings that coexist on pages and satisfy my less-than-likely search.
Yahoo! and Bing return vastly different results. While Bing associates “CNG” with “Compressed Natural Gas” and highlights both terms within its results, Yahoo! does not seem to make the same association. Bing returns the first result with an exact match for “CNG veh” and Yahoo! does so with its second result. Both search engines return matches that tend to occur in non-HTML documents. Neither return token or root matches. Yahoo! suggests that I “Also try: cng veh in” as a search. Bing makes no suggestions.
Google associates computers and competition with the “comp” in the query “comp nat gas,” but fails to associate the more obvious “compression” or “compressed”. Four of the first 10 results for “comp nat gas” actually contain the word “compression” without highlighting, which seem to indicate, that though the term is present, there is not a strong enough association to merit confidence.
There are instances within results where associative confidence is present only in certain sections of the results content. Check out the bottom-of-page results for “cng veh” where Google highlights the Honda NGV in it’s similar searches, and twice fails to highlight “cng” which was actually a string in my query!
My theory is that these are perhaps some of the common roots at play in the associative partial-word substitutions seen in the examples above. Is it possible that there is some more-loosely-constructed means to determine the substituted words – one which does not rely on a defined lexicon and a matrix of lexical roots? Is this result of some substring or back tracking function in tandem with co-occurrence probabilities? And is it even correct to refer to part of the phenomena as LSI? Is LSI even a real component of Google’s organic search? Would a phrase-based method of indexing and association-making be a more fitting explanation? Or are these instances more akin to LDA? What does any of this mean anyway? Should we just refer to it broadly as semantic analysis or probabilistic semantics? Maybe associative semantics?
Is it possible that these associations and query permutations are derived from a small result set (perhaps 100 to 1000 documents), instead of derivatives born of the entire index? And, what role does lexical uniqueness play in the triggering of these sub-algorithmic determinations (as surely things are kicking in here that are normally not present in a more-common search string)? Or are they?
These “unexpected” variations within search results beg a few questions. I guess it’s time to look closely at some of Google’s patents.
And, if you’re interested, take a look at some of the recent research going on over at SEOMoz, regarding the role of LDA in the ranking of search results —
Any qualified takes on these are welcome.