In 2009, in order to get my hands on data visualization and API manipulation, I decided to investigate the relationship that might exist between Google and Wikipedia.
Google wants to index the world’s knowledge and Wikipedia is a freely available source. To what extend does Google privileges in its search results pages coming from the Wikimedia Foundation?
As I needed an entry point, I was looking for a list of things to search using Google to then observe the results and pinpoint the ones going to Wikipedia or not. Since this approach had no real scientific implication, just an exercise to play around with APIs, but that would a least have a little bit of meaningful results, I decided to use the list of all the english articles written in Wikipedia and to use their titles as search terms.
The english version of the encyclopedia had a little bit more than a million articles at that time. It took me about a month, using a regularly updated Python script, to "google" all these titles and record every lucky URL returned by the search engine.
I then composed an image where each pixel would represents an article / search term. In Blue, the search term was leading to a Wikipedia page as a first result. In grey, it would send to another website. The red pixels show empty search results. For these last category, does it mean Google had no answer or was my script buggy at that moment? We’ll never know.
Compiling all these results, I realized that Google was directly replying with a Wikipedia page in 60% of the cases. By looking at the image generated, it is interesting to see that there is some big chunks of searches terms that link or don’t link at all to the encyclopedia.
The second stage of this project should have been the creation of a visualizing tool that would permit a user to explore the data set. But this step never reached maturity.