A Textbook

Personal Evaluations of Search Engines:

Google, Yahoo! and MSN

Department of Computer Science
University of Illinois at Chicago

Comparison of search evaluation results from fall 2006 and fall 2007
See a new evaluation conducted in Spring 2011: Google vs. Bing vs. Blekko

Date of Evaluation: Fall, 2006

Abstract

In many ways, search engines have become the most important tool for our information seeking. Due to their tremendous economic value, search engine companies constantly put major efforts to improve their search results. Measuring search effectiveness is thus an important issue. Although many evaluations have been done on different search engines in the past, they mainly use fixed sets of queries and judge the relevance of each returned page by a panel of human judges [1, 3, 4, 5, 8, 9]. The results were often measured based on precision and recall just like in information retrieval. However, this evaluation method is by no means ideal because relevance does not mean user satisfaction. User satisfaction can only be measured using queries from the user's daily information needs and based on his/her personal assessment of utility of the returned results to the queries. An ideal evaluation is a personal evaluation. In this article, I describe and summarize personal evaluations of 25 people on three major search engines, Google, Yahoo! and MSN (which is now called Live) search. In terms of user satisfaction, the lead of Google over Yahoo! and MSN is huge. Although this may not be very surprising, the results do provide some valuable quantitative information.

1. Introduction

The motivation for this evaluation was mainly to satisfy my own curiosity: how different are the main search engines in terms of their search effectiveness? Over the years, I heard people saying both that they are very different and that they are quite similar. I decided to evaluate myself. One day in summer 2006, I decided to force myself to use only MSN search, which I had not used before. At the beginning, I felt that it was not so bad, but after 1.5 weeks I had to give up and go back to Google. It was navigational queries (see below) that MSN search was quite far behind Google. For example, I could not find the Web site of "Lianhe Zaobao", which is a Chinese newspaper in Singapore when I used the query "Lian he". However, using the same query, the site came up top in Google. Yahoo! search could not find the site either. When I used "Lianhe" as the query term, all three search engines could find the site and rank it at the top at that time (sadly, MSN or Live could not find the site now even using "Lianhe" (more precisely, the site was found but only ranked 5^th when this article was written and put on the Web on May 27, 2007). Furthermore, I could not find several research papers, which could be found by Google. However, my personal experience may not be generalizable. Thus, I decided to conduct a larger scale evaluation. As a professor, asking the students to help is a natural choice.

As we all know, due to the information redundancy on the Web, it is easy to find a huge number of relevant pages to almost any query. Thus page relevance is no longer a major issue. The usefulness/utility of each top-ranked page to each individual user becomes the key. The evaluation of usefulness can only be done based on queries derived from the user's personal information needs and his/her personal perception of the returned results to the queries. It was this belief that guided this evaluation. Such an evaluation can truly tell us why users choose one search engine over another. The results also show the weaknesses of each search engine and hence tell the search engine company where they should focus their efforts on in order to improve their search effectiveness.

2. Evaluation Setup

The evaluation was conducted in September 2006 with 25 students in a data mining and text mining class at the Department of Computer Science, University of Illinois at Chicago. Since they were all graduate students, thus the results given below only reflect this segment of the general population.

The students were split into three groups. The evaluation was done in a span of three weeks. The students were asked to use only one search engine in each week. Only if the designated search engine was unable to give a satisfactory result, might other search engines be employed. In order not to favor any search engine and to mitigate the impact of positive or negative sentiments on one search engine affecting the evaluation of the next search engine, each group is assigned to use a different search engine in each week with the following schedule:

Week 1:-------------------------Week 2 -------------------------------Week 3

Group 1: Google -------------Group 1: MSN -------------------Group 1: Yahoo!

Group 2: Yahoo! ------------Group 2: Google ------------------Group 2: MSN

Group 3: MSN --------------Group 3: Yahoo! ------------------Group 3: Google

We did not have a user interface system that could hide the identities of search engines such that students did not know which search engines they were using. Thus, the evaluation could be slightly affected by their pre-conceptions about each search engine. However, I explicitly told the students that they should be as fair as possible and not affected by any pre-conceptions of which search engine is better, and they should not factor in the efficiency (or speed) of each system in their evaluation (see Section 4, efficiency is nevertheless an important issue).

As indicated above, our evaluation had no fixed queries. The students were asked to perform their daily searches as usual based on their daily information needs without any change. The only requirement was that they needed to stick to the same search engine for the week and only to use another search engine if the first search engine did not give good results. They were also asked to record two pieces of simple information for each query, type of query and level of personal satisfaction. No precision or rank positions of the search results were measured.

In the existing literature, two main types of queries were identified [2, 7], navigational queries and information queries.

A navigational query is one that usually has only one satisfactory result, or there is a unique page that the user is looking for. For example, the user types "CNN" and expects to find the Web site of CNN.com, or types the name of a researcher and expects to find his/her homepage.

An informational query can have a few or many appropriate results, with varying degrees of relevance or utility and varying degrees of authority. Many times the user may need to read a few pages to get the complete information. For example, the user types the topic "search engine evaluation" and expects to be provided with pages related to the topic.

Note that the taxonomies of search queries proposed in [2, 7] include sub-categories of multiple levels. They also have another large category called transactional queries [2, 7]. However, there is still no general consensus on the classification. I only used the two most frequent ones in order to make the evaluation simple and less painful to the students (not to confuse them too).

I used only three levels of personal satisfaction, completely satisfied, partially satisfied and not satisfied (again for simplicity). Each student decides for him/herself the satisfaction level for the results of each query without any given criteria.

On possible bias of the evaluation, I believe that for navigational queries bias of students' pre-conceptions was not likely because they knew exactly the pages that they were looking for. However, for informational queries, some slight bias may exist but should be minimized because of my warning above. As we will see later, it is the navigational queries that Google has a huge lead over both Yahoo! and MSN search engines.

3. Evaluation Results

Table 1 gives the results for navigational queries, and Table 2 gives the results for informational queries. In each table, columns 2, 3 and 4 are the results for the three search engines. Each cell in row 2 shows the number of not satisfied searches. The number within () in each cell is the percent of queries that falls into the cell. Likewise, row 2 and row 3 give the corresponding results for partially satisfied and completely satisfied queries. The final row in each column gives the total number of queries for that column.

Table 1: Results for navigational queries.

Navigational query	Google	Yahoo!	MSN
Not satisfied	15(4%)	37(14%)	49(19%)
Partially satisfied	39(11%)	60(23%)	70(27%)
Completely satisfied	303(85%)	166(63%)	141(54%)
Total	357	263	260

Table 2. Results for informational queries.

Informational query	Google	Yahoo!	MSN
Not satisfied	137(21%)	103(21%)	110(21%)
Partially satisfied	93(14%)	149(30%)	162(31%)
Completely satisfied	416(65%)	247(49%)	257(48%)
Total	646	499	529

From Table 1, we observe that for navigational queries, Google is dramatically better than Yahoo! and MSN. MSN is the weakest. The students were only unsatisfied with 4% of the queries for Google, which is truly remarkable!

For informational queries, Google is also better than the other two. However, the difference is much smaller compared to that for navigational queries. Yahoo! and MSN performed similarly. The proportions of unsatisfactory results are about the same for all three search engines.

From the results in both tables, we can clear see why Google dominates in the marketplace. Google's key strength is its ability to find the right page for almost every navigational query!

We noticed that students had more informational queries than navigational queries. This difference may simply be due to the fact that they were students and thus search for information on more complex topics. However, for the general public, the situation may be reversed. We also note that there were more queries for Google than for Yahoo! and MSN. The reason from students was that Google gave better results and thus encouraged more subsequent searches.

4. Before and After the Evaluation

Right before the evaluation, I asked the students which search engines they were using daily. Everyone answered Google. Only 4 students said they tried Yahoo! search a few times before. No student used MSN search. After the evaluation, students were asked two questions:

1. Has the evaluation changed your perceptions on the three search engines?

2. Would you consider switching to a different search engine?

The answer for question 1 was Yes for 80% of the students. They said that Yahoo! and MSN were not as bad as they thought before the evaluation. Two students even sent me emails saying that Yahoo! search was not that bad. For question 2, everyone said that they would still stick to Google as their primary search engine. 24% of the students said they might use Yahoo! from time to time. There was still no student willing to use MSN.

A final note is that students raised the issue that both Yahoo! and MSN were slower than Google, and they all agreed that speed definitely affects which search engines they would choose to use.

5. Conclusion

This article reported a group of 25 personal evaluations of three search engines, Google, Yahoo! and MSN (or Live). I believe that the results truly reflect the sentiments of users based on their personal information needs and assessments. One main shortcoming of this evaluation is that the segment of population is narrow (graduate students of computer science). Future evaluation should involve people from different walks of life, which of course is more difficult to do.

Final thoughts: Google was remarkably better than Yahoo! and Live search in the Sept 2006 evaluation. However, Google may be reaching the limit of the current search paradigm. Further improvement by Google will take much more effort (possibly exponential amount of effort for linear gain). It is thus time for both Yahoo! and Live search to catch up, which is easier to do. Based on the evaluation results, I believe that it is going to very hard for either one of them to overtake Google (unless Google makes bad decisions), but to get close to Google is very likely, which in my opinion will happen in a not-long-distance future.

Acknowledgements

I would like to thank the 25 students in my fall 2006 CS583 class for their participation in the evaluation. Xiaowen Ding helped analyze the evaluation results. Some discussions with Zijian Zheng (Microsoft) and Ramakrishnan Srikant (Google) helped improve the presentation.

References

1. J. Bar-Ilan, Methods for measuring search engine performance over time, Journal of the American Society for Information Science & Technology, 53(4), p.308-319, 2002.

2. A. Broder. A taxonomy of Web search. SIGIR Forum, 36(2), 2002.

3. H. Chu, and M. Rosenthal. Search engines for the World Wide Web: a comparative study and evaluation methodology. Proceedings of the 59th annual meeting of the American Society for Information Science, 1996.

4. M. Gordon and P. Pathak, Finding information on the World Wide Web: the retrieval effectiveness of search engines, Information Processing and Management: an International Journal, 35(2), p.141-180, March 1999

5. D. Hawking, N. Craswell, P. Bailey, K. Griffihs. Measuring search engine quality, Information Retrieval, 4(1), 2001.

6.L. T. Su, H. Chen, and X. Dong. Evaluation of Web-based search engines from the end-user's perspective: a pilot study. Proceedings of the 61st Annual Meeting of the American Society for Information Science, 1998.

7. D. E. Rose and D. Levinson. Understanding user goals in Web search. WWW-04, 2004.

8. M-C. Tang, and Y. Sun. Evaluation of Web-based search engines using user-effort measures. Libres 13.2, 2003.

9. L. Vaughan. New measurements for search engine evaluation proposed and tested, Information Processing and Management: an International Journal, 40(4), 2004.