Ranking - Das IICM
Transcription
Ranking - Das IICM
Group Assignment Information Search and Retrieval Graz University of Technology WS 2012 Ranking Ranking Algorithms and Search Engine Optimization Group 10 Paul Kapellari Technische Universität Graz, [email protected] Daniel Krenn Technische Universität Graz, [email protected] Georg Kothmeier Technische Universität Graz, [email protected] Supervisor Univ.-Doz. Dr.techn. Christian GÜTL Institute for Information Systems and Computer Media (IICM), Graz University of Technology, Austria [email protected] and [email protected] Index ABSTRACT/ZUSAMMENFASSUNG 3 1 INTRODUCTION 4 1.1 Problem description 4 1.2 Motivation 5 1.3 Structure 5 2 RANKING ALGORITHMS AND STRATEGIES 6 2.1 In Degree 6 2.2 Page Rank 7 2.3 HITS 10 2.4 SALSA 12 2.5 Summary about link based algorithms 12 2.6 Non link based approaches 12 3 WEB SEARCH ENGINES / WEB SEARCH SERVICES 3.1 Google 14 3.2 Alexa 15 3.3 DMOZ 16 4 SEO 17 4.1 Techniques and Methods 18 4.2 Problems 21 5 NEW TRENDS 22 6 CONCLUSION 26 7 APPENDIX 27 7.1 References 27 7.2 List of figures 28 14 Abstract With a vast amount of information available in computer systems, the challenge for users to find and retrieve information becomes more and more tricky. Information needs to be analyzed and catalogued in order to make it accessible. When searching for specific documents, results need to be ranked and ordered properly to offer users the highest possible quality of information. This paper discusses the process of "Ranking", more specifically, "Ranking Algorithms and Search Engine Optimization (SEO)". After a short overview, it shows different ranking algorithms and strategies for this purpose, and gives an insight on web search engines and services. Furthermore some methods for search engine optimization and their problems will be discussed. Finally new trends in this field of science will be introduced. Zusammenfassung Durch die enorme Menge von Informationen welche in Computersystemen zur Verfügung gestellt werden, wurden das Suchen sowie das Auffinden von Informationen immer schwieriger. Um Informationen zugänglich zu machen ist es notwendig diese zu analysieren, zu katalogisieren und nach der Relevanz zu sortieren um den Benutzer eine hohe Qualität zu gewährleisten. Diese Arbeit beschäftigt sich mit dem Thema „Ranking“ genauer mit „Ranking Algorithmen und Suchmaschinen-Optimierung“. Nach einer kurzen Einführung in die Materie werden verschieden Algorithmen mit ihrer Funktionsweise erklärt und ein Überblick verschiedener Websuchmaschiene bzw. Websuchservices gegeben. Weiteres werden Möglichkeiten zur Suchmaschinen-Optimierung und die damit verbunden Probleme beschrieben. Abschließend wird noch auf die zukünftigen Trends zu den oben genannten Themen behandelt. 1 Introduction 1.1 Problem description Nowadays there is a vast of information available for everyone. Especially the accessible digital data grew enormously during the last years. Many people are speaking from an information flood. On the other hand also the information need is getting bigger and bigger. The reasons for information seeking are very different. People perform researches for educational reasons, for their jobs but also for their personal interests (e.g. news, their hobbies and so on). In this huge heap of information, the biggest challenge is to find the right resource, which provides the information you need. This is where ranking tries to help the user. If there are just a few search results, it’s no problem for every user to distinguish important resources from unimportant ones. This may be true for some small local databases, but on the web there are billions of resources and nobody can rank all of the documents manually. So a good ranking algorithm is essential for every search engine. There are many different approaches (which we will discuss later see section 2), but all try to accomplish the same goal. The most relevant resource should be display as first and the most irrelevant at the end of a list. Some approaches are more successful than others. Because the web is very diverse and HTML is far away from being semantic, there are many different attempts to find good resources and rank them in the descending order of their relevance. Right now (2012) it seems, that Google has the best ranking strategy. This is the reason why this paper discusses many ideas, algorithms and approaches form the Google universe. So there are the search engine providers on the one side, which want to rank as good as possible, and on the other hand there are webmasters which want to get their website into the top places of every search. So another discipline emerged which is called Search Engine Optimization (short SEO). Especially for people who run their business over their websites it is indispensible to appear at the first positions in search results. This is why they started to optimize their internet appearance also for search engines. SEO became an own industry. Almost every marketing agency offers SEO, there are plenty of websites which explain you SEO techniques and also conferences are held to this topic. Also in Austria see http://www.seokomm.at/, the event took place in Salzburg at the 23.11.2012. The chapter 4, are dealing in detail with SEO and some examples. Because there are always people who try to benefit more than others, there are a lot of problems with SEO. These people want to boost their rank by using unfair and dishonest methods. Doorway pages, link farms etc just to name a few. These kinds of problems will be explained in detail in chapter 4.2. So there is always a competition between the search engines and spammers. This is why ranking algorithms and strategies evolve over the time. The new trends seem to be more personalized ranked results. Some applications do this very well and the user doesn’t recognize that he gets ranked results. He doesn’t even know that he is searching, see Google Now on Android Phones. There is a lot going on this field. Also crowd ranking, folksonomies, and Social Media are getting more important for ranking. These new trends are presented in chapter 5. 1.2 Motivation Since the internet began to grow and since many people have access to it, searching and ranking became more and more important. If you see the big turnover of search services like Google, Yahoo and so on you know that this is still a big topic in 2012. For small web developers it doesn’t seem that they can compete with the big companies on the searching and ranking area, so why this paper? The big ones do their own research. The target audience of this paper is not Google or Yahoo. It is you as interested web developer. Only if you understand the concepts behind ranking you can optimize your product and get it placed where you want it to. Ranking and SEO go very close together this is the reason why this paper discusses both. 1.3 Structure As mentioned before this paper has two main parts, ranking and SEO. At first there will be a brief overview about ranking. After that there is the more technical part. There are algorithms and approaches discussed. The SEO part is more practical and needs therefore the theoretical background which is provided in the previous chapters. Chapter 5 is focused on new trends, to give the reader an outlook on possible new developments. 2 Ranking algorithms and strategies To clarify what ranking is, a short explanation at the beginning. Imagine you are searching for a document out of several others. You will find a set of documents which could be relevant. Now this set should be ordered. The best thing would be that the first one is the most relevant. And this is what ranking tries to do. Sorting search results in a meaningful way to help users finding what they are searching for. Out there, there are plenty of different ranking algorithms and strategies. Which one is the best, is hard to tell. It depends strongly on the context, the use case and the system to which the ranking algorithm should be applied. The more specialized a system is the more advanced methods are possible. For specific information systems one can use complex machine learning algorithms like Bayesian Networks, Neuronal Networks etc. The more diverse the information in a system become, the more general approaches have to be used. But for all of these systems it is essential to perform well in ranking. If the user doesn’t find what he is searching for, he will consider the system as crap. For this reason the ranking strategy is one of the main factors for every information system to be successful. To illustrate how important ranking is, think of Google or Yahoo. They could close their whole business without an outstanding ranking. Because this paper deals with ranking in the World Wide Web its main focus is on link based algorithms. These types of algorithms seem to perform very well on the context of the web. They form the base of many search strategies. 2.1 In Degree All link based algorithm look at the web as a graph. If a user browses through the net, he is doing a random walk on this graph. Documents are seen as vertices and links are like edges in a graph. The idea for an algorithm based on the in degree of a vertex is inspired by citations in the academic world. A document A seems to be important if many other documents cite A. For scientific papers this seems to be true and ranking on the in degree of a vertex would be enough. But the web is very different to scientific citations. This was also mentioned in the paper “The PageRank Citation Ranking: Bringing Order to the Web”, (see also (Brin, Page, Motwani, & Winograd, 1999)). They mentioned that the web is more diverse. Especially in terms of quality and content. There is no quality assurance; everybody is able to publish content. Also the type of content varies from a text about some ones hobbies over news to very scientific things you can find all. This is why they started to think how to improve the idea of in degree. And they came up with the PageRank. Nowadays some papers which were published show that you can approximate the PageRank (see (Upstill, Craswell, & Hawking, 2003), or see (Fortunato, Boguná, Flammini, & Menezer, 2008)) with in degree and save a lot of computation time, but Google still relies on the PageRank and is very successful. So it seems that this strategy is still very good. 2.2 Page Rank The first idea to improve the in degree was to find a measure to see how important an incoming link is. A link from a web page with high reputation should be more worth than a link from a very poor web page. To model this, two types of links were introduced: “back links” and “forward links”. Figure 1: Forward links and back links Source: (The PageRank Citation Ranking: Bringing Order to the Web, 1999) The figure above shows a simple link structure with 3 web pages. A and B have one “forward link” and no “back links” and C has 2 “back links” and no “forward link”. For the PageRank a page is considered as important if it has many “back links”. So “back links” from these kinds of pages are very good for your PageRank. To illustrate how this works in detail see the figure below: Figure 2: Snapshot of page rank calculation, Source: (The PageRank Citation Ranking: Bringing Order to the Web, 1999) The first page achieves a PageRank of 100 with all it “back links”. This page has two “forward links”. To calculate the PageRank this page propagates over its “back links”, the own PageRank is simply dived by the number of “forward links”. The second page gets its PageRank by page one and page three (the one with PageRank 9). The sum of the “backward links” is 53 so the PageRank is 53. To calculate the PageRank you need the following recursive formula: (Brin, Page, Motwani, & Winograd, 1999) This formula has two big problems dangling links and infinite loops. 2.2.1 Random surfer A tricky problem is the possibility of an infinite loop. This happens if some pages are only interconnected to each other and this circle has only one “back link”. The pages in this circle would propagate their PageRank all the time over and over again. This circle produces a totally wrong PageRank which doesn’t reflect the real value. The next figure shows how such an infinite loop happens: Figure 3: Infinite loop Source: (The PageRank Citation Ranking: Bringing Order to the Web, 1999) To solve this problem the model of a random surfer was introduced. This model also applies better to a real world scenario and also solves the mathematical problem. The random surfer can visit every page randomly. He can switch the page by simply typing a new URL into the browsers address bar. Also it is very unlikely that a random surfer gets lost in an infinite cycle. After he recognizes that he is in a loop he will jump to another page. So the summation formula from above has to be redefined: (Brin, Page, Motwani, & Winograd, 1999) The vector E describes the likelihood that a user jumps to another page. There can be made different distributions which leads to different results. (Brin, Page, Motwani, & Winograd, 1999) suggest to us a uniform distribution over all web pages and adding a term α to adjust the weight of E. 2.2.2 Dangling links Dangling links are links which link to pages which don’t have outgoing links. The problem is that it is not known where to distribute their PageRank. It could be that these pages really don’t link to others or that you don’t see the links, because your sample of the web is too small. This is the more likely case because nobody has a full representation of the World Wide Web. To solve this problem (Brin, Page, Motwani, & Winograd, 1999) suggest, removing them during computation and adding them after the process converged. It is noticed, that this changes the results slightly but this has no big effect. 2.2.3 Convergence After solving the two main problems you can apply the algorithm to real data. The algorithm is started and can be stopped, when the difference between the last iteration and the actual iteration is smaller than a predefined very small value. The results of (Brin, Page, Motwani, & Winograd, 1999) show that the algorithm converges very well and is also useable for big data sets. Figure 4: Convergence rate for half size and full size link database Source: (The PageRank Citation Ranking: Bringing Order to the Web, 1999) As you can see, after 52 iterations the algorithm converges already. Also the difference between the half size link database and the full size link database isn’t that big. The scaling factor is circa linear in log(n). 2.2.4 Google Matrix To compute the algorithm efficiently it can be implemented as matrix multiplication. Therefore the Google Matrix was implemented. With the knowledge of the previous chapters it is really easy to understand this matrix. Figure 5: Description of the Google Matrix Source: Slide from the Lecture Webscience and Webtechnologie at TU Graz presented by Markus Strohmaier The matrix H is simply the transition matrix with the probablitities that you reach one page over a link from another. After all the PageRank algorithm can be described with: 2.2.5 Acutal development PageRank is still the heart of Google’s ranking tactic (Moskwa, 2011). But it seems that PageRank is getting more and more competitors. It is hard to tell how Google really ranks, because this is one of their biggest secrets, but it seems that additionally to the PageRank different factors play a role when ranking. Especially the Social Web, Google+ and many more services changed the WWW dramatically, so it is obvious that more factors are considered. 2.3 HITS The HITS (Hypertext Induced Topic Search) algorithm emerged during the same time as the PageRank. HITS was even a little bit earlier. It is still an important approach even though it wasn’t as successful as PageRank. The idea behind HITS is similar to PageRank. There are two big differences. First there a search is executed. The set of found documents is used to calculate the ranking. In contrast to PageRank you consider not the whole web graph. You only look at a certain sub graph. The second major difference is that there are two raking scores. The hub rank and the authority rank. Kleinberg defines a hub as a page which links to many others and a authority is a page with many incoming links. Also some pages are called universal popular. These pages have many incoming links and almost no outgoing links. Figure 6: shows hubs, authorities and universal populars Source: Information Search and Retrieval at TU Graz, presented by Christian Gütl The hub and authority values are propagated by every page to the next one like in PageRank. This also leads to a recursive formula, which can be calculated iteratively. One big challenge is to find the right sub graph. The ideal sub graph would be small and consists out of very good hubs and authorities. But this is not always the case so Kleinberg suggests looking at a bigger sub set as given by the search. Therefore pages are considered as relevant which link into the initial set and pages which are linked from the initial set. Figure 7: extending the root set to the base set Source: (Authoritative Sources in a Hyperlinked Environment, 1999) The figure from Kleinberg’s paper illustrates the idea behind the extension. The initial set is called root set and the expanded set is the base set. Often the base set is also called as neighborhood graph. The HITS algorithm has many pros and cons. (see (Langville, 2012), chapter 11.5) The HITS algorithm has the advantage of two ranking scores. Because hub and authority values are distinguished a user has the option to decide if he wants to do a broader search or a more specific search. For specific searches he will prefer authority pages and for broad searches hub pages have advantages. Also the small subset for which the ranking has to be computed can be an advantage but is also be a weakness. Especially against spam a smaller subset is not very resistance. Also the search for the sub set is critical. How to find the right sub graph? Finding the neighborhood graph also leads often to the case, that off topic pages are included. Many scientists wrote papers how to solve these weaknesses and so HITS is also useable in real world scenarios. Monika Henzinger and Krishna Bharat (see (Improved Algorithms for Topic Distillation in a Hyperlinked, 1999)) wrote a solution to deal with spamming. Also there are many versions of HITS which make the algorithm query independent. So the algorithm simply uses the whole set of pages not only a sub set. Longzhuang Li, Yi Shang, and Wei Zhang presented „Improvement of HITS-based Algorithms on Web Documents” 11th International World Wide Web Conference. So there is a lot of research going on about HITS. And it is still used (www.ask.com) even if it has a very strong competitor named PageRank. 2.4 SALSA SALSA was introduced 2000 by Ronny Lempel and Shlomo Moran. So it was invented after PageRank and HITS. One idea was, to combine the features of both, HITS and PageRank. SALSA also distinguishes between hubs and authorities. But the scores are calculated by a stochastic process in form of a Markov chain. This is what it has together with the PageRank. Like HITS SALSA is also query dependent and forms a neighborhood graph. This graph is then transformed into a bipartite graph. On one side the hubs and on the other side the authorities. On this graph SALSA performs the random walk. Because of its stochastic nature SALSA doesn’t suffer from the same problem as HITS to derive many off topic pages from the root page set. It is also more robust against spamming as HITS. But in this category PageRank is still better. Another advantage is that the computation time is less because of the used sub graph. And like HITS the user can choose between authority –and hub results. The biggest drawback is also the query dependence. This should be always considered if someone wants to use a query dependent ranking algorithm. It is possible to fix this problem in the same way as for HITS. Just calculate the scores for the whole graph. A very detailed description of SALSA can be found in (Google's Pagerank and Beyond: The Science of Search Engine Rankings, 2012) chapter 12. 2.5 Summary about link based algorithms Link based algorithms suites very good to the structure of the web. Because it is very heterogeneous it is very hard to apply specific approaches. Also link based algorithms are relatively easy to understand and fast to compute. If intelligent designed this approaches also scale very good. One thing is really different between PageRank and the original HITS. PageRank computes a ranking over all pages. So you have a global ranking which can be used to rank the search results. PageRank is therefore not query dependent. HITS in contrast, computes the ranking for every query new. This is the reason why HITS only has a local ranking. The rest of the ideas are very similar. For all readers who want to step deeper into this topics the book (Google's Pagerank and Beyond: The Science of Search Engine Rankings) is suggested. There are a lot of examples and background information. Also some calculation examples are listed, with step by step generation of the whole algorithms. 2.6 Non link based approaches The most common techniques used by search engines are link based approaches with influences from other factors. These other factors aren’t very clear because every search engine is keeping a big secret about their real strategies. But beside the link based strategies there are some other approaches too. We will introduce you in a very short way to them. 2.6.1 Rank aggregation Because the search results of the different search engines are very different there came up a new idea which is called Rank Aggregation. A Research Study by Dogpile.com in collaboration with Queensland University of Technology and Pennsylvania State University (see (Different Engines, Different Results, 2007)) shows the differences in numbers. The following table has the details: Figure 8: unique results of a search engine in the top results Source: (Different Engines, Different Results, 2007) As you can see the major number of the top search results differs from search engine to search engine. The rank aggregation approach tries to use these rankings and combine it to a new one. Meta search engines use this concept. There is also a lot of research going on at this area, because it is not totally clear how to combine the different results in the best way. 2.6.2 Traffic Rank Another approach to rank web pages is, to rank them by their traffic. The simple and efficient idea is that the page with the most traffic is the most important. For instance: Alexa computes a ranking based on the traffic which occurs on a page. This ranking can be viewed under the following URL: http://www.alexa.com/topsites. There are different lists summarized for countries, categories etc. It is unfeasible to calculate the exact traffic for a web page therefore Alexa invented its tool bar. This toolbar sends information about the surfing behavior to Alexa. With this data Alexa computes a prediction how high the traffic is. More about Alexa you can read in chapter 3.2 2.6.3 Summary non link based approaches Most of these techniques are very young and there are several more out there than the two which are presented here. It is hard to say what will be the next big hit. Will link based ranking always be the best method or will there come up new strategies which succeed PageRank, HITS and co? These kind of questions are tried to be answered in chapter 4.2 3 Web search engines / Web search services The following chapter should give an overview over one of the biggest Web search engine Google and there measures to stay on the top. Furthermore other Web search services like Alexa and DMOZ will be shortly introduced. 3.1 Google Today Google is the most powerful and used web search engine in the world. Google answers more than one billion question from people around the globe in 181 countries and 146 languages. (www.google.com, 2012) It recorded the most traffic of all search sites and even had is own dictionary entry (Grappone & Couzin, 2011). But why is Google so powerful? Figure 9: Basic information about Google Source: (Grappone & Couzin, 2011) Dana Blankenhorn from www.smartplanet.com said that the Google story isn’t about media or marketing, or young engineers. It’s all about reducing its cost of doing business online like the big online store Amazon (Blankenhorn, 2009). But this could not be the only point. Beside the web search, Google also offers different services like: email, maps, a calendar, online document sharing, video, image and many more which helps that Google is in every mouth (Grappone & Couzin, 2011). But they don’t concentrate only on special services, they also worked permanently on new search functions that guaranteed that the users find easily the requested information and stay on their site. In the following enumeration you will see some new search functions from Google: Instant Search Flight Search Handwrite Search for devices with touch screen Search by image Voice search Knowledge graph Related Search Previews Another point why Google become so powerful is the Ranking. Nobody in the public knows exactly how it works but it is based on the Google PageRank, which is explained in the chapter 2.2, and several other algorithms (Singhal, 2009). Amit Singhal wrote in the Google blog that there stand three philosophies behind the Google Rank: 1. Best locally relevant results served globally. 2. Keep it simple. 3. No manual intervention. So it is not a lot we know about the ranking but Google said that there are 200 facts, which are definitely important. In the following enumeration there are some of them: Domain – age, top level domain, sub or root domain, domain history, keyword in domain Server – geographical location, availability Architecture – URL and HTML structure, external CSS / JS, valid HTML code, cookies Content – language, amount of information, uniqueness, actuality, orthography Website – age, number of pages, xml sitemap, on page trust, style Keywords – in alt tags, in title, at the beginning of continuous text, in URL Outgoing links - number per domain / site Backlink profile – relevance and quality about linked websites Users – location, quantity 3.2 Alexa Brewster Kahle and Bruce Gilliat founded Alexa in April 1996 named after the Library of Alexandria. The company crawls all publicly available websites to create a series of snapshots of the web. The amount of data that is collected over the time are used to create features and services like: Site Info – traffic ranks, search analytics, demographics Related Links – similar or relevant sites for the one that the user currently views The data, which Alexa gathered per day, are approximately 1.6 terabytes of web content. After each snapshot of the web they collect 4.5 billion pages from over 16 million sites. Figure 10: Traffic rank of www.google.at Source: (www.Alexa.com) But they don’t only crawl the Internet they also gathering web usage information. To get this information they developed a toolbar for nearly every major browser. Every user who hast this toolbar installed sends information about the web, how it is used and what is important and what isn’t to the community where it processed for the services Alexa offers. (www.Alexa.com) 3.3 DMOZ The DMOZ – Open directory project is the largest human edited directory of the Web, which is developed and maintained by a global community of volunteer editors. DMOZ has nearly 5.2 million sites and 97.000 editors for over 1 million categories. (dmoz - open directory project) Figure 11: Homepage www.dmoz.org Source: (dmoz - open directory project) 4 SEO Why SEO? This chapter will give some insights on Search Engine Optimization (SEO) and clear the question what SEO exactly is. When speaking about SEO or better the goal one wants to achieve by performing SEO, people would probably just think of improving a page's rank on a variety of search engines, just to mention Google for now. But in fact, the term Search Engine Optimization describes an entire set of activities which may be performed to increase the number of visitors, finding a website by a particular search engine. These activities not only include techniques and methods that can be performed to the HTML code of a website, but also to the text, speaking of the websites content itself. Before thinking about how to optimize a website to communicate with several search engines, one should clear the question "Which function does the website serve?" or better "Does the website serve a function at all?" It is not uncommon that companies sometimes build a website just for the purpose to have a website. But even in this case, a site serves several functions like an online store or product portfolio, a personal blog, some kind of news service, company information, or any other function may think of. To become aware of a website's functions is the first step to optimize it, so that one may be able to answer the second important question "What do I want a visitor to do on my website?" Now the Search Engine Optimization may begin. (Grappone & Couzin, 2011) Before coming to the techniques and methods of SEO, this article will approach the basics of search engines first. Basically their results from queries are divided into the following two types, the so called "organic listings" and "paid search adwords". Due to the fact, that no SEO is necessary to list a website within paid search adwords at all, this article will concentrate on the first type of results. Figure 12: Comparison between the search engines Google and Yahoo showing results to the query "Skifahren". Results origin from paid search adwords are shaded in color, respectively listed separately. Source: (www.google.com, 2012) (www.yahoo.de) 4.1 Techniques and Methods The one probably most important fact about SEO is that text in a website's content plays the biggest role at all. Due to the fact that a web search is mostly text based, search engines will always consider how much text there is on a webpage, even if users search for multimedia content. How it's formatted and what the content says are crucial for the result. This simple fact hasn't changed since the beginning of search engines on the web. Every webpage contains both visible and invisible text to the user. Invisible are for example alttags of images or title-tags of hyperlinks. Not forgetting the meta-tags of a webpage, search engines consider all these parts for the ranking of results as will be explained soon. Robots and Spiders Agents, more specifically the so called robots or spiders, continuously search the text of a website. In order to help those agents with their search and by doing so optimizing the website for search engines, there is the change to communicate with the robots by using the just mentioned tags to include invisible text to the page. But also tags which mark up visible text are important to these robots. This lets them know which parts of the text are more important than the others as you can see in Figure 13. Figure 13: Showing the first organic search result for "skifahren in der steiermark" of Google on the left and the regarding website www.steiermark.at on the right Source: (www.google.com, 2012) Google marks the text snippets which lead to the result in bold print. In Figure 13 one can easily see that all parts of the search query show up in the web address, as well as the html-title-tag and in a piece of text which is marked by a strong-tag. The lesson is clear: To gain the best possible result for specific key words on a specific page, they should appear in both the title-tag and the text on the page. Those two elements need to work together. Myth Meta Tags As a typical webpage contains so called meta tags, which belong to the invisible text of a webpage, this paragraph will put some light on the importance of the meta-tag description and keywords to the page rank in search results. The meta-description-tag contains information which describes the website and can be displayed in the search results right beneath the link to a specific page. <meta name="description" content="Aflenz Outdoorpark - der steirische Bewegungspark Aflenz Bürgeralm - das Naturschneeparadies"> <meta name="description" content="Eines der größten Skigebiete Österreichs im Herzen der Region Schladming-Dachstein und Austragungsort der Alpinen Ski WM 2013."> Figure 14: Search results on Google showing the meta tag description right below the link to the website on the left and the regarding html meta description of these pages on the right. Words occurring in the search query and in the meta description tag are print bold in the result. Figure 14 shows how big the influence of the meta description tag is for the search result and the information displayed to the user. On the other hand side, there is also a so called meta keyword tag part of the html head. This provides an opportunity for the website owner to simply enter a list of keywords without showing them on the visible text. But compared with the meta description tag, the keywords "carry little or no weight in search engine rankings." (Grappone & Couzin, 2011) This can very easily be tested by entering all the keywords of a particular website into a search query. If these words don't occur in the rest of website's text, the particular page probably won't be displayed on the top of the search results at all. 4.1.1 Google specifics This chapter deals with Google specific optimizations. There will be shown some new optimizations here, but Google also suggests to use those that already have been explained before. Keywords in URLs Beside using unique title and description tags on every site, Google suggests an improvement of URLs in order to gain a better page ranking. Many content management systems use URLs with the id number of the regarding article instead of words. This is not only hard for the user to work with, but also not good for the page rank, as the URL of an document is an important part for the search result. The best practice to optimize an URL is to use several words in it which are relevant both for the site's content and structure. It can also help if the structure of the menu, respectively the directory is reflected by the URL. An example could be: “http://www.domain.com/stories/2012/keyword1-keyword2-...-.html” .......................... Figure 15: Websites with "search engine friendly" URLs gain better search results. Source: (www.google.com, 2012) A single URL for each Webpage The fact that Google puts a big value on the text in URLs, could lead to the assumption, that the same page could be linked with a bunch of different URLs in order to gain the best search result. But in truth, this leads to a technique called "Duplicate Content", which is very depreciated to be used as search engine optimization and explicitly unwanted by Google. But more on that on the next chapter. But what in fact can be done in this matter, is setting a 301-redirection to another webpage if it's absolutely necessary to have multiple URLs to the same target. This way, Google knows for fact that there is only a single page containing this specific content. Navigation on a website The navigation, particularly the structure of the website is very important, not only to visitors but also to search engines. Navigation should be planned in a way, so that every page should be accessible by following hyperlinks from the home site on. In the end there should be as less different ways as possible to gain access to a specific site in order to make navigation clearer to the user. Besides that, Google also suggests offering a breadcrumb navigation. Sitemaps Sitemaps, one for the user, one for the search engine. Sitemaps offer a way to present a user every accessible site in a simple overview. But analyzing the user-specific sitemap is not always optimal for search engines. So Google suggests creating an xml-based sitemap which is meant to be at the search engines proposal only. 4.2 Problems SEO can become harmful if people try to push their rankings. Trying to achieve underserved ranks is called spamming. If search engines discover sites to be spamming, even if not doing on purpose, the sites rank may be downgraded or the site may even be banned. 4.2.1 Problems for search engines Cloaking The name cloaking describes a technique where a website relays robots to so called doorway- pages instead of showing those the human visitors get to see. This way, pages can show significantly different content to search engine robots. This would keep the search engine from indexing the site correctly and giving users accurate results to their search queries. Duplicate Content If the designer of a website produces duplicate content, in other words the same content twice or even a few times, the search engine will no longer be able to distinguish these pages, which would render the search result useless. Keyword Stuffing Adding important keywords to a page over and over again, not in a rational way but just to massively mention the same words is depreciated by search engines and could cause a bad ranking for the particular page. Invisible Text If there is text in the same color then the background it's on, so it can't be read by an human user but only by an search engine, the page ranking won't be accurate anymore and causes the same bad effects keyword stuffing or cloaking. 5 New trends Over the last years the amount of information in the World Wide Web strikingly increased. New innovations like Social Networks or Smartphones are one of the main reasons of the process of growth. But as much information is available as much difficulty is the search and so some projects arose to use the new technology to improve the searching. One project is the “PeerSpective”, a social network based web search. A group of scientist of the Max Planck Institute for Software Systems from Germany tries to integrate information of a social network in a web search because they think that social network links can be important to increase the quality of search results (Mislove, Gummadi, & Druschel). So they build up an own network of ten people to share the downloaded and viewed content with one another. When a search started the query was sent to every user in the network and to Google. Every user had his own proxy, which executes the query on the local indexed sites and sent the results to the sender. These results were displayed right next to the Google results as shown in Figure 16. Figure 16: Result of PeerSpective Source: (Mislove, Gummadi, & Druschel) To rank the results of the network they used a Lucene text search engine ranking, multiplied the Google PageRank to that result and adding the scores from all users who viewed the result. With these technic the search takes advantage of the hyperlinks of the Web and the social links of the network. (Mislove, Gummadi, & Druschel) Social media sites have by now reached an enormous influence on the page ranks. Postings in e.g. Facebook, Twitter or Google+ perfectly correlate with search results regarding the specific websites these posts mention. Due to this fact it becomes very obvious, that social media is pretty helpful for gaining perfect search results. This has to do with establishing backlinks to which lead users from the social platform back to a website. In case of Facebook, if users even "like" or "comment" such a posting or even share it with other users, the backlinks take care of the rest. Speaking of links, it is nowadays considered useful to have links which contain instead of just keywords also stopwords, in order to guarantee a natural language hyperlink. Search engines recognize and appreciate this kind of user friendliness. But one important part mustn't be ignored: Having the keyword in the domain still puts the website regardless of all other facts on rank one. Beside speaking of links there is another important point regarding a website's content. Having too much advertises on a website can decrease the page rank, as it is considered spamming. Figure 17: The speaman’s rank correlation describes how the named facts influence search results. Source: http://www.searchmetrics.com/de/services/whitepaper/seo-ranking-faktoren-deutschland/ As we heard before, keywords in h1- and title-tags are important to reach good search result. But in case of Google, it's interesting to mention, that keywords from the search query won't appear completely within the titles of the first search results, there content matters most. Regardless these and facts about social media, having a website subscribed on Google+ obviously leads to better search results, for the moment. As one can see, a list of correlating results from Google+ will appear right below the first few search results. Another Project called “Geooreka” tries to improve search results with geographical information. Geooreka is a web search engine integrated with a Geographic Information System (GIS) database, which allows to search web documents that refers to an area visually configured by a user by means of a map (Buscaldi & Rosso, 2009). The architectur of Geooreka is quit easy as you can see in Figure 18. The user selects an area and adds a search theme. Then all toponyms, which are relevant for the chosen zoom level of the map, were extracted. Then the web counts and common information were used to find the optimal results. To speed up the whole process web counts were calculated by the Google Web1T and the search for the theme and toponym combination were processed by Yahoo!. (Buscaldi & Rosso, 2009). Figure 18: Architecture of Geooreka Source: (Buscaldi & Rosso, 2009) Another trend in web searching is to show the answer directly and not the site where you can find it. Such engines called answer engines and the most popular called “Wolfram Alpha” which was developed by Wolfram Research in 2009. The users of Wolfram Alpha submit the question via a text field. The engine then computes answers and relevant visualizations based on a big internal database, external data sources like Facebook, CrunchBase, Best Buy and under the functional principle of cellular automatons. The only disadvantage is that the user have to confirm special rules for the input that the engine can handle the query. (Wolfram Alpha - Wikipedia, the free encyclopedia, 2012) Figure 19: Homepage www.wolframalpha.com Source: (Wolfram Alpha) 6 Conclusion The diversity of the web and the vast of information demands good search services. Besides searching, the ranking of results is becoming very important. The simple approaches of link based ranking strategies still found the base of many search services. But it is totally clear, that these factors aren’t the only ones. What more is considered for ranking is still one of the biggest secrets of Google and Co. Many studies show that it seems that Social Signals become more and more important also “location based result ranking” is very trendy. Because of the many smart phones this trend will be increasing in the next years for sure. The ranking strategies permanently evolve and develop to fit to the actual requirements. And so the SEO guys also have to do. They need always to adapt to new algorithms and approaches. The graphic of gShiftLabs show an estimation what is important for a web presence to be found well. Figure 20: Hierarchy of Web Presence Optimization Source: http://searchenginewatch.com/article/2228256/10-SEO-Truths-of-2012-for-Agencies-In-House-Teams As you can see gShift defines a pyramid for successful SEO. Therefore the old and well known topics still form the base. On this base you can build up the more advanced techniques and react to new trends. This seems reasonable, because ranking strategies doesn’t change totally. Best practice and good experience with different ranking factors will be always a part of ranking strategies. So you don’t need to change your SEO strategies total only because there is some new hype out there in the web. It will be interesting what the future of search and ranking will be. Definitely there will be more crowed based factors, more location aware factors. As applications like Google Now, Apple’s Siri and Wolfram Alpha show, there will emerge more and more semantic search tools. The future is hard to predict, because many people think about creative ideas how to rank search results. Because of the many possibilities you have to define a rank no one can say what will be the next big thing. 7 Appendix 7.1 References Blankenhorn, D. (2009, 10 9). Smartplanet. Retrieved 12 9, 2012, from http://www.smartplanet.com/blog/thinking-tech/what-makes-google-powerful/1749 Brin, S., Page, L., Motwani, R., & Winograd, T. (1999). The PageRank Citation Ranking: Bringing Order to the Web. Stanford InfoLab . Buscaldi, & Rosso. (2009). Geooreka: Enhancing Web Searches with Geographical Information. Valencia, Spain: Universidad Politécnica. dmoz - open directory project. (n.d.). Retrieved 12 9, 2012, from http://www.dmoz.org/ Fortunato, S., Boguná, M., Flammini, A., & Menezer, F. (2008). Approximating PageRank from In-Degree. Springer-Verlage Berlin Heidelberg . Grappone, J., & Couzin, G. (2011). Search Engine Optimization (SEO): An Hour a Day. John Wiley & Sons. Henzinger, M., & Bharat, K. (1999). Improved Algorithms for Topic Distillation in a Hyperlinked. Kleinberg, J. M. (1999). Authoritative Sources in a Hyperlinked Environment. Langville, A. N. (2012). Google's Pagerank and Beyond: The Science of Search Engine Rankings. Princeton University Press. Mislove, A., Gummadi, K. P., & Druschel, P. (n.d.). Exploiting Social Networks for Internet Search. Max Planck Institute for Software Systems & Rice University . Moskwa, S. (2011). Beyond PageRank: Graduating to actionable metrics. From Goole Webmaster Central. Singhal, A. (2009, 6 9). GoogleBlog. Retrieved 12 9, 2012, from http://googleblog.blogspot.co.at/2008/07/introduction-to-google-ranking.html University, D. i. (2007). Different Engines, Different Results. Upstill, T., Craswell, N., & Hawking, D. (2003). Predicting Fame and Fortune: PageRank or Indegree? Department of Computer Science, CSIT Building, ANU Canberra . Wolfram Alpha. (n.d.). Retrieved 12 9, 2012, from htttp://www.wolframalpha.com Wolfram Alpha - Wikipedia, the free encyclopedia. (2012, 12 05). Retrieved 12 08, 2012, from Wikipedia, the free encyclopedia: http://en.wikipedia.org/wiki/Wolfram_Alpha www.Alexa.com. (n.d.). Retrieved 12 9, 2012, from http://www.alexa.com/company/technology www.google.com. (2012). Retrieved 12 9, 2012, from http://www.google.com/competition/howgooglesearchworks.html www.yahoo.de. (n.d.). Retrieved 12 9, 2012, from http://www.yahoo.de 7.2 List of figures Figure 1: Forward links and back links Source: (The PageRank Citation Ranking: Bringing Order to the Web, 1999)................................................................................................................................................. 7 Figure 2: Snapshot of page rank calculation, ..................................................................................................... 7 Figure 3: Infinite loop ........................................................................................................................................ 8 Figure 4: Convergence rate for half size and full size link database ................................................................. 9 Figure 5: Description of the Google Matrix .................................................................................................... 10 Figure 6: shows hubs, authorities and universal populars ............................................................................... 11 Figure 7: extending the root set to the base set ................................................................................................ 11 Figure 8: unique results of a search engine in the top results .......................................................................... 13 Figure 9: Basic information about Google....................................................................................................... 14 Figure 10: Traffic rank of www.google.at ....................................................................................................... 16 Figure 11: Homepage www.dmoz.org ............................................................................................................. 16 Figure 12: Comparison between the search engines Google and Yahoo showing results to the query "Skifahren". Results origin from paid search adwords are shaded in color, respectively listed separately. .................................................................................................................................................................. 17 Figure 13: Showing the first organic search result for "skifahren in der steiermark" of Google on the left and the regarding website www.steiermark.at on the right ............................................................................ 18 Figure 14: Search results on Google showing the meta tag description right below the link to the website on the left and the regarding html meta description of these pages on the right. Words occurring in the search query and in the meta description tag are print bold in the result................................................. 19 Figure 15: Websites with "search engine friendly" URLs gain better search results. ..................................... 20 Figure 16: Result of PeerSpective.................................................................................................................... 22 Figure 17: The speaman’s rank correlation describes how the named facts influence search results. ............ 23 Figure 18: Architecture of Geooreka ............................................................................................................... 24 Figure 19: Homepage www.wolframalpha.com .............................................................................................. 25 Figure 20: Hierarchy of Web Presence Optimization ...................................................................................... 26