Cybermetrics
Transcription
Cybermetrics
Cybermetrics Theory and practice Isidro F. Aguillo Version 2 (Nov’11) [email protected] Presentación: Isidro F. Aguillo Current position Background Head, Cybermetrics Lab Spanish National Research Council (CSIC) MSc. Biology (Univ. Complutense, Madrid) MID (Univ. Carlos III, Madrid) DEA (Univ. Granada) Doctor Honoris Causa (Univ. Indonesia) Research topics & other working activities Rankings Portal: webometrics.info Research projects: QEAVIS (e-humanities), MAVIR (multilingual Web), CARTO (R&D cartography), ICYTnet (Virtual Libraries) EU funded projects: ACUMEN (indicators portfolio for individuals), OpenAIRE (EU central repository), WISER (cybermetrics), EICSTES (R&D web indicators), PEKING (knowledge management), IMPACT-INFO2000 (information society) Founder and editor of the e-journal “Cybermetrics” 300 seminars and conferences in over 100 universities from all over the World 2 Agenda I. Descriptive Cybermetrics II. Applied Webometrics Methods and tools Web indicators Positioning in search engines Optimising web contents III. Usagemetrics Log files and visits analysis Popularity 3 MODULE 1 Descriptive Cybermetrics Web Analysis See also: Usability Accesibility Web Metrics Definition Cybermetrics is the discipline dedicated to the quantitative description of the contents and processes of the communication that take place in the cyberspace Cyberspace is the set of contents accessible in electronic format. The condition of universal accessibility of Internet suggests the use of this term as synonymous of the Internet of the contents, basically but not exclusively, the webspace Since the Cyber-scientometric is the sub-field more developed, for practical reasons it is named with the more general term of Cybermetrics or the more specific of Webometrics 5 Quantitative disciplines informetrics bibliometrics scientometrics Cyberscientometrics webometrics cibermetrics Adapted from Björneborn 6 Relationships Scientific policy Investigation managementn Scientific documentation Libraries Services for Investigation in Economy Science‘s sociology applied Librarianship and Documentation History of science Scienctometrics basic Informetrics Life sciences Webometrics Mathematics/Physics Other sciences/Humanities www.ulb.ac.be/unica/docs/Sch-com-2004-pres-Glanzel.ppt 7 Advantages of the quantitative approach The presence on the Web reflects more and better the activities of the institution or individual than the traditional publications on paper The Web reaches a greater audience than other traditional scientific communication media At the academic area, professors, researchers and students put on the Web unpublished material, first draw works, preliminary versions of papers, course materials, slides for presentations or data bases The scientific journals has a restricted distribution The hypertext nature of the Web offers the possibility to discover hidden patterns between the different institutional sites The academic sites link to other sites with a marked economic, industrial, cultural, politic or social character 8 New application areas Webometrics Topology of hipertextual networks Social networks PageRank, HITS Comparative analysis of search engines Ciberscientometrics Studies of electronic mails and forums “Big Science” & Grid Cybergeography and cyberdemography New units: institutional Web sites New indicators Visibility Popularity 9 Cibergeography, ciberdemography Data and sources Internet Geography Project www.zooknic.com Cybergeography www.cybergeography.org Clickz Surveys www.clickz.com/stats Blog www.internetworldstats.com/blog.htm Demography and Geography of the Internet www.sociosite.org/demography.php www.sociosite.net/topics/webgeography.php Internet Demographics Directory internet-demographics.netfirms.com 10 Ciberdemography (I) www.internetworldstats.com/stats.htm 11 Ciberdemography (II) 12 Ciberdemography (III) www.internetworldstats.com/stats7.htm 13 Size of Internet: Infrastructures Hosts www.isc.org/ds www.ripe.net/info/stats/hostcount/ www.ciolek.com/Asia-Web-Watch/main-page.html Netcraft www.netcraft.com Servers Lottor (World) RIPE (Europe) Asia Web Watch Domains World www.norid.no/domenenavnbaser/domreg.html Domain worldwide www.domainworldwide.com www.verisign.com/Resources/Naming_Services_Resources/Domain_Name _Industry_Brief/ Germany (and others) www.denic.de/en/domains/statistiken Studies (outdated) www.zooknic.com 14 Internet evolution (Lottor) 15 Lottor http://ftp.isc.org/www/survey/reports/2011/01/bynum.txt 16 Web servers http://news.netcraft.com/archives/web_server_survey.html 17 Web contents Webspace Spireproject Present day Deposits 10.000 millions (10/02) spireproject.com/art13.htm 40+40.000 millions Archive Google Cache www.archive.org www.google.com Traffic The 80% of the browser sessions in the Web imply the use of a search engine or a directory. Yahoo and, specially Google, are the more important intermediaries 18 Wayback Machine 19 The problem with the gTLD gTLD First ones: .com, .org, .net, .int (.eu.int) New ones: .biz, .info, .name, .aero, .coop, .museum, .eu, .cat De facto: .cx, .tv, .cc Special cases: .edu Experiments Google/Bing/Exalead Filter operator “site:” Problems with some cTLD Domains and countries International domains (gTLD) IP translators IP Locator 1.41 AW IP Locator 1.8 IP Address Locator Ip2location www.atelierweb.com/iploc www.geobytes.com/IpLocator.htm?GetLocation www.ip2location.com/free.asp 20 Google: Languages and countries 21 Mentions 22 Academic Webspace Sites Institutional domains OCLC Web Characterization (1998-2002) http://www.oclc.org/research/projects/archive/wcp/ Sites and institutional sites Netcraft October 2011 500 millions of web sites Active (50%) * (5-10 institutional site/site) ~ 2 000 mill. institutional sites Academic webspace Academic subdomains Not every country 23 Academic subdomains ac.ae ac.at ac.bd ac.be ac.bw ac.by ac.ci ac.cn ac.cr ac.cy ac.fj ac.gg ac.gs ac.id ac.il ac.im ac.in ac.ir ac.je ac.jp ac.ke ac.kr ac.lk ac.lv ac.ma ac.mu ac.mz ac.nz ac.pa ac.pg ac.pl ac.ru ac.rw ac.se ac.sg ac.sz ac.th ac.tz ac.ug ac.uk ac.uz ac.vn ac.yu ac.za ac.zm ac.zw acad.bg edu.al edu.am edu.ar edu.au edu.az edu.ba edu.bb edu.bh edu.bm edu.bn edu.bo edu.br edu.bs edu.bt edu.by edu.bz edu.ck edu.cn edu.co edu.cu edu.dm edu.do edu.dz edu.ec edu.ee edu.eg edu.gd edu.ge edu.gh edu.gr edu.gs edu.gt edu.gu edu.hk edu.hn edu.hu edu.jm edu.jo edu.kg edu.kh edu.kn edu.kw edu.ky edu.kz edu.lb edu.lc edu.li edu.lv edu.mk edu.mm edu.mn edu.mo edu.mp edu.mt edu.mx edu.my edu.na edu.nf edu.ng edu.ni edu.np edu.om edu.pa edu.pe edu.ph edu.pk edu.pl edu.pr edu.pt edu.py edu.qa edu.ru edu.sa edu.sg edu.sh edu.st edu.sv edu.to edu.tr edu.tt edu.tw edu.ua edu.uy edu.ve edu.vg edu.vn edu.ws edu.ye edu.yu edu.za edu.zm 24 Academic databases Public Web Google Scholar Publish or Perish Citations Gadget scholar.google.com www.harzing.com/pop.htm code.google.com/p/citations-gadget/ MS Academic Search academic.research.microsoft.com Scirus CiteSeerX Citebase Paracite DBLP ScienceDirect (US) Science Gov In-extenso www.scirus.com citeseerx.ist.psu.edu www.citebase.org paracite.eprints.org dblp.uni-trier.de www.sciencedirect.com www.science.gov www.in-extenso.org 25 Context Public Web Private Web Invisible Internet Databases Visible Web Repositories Electronic journals 26 Google Scholar 27 Scholar (II) Trabajos en dominios universitarios (Enero ‘07) 28 Scholar: Publish or Perish 29 Google Scholar Citations (testing) 30 Microsoft Academic Search 31 MAS Author entry 32 MAS Institution entry 33 MAS Comparing institutions 34 CiteSeerX 35 Rich files and media files Rich files Definition and types Size Filter operators: filetype (Google, Live, Exalead) Media files Definition and types Adobe Acrobat (pdf) y Postscript (ps) MS Office: Word (doc, rtf), Excel (xls), Powerpoint (ppt) FilExt www.filext.com Localization in search engines Terms Filter operators Autonomous databases 36 Google (filetype) 37 Bing (filetype) 38 Images in search engines 39 Languages on the Net Sources and studies Users according to language Global Reach global-reach.biz/globstats/index.php3 Composition of the webspace Experiments with search engines Google Yahoo! Bing (ex-Live) Search Ask (Teoma) Copernic 40 Users according to language http://www.glreach.com/globstats/index.php3 41 Languages on the Net Languages used to access Google www.google.com/press/zeitgeist.html 42 Languages (Google) <lr> value Idioma Arabic Chinese (S) Chinese (T) Czech Danish Dutch English Estonian Finnish French German Greek Hebrew Hungarian Código lang_ar lang_zh-CN lang_zh-TW lang_cs lang_da lang_nl lang_en lang_et lang_fi lang_fr lang_de lang_el lang_iw lang_hu Language Language Idioma Icelandic Italian Japanese Korean Latvian Lithuanian Norwegian Portuguese Polish Romanian Russian Spanish Swedish Turkish Código lang_is lang_it lang_ja lang_ko lang_lv lang_lt lang_no lang_pt lang_pl lang_ro lang_ru lang_es lang_sv lang_tr 43 Countries (Google) Andorra United Arab Emirates Afghanistan Antigua and Barbuda Anguilla Albania Armenia Netherlands Antilles Angola Antarctica Argentina American Samoa Austria Australia Aruba Azerbaijan Bosnia and Herzegowina Barbados Bangladesh Belgium Burkina Faso Bulgaria Bahrain Burundi Benin Bermuda Brunei Darussalam Bolivia Brazil Bahamas AD AE AF AG AI AL AM AN AO AQ AR AS AT AU AW AZ BA BB BD BE BF BG BH BI BJ BM BN BO BR BS Bhutan Bouvet Island Botswana Belarus Belize Canada Cocos (Keeling) Islands Congo, DR Central African Republic Congo Switzerland Cote D'ivoire Cook Islands Chile Cameroon China Colombia Costa Rica Cuba Cape Verde Christmas Island Cyprus Czech Republic Germany Djibouti Denmark Dominica Dominican Republic Algeria Ecuador BT BV BW BY BZ CA CC CD CF CG CH CI CK CL CM CN CO CR CU CV CX CY CZ DE DJ DK DM DO DZ EC Estonia Egypt Western Sahara Eritrea Spain Ethiopia European Union Language Finland Fiji Falkland Islands (Malvinas) Micronesia, FS Language Faroe Islands France France, Metropolitan Gabon United Kingdom Grenada Georgia French Quiana Ghana Gibraltar Greenland Gambia Guinea Guadeloupe Equatorial Guinea Greece South Georgia/South Sandwich I. Guatemala Guam EE EG EH ER ES ET EU FI FJ FK FM FO FR FX GA UK GD GE GF GH GI GL GM GN GP GQ GR GS GT GU Guinea-Bissau Guyana Hong Kong Heard and Mc Donald Islands Honduras Croatia (Hrvatska) Haiti Hungary Indonesia Ireland Israel India British Indian Ocean Terr. Iraq Iran Iceland Italy Jamaica Jordan Japan Kenya Kyrgyzstan Cambodia Kiribati Comoros Saint Kitts and Nevis Korea, DPR Korea, Republic of Kuwait Cayman Islands GW GY HK HM HN HR HT HU ID IE IL IN IO IQ IR IS IT JM JO JP KE KG KH KI KM KN KP KR KW KY Kazakhstan Lao PDR Lebanon Saint Lucia Liechtenstein Sri Lanka Liberia Lesotho Lithuania Luxembourg Latvia Libya Morocco Monaco Moldova Madagascar Marshall Islands Macedonia, FYR Mali Myanmar Mongolia Macau Northern Mariana Islands Martinique Mauritania Montserrat Malta Mauritius Maldives Malawi 44 KZ LA LB LC LI LK LR LS LT LU LV LY MA MC MD MG MH MK ML MM MN MO MP MQ MR MS MT MU MV MW Countries II (Google) Mexico Malaysia Mozambique Namibia New Caledonia Niger Norfolk Island Nigeria Nicaragua Netherlands Norway Nepal Nauru Niue New Zealand Oman Panama Peru French Polynesia Papua New Guinea Philippines Pakistan Poland St. Pierre and Miquelon Pitcairn Puerto Rico Palestine Portugal Palau Paraguay MX MY MZ NA NC NE NF NG NI NL NO NP NR NU NZ OM PA PE PF PG PH PK PL PM PN PR PS PT PW PY Qatar Reunion Romania Russian Federation Rwanda Saudi Arabia Solomon Islands Seychelles Sudan Language Sweden Singapore St. Helena Language Slovenia Svalbard and Jan Mayen Is. Slovakia (Slovak Republic) Sierra Leone San Marino Senegal Somalia Suriname Sao Tome and Principe El Salvador Syria Swaziland Turks and Caicos Islands Chad French Southern Territories Togo Thailand Tajikistan QA RE RO RU RW SA SB SC SD SE SG SH SI SJ SK SL SM SN SO SR ST SV SY SZ TC TD TF TG TH TJ Tokelau Turkmenistan Tunisia Tonga East Timor Turkey Trinidad and Tobago Tuvalu Taiwan Tanzania Ukraine Uganda United States Minor Outlying I. United States Uruguay Uzbekistan Holy See (Vatican City State) Saint Vincent and the Grenadines Venezuela Virgin Islands (British) Virgin Islands (U.S.) Vietnam Vanuatu Wallis and Futuna Islands Samoa Yemen Mayotte Yugoslavia South Africa Zambia TK TM TN TO TP TR TT TV TW TZ UA UG UM US UY UZ VA VC VE VG VI VN VU WF WS YE YT YU ZA ZM 45 Lists of universities Braintrack www.braintrack.com Universities Worldwide univ.cc Galilei www.galilei.com.ar Webometrics Catalogue www.webometrics.info/university_by_country_select.asp HEIR siu.no/heir General Education Online www.findaschool.org International Colleges and Universities www.4icu.org Portal Tecnociencia www.tecnociencia.es Universia www.universia.es Canadian Universities www.uwaterloo.ca/canu U.S. Universities by State www.utexas.edu/world/univ/state Top American Reseach Universities thecenter.ufl.edu UK Higher Education Map www.scit.wlv.ac.uk/ukinfo/uk.map.html Times World Universities Rankings www.thes.co.uk/worldrankings German University Ranking www.university-ranking.org Academic Ranking of World Universities ed.sjtu.edu.cn/ranking.htm All Universities around the World www.bulter.nl/universities Ranking of China Universities rank2005.netbig.com Alphabetical Index of Japanese Universities camp.ff.tku.ac.jp/TOOL-BOX/JapanUNIV Language Language 46 Personal agents (I) Website extractors AaronWebVacuum 2.9 JOC WebSpider 5.7 Teleport Pro 1.64 Leech 4.3 WebCopier 5.4 BlackWidow 6.28 MemoWeb 4.0 Offline Commander 2.1 WebReaper 10 Offline Explorer Pro 5.9 Website Extractor 10.0 WebWhacker 5.0 WebZip 7.1 Website2PDF 1.0 Medusa 1.2 www.surfwarelabs.com www.jocsoft.com www.tenmax.com www.aeria.com www.maximumsoft.com www.softbytelabs.com www.goto.fr www.zylox.com www.webreaper.net www.metaproducts.com www.asona.org www.bluesquirrel.com www.spidersoft.com www.spidersoft.com www.candego.com 47 Personal agents (II) Link checkers Alert LinkRunner 6.01 HTML Link Validator 4.47 HTML Validator Professional 11 Link Checker Pro 3.3 LinkScan Workstation 12.1 Web Link Validator 5.5 Xenu's Link Sleuth 1.3 www.alertbookmarks.com/lr www.lithopssoft.com www.htmlvalidator.com www.link-checker-pro.com www.elsop.com www.relsoftware.com/wlv home.snafu.de/tilman/xenulink.html 48 Personal agents (III) HTML extractors WebData Extractor 6.0 www.webextractor.com Experiments Site extraction with the offline browser Teleport Pro Mapping of the extracted site with Xenu Direct mapping of the site with Xenu Link checking Link checking Size of the site according to the search engines Google, Yahoo, Exalead, Ask, Gigablast 49 WebDataExtractor 50 Website extraction, checking and mapping 51 Cybermetrics of search engines Search engines: Characteristics and problems 8 “different” big search engines Google Yahoo Search (now Bing supplied) Bing (ex-Live) Search Ask (ex-Teoma) Exalead Wisenut Gigablast Alexa Studies about search engines Search Engine Showdown searchengineshowdown.com Search Engine Watch searchenginewatch.com 52 ¿Only seven (+one)? 2003 Base de datos Sede GOOGLE NETSCAPE YAHOO ALTAVISTA ALLTHEWEB LYCOS IWON HOTBOT MSN SEARCH TEOMA ASK JEEVES ALEXA GOOGLE ALTAVISTA FAST GOOGLE INKTOMI 2004-2005 Base de datos Sede GOOGLE NETSCAPE GOOGLE YAHOO ALTAVISTA YAHOO ALLTHEWEB LYCOS TEOMA IWON GOOGLE WISENUT WISENUT MSN SEARCHMSN SEARCH TEOMA TEOMA ASK JEEVES ALEXA GOOGLE/MSN SEARCH A9 EXALEAD EXALEAD WISENUT WISENUT GIGABLAST GIGABLAST GIGABLAST GIGABLAST TEOMA GOOGLE 2006-2007 Base de datos Sede GOOGLE NETSCAPE YAHOO ALTAVISTA ALLTHEWEB LYCOS IWON HOTBOT LIVE GOOGLE YAHOO ASK LIVE ASK ASK ALEXA A9 EXALEAD WISENUT GIGABLAST HEREUARE ALEXA LIVE EXALEAD WISENUT GIGABLAST 53 Cybermetrics of search engines GOOGLE BING (LIVE) EXALEAD ASK GIGABLAST site:xx site:xx site:xx site:xx site:xx site:aa.xx site:aa.xx site:aa.xx site:aa.xx site:aa.xx site:aa.xx/bb site:aa.xx/bb site:aa.xx/bb NO inurl:xx NO NO inurl:xx url:xx inurl:xx inurl:xx link:aa.xx/b.htm NO link:www.aa.xx (NO) (NO) NO NO link:aaa.xx NO NO File type filetype:yy filetype:yy filetype:yy filetype:yy filetype:yy Language Advanced Advanced Advanced Advanced NO Country Advanced (Advanced) Advanced Advanced NO TLD Domain Directory Word in url Link Link domain 54 URL-mention 55 Outlinks 56 Quality, visibility and impact Quantitative evaluation of institutional websites The Google model ToolBar installation (toolbar.google.com) Page Rank Logarithmic scale rankwhere.com/google-page-rank.php www.rustybrick.com/pagerank-prediction.php Components: visibility + weight Visibility Types of links: inlinks, outlinks, self-links, back-links Calculation using search engines Web impact (WebIF) Link quality: Link inspectors 57 Google Toolbar 58 RankWhere 59 PageRank Prediction 60 urltrends 61 Nutch 62 Popularity Number of visits It's difficult to obtain for comparative studies Relative position Popularity according to Only domains World Wide coverage Some “absolute” values Temporal evolution Geographic biases (>> Asia) Snapshot Only USA!!! Ranking.com Traffic Estimate Popularity according to Netcraft Institutional sites and variants More restricted coverage No comparables www.alexa.com snapshot.compete.com www.ranking.com www.trafficestimate.com toolbar.netcraft.com/site_report 63 Alexa 64 Limits of Alexa 65 Inequalities in Alexa Posición % VISITAS Top 3 23 Top 500 45 Número 10 5 Número 100 0,1 Número 1.000 0,06% Número 10.000 0,02% 66 Snapshot 67 Ranking.com 68 Netcraft 69 Working with links Visibility Web impact Inlinks (incoming links) Yahoo Site Explorer Exalead: link: -site: Outlinks (outgoing links)=Luminosity Link inspectors Definition of WebIF Calculation=Visibility/size Quality Link checkers 70 Basic terminology A B E G C D F B has an outlink to C : ~ reference B has an inlink from A : ~ citation B has a selflink : ~ self-citation E and F are reciprocally linked A is transitively linked with H via B-D A has a transversal link to G : short cut H co-links C and D are co-linked from B, i.e. shared inlinks: co-citation B and E are co-linking to D, i.e. shared outlinks: bibliog.coupling 71 Cyberscientometrics Development of R&D indicators in the Web Units Models Indicators Small World www.db.dk/lb/2002smallworld.pps CiteSeerX CiteBase Google Scholar Arxiv Scirus DBLP citeseerx.ist.psu.edu citebase.eprints.org/cgi-bin/search scholar.google.com arxiv.org www.scirus.com dblp.uni-trier.de Institutional site Co-sitation, social networks and theory of the “small world” Bibliometrics of e-journals and deposits of documents 72 Web indicators Scientometrics Input Output R&D Indicators Bibliometrics Patentometrics Web Indicators Webometrics Cybermetrics Information Society Indicators 73 Building Indicators Experiments Codification Institutional Subject (UNESCO) Geographic (NUTS) Indicators calculation Visibility (sitations) Visibility of the rich files Visibility of articles in repositories Visibility of electronic journals Impact (WebIF) Diversity Co-citation 74 Composite indicators Web Impact factor (WebIF) Visibility (sitations)/ Size (No. of pages) Webometrics (Academic) Rank Size No. of Webpages No. of files Rich files: pdf, ppt, doc, ps No. of papers Google Scholar Other bibliographic databases Visibility Incoming external links Mentions Popularity 75 Webometrics Ranking www.webometrics.info 76 Size (number of pages) 77 Direct crawling 78 Other rankings http://vcmike.blogspot.com/2006/01/ranking-colleges-using-google-and-oss.html 79 Other rankings: G-factor http://www.universitymetrics.com/g-factor 80 Related (I) 81 Related (II) 82 MODULE 2 Applied Cybermetrics Search Engine Optimization (SEO) Web Positioning Applied Cybermetrics The aim is not only to publish in the Web, but to get visibility A search engine is used in 80% of the web sessions Getting a great number of visits (real audience closed to the potential one) Receiving external links Being present in directories and portals The web positioning is the key to increment visibility Quality influences the chances to get a good positioning, but also... The volume of information The hypertext structure The contents annotation 84 Positioning Presence measurements Visibility measurements Directory indexing Actual indexed pages by a search engine/Total pages Page Rank Prominence by terms Measurements of access and usage Popularity • • Absolute: Number of visits Relative: Alexa Ranking Usage • • • Number of downloaded files Average time per visit More frequent reference terms 85 PageRank Google 86 Problems Design is irrelevant, or even counterproductive Invisible Internet Databases and dynamic web pages can not be indexed by search engines Link quality Few indexable contents on main page Flash animations or Java applets that hinder the robots’ navigation It's necessary a continuous maintenance and update of external and internal links Rich files Documental files are handy for distributing information with a plus value • Formats pdf, ppt, doc, ps 87 Tools Webmasters World tools.webmastersworld.org SEO Encyclopedia Webmasters Tools SEO Online PageStrength Data Centers Tool SEO Tools SEO Web Directory SEO Company SEO ToolSet www.seopedia.info tools.devshed.com www.seoonline.info www.seomoz.org/tools/page-strength.php www.seocritique.com/datacentertool www.seochat.com/seo-tools www.seowebdirectory.com/SEO_Tools www.seocompany.ca/tool/seo-tools.html www.webconfs.com 88 89 90 Criteria (Google) Hypertext structure Number of times that the search terms appear Relative position of the search terms Title and URL Metadata Headings ALT tags and external anchors Updating periodicity Maturity: Depth of the institutional sites Visibility: PageRank Neighborhood: External and internal links Freshness (new contents) Popularity: Page visits Local aspects (geographic, languages) 91 Criteria (Google) 92 Presence of terms in the URL Very relevant Preferably in the domain or subdomain Recommended no longer than 30 characters The order is important Whole words, not truncated http://better.good.xx/aceptable http://lib.univ.edu http://library.university.edu (YES) Independent terms/phrases (dash/underscore) Universidad-Complutense= +Universidad +Complutense Universidad_Complutense= “Universidad Complutense” 93 Agapea 94 Presence of terms in Title Very relevant Tag contents <TITLE>!!! Key words, no title The position is important: first words carefully selected Long phrase, without empty words (~60 characters) Don't repeat terms, bilingual option Institutional identification, geographic localization The tag’s contents are also considered <Hn> The heading gives the title obtained <H1> Moving generic words: “Hello”, “Welcome”, “Page of” to inferior levels <H2> ó <H3> 95 Terms in Title 96 Metatags They are not so important Description Keywords Up to 250 characters Reusable tag for versions in other languages The position is important: choose wisely the first words Don’t repeat words Up to 20 terms Terms SHOULD also appear in the text Reusable tag for versions in other languages The position is important: choose wisely the first words Don’t repeat words Description pre-cataloging Use another tags: Dublin Core model (15 repeatable) 97 Generating META tags Meta Builder 2 vancouver-webpages.com/META/mk-metas.html Meta Tags Generator www.meta-tags.us MetaTags Generator tools.webmastersworld.org/MetatagsGenerator.php Meta Tag Generator www.invision-graphics.com/meta-tag-generator.html Meta Tag Generator www.submitcorner.com/Tools/Meta DC-Dot www.ukoln.ac.uk/metadata/dcdot/ 98 Key words in text To select correctly Density To study synonymy, variants, similar terms in other languages To analyze usage in search engines Total: Up to 25% Individual: Up to 5% Position Heading tags <Hn> First paragraphs Font modifying tags Bold <B><strong>; Italic <I>; Font size To promote the proximity of terms (where appropriate) 99 More about keywords Alternative text ALT Very important Used to give meaning to images, graphs and banners Specific treatment similar to title Up to 250 characters Anchor terms in the links Use keywords It’s very important the pages that link ours It’s also relevant for the internal navigational links 100 Google-bombing 101 Google Trends 102 Google Timeline & Map 103 Links to external pages Link’s density Average of links/page (incl. internal) ~ 20 Structuring resource lists in hierarchical directories Each category, one or more pages Target pages Linking to good pages Main page (whenever appropriate) Pages with high PR Updated pages Local>.edu>.org>.info>.com Check frequently that links are still active Avoid links to link farms Select carefully the text on the link (avoid “here”, “page”) 104 Characteristics of the institutional sites Domain Own Subdomain: Inherit PR from site root Don’t change domain!!! Medium-sized and big institutional sites Preferably large Updating Frequently Avoid acronyms, provide content Local, .org, .info, .name versus .com Increase number of pages (maintain new/old rate ) Promote inlinks Promote visits Keep statistics 105 Characteristics of the pages Size Small or medium-sized <100 k Medium or big-sized Updating Frequent, but not that much Change contents, no address But 40-50 k can be a great volume of text Structure correctly the groups of pages through consecutive links (back-next) Reduce to a minimum the restructuring Versions In different pages In other languages In other formats (pdf, doc, ps, ppt, ...) 106 Barriers for robots Links hidden, incomplete or without meaning Graphs and way-in banners without link in text mode Javascripts in navigational menus With hidden links With relative, incomplete links (without URL Base declaration) Frames (but NOT always!!) Orphan pages Avoid re-direction and alias Specially Flash files It’s also important the presence of ALT text Refresh tags Institutional farms (site.es; site.com; site.org) Dynamic pages Reduce length and complexity of the URLS: Give them a meaning 107 Robot-friendly File robots.txt Map of the site (html and xml) Navigational internal links Just the ones and necessary Sign-in in referrals Don’t abuse of “no index” At the search engines (not very important, only speed-up indexing) In directories (In Yahoo increase the visibility) In supersites (trick: Wikipedia) Fight against the invisibility Static pages Support submenus 108 “Visible” Internet 109 Hacking strategies (to avoid) Invisible texts Pixel links Link farms Duplicate texts Cloaking Link buying Visits buying Different pages for the search engine than for the user Hacking mirrors 110 Tools: Words’ Density Site Content Analyzer 2.2.15 www.sitecontentanalyzer.com Good Keywords 2.0 www.goodkeywords.com Keyword Density www.keyworddensity.com Keyw. Dens. & Prominence 1.2 www.ranks.nl/tools/spider.html Keyword Density Analyzer tool.motoricerca.info/keyword-density.phtml KDAnalyzer Version 2.0 www.webjectives.com/keyword.htm Google Adwords adwords.google.com/select/KeywordSandbox Keyword Density Analyzer 1.3 www.searchengineworld.com/cgi-bin/kwda.cgi Keyword Investigator www.keywordster.com/keyword-investigator.htm GRKda www.grsoftware.net/search_engines/software/grkda.html 111 Keyword Density & Prominence 112 Tools: Position Accurate Monitor 2.5 Advanced Web Ranking 4.7 AgentWebRanking Pro 2.6 IBP 9 Dynamic Web Ranking 7.0 Link Popularity Analysis 2.0 Link Popularity Check 3.0 Link Survey 1.5 RankSpy 1.3 Trellian SEO Toolkit Web CEO 6.0 www.cleverstat.com www.advancedwebranking.com www.agentwebranking.com www.axandra.com www.dynamicwebrank.com www.link-popularity-analysis.com www.checkyourlinkpopularity.com www.antssoft.com www.searchutilities.com/rankspy www.trellian.com/seotoolkit www.webceo.com 113 WebPosition 114 Advanced Web Ranking 115 Quality: Duplicates, broken links 116 Evolution and persistence Volatility Persistence Changes in web pages used to be minor or cosmetic The frequency of change varies according to the domains The magnitude of the change depends largely on the size Big pages change more and more frequently research.microsoft.com/research/sv/sv-pubs/p97-fetterly/p97-fetterly.pdf 117 Generating Contents Personal pages (also research groups or departments) Institutional Repositories Papers, books and book chapters, dissertations, … Multimedia repositories Portal of journals Access to full texts files (academic publications) Local institutional journals Super-sites Added value directories of (web) resources 118 Added value 119 Personal pages Current situation Few scholars with their own personal webpage, most of them with a limited amount of contents Bad positioning practices, especially regarding the URL Personal Branding Increased Impact (global audiences) Efficient Networking (peers and non-peers) Complements your formal scholarly communication Reflects the diversity of your activities (and of yourself) Not only reactive but also proactive It is easy, fast and cheap 120 A model Institutional Logo & Banner Name of the group, department or faculty Index Papers Conferences Books Teaching Proyects Popular Science Prizes Hobbies Press notes Blog / Web 2.0 Statistics CV (pdf) Photo Contact info http://johnclements.net/home General comments and presentation News, relevant new info Next conferences Links Updated 5-July-2012 thebook.virtualknowledgestudio.nl/author/paul-wouters 121 MODULE 3 Usage metrics Tracking and Analyzing Visits Web Usage Mining Definitions Data mining: Knowledge extraction from databases Web Mining: Gathering and analisys of the visit patterns of a Web site Objectives: Aspects to explore It is not to search or recover information about that site Joining Classification and clustering Transversal patterns Sequential patterns Similarities Visits Web sites analysis Log files: Definition and structure Software for log analyzing Practices with WebTrends Analysis Suite (www.netiq.com) 123 Taxonomy of the Web Mining Web Mining Mining of Web contents Mining based on agents Search engines Metasearchers Personal agents Mining of the Web use Database mining Identification Description Analysis tools Invisible Internet 124 Log files(logbook) File that automatically records all data about the visits that a web site receives IP address from the visitor Visited URLs Time of visit Time dedicated to the visit URL from which the visit came Type of petition Type of answer Size of answer (bytes) Browser used etc… Apache web log 205.188.209.10 - - [29/Mar/2002:03:58:06 -0800] "GET /~sophal/whole5.gif HTTP/1.0" 200 9609 "http://www.csua.berkeley.edu/~sophal/whole.html" "Mozilla/4.0 (compatible; MSIE 5.0; AOL 6.0; Windows 98; DigExt)" 216.35.116.26 - - [29/Mar/2002:03:59:40 -0800] "GET /~alexlam/resume.html HTTP/1.0" 200 2674 "-" "Mozilla/5.0 (Slurp/cat; [email protected]; http://www.inktomi.com/slurp.html)“ 202.155.20.142 - - [29/Mar/2002:03:00:14 -0800] "GET /~tahir/indextop.html HTTP/1.1" 200 3510 "http://www.csua.berkeley.edu/~tahir/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)“ 125 Utilities Questions to answer ¿How the information has been used? ¿How frequently? ¿What is the most and the less popular (visited)? ¿Where from do the visitors come?. ¿Where from do they exit? ¿Where do they spend more time? ¿How much time do they spend? ¿Which are the paths that visitors follow the most? ¿Who are the visitors? ¿Where do they come from? ¿How did they arrive? 126 Visits trackers Google Analytics Yahoo Web Analytics StatCounter ActiveMeter 123Statmore Counter Central Digits Web Counter Free Hit Counter GoStats MyWebStats OneStat Free OneStat Opentracker ShinyStat TDstats TheCounter WebSTAT What Counter www.google.com/analytics web.analytics.yahoo.com www.statcounter.com www.activemeter.com www.123stat.com www.countercentral.com www.digits.com www.ritecounter.com www.gostats.com www.mywebstats.org www.onestatfree.com www.onestat.com www.opentracker.net www.shinystat.com www.tdstats.com www.thecounter.com www.webstat.com www.whatcounter.com 127 Google Analytics 128 Google Analytics (II) 129 Google Analytics (III) 130 StatCounter 131 Log file analysis software 10-Strike Log-Analyzer 1.53 123LogAnalyzer 3.3 Log2Stats 1.5 AdvancedLogAnalyzer 2.1 Alterwind Log Analyzer 4.0 Analog 6.0 Analyse Spider 3.01 Deep Log Analyzer 4.0 eWebLogAnalyzer 2.3 FastStats Analyzer 4.1 Nihuo Web Log Analyzer 4.07 SawMill 8.5 SmarterStats 6.5 Surfstats 2011 WebLogStorming 2.6 WebLogExpert 7.4 WebTrends Analytics 10 www.10-strike.com www.123loganalyzer.com www.bitstrike.com www.abacre.com/ala/index.htm www.alterwind.com www.analog.cx www.analysespider.com www.deep-software.com www.esoftys.com www.mach5.com/products/analyzer www.nihuo.com www.sawmill.net www.smartertools.com www.surfstats.com www.datalandsoftware.com/weblog www.weblogexpert.com www.webtrends.com 132 10-Strike Log Analyzer 133 123-Log Analyzer 134 SawMill 135 Exercises Experiments Funnel Web 5.0 Practices with log files Total and disaggregated visits More popular pages and directories Downloaded files Points of entry and exit Visitors demography Entry referrals (origin, browser and search engine words used) 136 Configuring Funnel Web 137 Results 138 Referrals 139 Bibliography/Webliography General Bibliography/Webliography www.cindoc.csic.es/cybermetrics/links03.html Björneborn, L. & Ingwersen, P. (2001). Perspectives of webometrics. Scientometrics, 50(1): 65-82. http://www.db.dk/lb/2001webometrics.pdf van Raan, A. F. J. (2001). Bibliometrics and internet: Some observations and expectations. Scientometrics, 50(1): 59-63 Bar-Ilan, J. (2001). Data collection methods on the Web for infometric purposes. A review and analysis. Scientometrics, 50(1):7-32 Björneborn, L. (2004). Small-world link structures across an academic web space : a library and information science approach. PhD dissertation. Royal School of Library and Information Science. xxxvi, 399 p. ISBN 877415-276-9.<http://www.db.dk/lb/phd/phd-thesis.pdf > Jepsen, E.T.; Seiden, P.; Ingwersen, P.; Björneborn, L. & Borlund, P. (2005). Characteristics of scientific web publications: preliminary data gathering and analysis. Journal of the American Society for Information Science and Technology. Special Issue on Webometrics. Björneborn, L. & Ingwersen, P. (2005). Towards a basic framework for webometrics. Journal of the American Society for Information Science and Technology. Special Issue on Webometrics. Thelwall, M.; Vaughan, L. & Björneborn, L. (2005). Webometrics. Annual Review of Information Science and Technology, 39. Ingwersen, P. & Björneborn, L. (2004). Methodological issues of webometric studies. In: Glänzel, W. et al. (eds.). Quantitative Science and Technology Research. Klüwer Academic Publishers. The Statistical Cybermetrics Research Group. Wolverhampton University <http://cybermetrics.wlv.ac.uk> Alonso Berrocal, J.L.; Figuerola, C.G. & Zazo, A.F. (2004). Cibermetría:nuevas técnicas de estudio aplicables al Web. Ediciones Trea, Gijón. 207 pags. Faba Perez, C., Guerrero Bote, V. P. & Moya Anegón, F. (2004). Fundamentos y técnicas cibermétricas: modelos cuantitativos de análisis. Junta de Extremadura, Mérida. Serie Sociedad de la Información, no. 18. 216 pags. Prime, C.; Bassecoulard, E.; Zitt, M. (2002). Co-citations and co-sitations: A cautionary view on an analogy. Scientometrics 54 (2): 291-308: 140