Distributed Data Management - Databases and Information Systems
Transcription
Distributed Data Management - Databases and Information Systems
Information Retrieval and Data Mining Summer Semester 2015 TU Kaiserslautern Prof. Dr.-Ing. Sebastian Michel Databases and Information Systems Group (AG DBIS) http://dbis.informatik.uni-kl.de/ Information Retrieval and Data Mining, SoSe 2015, S. Michel 1 Outlook and Announcement • Outlook on this lecture + next one: – Continuation of probabilistic models – Language Models (=generative models) • Then: New Chapter: Link Analysis (Pagerank etc). • Then, Indexing and Querying, ….., Data Mining…. • Announcement for exercises: – Solution to sheet will be uploaded after exercise session. Access is restricted to university network or via login: • Username: irdm • Password: Information Retrieval and Data Mining, SoSe 2015, S. Michel 2 Ranking Proportional to Relevance Odds (cont’d) … • pt is probability that terms appears in relevant document • qt is probability that terms appears in an irrelevant document Information Retrieval and Data Mining, SoSe 2015, S. Michel 3 Estimating pt and qt with a Training Sample • We can estimate pt and qt based on a training sample obtained by evaluating the query q on a small sample of the corpus and asking the user for relevance feedback about the results • Let N be the # documents in our sample R be the # relevant documents in our sample nt be the # documents in our sample that contain t rt be the # relevant documents in our sample that contain t we estimate or with Lidstone smoothing (λ = 0.5) Information Retrieval and Data Mining, SoSe 2015, S. Michel 4 More info on MLE for Binomial: https://onlinecourses.science.psu.edu/stat504/node/28 Smoothing (with Uniform Prior) • Probabilities pt and qt for term t are estimated (by Maximum-Likelihood-Estimate (MLE) for Binomial distribution) – repeated coin tosses for term t in relevant documents (pt) – repeated coin tosses for term t in irrelevant documents (qt) • Avoid overfitting to the training sample by smoothing estimates – Laplace smoothing (based on Laplace’s law of succession) – Lidstone smoothing (heuristic generalization with λ > 0) Information Retrieval and Data Mining, SoSe 2015, S. Michel 5 Estimating pt and qt without a Training Sample • When no training sample is available, we estimate pt and qt as – pt reflects that we have no information about relevant documents – qt under the assumption that # relevant documents <<< # documents • When we plug in these estimates of pt and qt, we obtain which can be seen as TF*IDF with binary term frequencies and logarithmically dampened inverse document frequencies Information Retrieval and Data Mining, SoSe 2015, S. Michel 6 3. Okapi BM25 • Generalizes term weight into where pi and qi denote the probability that term occurs i times in a relevant or irrelevant document, respectively • Postulates Poisson (or 2-Poisson-mixture) distributions for terms Information Retrieval and Data Mining, SoSe 2015, S. Michel 7 Okapi BM25 (cont’d) • Reduces the number of parameters that have to be learned and approximates Poisson model by similarly-shaped function • Finally leads to Okapi BM25 as state-of-the-art retrieval model (with top-ranked results in TREC) 𝑘1 + 1 𝑡𝑓𝑡,𝑑 𝑤𝑡,𝑑 = 𝑘1 1−𝑏 +𝑏 𝑑 + 𝑡𝑓𝑡,𝑑 𝑎𝑣𝑑𝑙 log 𝐷 − 𝑑𝑓𝑡 + 0.5 𝑑𝑓𝑡 + 0.5 – k1 controls impact of term frequency (common choice k1 = 1.2) – b controls impact of document length (common choice b = 0.75) Information Retrieval and Data Mining, SoSe 2015, S. Michel 8 Okapi BM25 (Example) • 3D plot of a simplified BM25 scoring function using k1 = 1.2 as parameter (DF mirrored for better readability) • Scores for dft > |D|/2 are negative (is this a problem?) Information Retrieval and Data Mining, SoSe 2015, S. Michel 9 Models Considering Term Dependences • So far, models assume terms appear independent from one another. • Instead, the upcoming models – Tree Dependence Model – Bayesian Network Model do not assume independence of term appearances. • Need to obtain probability of term, or term pair, or term triplet,… appearances in relevant, respectively, irrelevant documents. • Then, can compute Information Retrieval and Data Mining, SoSe 2015, S. Michel 10 4. Tree Dependence Model • Consider term correlations in documents (with binary RV Xi) requires estimating m-dimensional probability distribution • Tree dependence model [van Rijsbergen 1979] – considers only 2-dimensional probabilities for term pairs (i, j) – estimates for each (i, j) the error made by independence assumptions – constructs a tree with terms as nodes and m-1 weighted edges connecting the highest-error term pairs – Assumption: term depends at most on one other term Information Retrieval and Data Mining, SoSe 2015, S. Michel 11 Two-Dimensional Term Correlations • Kullback-Leibler divergence estimates error of approximating f by g assuming pairwise term independence • Correlation coefficient for term pairs • p-values of Χ2 test of independence Information Retrieval and Data Mining, SoSe 2015, S. Michel 12 Kullback-Leibler Divergence (Example) • Given are documents d1=(1,1), d2=(0,0), d3=(1,1), d4=(0,1) • 2-dimensional probability distribution f: f(1,1) = P[X1 = 1, X2 = 1] = 2/4 f(0,0) = 1/4, f(0,1) = 1/4, f(1,0) = 0 • 1-dimensional marginal distributions g1 and g2 g1(1) = P[X1=1] = 2/4, g1(0) = 2/4 g2(1) = P[X2=1] = 3/4, g2(0) = 1/4 • 2-dimensional probability distribution assuming independence g(1,1) = g1(1) g2(1) = 3/8 g(0,0) = 1/8, g(0,1) = 3/8, g(1,0) = 1/8 • approximation error ε (Kullback-Leibler divergence) ε = 2/4 log 4/3 + 1/4 log 2 + 1/4 log 2/3 + 0 0.311 Information Retrieval and Data Mining, SoSe 2015, S. Michel 13 Constructing the Term Dependence Tree • Input: Complete graph (V, E ) with m nodes Xi ∈ V and m2 undirected edges (i, j) ∈ E with weights ε • Output: Spanning tree (V, E’) with maximum total edge weight • Algorithm: – Sort m2 edges in descending order of weights – E’ = ∅ – Repeat until |E’| = m-1 • E’ = E’ ∪ {(i, j) ∈ E \ E’ | (i, j) has maximal weight and E’ remains acyclic} • Example: web 0.9 0.7 0.5 surf 0.3 net 0.1 0.1 web net swim Information Retrieval and Data Mining, SoSe 2015, S. Michel surf swim 14 Estimation with Term Dependence Tree • Given a term dependence tree (V={X1, …, Xm}, E’) with preorder-labeled nodes (i.e., X1 is root) and assuming that Xi and Xj are independent for (i, j) ∉ E’ web • Example: net surf 15 swim Information Retrieval and Data Mining, SoSe 2015, S. Michel Summary: Tree Dependence Model • Uses (n-1) term dependences for n terms • Which ones to take? Highest “error” made by independence assumption • Efficient to evaluate • How to apply? Learned for relevant and irrelevant documents. • Then as BIM computing with difference that conditional probability is computed given the “parent” term in tree. Information Retrieval and Data Mining, SoSe 2015, S. Michel 16 5. Bayesian Networks • A Bayesian network (BN) is a directed, acyclic graph (V, E) with – Vertices V representing random variables – Edges E representing dependencies – For a root R ∈ V the BN captures the prior probability P[R = …] – For a vertex X ∈ V with parents parents(x) = {P1, …, Pk} the BN captures the conditional probability P[X |P1, …, Pk ] – The vertex X is conditionally independent of a nonparent node Y given its parents parents(x) = {P1, …, Pk}, i.e.: Information Retrieval and Data Mining, SoSe 2015, S. Michel 17 Bayesian Networks (Example) Cloudy Sprinklers Rain Wet Information Retrieval and Data Mining, SoSe 2015, S. Michel 18 Bayesian Networks (cont’d) • We can determine any joint probability using the BN Information Retrieval and Data Mining, SoSe 2015, S. Michel 19 Some Math Recap • Chain rule we can rewrite repeating this leads to Example with 4 variables: copied from Wikipedia Information Retrieval and Data Mining, SoSe 2015, S. Michel 20 Bayesian Networks for IR … d1 t1 … … dj ti tk … dN tM q Information Retrieval and Data Mining, SoSe 2015, S. Michel 21 Advanced Bayesian Networks for IR • BN not widely adopted in IR due to challenges in parameter estimation, representation, efficiency, and practical effectiveness … d1 t1 … dj … ti c1 … tk … q dN cK tM concepts/topics cl: Information Retrieval and Data Mining, SoSe 2015, S. Michel 22 Summary of I.3 • Probabilistic IR as a family of (more) principled approaches relying on generative models of documents as bags of words • Probabilistic ranking principle as the foundation establishing that ranking documents by P[R| d, q] is optimal • Binary independence model puts that principle into practice based on a multivariate Bernoulli model • Smoothing to avoid overfitting to the training sample • Okapi BM25 as a state-of-the-art retrieval model based on an approximation of a 2-Poisson mixture model Information Retrieval and Data Mining, SoSe 2015, S. Michel 23 Additional Literature for I.3 • • • • • • • G. Salton, C. Buckley, and C. T. Yu: An Evaluation of Term Dependence Models in Information Retrieval. Proceedings of the 5th annual ACM conference on Research and development in information retrieval (SIGIR), 1982. F. Crestani, M. Lalmas, C. J. Van Rijsbergen, and I. Campbell: “Is This Document Relevant? ... Probably”: A Survey of Probabilistic Models in Information Retrieval, ACM Computing Surveys 30(4):528-552, 1998 S.E. Robertson, K. Spärck Jones: Relevance Weighting of Search Terms, JASIS 27(3), 1976 S.E. Robertson, S. Walker: Some Simple Effective Approximations to the 2Poisson Model for Probabilistic Weighted Retrieval, SIGIR 1994 T. Roelleke: Information Retrieval Models: Foundations and Relationships Morgan & Claypool Publishers, 2013 K. Spärck-Jones, S. Walter, S. E. Robertson: A probabilistic model of information retrieval: development and comparative experiments, IP&M 36:779-840, 2000 K. J. van Rijsbergen: Information Retrieval, University of Glasgow, 1979 http://www.dcs.gla.ac.uk/Keith/Preface.html Information Retrieval and Data Mining, SoSe 2015, S. Michel 24 I.4 Statistical Language Models 1. 2. 3. 4. Basics of Statistical Language Models Query-Likelihood Approaches Smoothing Methods Novely and Diversity Based on MRS Chapter 12 and [Zhai 2008] Information Retrieval and Data Mining, SoSe 2015, S. Michel 25 1. Basics of Statistical Language Models • Statistical language models (LMs) are generative models of word sequences (or, bags of words, sets of words, etc.) Information Retrieval and Data Mining, SoSe 2015, S. Michel 26 Illustration: Finite Automata • Consider the following finite automata that outputs individual terms with certain probability dog cat hog : : : 0.5 0.4 0.1 0.9 Probability of generating a term Probability of stopping 0.1 Probability of continuing Information Retrieval and Data Mining, SoSe 2015, S. Michel 27 Language Models: Intuition • How can you come up with a good query? • Think of terms that would likely appear in a relevant document! • Intuition behind Language Model: A document d is a good match for a query if its model M is likely to generate the query (i.e., P[d|M] is high) • Essentially: LM is around finding probability measure for strings generated out of a model. Information Retrieval and Data Mining, SoSe 2015, S. Michel 28 Example Model M2 the 0.2 the 0.15 a 0.1 a 0.12 frog 0.01 frog 0.0002 toad 0.01 toad 0.0001 said 0.03 said 0.03 likes 0.02 likes 0.04 that 0.04 that 0.04 dog 0.005 dog 0.01 cat 0.003 cat 0.015 monkey 0.001 monkey 0.002 … … … … s frog said that toad likes that dog M1 0.01 0.03 0.04 0.01 0.02 0.04 0.005 M2 0.0002 0.03 0.04 0.0001 0.04 0.04 0.01 P(s|M1) = 4.8E-13 and P(s|M2) = 3.84E-16 i.e., P(s|M1) > P(s|M2) Example from MRS Chapter 12 Model M1 29 Statistical Language Models Application Examples • Application examples: – Speech recognition, to select among multiple phonetically similar sentences (“get up at 8 o’clock” vs. “get a potato clock”) – Statistical machine translation, to select among multiple candidate translations (“logical closing” vs. “logical reasoning”) – Information retrieval, to rank documents in response to a query Information Retrieval and Data Mining, SoSe 2015, S. Michel 30 Types of Language Models • Unigram LM based on only single words (unigrams), considers no context, and assumes independent generation of words • Bigram LM conditions on the preceding term • n-Gram LM conditions on the preceding (n-1) terms Information Retrieval and Data Mining, SoSe 2015, S. Michel 31 Parameter Estimation • Parameters (e.g., P(ti), P(ti | ti-1)) of language model θ are estimated based on a sample of documents (or for each document), which are assumed to have been generated by θ Information Retrieval and Data Mining, SoSe 2015, S. Michel 32 Example • Example: Unigram language models θSports and θPolitics estimated from documents about sports and politics θSports soccer goal tennis player : : : : : 0.20 0.15 0.10 0.05 : : : : : 0.20 0.20 0.15 0.05 θPolitics party debate scandal election Sample generates Sample generates Information Retrieval and Data Mining, SoSe 2015, S. Michel 33 Probabilistic IR vs. Statistical Language Models “User finds document d relevant to query q” Probabilistic IR ranks according to relevance odds Statistical LMs rank according to query likelihood Information Retrieval and Data Mining, SoSe 2015, S. Michel 34 2. Query-Likelihood Approaches θd1 apple pie : : : 0.20 0.15 Sample d1 : : : 0.20 0.15 Sample d2 q θd2 cake apple • P(q|d) is the likelihood that the query was generated by the language model θd estimated from document d • Intuition: – User formulates query q by selecting words from a prototype document – Which document is “closest” to that prototype document Information Retrieval and Data Mining, SoSe 2015, S. Michel 35 Multi-Bernoulli LM • Query q is seen as a set of terms and generated from document d by tossing a coin for every word from the vocabulary V Estimate [Ponte and Croft ’98] pioneered the use of LMs in IR Information Retrieval and Data Mining, SoSe 2015, S. Michel 36 Multinomial LM • Query q is seen as a bag of terms and generated from document d by drawing terms from the bag of terms corresponding to d • Multinomial LM is more expressive than Multi-Bernoulli LM and therefore usually preferred Information Retrieval and Data Mining, SoSe 2015, S. Michel 37 Multinomial LM (cont’d) • Maximum-likelihood estimate for parameters P(ti|d) is prone to overfitting and leads to – bias in favor of short documents / against long documents – conjunctive query semantics, i.e., query can not be generated from language models of documents that miss one of the query terms Information Retrieval and Data Mining, SoSe 2015, S. Michel 38 3. Smoothing • Smoothing methods avoid overfitting to the sample (often: one document) and are essential for LMs to work in practice – – – – – – – Laplace smoothing (cf. Chapter I.3) Absolute discounting Jelinek-Mercer smoothing Dirichlet smoothing Good-Turing smoothing Katz’s back-off model … • Choice of smoothing method and parameter setting still mostly “black art” (or empirical, i.e., based on training data) Information Retrieval and Data Mining, SoSe 2015, S. Michel 39 Jelinek-Mercer Smoothing • Uses a linear combination (mixture) of document language model θd and document-collection language model θD with document D as concatenation of entire document collection • Parameter λ can be tuned by cross-validation with held-out data – divide set of relevant (q, d) pairs into n partitions – build LM on the pairs from n-1 partitions – choose λ to maximize precision (or recall or F1) on held-out partition – iterate with different choice of nth partition and average • Parameter λ can be made document- or term-dependent Information Retrieval and Data Mining, SoSe 2015, S. Michel 40 Jelinek-Mercer Smoothing vs. TF*IDF ~ tf ~ idf • (Jelinek-Mercer) smoothing has effect similar to IDF weighting • Jelinek-Mercer smoothing leads to a TF*IDF-style model Information Retrieval and Data Mining, SoSe 2015, S. Michel 41 Dirichlet-Prior Smoothing • Uses Bayesian estimation with a conjugate Dirichlet prior instead of the Maximum-Likelihood Estimation • Intuition: Document d is extended by α terms generated by the document-collection language model • Parameter α usually set as multiple of average document length Information Retrieval and Data Mining, SoSe 2015, S. Michel 42 Dirichlet Smoothing vs. JelinekMercer Smoothing • Jelinek-Mercer smoothing with documentdependent λ becomes a special case of Dirichlet smoothing Information Retrieval and Data Mining, SoSe 2015, S. Michel 43 4. Novelty & Diversity • Retrieval models seen so far (e.g., TF*IDF, LMs) assume that relevance of documents is independent from each other • Problem: Not a very realistic assumption in practice due to (near-)duplicate documents (e.g., articles about same event) • Objective: Make sure that the user sees novel (i.e., nonredundant) information with every additional result inspected • Queries are often ambiguous (e.g., jaguar) with multiple different information needs behind them (e.g., car, cat, OS) • Objective: Make sure that user sees diverse results that cover many of the information needs possibly behind the query Information Retrieval and Data Mining, SoSe 2015, S. Michel 44 Maximum Marginal Relevance (MMR) • Intuition: Next result returned di should be relevant to the query but also different from the already returned results d1, …, di-1 with tunable parameter λ and similarity measure sim(q,d) • Usually implemented as re-ranking of top-k query results • Example: sim(q,d2) = 0.8 sim(q,d3) = 0.7 sim(q,d4) = 0.6 mmr(q,d1) = 0.45 Final Result Initial Result sim(q,d1) = 0.9 sim(q,d5) = 0.5 mmr(q,d3) = 0.35 mmr(q,d5) = 0.25 mmr(q,d2) = -0.10 mmr(q,d4) = -0.20 • Full details: [Carbonell and Goldstein ’98] Information Retrieval and Data Mining, SoSe 2015, S. Michel 45 Summary of I.4 • Statistical language models widely used in natural language applications other than IR • Query-likelihood approaches see the query as a sample from the document LM • Smoothing methods are absolutely essential to make LMs work in practice • Brief discussion of one approach for novelty&diversity Information Retrieval and Data Mining, SoSe 2015, S. Michel 46 Additional Literature for I.4 • D. Hiemstra: Using Language Models for Information Retrieval, Ph.D. Thesis, University of Twente, 2001 • M. Federico and N. Bertoldi: Statistical Cross-Language Information Retrieval using N-Best Query Translations, SIGIR 2001 • Z. Nie, Y. Ma, S. Shi, J.-R. Wen and W.-Y. Ma: Web Object Retrieval, WWW 2007 • H. M. Peetz and M. de Rijke: Cognitive Temporal Document Priors, ECIR 2013 • J. M. Ponte and B. Croft: A Language Modeling Approach to Information Retrieval, SIGIR 1998 • C. Zhai and J. Lafferty: Model-based Feedback in the Language Modeling Approach for Information Retrieval, CIKM 2001 • C. Zhai: Statistical Language Models for Information Retrieval A Critical Review, Foundations and Trends in Information Retrieval 2(3):137-213, 2008 • R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong: Diversifying Search Results, WSDM 2009 • J. G. Carbonell and J. Goldstein: The Use of MMR, Diversity-Based Reranking for Reordering Information Retrieval and Data Mining, SoSe 2015, S. Michel 47