Sample Midterm Exam Questions: CS 689: Fall 2011
Transcription
Sample Midterm Exam Questions: CS 689: Fall 2011
Sample Midterm Exam Questions: CS 689: Fall 2011 Question 1 You are given two sealed envelopes, one containing $ x and the other containing $ 2x, where x is unknown. You are asked to select an envelope at random, open it, and then decide if you would like to exchange it with the other envelope. Develop a justification for your decision based on reasoning using expected values. Is there an inherent paradox in your solution? Contrast decision-making with expected values to decision-making with likelihoods, by formulating a likelihood-based solution to this problem (where θ is the unknown parameter which is either $ x or $ 2x). Question 2 Let x1 = 0.1, x2 = 0.7, x3 = 1.4, x4 = 2.3 be a set of IID samples from a uniform distribution U [0, θ] for θ > 0. Draw (by hand) the shape of the likelihood function. Label the axes clearly. As the number of samples grows, what happens to the shape of the likelihood function? Question 3 For each of the following questions, answer true or false, and provide a detailed justification for each answer. • The variance of the sample mean is always higher than the variance of any individual observation. • Rao-Blackwellization can only be applied to unbiased estimators. • Every distribution P (X|θ) has a sufficient statistic. • The likelihood function is always unimodal. • The EM algorithm attempts to maximize the log-likelihood of the observed data. Question 4 Consider the problem of linear regression, where the goal is to find the regression of some variable, such as your height y, on your father’s height, x. Imagine a cloud of points as the given data, where each point (xi , yi ) represents your height vs. your father’s height (one for each person). Let us define the standard deviation line as the line that goes through the sample average point (¯ x, y¯) with a slope equal σy to sign(r) σx , where sign(r) is the sign (positive or negative) of the correlation coefficient r, and σx and σy are the sample standard deviations of x and y. Derive an expression for the regression line in terms of these quantities, and explain whether its slope is flatter or steeper than the standard deviation line. Draw a qualitative diagram (by hand) illustrating these two lines for some sample data points. Question 5 This question concerns the use of EM to infer models from partially missing data. The National Crime Survey conducted by the US Bureau of the Census interviewed occupants of a certain apartment complex to determine if they had been victimized by crime during the preceding six month period. Six months 1 later, the occupants were interviewed again to determine if they had been the victims of a crime in the intervening months since the first interview. The data resulting from the survey is summarized in the following table. First Interview Crime-Free Victims Nonrespondents Second Interview Crime-Free Victims Nonrespondents 392 55 33 76 38 9 31 7 115 part a What is the “missing” data here? Explain how many training instances in this data can be used by EM in parameter estimation and which instances have to be discarded. Justify your reasoning. part b Let xij be the complete data cell counts for the (i, j)th entry of the table. Write an expression for the complete log-likelihood of the data, and derive the maximum likelihood estimator θˆij . part c To model this as a missing data problem, show how to represent each complete data cell count xij in terms of the observed data and the missing data. part d Specify the E-step of the EM algorithm, This requires specifying an expression for P (Xm |Xo , θt ), the predictive distribution of the missing data, given the observed data and an initial setting of the model parameter θ. You may find it helpful to use the following interesting property of multinomial distributions, 11 whereby the conditional distribution P (x11 |x11 + x12 , θ) = θ11θ+θ , i.e. the conditional distribution of a 12 multinomial variable is also a multinomial. part e ( 5 points): Give the complete EM algorithm. Given the small size of the problem, it is possible to compute the results using a calculator, and the EM procedure converges in 3-4 steps. Using a calculator, show the convergence of EM for this problem, starting with a random guess of the model parameter θ. Question 6 S(t−1,1) S(t,1) S(t+1,1) S(t−1,2) S(t,2) S(t+1,2) Y(t−1) Y(t) Y(t+1) The above figure shows a factorial HMM, which can be viewed as a variant of a regular HMM. Here, each state at time t is made up of a vector of state variables S(t, i), where each variable is governed by a probability distribution that depends only on the value of the corresponding variable at the previous time. However, the observation Y (t) at time t can depend on all the state variables at that time. part a Analyze the conditional independence properties of a factorial HMM using the concept of dseparation. In particular, are the state variables at time t marginally independent? Are the state variables at time t conditionally independent given the observation Y (t) at time t? Are the state variables at time 2 t conditionally independent of the past history of state variables, given the value of the variables at the previous time instant t − 1? part b Suppose there were M parallel chains (instead of 2 as shown in the figure), and each state variable takes on K values. Show how to convert a factorial HMM into a regular HMM, and give an expression for the complexity of the forward algorithm for the converted HMM in terms of T (the length of the sequence), K (the number of values each state variable takes on), and M (the number of chains). part c Is there a reason to not reduce a factorial HMM into a regular HMM? Is the E-step tractable for factorial HMMs (in their original or converted form?). Question 7 EM was primarily specified as a method of doing maximum likelihood estimation over incomplete data. Does EM assume the underlying samples are IID? Explain. We often want to do full Bayesian estimation, and not assume uniform priors. Specify a modified auxiliary function Q∗ (θ|θt ) for applying EM to obtain posterior distributions from incomplete data. Explain the modification needed, if any, to the E-step and the M-steps. Question 8 Give a general procedure for determining if a (continuous) function is convex or concave. Consider the logistic function f (x) = 1 1 + e−x Is the logistic function convex or concave? How about the log f (x)? Compare the two loss functions we have studied in class: the absolute loss and the squared loss. Are both these loss functions convex? What are the pros and cons of using convex loss functions? Question 9 part a Derive an expression that relates the Fisher information JX (θ) for an IID dataset of N instances X = x1 , . . . , xn sampled from a univariate distribution P (x|θ), in terms of the Fisher information Jx (θ) of each individual instance. part b Compute the Fisher information JX (θ) for an IID dataset X of N instances sampled from the Rayleigh distribution 2 P (x|θ) = 2θxe−θx 3 x≥0 Question 10 A finite undirected graph G = (V, E) consists of a set of vertices V and a set of edges (u, v) ∈ E where u, v ∈ V . This question pertains to the analysis of probability distributions over graphs. Denote by A the adjacency matrix of the graph G, where A(u, v) = 1 if and only if (u, v) ∈ E. Define the random walk matrix Pr over a graph G to be an |V | × |V | matrix where Pr (i, j) = d1i , specifying the probability of a transition from vertex i to vertex j. Here, di denotes the degree of vertex i. Assume the graph is connected, so all vertices are reachable from every other vertex. Assume |V | = n. part a Is the random walk matrix symmetric? If you think it is not, give a counterexample. Either way, justify your answer rigorously. part b Is the adjacency matrix symmetric? Derive an expression for the random walk matrix in terms of the adjacency matrix. part c Let π define a distribution over V so that any two functions on a graph f, g : V → R as follows: hf, giπ = X P v∈V π(v) = 1. Define the inner product hf, giπ of f (i)g(i)π(i) i∈V Show that this definition of inner product satisfies all the axioms defining the inner product. part d Starting from any initial distribution π0 , derive an expression for the distribution at time step t > 0 resulting from a random walk on the graph of length t. Starting from any distribution π, does this process of doing a random walk over longer and longer time periods converge? 4