Student Name and ID: CS246 Midterm: Closed Book, 1 Hour 30

Transcription

Student Name and ID: CS246 Midterm: Closed Book, 1 Hour 30
CS246 Midterm, Spring 2014 — Page: 1
UCLA
Computer Science Department
Spring 2014
Instructor: J. Cho
Student Name and ID:
CS246 Midterm: Closed Book, 1 Hour 30 Minutes
Attach extra pages as needed. Write your name and ID on the extra
pages. Please, write neatly.
Problem
1
2
3
4
5
Total
Exam Score:
Score
20
20
20
20
20
100
CS246 Midterm, Spring 2014 — Page: 2
Problem 1: 20 points
We want to identify the number of unique books available on Amazon. We consider a
book is different from another if their ISBN numbers are different. (An ISBN number is a
sequence of 10 numeric digits that are assigned to a book.) We assume every book has an
ISBN number.
Describe an efficient way to estimate the number of unique books available on Amazon.
Note that Amazon provides search interfaces on various fields on books, including title,
author and ISBN. However, you are not allowed to download the entire Web site from
Amazon to measure the number of books and Amazon does not support “NOT ...” queries.
CS246 Midterm, Spring 2014 — Page: 3
Problem 2: 20 points
You are a Web administrator who is in charge of all pages in the set S given below. As far
as you know, Google uses the following TrustRank variation of PageRank:
X
T R(pj )/cj + bi ,
T R(pi ) = 0.8
pj ∈P AREN T (pi )
where P AREN T (pi ) is the set of all pages that have a link to pi and cj is the number of
outgoing links in pj .
S
p2
p1
p4
p3
None of your pages in S is trusted by Google (i.e., bi = 0 for all pi ∈ S). Fortunately, some
pages outside of S have a link to your pages as shown in the graph. You know that the
TrustRank scores of the pages outside of S are as follows:
T R(p1 ) = 0, T R(p2 ) = T R(p3 ) = 0.06, T R(p4 ) = 0.01
Your task is to make the minimum number of modifications to the link structure among
and/or from the pages in S in order to (1) maximize the TrustRank sum of all pages in S
and (2) all pages in S have non-zero T R(pi ) values.
CS246 Midterm, Spring 2014 — Page: 4
1. Indicate such modifications directly to the graph below. In your modifications, you
are only allowed to add and/or delete link(s) originating from page(s) in S. To get
partial credit, you may want to briefly explain why your modifications are suggested.
(15 points)
S
p2
p1
p4
p3
2. After your modification, what is the sum of all TrustRank scores of all pages in S? (5
points)
X
pi ∈S
T R(pi ) =
CS246 Midterm, Spring 2014 — Page: 5
Problem 3: 20 points
Given the following 3×3 matrix A, we perform its singular value decomposition into Q1 DQT2 :
 √

6 2
9
√
−1
5 2 
 5
 √

 6 2

9
A=
= Q1 D QT2 ,
1 5√2 
 5



4
3
0
−5
5
where Q1 and Q2 are orthonormal matrices and D is a diagonal matrix.
1. Write down the three decomposed matrices in the following space (7 points):





Q1 = 













 D=















 T 
 Q2 = 














Please note that the third matrix is QT2 , not Q2 .
You may find the following information helpful in performing the decomposition:
AT A is a symmetric matrix with three eigenvalues, 2, 1, and 9, whose corresponding eigenvectors are (0, 1, 0), (−3/5, 0, 4/5), and (4/5, 0, 3/5), respectively. AAT is
also a symmetric matrix with√three√eigenvalues, 1,√2, and
√ 9, whose corresponding
eigenvectors are (0, 0, 1), (−1/ 2, 1/ 2, 0), and (1/ 2, 1/ 2, 0), respectively.
CS246 Midterm, Spring 2014 — Page: 6
2. Please describe the linear transformation that the matrix A represents as a (combination of) stretching and/or rotation. For stretching, make sure you specify the
stretching direction(s) and stretching factor. For rotation, describe how three orthonormal (basis) vectors transform. (For example, you may say that
√ the
√ matrix
correspond
to
rotation,
where
the
three
basis
vectors
(1,
0,
0),
(0,
1/
2,
1/
2), and
√
√
(0, −1/ 2, 1/ 2) transform to (0, 1, 0), (3/5, 0, 4/5), and (4/5, 0, −3/5), respectively)
(7 points)
3. Consider the multiplication of the matrix A with a vector X whose length is 1 (i.e.,
|X| = 1). What is the largest possible value of |AX|? What is the X that gives such
a |AX|? (6 points)
CS246 Midterm, Spring 2014 — Page: 7
Problem 4: 20 points.
Consider a collection of 5 documents, d1 , d2 , ..., d5 . Each document contains 10 words. The
entire document collection contains three unique words, Microsoft, office, and chair. The
following diagram shows how many times each word appears in each document. For now
ignore the color of each circle in the diagram.
Microsoft
d1
d2
d3
d4
d5
••◦
•••••
•••••••
office
◦•◦
◦◦◦•◦
• ◦ ◦•
◦•◦••
•◦•
chair
◦◦◦◦◦•◦
◦◦◦◦◦
•◦◦
For example, the document d3 contains the word Microsoft three times, office four times,
and chair three times.
We have run the LDA algorithm and have assigned each word in the documents to one of
two topics, z1 and z2 through multiple iterations. The result so far is labeled as a black (z1 )
or white (z2 ) color in the above diagram. For example, the first and the third office’s of d1
have been assigned to z2 (white dots), and the second office of d1 has been assigned to z1 (a
black dot). We are using α = 0 and β = 0 as the parameter values of the LDA algorithm.
1. As the final step in our LDA iterations, we want to reassign the topic of the third
office of d5 to either z1 or z2 . In the previous iteration, the word was assigned to z1 (a
black dot), as shown in the diagram. What should be the probability that the word
is assigned to z1 in the current iteration? What about to z2 ? Write down the two
probabilities. (5 points)
CS246 Midterm, Spring 2014 — Page: 8
2. Assume that the third office of d5 was assigned again to z1 in our last iteration.
Given the final assignment of topics, write down the estimated values of the following
probability vectors. (10 points)
Probability vector
hP (z1 |d1 ), P (z2 |d1 )i
Values
hP (z1 |d3 ), P (z2 |d3 )i
hP (z1 |d4 ), P (z2 |d4 )i
hP (M icrosof t|z1 ), P (of f ice|z1 ), P (chair|z1 )i
hP (M icrosof t|z2 ), P (of f ice|z2 ), P (chair|z2 )i
3. Given the query Microsoft, we decided to rank documents based on TF-IDF weighting
and the cosine similarity measure between the query and each document. Which
documents will have non-zero scores under this scheme? What will be their relative
ranking? For example, if d1 , d3 , and d4 have non-zero scores and d1 has a higher
similarity score than d4 , and d4 has a higher score than d3 , write your answer as
“d1 > d4 > d3 ”. (5 points)
CS246 Midterm, Spring 2014 — Page: 9
Problem 5: 20 points
You want to simulate the arrival of incoming phone calls at your company’s call center.
You know that your call arrivals follow a Poisson process very closely. On average, your
call center gets 20 phone calls per hour. To perform simulation, you decided to discretize
time in the unit of one second and generate a call arrival event at the beginning of every
second with probability p.
1. Compute the value of the probability p to be used for this simulation. (10 points)
2. What is the probability that your call center does not get any phone call for x seconds?
(10 points)