Eduard Kuric, prof. Mária Bieliková

Transcription

Eduard Kuric, prof. Mária Bieliková
Search in Source Code Based on Identifying Popular Fragments
Eduard Kuric, prof. Mária Bieliková
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
> Motivation
When programmers write new code, they are often
interested in nding de nitions of functions, existing,
working fragments, with the same or similar functionality and reusing as much of that code as possible.
They want easily understand how the functions are
used and see the sequence of function invocations in
order to understand how concepts are implemented.
> Goals
enable programmers to nd relevant functions to
query terms and their usages; and see associations
between concepts in the functions and query
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
> Searching for relevant functions given a query
> Processing source code repository
A1
2
A2
Source code
repository
Subdocuments
Indexes
Index
creator
Relevant
documents
Indexes
1
B1
Function names
3
7
5
3
B2
C
Programmer
Function graph
creator
Functional
dependencies
PageRank
Cosine similarity
5
4
Functional
dependencies
6
8
sc(Fn) = sim(dj, q) + sim(djk, q) + pr(Fn)
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
The function graph creator (B1) creates a directed graph
of functional dependencies (B2).
- nodes represent functions (names of functions)
- a directed edge between the function G and the function H
is created if the function H is invoked in the function G
Retrieving the relevant documents:
1. for a query (1), a list of relevant documents (code les) is retrieved (2);
2. similarity (3) between the query q and a relevant document dj is
calculated (4) using the cosine distance;
3. each retrieved relevant document is divided into subdocuments, where
each one contains only one de nition of a function with surrounded
comments (if any);
4. from each subdocument, terms are extracted from comments, function
name and identi ers; for the terms, TF/IDF weights are calculated; and
the cosine similarity (5) is calculated between each subdocument djk
and the query q (6).
The PageRank process (C) is run on the directed graph of functional dependencies, and it calculates a rank vector, in which
every element is a score for each function in the graph.
- the PageRank of a function is de ned recursively and depends
on how many functions call (invoke) it
- the rank value indicates an importance of a particular function
Ranking the relevant functions:
1. names of de ned functions are extracted from the relevant documents;
2. for each function name Fn, a nal score sc(Fn) is calculated as a sum of:
a. a similarity between the query q and the document dj, in which is
the function Fn de ned (4); a similarity between the query q and
the subdocument djk, in which is the function Fn de ned (6),
b. a PageRank score pr(Fn) for the Fn (7)(8).
The index creator (A1) creates document and term indexes (A2).
- based on the vector space model - each document is modelled
as a vector of terms, which occur in that document
- terms are extracted from comments, names of functions and
identi ers -- NLP techniques are applied
72
73
74
75
76
77
78
79
80
eMail:
web:
> Method evaluation
- project Vilcacora for interactive browsing of multimedia content
- the precision metric - the fraction of the top 10 ranked functions
relevant to a query; for each query a level of con dence was
assigned (completely/mostly irrelevant, mostly/highly relevant)
[email protected]
www.fiit.stuba.sk/~kuric
[email protected]
www.fiit.stuba.sk/~bielik
Examples of the results for 3 queries (0.74 mean precision for 15 queries)
query
precision
load, image, create,
texture, map object
0.7
image, transform, ip,
aspect, ratio
0.4
mask, texture,
grayscale
0.8