Why My App Got Deleted Detection of Spam Mobile Apps

Transcription

Why My App Got Deleted Detection of Spam Mobile Apps
Why My App Got Deleted
Detection of Spam Mobile Apps
Suranga Seneviratne†? , Aruna Seneviratne†? , Dali Kaafar† ,
Anirban Mahanti† , Prasant Mohapatra‡
†
NICTA; ? School of EET, UNSW; ‡ Department of CS, University of California, Davis
ABSTRACT
Increased popularity of smartphones has attracted a large
number of developers to develop apps for various smartphone platforms. As a result, app markets are also populated with spam apps, which reduce the users’ quality of
experience and increase the workload of app market operators. The latter resort to remove those apps, upon user
complaints or to deny the developers’ publication approval
requests by relying on continuous human efforts to check
the app compliance with anti-spam policies. Apps can be
“spammy” in multiple ways including not having a specific
functionality, unrelated app description or unrelated keywords and publishing similar apps several times and across
diverse categories. Through a systematic crawl of a popular
app market and by identifying a set of removed apps, we
propose a method to detect spam apps solely using apps’
metadata available at the time of publication. We first propose a methodology to manually label a sample of removed
apps, according to a set of checkpoint heuristics to reveal
the reasons behind apps removal. This analysis suggests
that approximately 35% of the apps being removed are very
likely to be spam apps. We then map the identified heuristics to several quantifiable features and show how distinguishing these features are for spam apps. Finally, we build
an Adaptive Boost classifier for early identification of spam
apps using only the metadata of the apps. Our classifier
achieves an accuracy over 95% with precision varying between 85%–95% and recall varying between 38%–98%. By
applying the classifier on a set of apps present at the app
market at the time of our crawl, we estimate that at least
2.7% of them are spam apps.
1.
INTRODUCTION
Recent years has seen wide adoption of mobile apps, and
the number apps that are being offered are growing exponentially. As of mid-2014, Google Play Store and Apple App
Store, each hosted approximately 1.2 million apps [40, 52],
with around 20,000 new apps being published each month
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.
in both of these app markets. These numbers are expected
to significantly grow over the next few years [53, 40, 41].
Both Google and Apple have policies governing the publication of apps through their app markets. Google has an
explicit spam policy [32] that describes their main reasons
for considering an app to be “spammy”. These include : i)
apps that were automatically generated and published, and
as such do not have any specific functionality or a meaningful description; ii) multiple instances of the same app being
published to obtain increased visibility in the app market;
and iii) apps that make excessive use of unrelated keywords
to attract unintended searches. Overall, spam apps vitiate
the app market experience and its usefulness.
At present, Google and Apple take different approaches to
the spam detection problem. Google’s approach is reactive.
Google’s app approval process does not check an app against
its explicit spam app policy [46], and take action only on the
basis of customer complaints [50]. In general, such a crowd
sourced approach reveals spam apps only once the app becomes popular, which can lead to a considerable time lag
between app submission and detection of spam behaviour.
In contrast, Apple scrutinises the apps submitted for approval, presumably using a manual or semi-manual process,
to determine whether the submitted app conforms to Apple’s policies. Although this approach is likely to detect
spam apps before they are published, it lengthens the app
approval process. With the ever increasing number of apps
being submitted daily for approval, the app market operators
need to be able to detect spam apps quickly and accurately.
In this paper, we propose a methodology for automated
detection of spam apps at the time of app submission that
can be used in both app market management models. Our
classifier utilises only features that can be derived from an
app’s metadata available during the publication approval
process. It does not require any human intervention such as
manual inspection of the metadata or manual testing of the
app. We validate our app classifier, by applying it to a large
dataset of apps collected between December 2013 and May
2014, by crawling and identifying apps that were removed
from the Google Play Store.
We make the following contributions in this paper:
• We develop a manual app classification methodology
based on a set of heuristic checkpoints that can be used
to identify reasons behind an app’s removal. Using this
methodology we found that approximately 35% of the
apps that were removed are spam apps; problematic
content and counterfeits were the other key reasons
for removal of apps.
• We present a mapping of our proposed spam checkpoints to one or more quantifiable features that can
be used to train a learning model. We provide a characterisation of these features highlighting differences
between spam and non-spam apps and indicate which
features are more discriminative.
• We build an Adaptive Boost classifier for early detection of spam apps and show that our classifier can
achieve an accuracy over 95% at a precision between
85%–95% and a recall between 38%–98%.
• We applied our classifier to over 180,000 apps available in Google Play Store and show that approximately
2.7% of them are potentially spam apps.
To the best of our knowledge, this is the first work to develop an early detection framework for identifying spam mobile apps and to demonstrate its efficacy using real world
datasets. Our work complements prior research on detecting apps seeking over-permissions or apps containing malware [64, 23, 9, 63, 62, 49, 22].
The remainder of the paper is organised as follows. In Section 2 we discuss related work. In Section 3, we describe our
datasets and our methodology. Section 4 presents our manual classification process and the heuristic checkpoints, while
Section 5 introduces the process of mapping these checkpoints to machine learning features. We evaluate the performance of our classifier in Section 6. Section 7 discusses
the implications of our work and concludes the paper.
2.
RELATED WORK
Machine learning and statistical analysis have been applied for a range of spam detection problems. However, to
the best of our knowledge, these approaches have not been
used to automatically identify spam apps in mobile app markets. In this section, we discuss related work in spam detection for web pages, SMS, and emails, and the detection of
malware apps and over-privileged apps.
Web spam refers to the publication of web pages that are
specifically designed to influence search engine results. Using a set of manually classified samples of web pages obtained from “MSN Search”, Ntoulas et al. [45] characterise
the web page features that can be used to classify a web
page as spam or not. The features include top-level domain,
language of the web page, number of words in the web page,
number of words in the page title, and average word length
etc. The authors show that a decision tree-based classifier
achieves up to 85% precision and 90% recall for spam while
keeping the respective values for non-spam around 98%.
Fetterly et al. [18] characterise the features that can potentially be used to identify web spam through statistical
analysis. The authors analyse features such as URL properties, host name resolutions, and linkage properties and find
that outliers within each of the properties considered are
spam. Castillo et al. [11] describe a cost-sensitive bagging
classifier with decision trees, which when used individually
with link-based features and content-based features can classify web pages into spam and non-spam. Gyöngyi et al. [24]
propose the use of the link structure in a limited set of manually identified non-spam web pages to iteratively find spam
and non-spam web pages, and show that a significant fraction of the web spam can be filtered using only a seed set
of less than 200 sites. Krishnan et al. [37] use a similar
approach. Erdélyi et al. [16] show that a computationally
inexpensive feature subset, such as the number of words in
a page, the number of words in a title, and the average word
length, is sufficient to detect web spam.
Detection of email spam has received considerable attention [7]. Various content related features of mail messages
such as email header, text in the email body, and graphical elements have been used together with machine learning
techniques, such as naive Bayesian [48, 54, 42, 39], support
vector machines [15, 3, 55, 6], and k nearest neighbour [2].
Non-content related features such as SMTP path [38] and
user’s social network [47, 13] has also been used in spam
email detection. Currently, spam email detection is ubiquitous and most email services provide spam filtering.
Spam has also been studied in the context of SMS [21, 14],
product reviews [35, 34, 12], blog comments [43], social media [60, 5]. Cormack et al. [14] show that due to the limited
size of SMS messages, bag of words or word bigram based
spam classifiers do not perform well, and their performance
can be improved by expanding the set of features to include
orthogonal sparse word bigrams, and character bigrams and
trigrams. Jindal et al. [35] identify spam product reviews using review centric features such as the number of feedback
reports, textual features of the reviews, and product centric
features such as price, sales rank as well as reviewer-centric
features such as the average rating given by the reviewer.
Wang et al. [60] study detection of spammers in Twitter.
More recently, Malware mobile apps and over-privileged
apps (i.e., apps with over-permissions) have received attention [64, 23, 9, 63, 62, 49, 22]. Zhou et al. [64] propose DroidRanger, which uses permission-based behavioural
fingerprinting to detect new samples of known Android malware families and heuristics-based filtering to identify inherent behaviours of unknown malicious families. Gorla et
al. [22] regroup app descriptions using a Latent Dirichlet
Allocation based model and k-means clustering to identify
apps that have unexpected API usage characteristics.
3.
3.1
METHODOLOGY
Dataset
We use apps collected in a previous study [57], as a seed
for collecting a new dataset. This initial seed contained
94,782 apps and was curated from the lists of apps obtained
from approximately 10,000 smartphone users. The user base
consisted of volunteers, Amazon mechanical turk users, and
users who published their lists of apps in social app discovery sites such as Appbrain 1 . We crawled each app’s Google
Play Store page to discover: i) functionally similar apps; ii)
other apps by the same developer. Then we collected metadata such as app name, app description, and app category
for all the apps we discovered including the seed set. The
complete list of metadata can be found in Appendix 9.15.
We discarded the apps with a non-English description using
the language detection API, “Detect Language” [1]. We refer
to this final set as the Observed Set - O.
After one month of identifying the Observed Set, apps,
we revisited Google Play Store to check the availability of
each app. The subset of apps that were unavailable at the
time of this second crawl is referred to as Crawl 1 - C1 .
This process was repeated two times, Crawl 2 - C2 and
Crawl 3 - C3 with at least one month gap between two
1
www.appbrain.com
consecutive crawls. Figure 1 provides an overview of the
data collection process and Table 1 summarises the datasets
in use.
App discovery
Crawl 1
Crawl 2
Observed
Set
Crawl 1
Crawl 2
Figure 1: Chronology of the data collection
Table 1: Summary of the dataset
Set
Number of apps
232,906
6,566
9,184
18,897
Unofficial
content
Crawl 3
Dec’13 – May’14
Observed set (O)
Crawl 1 (C1 )
Crawl 2 (C2 )
Crawl 3 (C3 )
Reason
Spam
Crawl 3
~2 weeks 4 weeks ~2 weeks 4 weeks ~2 weeks 4 weeks ~2 weeks
Initial
Seed
Table 2: Key reasons for removal of apps
Copyrighted
content
Adult
content
Problematic
content
Android
counterfeit
Other
counterfeit
Developer
deleted
Developer
banned
Description
These apps often have characteristics such as unrelated description, keyword misuse, and multiple
instances of the same app. Section 4 presents details on spam app characteristics.
Apps that provide unofficial interfaces to popular
websites or services (e.g., an app providing an interface to a popular online shopping site without
any official affiliation).
Apps illegally distributing copyrighted content.
Apps with explicit sexual content.
Apps with illegal or problematic content.
E.g., Hate speech and drug related.
Apps pretending to be another popular app in the
Google Play Store.
A counterfeit app, for which the original app
comes from a different source than Google Play
Store (E.g., Apple App Store)
Apps that were removed by the developer.
Developer’s other apps were removed due to various reasons and Google decides to ban the developer. Thus all of his apps get removed.
Table 3: Reviewer agreement in labelling reason for removal
We identify temporary versus long-term removal of apps
by re-checking the status of apps deemed to have been
removed during an earlier crawl. For instance, all apps
in Crawl 1 were checked again during Crawl 2. We
found that only 85 (∼0.13%) apps identified as removed in
Crawl 1 reappeared in Crawl 2. Similarly, only 153 apps
(∼0.02%) identified as removed in Crawl 2 reappeared in
Crawl 3. These apps were not included in our analysis.
3.2
App Labelling Process
For a subset of the removed apps in our dataset, our goal
was to manually identify the reasons behind their removal.
As there can be multiple reasons for removal of apps, identification of the reasons behind the removal of an app is
challenging and time consuming.
We identified factors that lead to removal of apps from the
market place by consulting numerous market reports [51,
26] as well as by examining the policies of the major app
markets [32, 30, 28, 29, 31, 25]. We identified nine key
reasons, which are summarised in Table 2. For each of
these reasons, we formulated a set of heuristic checkpoints
that can be used to manually label whether or not an app
is likely to be removed. Owing to space limitations we do
not provide the heuristic checkpoints for removed reasons
except for spam. Full list of checkpoints for each reason
can be found in Appendix 9.2–9.8. Section 4 delves into the
checkpoints developed for identifying spam apps.
From Crawl 1, we took a random sample of 1500 apps and
asked three independent reviewers to identify the reasons
behind the removal of each app using the heuristic checkpoints that we developed as a guideline. These reviewers
were domain experts and had experience in developing and
publishing Android apps. If a reviewer could not conclusively determine the reasons behind a removal, she labeled
those apps as unknown. Moreover, for consistency, reviewers were asked to follow the same order of reasons for each
app (i.e., check for spam first, then for unofficial content,
then for copyrighted content followed by Android counterfeits and so on) and label the first reason they found as the
most plausible. The set of apps that were used is referred
as Labeled Set - L.
Reason
3
agrees
2
agrees
Total
Percentage
Spam
Unofficial content
Developer deleted
Android counterfeit
Developer banned
Copyrighted content
Other counterfeit
Adult content
Problematic content
Unknown
Sub total
Reviewer disagreement
Total Labeled
292
65
68
27
24
2
11
8
3
101
601
NA
NA
259
127
56
61
54
34
23
4
4
127
749
NA
NA
551
192
124
88
78
36
34
12
7
228
1350
150
1500
36.73%
12.80%
8.27%
5.87%
5.20%
2.40%
2.27%
0.80%
0.47%
15.20%
90.00%
10.00%
100.00%
3.3
Agreement Among the Reviewers
We used majority voting to merge the results of the experts assessment, to arrive at the reason behind the app
removal in L. We decided not to crowdsource the labelling
task to avoid issues with training and expertise.
Table 3 summarises reviewer agreement. For approximately 40% (601 out of 1500) of the apps in L, all three
reviewers reached a consensus by identifying the same reason and for 90% (1350 out of 1500) of the apps majority of
the reviewers agreed on the same reason. For the remaining
10% of apps, reviewers disagreed about the reasons.
Table 3 also shows the composition of L after majority
voting-based label merging. We observe that spam is the
main reason for app removal, accounting for approximately
37% of the removals, followed by unofficial content accounting for approximately 13% of the removals. Around 15% of
the apps were labelled as unknown.
Figure 2 shows the conditional probability of the third
reviewer’s reasoning, given that the other two reviewers are
in agreement. There is over 50% probability of the third reviewer’s judgement of an app being spam, when two reviewers already judged the app to be spam. Other reasons showing such high probability are developer deleted and adult
content apps.
Our analysis through the manual labelling process shows
that the main reason behind app removal is them being spam
Value
0
0.1
0.2
0.3
0.4
0.5
0.6
Color Key
on a reason
Two reviewers agreeing
4.2
S7
S8
S9
89
115
11
11
3
26
3
15
The sum of the cases where 3 reviewers agreed, 2 reviewers
agreed, and reviewers disagreed shown in Table 5, is more
than the total number of spam apps identified by merging
the reviewer agreement (i.e., 551 apps) because one reviewer
might mark an app as satisfying multiple checkpoints. For
example, consider a scenario where reviewer-1 labels the app
as satisfying checkpoints S1 , S2 and reviewer-2 labels as S1 ,
S2 and reviewer-3 labels only as S1 . This will cause one scenario of 3 reviewer agreement (for S1 ) and another scenario
of 2 reviewer agreement (for S2 ).
Out of the 551 spam apps, three reviewers agreed on at
least one checkpoint for 210 apps (∼ 38%), and two reviewers agreed on at least one checkpoint for 296 apps (∼ 54%).
For 45 apps (∼ 8%), the three reviewers gave different checkpoints (i.e. when the reviewers assess an app as spam, over
90% of the time they also agreed on at least one checkpoint).
0.7
0.7
S9
0.6
S8
S7
S7
0.5
S2
S1
0.4
0.3
0.3
S5
0.4
S4
0.2
0.2
S3
Two reviewers agreeing
on a checkpoint
S5
S4
S6
S3
S2
S1
0.1
0.1
S6
Color Key
0.5
S8
0.6
Other
Probability of third
reviewer's reasoning
Value
S1
S2
S3
S4
S5
S6
S9
S7
Other
0
0.0
S9
S9
S8
S8
Two reviewers agreeing
on a checkpoint
S7
S7
S6
S6
S5
S5
S4
S4
S3
S3
Table 4 presents the nine heuristic checkpoints used by the
reviewers and those were derived based on Google’s spam
policy [32]. While we are unaware of Apple’s spam policy, we note that Apple’s app review guideline [25] includes
certain provisions that match our spam checkpoints. For
example, term 2.20 in the Apple app review guideline, “Developers spamming the App Store with many versions of similar apps will be removed from the iOS Developer Program”
maps to our checkpoint S6 . Similarly, term 3.3 “Apps with
names, descriptions, screenshots, or previews not relevant to
the content and functionality of the app will be rejected” is
similar to our checkpoint S2 .
Reviewers were asked not to deem the apps as spam when
they simply observe an app satisfying a single checkpoint,
but to consider making their decision based on matches with
multiple checkpoints according to their domain expertise.
Note that Table 4 includes all checkpoint relevant to spam,
including those that are considered only for manual labelling
and not for automated labelling. In particular, checkpoints
S8 and S9 are only used for manual labelling as features corresponding to these checkpoints are either not available at
the time of app submission or require significant dynamic
analysis for feature discovery. For instance, we do not use
user “reviews” of the app (cf. checkpoint S8 ), as user reviews are not available prior to the app’s approval. Similarly, checking for excessive advertising (cf. checkpoint S9 )
was also used only for manual labelling, as it requires dynamic analysis by executing it in a controlled environment.
S6
4
6
45
S2
Spam Checkpoints
S5
0
3
S8
4.1
S4
20
75
S2
MANUAL LABELLING OF SPAM APPS
This section introduces our heuristic checkpoints, which
are used to manually label the spam apps. We also discuss
the reviewers’ assessment according to the checkpoints. Section 5 maps the defined checkpoints into automated features.
S3
63
81
S1
4.
S2
22
52
S9
Figure 2: Probability of a third reviewer’s judgement when
two reviewers already agreed on a reason
S1
3 agreed
2 agreed
Disagreed
S1
0.3
0.2
0.1
0
Unknown
Two reviewers agreeing on a reason
Checkpoint
Other
0.5
0.4
Value
Color Key
Two reviewers agreeing
on a reason
Spam
0.0
Problematic content
Adult content
Other counterfeit
Copyrighted content
Developer banned
Developer deleted
Android counterfeit
Spam
Unofficial content
Unofficial.content
Spam
Unofficial content
0.1
gninosaer s'reweiver driht fo ytilibaborP
0.6
Unknown
Problematic content
Developer.deleted
Spam
Developer deleted
Table 5: Checkpoint-wise reviewer agreement for spam
Probability of third reviewer’s
reasoning
n w o nk n U
tnetnoc.citamelborP
tiefretnuoc.rehtO
tnetnoc.tludA
dennab.repoleveD
tnetnoc.dethgirypoC
deteled.repoleveD
tiefretnuoc.diordnA
mapS
tnetnoc.laiciffonU
Adult content
Copyrighted content
Developer banned
Android counterfeit
Developer deleted
Unofficial content
Spam
Other counterfeit
Android.counterfeit
Probability of third reviewer’s
reasoning
Developer.banned
Unofficial content
0.2
S1
Copyrighted.content
Developer deleted
Android counterfeit
S2
Adult.content
Other.counterfeit
Android counterfeit
Developer banned
0.3
S3
Unknown
Problematic.content
Developer banned
S4
0.4
Copyrighted content
S5
Other counterfeit
Other counterfeit
Copyrighted content
S6
Adult content
0.5
S7
Adult content
0.6
Problematic content
S8
Problematic content
Unknown
Table 5 shows how often the reviewers agreed upon each
checkpoint. Checkpoints S2 and S6 led to the highest number of 3-reviewers agreements and 2-reviewer agreements.
Checkpoints S1 and S3 have moderate number of 3-reviewer
agreements while having a high number of 2-reviewer agreements. The table also suggests that checkpoints S1 , S2 , S3
and S6 are the most dominant checkpoints.
S9
Probability of third reviewer's reasoning
Unknown
Reviewer Agreement on Spam Checkpoints
Probability of third reviewer's reasoning
apps. Furthermore, the reviewer agreement was high (more
than 50%) when manually labelling spam apps indicating
spam apps stands out clearly when looking at removed apps.
Two reviewers agreeing on a checkpoint
Color Key
Figure 3: Probability of a third reviewer’s judgement when
Value
two reviewers already agreed on a checkpoint
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Figure 3 illustrates the conditional probability of the third
reviewer’s checkpoint selection given two reviewers have already agreed on a checkpoint. We observe that for checkpoints S1 , S2 , S5 , S6 , S7 and S8 , there is a high probability that the third reviewer indicates the same checkpoint
when two of the reviewers already agreed on a checkpoint.
There is, however, a noticeable anomaly for checkpoint S4 ,
for which it seems that reviewers are getting confused with
S3 . This is due to the lower frequency of occurrences of
checkpoint S4 . There were no 3-reviewer agreements for
S4 and there were only 3 cases where 2 reviewers agreed
as shown in Table 5. In those 3 cases the other reviewer
marked the checkpoint S3 in 2 cases giving a high proba-
Table 4: Checkpoints used for labelling of spam apps
Attribute
ID
S1
Description
S2
S3
S4
S5
S6
Identifier
S7
Reviews
S8
Adware
S9
Description and Examples
Does the app description describe the app function clearly and concisely?
E.g. Signature Capture App (Non - spam) - Description is clear on the functionality of the application
“SignIt app allows a user to sign and take notes which can be saved and shared instantly in email or social
media.”
E.g. Manchester (Spam) - Description contains details about Manchester United Football Club without any
detail on the functionality of the app.
“Manchester United Football Club is an English professional football club, based in Old Trafford, Greater
Manchester, that plays in the Premier League. ..... In 1998-99, the club won a continental treble of the
Premier League, the FA Cup and the UEFA Champions League, an unprecedented feat for an English club.”
Does the app description contain too much details / incoherent text / unrelated text for an app description?
E.g. SpeedMoto (Non - spam) - Description is clear and concise about the functionality and usage.
“SpeedMoto is a 3d moto racing game with simple control and excellent graphic effect. Just swap your phone
to control moto direction. Tap the screen to accelerate the moto. In this game you can ride the motorcycle
shuttle in outskirts, forest, snow mountain,bridge. More and More maps and motos will coming soon”
E.g. Ferrari Wallpapers HD (Spam) - Description starts mentioning app as a wallpaper. However, then it
goes into to details about Ferrari.
“*HD WALLPAPERS *EASY TO SAVE *EASY TO SET WALLPAPER *INCLUDES ADS FOR ESTABLISHING THIS APP FREE TO YOU THANKS FOR UNDERSTANDING AND DOWNLOADING
=) Ferrari S.p.A. is an Italian sports car manufacturer based in Maranello, Italy. Founded by Enzo Ferrari
in 1929, as Scuderia Ferrari, the company sponsored drivers and manufactured race cars before moving into
production of street-legal .....”
Does the app description contain a noticeable repetition of words or keywords?
E.g. English Chinese Dictionary - Keywords do not have excessive repetition.
“Keywords: ec, dict, translator, learn, translate, lesson, course, grammar, phrases, vocabulary, translation,
dict”
E.g. Best live HD TV no ads (Spam) - Excessive repetition of words.
“Keywords: live tv for free mobile tv tuner tv mobile android tv on line windows mobile tv verizon mobile
tv tv streaming live tv for mobile phone mobile tv stream mobile tv phone mobile phone tv rogers mobile tv
live mobile tv channels sony mobile tv free download mobile tv dstv mobile tv mobile tv.....”
Does the app description contain unrelated keywords or references?
E.g. FD Call Assistant Free (Non - spam) - All the keywords are related to the fire department.
“Keywords: firefighter, fire department, emergency, police, ems, mapping, dispatch, 911”
E.g. Diamond Eggs (Spam) - Reference to popular games Bejeweled Blitz and Diamond Blast without any
reason.
“Keywords : bejeweled, bejeweled blitz, gems twist, enjoy games, brain games, diamond, diamond blast,
diamond cash, diamond gems,Eggs, jewels, jewels star”
Does the app description contain excessive references for other applications from the same developer?
E.g. Kids Puzzles (Non - spam) - Description does not contain references to developer’s other apps.
“This kids game has 55 puzzles. Easy to build puzzles. Shapes Animals Nature and more... With sound
telling your child what the puzzle is. Will be adding new puzzles very soon. Keywords: kids, puzzle, puzzles,
toddler, fun”
E.g. Diamond Snaker (Spam) - Excessive references to developer’s other applications.
“If you like it, you can try our other apps (Money Exchange, Color Blocks, Chinese Chess Puzzel, PTT
Web .....”
Does the developer have multiple apps with approximately the same description?
The developer “Universal App” have 16 apps having the following description, with each time XXXX term
is replaced with some other term.
“Get XXXX live wallpaper on your devices! Download the free XXXX live wallpaper featuring amazing
animation. Now with ”Water Droplet”, ”Photo Cube”, ”3D Photo Gallery” effect! Touch or tap the screen
to add water drops on your home screen! Touch the top right corner of the screen to customise the wallpaper
<>. To Use: Home -> Menu -> Wallpaper -> Live Wallpaper -> XXXX 3D Live Wallpaper To develop
more free great live wallpapers, we have implemented some ads in settings. Advertisement can support
our develop more free great live wallpapers. This live wallpaper has been tested on latest devices such as
Samsung Galaxy S3 and Galaxy Nexus. Please contact us if your device is not supported. Note: If your
wallpaper resets to default after reboot, you will need put the app on phone instead of SD card. ”
Does the app identifier make sense and have some relevance to the functionality of the application or does it
look like auto generated?
E.g. Angry Birds Seasons & Candy Crush Saga (Non - spam) - Identifier give an idea about the app.
“com.rovio.angrybirdsseasons”, “com.king.candycrushsaga”
E.g. Game of Thrones FREE Fan App & How To Draw Graffiti (Spam) - Identifiers appear to be auto
generated.
“com.a121828451851a959009786c1a.a10023846a”, “com.a13106102865265262e503a24a.a13796080a”
Do users complain about app being spammy in reviews?
E.g. Yoga Trainer & Fitness & Google Play History Cleaner (Spam) - Users complain about app being
spammy.
“Junk spam app Avoid”, “More like a spam trojan! Download if you like, but this is straight garbage!!”
Do the online APK checking tools highlight app having excessive advertising?
E.g. Biceps & Triceps Workouts
“AVG threat labs” [27] gives a warning about inclusion of malware causing excessive advertising.
word-trigram as features and the frequency of occurrence as
the value of the feature. Full list of these bigrams and trigrams can be found in Appendix 9.11.Figure 6 visualises top50 bigrams (when ranked according to information gain);
the size of the bigrams proportional to the logarithm of the
normalised information gain. Note that bigrams such as
“live wallpaper”, “wallpaper on”, and “some ads” are mostly
present in spam whereas bigrams such as “follow us”, “to
play” and “on twitter” are popular in non-spam.
implemented.some
settings.advertisement
your.wallpaper phone.instead
on.twitter
5.1
Checkpoint S1 - Does the app description describe the app function clearly and concisely?
We automate this heuristic by identifying word-bigrams
and word-trigrams that are used to describe a functionality
and are popular among either spam or non-spam apps.
First, we manually read the descriptions of top 100 nonspam apps and identified 100 word bigrams and word trigrams that describe app functionality, such as “this app’,
“allows you”, and “app for”. Full list of these bigrams and
trigrams found in Appendix 9.9. Figure 4 shows the cumulative distribution function (CDF) of the number of bigrams and trigrams observed in app descriptions in each
dataset. There is a high probability of spam apps not having these features in their app descriptions. For example,
50% of the spam apps had one or more occurrence of these
bigrams whereas 80% of the top 1x apps had one or more
of these bi- and tri-grams. The difference in distribution
of these bigrams and trigrams between spam and non-spam
apps decreases when more and more lower ranked apps are
considered in the non-spam category. This finding is consistent across some of the features we discuss in subsequent
sections and we believe this indicates lower ranked apps can
potentially include spams that are as yet not identified by
Google.
Second, as manual identification of bigrams and trigrams
is not exhaustive, we identified the word-bigrams and wordtrigrams that appear in at least 10% of the spam and 10%
of the non-spam apps. We used each such word-bigram and
2
https://evernote.com
http://www.tictoc.net
4
http://www.saavn.com
3
latest.devices
wallpaper.to
wallpaper.on
live.wallpaper
reboot.you
note.if.your on.latest
can.support
s.and of.sd
this.live
the.wallpaper
screen.to
on.facebook
follow.us devices.such
to.use.home wallpaper.has
wallpapers.we
advertisement.can
develop.more
have.implemented
Non−Spam
wallpapers.this
tested.on
to.default
default.after
on.phone
free.great
great.live resets.to use.home
more.free wallpaper.resets support.our
ads.in some.ads us.if.your
been.tested
live.wallpapers
Checkpoint heuristics for spam (cf. Table 4) need to be
mapped into features that can be extracted and used for
automatic spam detection during the app approval process.
This section presents this mapping and a characterisation of
these features with respect to spam and non spam apps.
A spam app is an app labelled as spam in L. We assumed that top apps with respect to the number of downloads, number of user reviews, and rating, are quite likely
to be non spam. For example, according to this ranking,
number one ranked app was Facebook and number 100 was
Evernote 2 , number 500 was Tictoc 3 and number 1000 was
Saavn 4 . Thus, for non spam apps, we selected top k times
the number of labelled spam apps (551) from the set O, except all removed apps, after ranking them according to total
number of downloads, total number of user reviews, and average rating. We varied k logarithmically between 1 and 32,
(i.e., 1x, 2x, 4x,..., 32x) to obtain 6 datasets of non-spam
apps. At larger k values it is possible that spam apps being
considered to be non-spam, as discussed later in this section.
As mentioned earlier in Section 4.1, checkpoints S8 and
S9 are not used to develop features as we intend to enable
automated spam detection at the time of developer submission.
Total number of characters in the description
Total number of words in the description
Total number of sentences in the description
Average word length
Average sentence length
Percentage of upper case characters
Percentage of punctuations
Percentages of numeric characters
Percentage of non-alphabet characters
Percentage of common English words [10]
Percentage of personal pronouns [10]
Percentage of emotional words [44]
Percentage of misspelled words [61]
Percentage of words with alphabet and numeric characters
Automatic readability index (AR) [58]
Flesch readability score (FR) [19]
after.reboot
need.put
screen.touch
FEATURE MAPPING
Feature
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
as.samsung
5.
Table 6: Features associated with Checkpoint S2
to.play
bility (∼ 67%) for S4 getting confused with S3 . Note that
Other refers to all the checkpoints associated with reasons
for app removals other than spam; the probability of Others being chosen is high because of the accumulated sum of
probabilities of checkpoints associated with other reasons.
your.friends
Spam
Figure 6: Top-50 bigrams differentiating Spam and Top-1x
5.2
Checkpoint S2 - Does the app description contain too much details, incoherent text, or unrelated text?
We use writing style related “stylometry” features to map
this checkpoint to a set of features anticipating spam and
non-spam apps might have a different writing style. We
used the 16 features listed in Table 6 for this checkpoint and
Appendix 9.12 provides further details on these features..
For characterisation, we select a set of features using
the greedy forward feature selection in wrapper method [36].
We use a decision tree classifier with maximum depth 10
and the feature subset was selected such that the classifier optimise the performance metric asymmetric F-Measure,
Fβ = (1 + β 2 ). (β 2precision.recall
, with β = 0.5. This met.precision)+recall
ric was used since in spam app detection, precision is more
important than recall [59]. This process identified the Total
number of words in the description, Total number of sentences in the description, Percentages of numeric characters, Percentage of non-alphabet characters, and Percentage
of common English words as the most discriminative features.
Figures 5a shows the CDF of the total number of words
in the app description. Spam apps tend to have less wordy
0.8
0.6
Spam
Top-1x
Top-2x
Top-4x
Top-8x
Top-16x
Top-32x
0.4
0.2
0
0
2
4
6
8
10
Number of manually identified bigrams and trigrams
1
0.8
0.6
Spam
Top-1x
Top-2x
Top-4x
Top-8x
Top-16x
Top-32x
0.4
0.2
0
0
100 200 300 400 500 600 700
Word count
(a)
Figure 4: CDF of the frequency of manually
identified word-bigrams and word-trigrams
1
0.8
0.6
Spam
Top-1x
Top-2x
Top-4x
Top-8x
Top-16x
Top-32x
0.4
0.2
0
0
0.15
0.3
0.45
0.6
Percentage of common English words
(b)
Figure 5: Example features associated with Checkpoint S2
app descriptions than non-spam apps. For instance, nearly
30% of the spam apps have less than 100 words whereas approximately only 10% top-1x apps and top-2x apps have less
than 100 words. As we inject more and more apps of lower
popularity the difference diminishes. Figure 5b presents the
CDF of the percentage of common English words [10] in the
app description. It illustrates that spam apps typically use
fewer common English words compared to non-spam apps.
5.3
Checkpoint S3 - Does the app description contain a noticeable repetition of words or keywords?
We quantify this checkpoint using the vocabulary richness
of unique words
(VR)= Number
metric. Figure 7 shows the CDF
Total number of words
of the VR measure. We expected spam apps to have low VR
due to repetition of keywords. However, we observe this only
in a limited number of cases. According to Figure 7, if VR
is less than 0.3, an app is only marginally more likely to be
spam. Perhaps the most surprising finding is that the apps
with VR close to 1 are more likely to be spam. 10% of the
spam apps had VR over 0.9 and none of the non-spam apps
had such high VR values. When we manually checked these
spam apps we found that they had terse app descriptions.
We showed earlier that apps with shorter description are
more likely to be spam under checkpoint S2 .
5.4
Cumulative Probability
Cumulative Probability
Cumulative Probability
1
Checkpoint S4 - Does the app description contain unrelated keywords or references?
Inserting unrelated keywords such that an app would appear in popular search results is a common spamming technique. Since the topics of the apps can vary significantly it
is difficult to find when the terms included in the description
are actually related to the functionality. In this study, we focussed on a specific strategy the developers might use, that
is mentioning the names of the popular applications with the
hope of appearing in the search results for these applications.
For each app we calculated the number of mentioning of the
top 100 popular app names in the app’s description.
Figure 8a shows the distribution of the number of times
the name of top-100 apps appear in the description. Somewhat surprisingly, we found that only 20% of the spam apps
had more than one mention of popular apps, whereas 40%–
60% of the top-kx apps had more than a single mention of
popular apps. On investigation, we find that many top-kx
apps have social networking interfaces and fan pages and as
a result they mention the names of popular social networking apps. This is evident in Figure 6 as well where Twitter or
Facebook are mentioned mostly in non spam apps. To sum-
marise, presence of social sharing features or availability of
fan pages can be used to identify non spam apps.
Next, we consider the sum of the tf-idf weights of the top100 app names. This feature attempts to reduce the effect
of highly popular apps. For example, Facebook can have a
high frequency because of its sharing capability and many
apps might refer it. However, if an app refers to another app
which is not as popular as Facebook, such as reference to popular games, it indicates a likelihood of it being a spam app.
To discount the effect of the commonly available app names,
we used the sum of tf-idf weights as a feature. For a given
dataset let tfik where i = 1 : 100 be the number of times the
app name of ith app (ni ) in top-100 apps appear in an app
description of app k (ak ). Then we define IDFi for ith app in
N
∀k. Then feature
top-100 apps as, IDFi = log(1+|{n
i ∈ak }|)
P
th
tf − idf (k) for k app is, tf − idf (k) = i=100
tfik . IDFi .
i=1
The calculation above is dataset dependent. Despite this,
as Figures 8b shows, the CDF of top-1x dataset and the
results still indicates that if popular app names are found in
the app description, the app tends to be non spam rather
than spam. We repeated the same with top-1000 app names
and found that the results were similar.
5.5
Checkpoint S5 - Does the app description contain excessive references to other applications
from the same developer?
We use the number of times a developer’s other app
names appear as the feature corresponding to this checkpoint. However, none of the cases marked by the reviewers
as matching checkpoint S5 satisfied this feature. This was
because the description contained links to the applications
rather than the app names and only 10 spam apps satisfied
this feature (cf. Table 5). We do not use checkpoint S5 in
our classifier.
5.6
Checkpoint S6 - Does the developer have multiple apps with approximately the same description?
For this checkpoint, for each app we considered the following features: (i) The total number of other apps the developer has, (ii) The total number of apps with an English
language description which can be used to measure descriptions similarity and (iii) The number of other apps from the
same developer having a description cosine similarity(s), of
over 60%, 70%, 80% and 90%. Further details of these features can be found in Appendix 9.13. We observe that the
features based on the similarity between app descriptions are
the most discriminative. As examples, Figures 9a and 9b re-
0.8
0.6
Spam
Top-1x
Top-2x
Top-4x
Top-8x
Top-16x
Top-32x
0.4
0.2
0
0
0.2
0.4
0.6
0.8
Vocabulary richness
1
1
0.8
0.6
Spam
Top-1x
Top-2x
Top-4x
Top-8x
Top-16x
Top-32x
0.4
0.2
0
0
2
4
6
8
Frequency of top-100 app names appear
Cumulative Probability
Cumulative Probability
Cumulative Probability
1
1
0.8
0.6
0.4
Spam
Top-1x
0
5
10
15
20
25
30
tf-idf of top-100 app names (1x dataset)
(a)
Figure 7: Vocabulary Richness
Figure 8: Mentioning popular app names
0.9
0.8
0.7
Spam
Top-1x
Top-2x
Top-4x
Top-8x
Top-16x
Top-32x
0
20 40 60 80 100 120 140
Apps by developer with similarity over 60%
(a) Over 60% similarity
Cumulative Probability
Cumulative Probability
spectively show the CDF of number of apps with s over 60%
and 90% by the same developer.
Figure 9a shows that only about 10%–15% of the nonspam apps have more than 5 other apps from the same developer with over 60% of description similarity. However,
approximately 27% of the spam apps have more than 5 apps
with over 60% of description similarity. This difference becomes more evident when the number of apps from the same
developer with over 90% description similarity is considered,
indicating that spam apps tend to have multiple clones with
similar app descriptions.
1
1
Table 7: Features associated with Checkpoint S7
Feature
1
2
3
4
5
6
7
8
9
10
11
12
13
Number of characters
Number of words
Average word length
Percentage of of non-letter characters to total characters
Percentage of upper case characters to total letter characters
Presence of parts of app name in appid
Percentage of bigrams with 1 non-letter to total bigrams
Percentage of bigrams with 2 non-letters to total bigrams
Percentage of bigrams with 1 or 2 non-letters to total bigrams
Percentage of trigrams with 1 non-letter to total trigrams
Percentage of trigrams with 2 non-letters to total trigrams
Percentage of trigrams with 3 non-letter to total trigrams
Percentage of trigrams with 1, 2 or 3 non-letters to total trigrams
0.9
0.8
0.7
Spam
Top-1x
Top-2x
Top-4x
Top-8x
Top-16x
Top-32x
0
20 40 60 80 100 120 140
Apps by developer with similarity over 90%
(b) Over 90% similarity
Figure 9: Similarity with developer’s other apps
5.7
(b)
Checkpoint S7 - Does the app identifier (appid) make sense and have some relevance to
the functionality of the application or does it
appear to be auto generated?
Table 7 shows the features corresponding to this checkpoint; these are derived from the appid. Further details
about these features can be found in Appendix 9.14. We
applied the feature selection method noted in Section 5.2 to
identify the most discriminative features (Number of words,
Average word length, Percentage of bigrams with 2 nonletters).
In Figure 10a we show the CDF of the number of words
in the appid. The spam apps tend to have more words compared to non-spam apps. For example, 15% of the spam
apps had more than 5 words in the appid where as only 5%
of the non-spam had the same. Figure 10b shows the CDF
of the average word length of a word in appid. For 10% of
the spam apps the average word length is higher than 10
and it was so only for 2%–3% of the non-spam apps.
Figure 10c shows the percentage of non-letter bigrams
(e.g., “01”, “8 ”) among all character bigrams that can be
generated from the appid. None of the non-spam apps had
more 20% of non-letter bigrams in the appid, whereas about
5% of the spam apps had more than 20% of non-letter bigrams. Therefore, if an appid contains over 20% of nonletter bigrams out of all possible bigrams that can be generated, that app is more likely to be spam than non-spam.
5.8
Other Metadata
In addition to the features derived from the checkpoints,
we added the metadata related features listed on Table 8.
Further details about these features can be found in Appendix 9.15.
Figure 11 shows the app category distribution in the spam
and the top-1x categories. We note that approximately 42%
of the spam apps belong to the categories Personalisation
and Entertainment. We found a negligibly small number of
spam apps in the categories Communication, Photography,
Racing, Social, Travel, and Weather. Moreover, for categories Arcade, Tools, and Casual, the percentage of spam is
significantly less than the percentage of non-spam. Qualitatively similar observations hold for these other top-kx sets.
Figure 12a shows the CDF of the length of the app name.
Spam apps tend to have longer names. For example, 40%
of the spam apps had more than 25 characters in the app
name. Only 20% of the non-spam apps had more than 25
characters in their app names. Figures 12b shows the CDF
of the size of the app in kilobytes. Spam apps tend to have
a high probability of having a size in some specific ranges.
For example if the size of the app is very low those apps are
more likely to be non-spam. For example, 30% of the topkx apps were less than 100KB in size and the corresponding
percentage of spam apps is almost zero. Almost all the spam
0.6
Spam
Top-1x
Top-2x
Top-4x
Top-8x
Top-16x
Top-32x
0.4
0.2
0
2
4
6
8
10
No of words in appid
12
1
Cumulative Probability
Cumulative Probability
Cumulative Probability
1
0.8
0.8
0.6
Spam
Top-1x
Top-2x
Top-4x
Top-8x
Top-16x
Top-32x
0.4
0.2
0
0
5
10
15
Average word length appid
(a)
1
0.98
0.96
Spam
Top-1x
Top-2x
Top-4x
Top-8x
Top-16x
Top-32x
0.94
0.92
20
0.9
0
0.2
0.4
0.6
0.8
Percentage of 2 non-letters to total bigrams
(b)
(c)
Table 8: Features associated with other app metadata
1
Spam
Top-1x
Top-2x
Top-4x
Top-8x
Top-16x
Top-32x
0.8
0.6
Cumulative Probability
apps were having sizes less than 30MB whereas 10%–15% of
the top-kx apps were more than 30MB in size. CDF of the
spam app shows a sharp rise between 1000KB and 3000KB
indicating there are more apps in that size range.
Table 9 shows the external information related features.
For example, if a link to a developer web site or a privacy
policy is given the app is more likely to be non spam.
Cumulative Probability
Figure 10: Example features associated with Checkpoint S7
0.4
0.2
0
0
5
10
15
20
25
30
Length of app name (in characters)
1
0.8
0.6
Spam
Top-1x
Top-2x
Top-4x
Top-8x
Top-16x
Top-32x
0.4
0.2
0
0
5000
Size (KB)
(a)
Feature
1
2
3
4
App category
Price
Length of app name
Size of the app (KB)
5
6
7
8
Developer’s website available
Developer’s website reachable
Developer’s email available
Privacy policy available
Website availa.
Website reacha.
Email availa.
Priv. policy availa.
Spam
top-1x
15
10
5
Arcade
Books
Brain
Business
Cards
Casual
Comics
Communication
Education
Entertainment
Finance
Health
Libraries
Lifestyle
Media
Medical
Music
News
Personalisation
Photography
Productivity
Racing
Shopping
Social
Sports
Sports Games
Tools
Transport
Travel
Weather
Percentage of apps
Figure 12: Features associated with other app metadata
Table 9: Availability of developer’s external information
20
App Category
Figure 11: App category
6.1
(b)
Feature
25
6.
10000 15000 20000 25000 30000
EVALUATION
Classifier Performance
We build a binary classifier based on the Adaptive Boost
algorithm [20] to detect whether or not a given app is spam.
We use the spam apps from L as positives and apps from
a top-kx set as negatives. Each app is represented by a
vector of the features studied in Section 5. The classifier
is trained using 80% of the data and the remaining 20% of
the data is used for testing. We repeat this experiment for
k = 1, 2, 4, · · · , 32.
Some of the features discussed in Section 5 depend on
the other apps in the dataset rather than on the metadata
Spam
top
1x
top
2x
top
4x
top
8x
top
16x
top
32x
57%
93%
99%
9%
93%
98%
84%
56%
94%
97%
89%
50%
93%
97%
91%
48%
91%
96%
93%
38%
89%
96%
94%
32%
86%
95%
95%
26%
of the individual apps. For such cases we calculate the features based only on the training set to avoid any information
propagation from the training to the testing set that would
artificially inflate the performance of the classifier. For example, when extracting the bigrams (and trigrams) that are
popular in at least 10% of the apps, only spam and nonspam apps from the training set are used. This vocabulary
is then used to generate feature values for individual apps
in both training and testing sets.
Decision trees [8] with a maximum depth of 5 are used as
the weak classifiers in the Adaptive Boost classifier5 and we
performed 500 iterations of the algorithm. Table 10 summarises the result. Our classifiers, while varying the value
of k, have precision over 85% with recall varying between
38%–98%. Notably, when k is small (e.g., when the total
number of non spam apps represents ≤ 2x the number of
spam apps) the classifier achieves an accuracy as high as
95%.
Recall drops when we increase the number of negative examples because larger k values implies inclusion of lower
ranked apps as negative (non-spam) examples. Some of
these lower ranked apps, however, exhibit some spam features and some may indeed be spam (not yet been detected
as spam). As we have only a small number of spam apps
5
Adaptive Boost algorithm combines the output of set of
weak classifiers to build a strong classifier. Interested readers
may refer to [20].
in the training set, when unlabelled spam apps are added
as non-spam, some spam related features become non-spam
features as the number of app satisfying that feature is high
in non-spam. As a result recall drops when k increases.
As the number of negative examples increases the classifier becomes more conservative and correctly identifies a
relatively small portion of spam apps. Nonetheless, even at
k = 32 we achieve a precision over 85%. This is particularly
helpful in spam detection, as marking a non-spam as spam
can be more expensive than missing a spam.
Additionally, if the objective is to build an aggressive classifier that identifies as much spam as possible so that app
market operators can flag these apps to make a decision
later, a classifier built using smaller number of negative examples (i.e., k = 1 or k = 2) can be used. For example,
in Table 11 we show the k = 2 classifier’s performance in
the higher order datasets. As can be seen, this classifier
identifies nearly 90% of the spam. However, the precision
goes down as we traverse down the app ranking because of
increased number of false positives. Nonetheless, in reality
some of these apps may actually be spam and as a result
may not exactly be false positives.
Table 10: Classifier Performance
k
Precision
Recall
Accuracy
F0.5
1
2
4
8
16
32
0.9310
0.9533
0.9126
0.9405
0.8833
0.8571
0.9818
0.9273
0.8545
0.7182
0.4818
0.3818
0.9545
0.9606
0.9545
0.9636
0.9658
0.9793
0.9408
0.9480
0.9004
0.8857
0.7571
0.6863
Table 11: Classifier Performance: k = 2 model
6.2
k
Precision
Recall
Accuracy
F0.5
4
8
16
32
0.8080
0.5549
0.2730
0.1164
0.9182
0.9182
0.9182
0.9182
0.9400
0.9091
0.8513
0.7862
0.8279
0.6026
0.3176
0.1410
Applying the Classifier in the Wild
To get an estimate of the volume of potential spam that
might be in Google Play Store, we used two of our classifiers to predict whether or not an app in our dataset was
spam. We selected a conservative classifier (k = 32) and
an aggressive classifier (k = 2) to obtain a lower and upper
bound.
We did this prediction only for the apps which were not
used for training during the classifier evaluation phase to
avoid classifier artificially reporting higher number of spam.
Table 12 shows the results. C1 , C2 , and C3 are three sets of
removed apps we identified as described in Section 3.1 and
Others are the apps in O which were neither removed nor
belong up to top-32x. Thus Others apps represent the average apps in Google Play Store and we know that they were
there in the market for at least for 6 months, and by time
we stopped monitoring they were still there in the market.
According to the results, more aggressive classifier (k = 2)
predicted around 70% of the removed apps and 55% of the
other apps to be spam. The conservative classifier (k = 32)
predicted 6%–12% of the removed apps and approximately
2.7% of the other apps as spam; this can be considered as
the lower bound for the percentage of spam apps currently
available in Google Play Store.
Table 12: Predictions on spam apps in Google Play Store
7.
Dataset
Size
Crawl 1 (C1 )
Crawl 2 (C2 )
Crawl 3 (C3 )
Others
6,566
9,184
18,897
180,627
k=2
70.37
73.14
72.99
54.59
%
%
%
%
k = 32
12.89 %
6.57 %
6.49 %
2.69 %
CONCLUDING REMARKS
In this paper, we propose a classifier for automated detection of spam apps at the time of app submission. Our app
classifier utilises only those features that can be derived from
an app’s metadata available during the publication approval
process. It does not require any human intervention such as
manual inspection of the metadata or manual testing of the
app. We validate our app classifier, by applying it to a large
dataset of apps collected between December 2013 and May
2014, by crawling and identifying apps that were removed
from the Google Play Store. Our results show that it is possible, to automate the process of detecting spam apps solely
based on apps’ metadata available at the time of publication
and achieve both high precision and recall. In the remainder
of this section, we discuss two challenges that can be fruitful
avenues for future work.
The manual labelling challenge. In this work, we
used a relatively small set (1500) of labelled spam apps and
used the top apps from the market place as proxies for nonspam apps. Our choices were dictated largely by the time
consuming nature of the labelling process. For this study, it
took on average 20 working days, for each of our reviewers
to label all the 1500 apps. Obviously, the performance of
the classifier can be improved by increasing the number of
labelled spam apps and further manually labelling non-spam
apps through an extension of our manual effort.
One approach is to rely on crowdsourcing and recruit appsavvy users as reviewers. This is not a trivial task as it requires a major effort of providing individualised guidelines
and interactions with the reviewers, and need to deal with
assessment inconsistencies due to variable technical expertise. Nonetheless, from an app market provider perspective,
this is a plausible option to consider. A framework that can
quickly assess whether an app is spam, when the developers
submit apps for approval will enable faster approvals whilst
reducing the number of spam apps in the market.
Alternatively, hybrid schemes can be developed where
apps are flagged during the approval process and removed
based on a limited number of customer complaints. Another potential direction is to consider a mix of supervised
and unsupervised learning, known as semi-supervised learning [4]. The basic idea behind semi-supervised learning is
to use “unlabelled data” to improve the classification model
that was previously being built using only labelled data.
Semi-supervised learning has previously being successfully
applied to real-time traffic classification problems [17].
The classification arms race. It is possible that spammers adapt to the spam app detection framework and change
their strategies according to our selected features to avoid
the detection. The relevant questions in this context are: i)
how frequently the should the classifier be retrained? and ii)
how to detect when retraining is required?. We believe that
spam app developers will find it challenging and have significant cost overheads to adapt their apps to avoid features
that allow discriminating between spam and non-spam apps.
For instance, to avoid the similarity of descriptions of multi-
ple apps spanning over different categories, the spammer has
to consider editing the different descriptions of the apps and
to customise each of the app description to contain sufficient
details and coherent text, etc. An important direction for
future work is to study the longevity of our classifier. Additionally, we plan to investigate how the process of identifying
retraining requirements can be automated. Prior work on
traffic classification suggests that automated identification
of re-training points is possible using semi-supervised learning approaches [17].
8.
REFERENCES
[1] Language Detection API. http://detectlanguage.com,
2013.
[2] I. Androutsopoulos, G. Paliouras, V. Karkaletsis,
G. Sakkis, C. D. Spyropoulos, and P. Stamatopoulos.
Learning to filter spam e-mail: A comparison of a
naive bayesian and a memory-based approach. arXiv
preprint cs/0009009, 2000.
[3] H. B. Aradhye, G. K. Myers, and J. A. Herson. Image
analysis for efficient categorization of image-based
spam e-mail. In Document Analysis and Recognition,
2005. Proceedings. Eighth International Conference
on, pages 914–918. IEEE, 2005.
[4] S. Basu, M. Bilenko, and R. J. Mooney. A
probabilistic framework for semi-supervised clustering.
In Proc. of KDD, KDD ’04, pages 59–68, 2004.
[5] F. Benevenuto, G. Magno, T. Rodrigues, and
V. Almeida. Detecting spammers on twitter. In
Collaboration, electronic messaging, anti-abuse and
spam conference (CEAS), volume 6, page 12, 2010.
[6] E. Blanzieri and A. Bryl. Evaluation of the highest
probability svm nearest neighbor classifier with
variable relative error cost. 2007.
[7] E. Blanzieri and A. Bryl. A survey of learning-based
techniques of email spam filtering. Artificial
Intelligence Review, 29(1):63–92, 2008.
[8] L. Breiman, J. Friedman, C. J. Stone, and R. A.
Olshen. Classification and regression trees. CRC press,
1984.
[9] I. Burguera, U. Zurutuza, and S. Nadjm-Tehrani.
Crowdroid: behavior-based malware detection system
for android. In Proceedings of the 1st ACM workshop
on Security and privacy in smartphones and mobile
devices, pages 15–26. ACM, 2011.
[10] O. Canales, V. Monaco, T. Murphy, E. Zych,
J. Stewart, C. Tappert, A. Castro, O. Sotoye,
L. Torres, and G. Truley. A stylometry system for
authenticating students taking online tests. Proc.
Student-Faculty CSIS Research Day, 2011.
[11] C. Castillo, D. Donato, A. Gionis, V. Murdock, and
F. Silvestri. Know your neighbors: Web spam
detection using the web topology. In Proceedings of the
30th annual international ACM SIGIR conference on
Research and development in information retrieval,
pages 423–430. ACM, 2007.
[12] R. Chandy and H. Gu. Identifying spam in the ios app
store. In Proceedings of the 2nd Joint
WICOW/AIRWeb Workshop on Web Quality, pages
56–59. ACM, 2012.
[13] P.-A. Chirita, J. Diederich, and W. Nejdl. Mailrank:
using ranking for spam detection. In Proceedings of the
14th ACM international conference on Information
and knowledge management, pages 373–380. ACM,
2005.
[14] G. V. Cormack, J. M. Gómez Hidalgo, and E. P. Sánz.
Spam filtering for short messages. In Proceedings of
the sixteenth ACM conference on Conference on
information and knowledge management, pages
313–320. ACM, 2007.
[15] H. Drucker, S. Wu, and V. N. Vapnik. Support vector
machines for spam categorization. Neural Networks,
IEEE Transactions on, 10(5):1048–1054, 1999.
[16] M. Erdélyi, A. Garzó, and A. A. Benczúr. Web spam
classification: a few features worth more. In
Proceedings of the 2011 Joint WICOW/AIRWeb
Workshop on Web Quality, pages 27–34. ACM, 2011.
[17] J. Erman, A. Mahanti, M. Arlitt, I. Cohen, and
C. Williamson. Offline/realtime traffic classification
using semi-supervised learning. Perform. Eval.,
64(9-12):1194–1213, Oct. 2007.
[18] D. Fetterly, M. Manasse, and M. Najork. Spam, damn
spam, and statistics: Using statistical analysis to
locate spam web pages. In Proceedings of the 7th
International Workshop on the Web and Databases:
colocated with ACM SIGMOD/PODS 2004, pages 1–6.
ACM, 2004.
[19] R. Flesch. A new readability yardstick. Journal of
applied psychology, 32(3):221, 1948.
[20] Y. Freund, R. E. Schapire, et al. Experiments with a
new boosting algorithm. In ICML, volume 96, pages
148–156, 1996.
[21] J. M. Gómez Hidalgo, G. C. Bringas, E. P. Sánz, and
F. C. Garcı́a. Content based sms spam filtering. In
Proceedings of the 2006 ACM symposium on
Document engineering, pages 107–114. ACM, 2006.
[22] A. Gorla, I. Tavecchia, F. Gross, and A. Zeller.
Checking app behavior against app descriptions. In
ICSE, pages 1025–1035, 2014.
[23] M. Grace, Y. Zhou, Q. Zhang, S. Zou, and X. Jiang.
Riskranker: scalable and accurate zero-day android
malware detection. In Proceedings of the 10th
international conference on Mobile systems,
applications, and services, pages 281–294. ACM, 2012.
[24] Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen.
Combating web spam with trustrank. In Proceedings
of the Thirtieth international conference on Very large
data bases-Volume 30, pages 576–587. VLDB
Endowment, 2004.
[25] Apple. Inc. App store review guidelines.
https://developer.apple.com/app-store/review/
guidelines/, 2014.
[26] Apple. Inc. Common app rejections.
https://developer.apple.com/app-store/review/
rejections/, 2014.
[27] AVG Threat Labs. Inc. Website safety ratings and
reputation. http://www.avgthreatlabs.com/
website-safety-reports/app, 2014.
[28] Google. Inc. Ads. http://developer.android.com/
distribute/googleplay/policies/ads.html, 2014.
[29] Google. Inc. Google play developer program policies.
https://play.google.com/about/
developer-content-policy.html, 2014.
[30] Google. Inc. Intellectual property.
http://developer.android.com/distribute/
googleplay/policies/ip.html, 2014.
[31] Google. Inc. Rating your application content for
Google Play. https://support.google.com/
googleplay/android-developer/answer/188189,
2014.
[32] Google. Inc. Spam. http:
//developer.android.com/distribute/googleplay/,
2014.
[33] WiseGeek. Inc. Why do spam emails have a lot of
misspellings? http://www.wisegeek.com/
why-do-spam-emails-have-a-lot-of-misspellings.
htm, 2014.
[34] N. Jindal and B. Liu. Review spam detection. In
Proceedings of the 16th international conference on
World Wide Web, pages 1189–1190. ACM, 2007.
[35] N. Jindal and B. Liu. Opinion spam and analysis. In
[36]
[37]
[38]
[39]
[40]
[41]
[42]
[43]
[44]
[45]
[46]
[47]
[48]
[49]
[50]
[51]
[52]
[53]
[54]
[55]
Proceedings of the 2008 International Conference on
Web Search and Data Mining, pages 219–230. ACM,
2008.
R. Kohavi and G. H. John. Wrappers for feature
subset selection. Artificial intelligence, 97(1):273–324,
1997.
V. Krishnan and R. Raj. Web spam detection with
anti-trust rank. In AIRWeb, volume 6, pages 37–40,
2006.
B. Leiba, J. Ossher, V. Rajan, R. Segal, and M. N.
Wegman. Smtp path analysis. In CEAS, 2005.
K. Li and Z. Zhong. Fast statistical spam filter by
approximate classifications. In ACM SIGMETRICS
Performance Evaluation Review, volume 34, pages
347–358. ACM, 2006.
AppBrain Inc. AppBrain Stats.
http://www.appbrain.com/stats, 2014.
Statista Inc. Number of newly developed
applications/games submitted for release to the
iTunes App Store from January 2012 to September
2013. http://www.statista.com, 2013.
V. Metsis, I. Androutsopoulos, and G. Paliouras.
Spam filtering with naive Bayes-which naive Bayes?
In CEAS, pages 27–28, 2006.
G. Mishne, D. Carmel, and R. Lempel. Blocking blog
spam with language model disagreement. In AIRWeb,
volume 5, pages 1–6, 2005.
A. Mukherjee and B. Liu. Improving gender
classification of blog authors. In Proceedings of the
2010 conference on Empirical Methods in natural
Language Processing, pages 207–217. Association for
Computational Linguistics, 2010.
A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly.
Detecting spam web pages through content analysis.
In Proceedings of the 15th international conference on
World Wide Web, pages 83–92. ACM, 2006.
J. Oberheide and C. Miller. Dissecting the android
bouncer. SummerCon2012, New York, 2012.
P. Oscar and V. Roychowdbury. Leveraging social
networks to fight spam. IEEE Computer, 38(4):61–68,
2005.
P. Pantel, D. Lin, et al. Spamcop: A spam
classification & organization program. In Proceedings
of AAAI-98 Workshop on Learning for Text
Categorization, pages 95–98, 1998.
H. Peng, C. Gates, B. Sarma, N. Li, Y. Qi,
R. Potharaju, C. Nita-Rotaru, and I. Molloy. Using
probabilistic generative models for ranking risks of
android apps. In Proceedings of the 2012 ACM
conference on Computer and communications security,
pages 241–252. ACM, 2012.
S. Perez. Developer spams Google Play with ripoffs of
well-known apps again. http://techcrunch.com, 2013.
S. Perez. Nearly 60K low-quality apps booted from
Google Play Store in February, points to increased
spam-fighting. http://techcrunch.com, 2013.
S. Perez. iTunes App Store now has 1.2 million apps,
has seen 75 billion downloads to date.
http://techcrunch.com, 2014.
D. Rowinski. Apple iOS App Store adding 20,000 apps
a month, hits 40 billion downloads.
http://readwrite.com, 2013.
M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz.
A Bayesian approach to filtering junk e-mail. In
Learning for Text Categorization: Papers from the
1998 workshop, volume 62, pages 98–105, 1998.
D. Sculley and G. M. Wachman. Relaxed online svms
for spam filtering. In Proceedings of the 30th annual
international ACM SIGIR conference on Research and
development in information retrieval, pages 415–422.
ACM, 2007.
[56] S. Seneviratne, A. Seneviratne, D. Kaafar,
A. Mahanti, and P. Mohapatra. Why my app got
deleted: Detection of spam mobile apps. Technical
report, NICTA, Australia, November 2013.
[57] S. Seneviratne, A. Seneviratne, P. Mohapatra, and
A. Mahanti. Predicting user traits from a snapshot of
apps installed on a smartphone. ACM SIGMOBILE
Mobile Computing and Communications Review,
18(2):1–8, 2014.
[58] E. Smith and R. Senter. Automated readability index.
AMRL-TR. Aerospace Medical Research Laboratories
(6570th), page 1, 1967.
[59] I. Soboroff, I. Ounis, J. Lin, and I. Soboroff. Overview
of the trec-2012 microblog track. In Proceedings of the
Twenty-First Text REtrieval Conference (TREC
2012), 2012.
[60] A. H. Wang. Don’t follow me: Spam detection in
twitter. In Security and Cryptography (SECRYPT),
Proceedings of the 2010 International Conference on,
pages 1–10. IEEE, 2010.
[61] Wikipedia. Wikipedia:lists of common misspellings.
http://en.wikipedia.org/wiki/, 2014.
[62] D.-J. Wu, C.-H. Mao, T.-E. Wei, H.-M. Lee, and K.-P.
Wu. Droidmat: Android malware detection through
manifest and API calls tracing. In Information
Security (Asia JCIS), 2012 Seventh Asia Joint
Conference on, pages 62–69. IEEE, 2012.
[63] W. Zhou, Y. Zhou, X. Jiang, and P. Ning. Detecting
repackaged smartphone applications in third-party
android marketplaces. In Proceedings of the second
ACM conference on Data and Application Security
and Privacy, pages 317–326. ACM, 2012.
[64] Y. Zhou, Z. Wang, W. Zhou, and X. Jiang. Hey, you,
get off of my market: Detecting malicious apps in
official and alternative android markets. In NDSS,
2012.
9.
9.1
APPENDIX
Collected Metadata
Table 13 shows all the metadata our crawler collected from
Google Play Store in related to apps.
9.2
Unofficial Content Apps
Table 14 shows the heuristics used to identify unofficial
content apps that are developed based on the Google’s content policy [29] and intellectual property policy [31].
9.3
Copyrighted Content Apps
Table 15 shows the heuristics used to identify copyrighted
content apps that are developed based on the Google’s content policy [29] and intellectual property policy [31].
9.4
Problematic Content Apps
Heuristics used to identify problematic or illegal content
apps are shown in Table 16. Those are developed based on
the Google’s content policy [29].
9.5
Counterfeit Apps
Table 17 summarises checkpoints used to identify Android
Counterfeits and Other Counterfeits. This was more challenging as the reviewers had to manually search in Google
Play Store and Google Search Engine with parts or full
app names and descriptions. Then the retrieved top results
(links/images) were checked whether they lead to other apps
(in Google Play Store or other App Markets). Also they verified whether the other apps are possibly from the same de-
Table 14: Checklist used for manual labeling of unofficial content apps
Attribute
Description
ID
U1
Description and Examples
Does the app description mention the app as an unofficial app?
E.g. The Verge Notifications - App description says the app is unofficial
“.... Unofficial The Verge Notifications application. Notifies you of new replies or
recommendations to your The Verge comments. ....”
E.g. Flash Download [unofficial] - App description says the app is unofficial
“.... Unofficial Flash Install Supporter will able you to install Latest Flash Player from
official site. . ....”
U2
Does the app description mention it provide unofficial interfaces to popular
websites or content?
E.g.
Shortcut to Hotmail - App description says it connects to Hotmail.
However
when developer details are checked it is not officially from Hotmail.
“.... Access Hotmail (currently Outlook) from this convenient shortcut, from your smartphone or table. Do not waste time ....”
E.g. Pin Search For BBM - App description says it connects to Blackberry messenger
without any affiliation with Blackberry.
“.... Pin Search BBM app is for all user Blackberry Messenger (BBM) , You can find
here BBM pin and make new friends. Share BlackBerry PIN’s world wide conveniently on
iPhone, Android and BlackBerry phone devices which support the BBM app. Disclaimer
:PLEASE NOTE , This app is not affiliated with Blackberry Limited ....”
U3
Does the app description mention popular brand names without any affiliation.
i.e. App developer is different from the official app developer of that brand,
official web site does not mention about an Android application.
E.g. Reebok Wall Ball - Unofficial use of the brand name Reebok
“.... Reebok Realflex Wall Ball. Flex your shoe to shoot medicine balls at targets on the
wall. Every level ....”
U4
When the app delivers contents from others sources does it mention it has the
right to do so?
E.g. Radio City Canada - The app description does not mention whether the app is
unofficial or has the right to deliver this content. When the official web site of Radio City
web site is checked there was no reference for an Android app.
“.... Listen to CBC Radio across Canada. Keywords: Radio, Canada, CBC, Radio-Canada,
Radio, TV, MP3, webradios, webradio. ....”
Name
U5
Does the app name mention app being unofficial?
E.g.
Flash Download [unofficial], American Express (UnOfficial), Khan Academy
(Unofficial), Pitbull Unofficial Fan App - App names mention app being unofficial
Website
U6
Does the developer’s website mention app being unofficial?
E.g.
STC Collect Call - The app description does not mention does not mention
whether the app is unofficial. However in the developer website it is mentioned that app is
unofficial.
App description “ Collect Call This app makes it easier to dial the collect call. Just select or
enter the number you wish to call, click on dial button and then select ”collect call” from the
menu. Below are information about STC collect call service. With the Collect Call service
from STC, the call will be charged to the receiver once he or she is notified and accepts to
pay the charges. *- No settings or activation required *- You can call even when you donÕt
have credit *- Free and has no monthly subscription fees *- Available to all Jawal postpaid
and prepaid customers (Sawa, Lana) *- Works only among STC customers and not with
other operators *- Does not apply while roaming for more info please visit www.STC.com.sa”
Description at developer web site “ is a service provided by Saudi telecom (STC). This
application is an unofficial application which I have developed for this service. Please post
your comments, inquires or wishes in English ”
Reviews
U7
Do users complain about app being unofficial?
E.g. Khan Academy (Unofficial)
“Fantastic app Being the unofficial gives me mild concern.but overall really easy interface’
veloper or not as some developers may publish apps in multiple app markets. Description and app names were check as
regular expressions with our master dataset collected from
the total crawl as well. Reviewers marked an app as counterfeit only if they could reliably conclude the original app
in Google Play Store or any other app market such as Apple
App Store and Amazon App Store and they were sure that
it is not coming from the same developer.
9.6
Adult Content Apps
Adult content apps were identified using the heuristics
shown in Table 18. These were developed based on the
Google’s content policy [29].
9.7
Developer Deleted Apps
Table 19 shows the heuristics used to identify developer
deleted applications. These were developed based on the
Table 15: Checklist used for manual labelling of copyrighted content apps
Attribute
ID
R1
Description
Description and Examples
Does the app description indicate potential copyright infringement or a trademark
infringement (unauthorized used of registered trademarks)?
E.g. WP8 Surface Wallpaper HD - App description indicates potential copyright infringement.
“.... Windows Phone 8 & Surface Tablet Wallpaper HD Official Original From Microsoft
Device. ....”
E.g. GHOST IN THE SHELL SAC+2ndGIG - App description indicates potential copyright
infringement.
“.... This app includes all episodes of GHOST IN THE SHELL SAC VIDEO Please
ENJOY!!!. ....”
R2
Does the app provide copyrighted material at a price?
E.g. Dungeon Keeper 2 Soundboard - According to app description it provides sounds
from a PC game and the app is priced at AU$7.44. The developer has no affiliation to
original content owner.
“A lot of sounds from the old but good Dungeon Keeper 2 PC game”
E.g. Pin Search For BBM - According to app description it provides contents from a movie
and a book and the app is priced at AU$1.49. The developer has no affiliation to original
content owner.
“.... MOVIE AND BOOK BY CHUCK PALAHNIUK, THIS FIGHT CLUB SOUNDBOARD BRINGS YOU YOUR FAVORITE QUOTES FROM THE MOVIE RIGHT TO
YOUR PHONE WITH JUST THE TOUCH OF A FINGER! ....”
R3
Does the app description contain disclaimers related to copyrights?
E.g. Love Stories - App description contains a disclaimer on potential copyright infrigements.
“.... Smart Applications does not support plagiarism in any form. Author must note that
the story should not be copied from anywhere; it should be an original piece of work. If
found otherwise, we reserve the right to delete it. ....”
Reviews
R4
Do the users mention about copyright infringements?
E.g. Astro Soldier War Of The Stars - Users mention about copyright infringements.
“.... You stole copyrighted material from TimeGate Studios. ....”, “..... Your so icon is
artwork from Section 8, one of their video games .....”
E.g. Harlem Shake Creator Pro - Users mention about copyright infringements.
“..... Want my money back!! Payed for this app so that my kids could save and share their
video, then Facebook removed it on account of copyright infringement!! .....”
Table 16: Checklist used for manual labeling of problematic/illegal apps
Attribute
ID
P1
Violence and Bullying - Depictions of gratuitous violence are not allowed.
Apps
should not contain materials that threaten, harass or bully other users.
Hate Speech - Apps advocating against groups of people based on their race or ethnic origin,
religion, disability, gender, age, veteran status, or sexual orientation/gender identity.
Illegal Activities - Apps for unlawful activities, such as the sale of prescription drugs
without a prescription.
Gambling - Apps allowing content or services that facilitate online gambling, including
but not limited to, online casinos, sports betting and lotteries, or games of skill that offer
prizes of cash or other value.
Illegal Activities - Apps for unlawful activities, such as the sale of prescription drugs
without a prescription.
Description
Reviews
Description and Examples
Does the app description indicate any relationship to following
P2
Do the users mention about the app related to the any points mentioned in P1 ?
domain knowledge on the app market eco system and by observing some samples from the removed apps dataset. These
checks were carried out similar to previously mentioned
counterfeit search in Section 9.5 by searching in Google Play
Store, Google Search Engine and our master database.
9.8
Developer Banned Apps
Developer banned apps were identified by checking
whether developer’s all other apps were deleted by querying
our master database and searching in Google Play Store.
9.9
Top 100 Apps
Table 20 shows the top 100 non spam apps we identified
when we ranked the applications in the descending order of
total number of downloads, total number of user reviews,
and average rating.
9.10
Manually Identified Bigrams and Trigrams
Table 22 shows the manually identified bigrams and trigrams from the top 100 app descriptions which are associated with describing a functionality of an app.
Table 17: Checklist used for labelling of counterfeit apps
Attribute
Name
ID
C1
Description and Examples
Does the app name resemble another application existing in the Android market?
E.g. Fruit Sniper Free -air. com. oligarch. us. FrootSniperFree (Counterfeit) vs. Fruit
Sniper Free - com. rongbong. fruitsniperfree (Original) - App name is exactly the same
E.g. SmsVault - PSS. SmsVault (Counterfeit) vs. SMSVault Lite - com. macrosoft. android.
smsvault. light (Original) - Similar app names
E.g. Volume Booster - Amplifier - aa.volume.booster (Counterfeit) vs. volume booster com.feaclipse.volume.booster (Original) - Similar app names
Description
C2
Does the app description resemble another application existing in the Android
market?
E.g. Description of Volume Booster - Amplifier (Counterfeit)
“...the volumes will be restored to their previous values. Of course you can adjust the
volumes with the phone’s hardware volume keys or via menu or other applications, so no
worries there. Just as a tip for you,....”
E.g. Description of volume booster (Original)
“...the volumes will be restored to their previous values. Of course you can adjust the
volumes with the phone’s hardware volume keys or via menu or other applications, so no
worries there. Just as a tip for you,....”
C3
Does the app description itself mention the app as a clone of another application?
E.g. Rain of Blocks (zeljkoa.games.rob) is a counterfeit of Tetris (com.ea.game.tetris2011_
row)
“... Rain of Blocks is a Tetris clone that will take all your free time away .....”
Logo
&
Screenshots
C4
Does the app logo or the screenshots resemble another application existing in the
Android market?
Reviews
C5
Do users complain about app being counterfeit in reviews?
E.g. School tetris (me.mikegol.stetris) is a counterfeit of Tetris (com.ea.game.tetris2011\
_row) - Users mentions the app being a counterfeit in reviews.
“Basic but not bad Controls need work but not bad have played a lot worse Terris clones”,
“Didn’t work good. This game is a rip-off ”
Table 18: Checklist used for manual labeling of adult content apps
9.11
Attribute
Description
Name
Content rating
ID
A1
A2
A3
Description and Examples
Does the app description indicate potential inclusion of pornography?
Does the app name indicate potential inclusion of pornography?
Does the content rating of the app is ”High Maturity”?
Screenshots
& App icons
A4
Do provided screenshots or the icons of the app indicate potential inclusion of
pornography?
Reviews
A5
Do the users mention about inclusion of pornography?
This condition alone was not used to decide an app as Adult content as there are
many legitimate applications with content rating as ”High Maturity”. It was rather used as
an assistance for borderline cases.
Automatically Identified Bigrams and
Trigrams
We identified bigrams and trigrams that a present in at
least 10% of the instances in the sets spam and top-kx in
order to augment the manually identified bigrams and trigrams mentioned in Section 5.1. For example, we in Table 22 we show the automatically identified bigrams and trigrams using the descriptions of spam and top-1x apps. Similarly we identified popular bigrams and trigrams for topk (k = 2, 4, 8, 16, 32) sets as well. Due to size limitations
we do not show them here. We performed standard text
pre-processing techniques:changing to lower case, removing
punctuation, removing numbers, and normalise white spaces
before identifying the bigrams and trigrams.
9.12
Stylometric Features
This section describes the writing style related features
we defined to map heuristic checkpoint S2 as described in
Section 5.2. All of the below features were generated on
text which are converted to lower case. We used Natural
Language Toolkit (NLTK) 6 in Python to generate these features.
1. Total number of characters in the description:
Total number of characters in the app description text
without the white spaces.
2. Total number of words in the description: The
description text was tokenised using white spaces and
each resulting element was considered as a word.
6
www.nltk.org
Table 19: Checklist used for manual labeling of developer deleted apps
Attribute
ID
D1
Description
Developer
ID
Description and Examples
Does the app description indicates app being deprecated/discontinued or replaced
by a new application.
E.g. Barrier Plus - App description indicates discontinuation
“.... This app will be ended on 12/25/2013 please consider to install ”SmartGate” thank
you for using this app ....”
E.g. Norton Utilities task killer - App description indicates discontinuation.
“.... Symantec will discontinue selling Norton Mobile Utilities as of Wednesday, January
22, 2014. For more information ....”
D2
Is there another app with the same app name by the same developer showing an
implicit indication of the discontinuation.
E.g. App DroidGuard(DroidGuard.main) was deleted from the market
“However currently there is another app DroidGuard(com. dynacolor. droidguard. main ) with
same description. This indicates developer has replaced the old app with a new application.
....”
E.g. Smart App Protector(App Lock+) (com.sp.protector)
“Developer deleted this application and released free version of the same
(com. sp. protector. free ) with a Premium membership option.’
Table 20: Top 100 apps
Facebook
Gmail
Google Play services
Candy Crush Saga
Torch Tiny Flashlight
GO Launcher EX
LINE
Pandora internet radio
Tango Messenger
Chrome Browser Google
Shazam
Dropbox
Flipboard Your News Magazine
ChatON
Google TexttoSpeech
Despicable Me
PicsArt - Photo Studio
WeChat
SuperBright LED Torch
MX Player
Talking Tom Cat Free
Ant Smasher
Mobile Security & Antivirus
Jetpack Joyride
TuneIn Radio
Barcode Scanner
Pool Billiards Pro
Kindle
Glow Hockey
DragonFlight for Kakao
DEER HUNTER 2014
Paradise Island
Badoo Meet New People
Evernote
Google Maps
Street View
Instagram
Subway Surfers
Viber
Pou
Twitter
Drag Racing
Fruit Ninja Free
Talking Tom Cat 2 Free
Google+
Google Play Music
Voice Search
Samsung Push Service
Samsung Link (AllShare Play)
Opera Mini browser for Android
Hill Climb Racing
GO SMS Pro
Angry Birds Rio
ZEDGE
Camera360 Ultimate
Netflix
eBay
Advanced Task Killer
GO Locker
Google Earth
SoundHound
3D Bowling
Google Play Games
Words With Friends Free
Waze Social GPS Maps & Traffic
Clash of Clans
Castle Clash
3. Total number of sentences in the description:
We used the English “pickle” tokeniser provided with
NLTK library for this.
4. Average word length: Total number of characters
divided by the total number of words.
5. Average sentence length: Total number of words
divided by the total number of sentences.
6. Percentage of upper case characters: Percentage of upper case characters from the total alphabetical
characters.
7. Percentage of punctuations: Percentage of punctuation characters from the total number of characters.
Characters used as punctuations are shown in Table 26.
8. Percentages of numeric characters: Percentage
of numeric characters from total number of characters.
9. Percentage of non-alphabet characters: Percentage of non-alphabetic characters(e.g. numeric charac-
YouTube
Google Search
WhatsApp Messenger
Angry Birds
Skype
Facebook Messenger
Temple Run
Temple Run 2
AntiVirus Security FREE
Google Translate
Adobe Reader
Hangouts
Google Play Books
Google Play Movies & TV
Google Play Newsstand
KakaoTalk Free Calls & Text
Brightest Flashlight Free
Angry Birds Seasons
Photo Grid Collage Maker
Angry Birds Space
Yahoo Mail
Top Security & Anti Virus FREE
Firefox Browser for Android
ColorNote Notepad Notes To do
Angry Birds Star Wars
Adobe AIR
Shoot Bubble Deluxe
Perfect Piano
HP Print Service Plugin
Dolphin Browser
DrḊriving
Easy Battery Saver
GROUP PLAY
ters, punctuation) from total number of characters.
10. Percentage of common English words:
We used the list of common English words provided by
Canales et al. [10] which consists of three components:
Conjunctions, Interrogatives, Prepositions.. Some examples of each category are shown below and the total
list of words are shown in Table 23.
E.g. Conjunctions - after, although
Interrogatives - where, which
Prepositions - about, below
11. Percentage of personal pronouns: We used the
list of personal pronouns provided by Canales et al. [10]
under three categories: first person, second person,
and third person. Some examples of each category are
shown below and the total list of words are shown in
Table 24.
E.g. first person - I, me
Table 13: Collected metadata through the crawl
Metadata
Description
App id
Java package name of the app. This
uniquely identifies an app in Google
Play Store.
Name of the app
Category of the app (Out of 30 predefined categories in Google Play
Store).
App description (Max limit 4000
characters )
Whether app is free or paid and the
price of app for paid apps.
Content rating defined by the developer (Everyone, High maturity etc.)
Number of downloads (in ranges 5001000 etc.)
Size of the APK file in KB
Developer defined version of the app.
Minimum Android version the app
can be executed on.
Last date the developer updated the
app.
Total number of users reviewed the
app.
Total number of users gave 5 start
rating for the app.
Total number of users gave 4 start
rating for the app.
Total number of users gave 3 start
rating for the app.
Total number of users gave 2 start
rating for the app.
Total number of users gave 1 start
rating for the app.
Average rating of the app.
Link to the developer’s website.
Developer’s email address.
Name of the developer.
Unique identifier of the developer for
Google Play Store.
Link to the privacy policy.
Default number of user review text
returned for an app page.
App ids of similar apps recommended
by Google.
App ids of apps from the same developer.
App name
App category
App description
Price
Content Rating
Number of downloads
Size
Version
Minimum Android version supported
Last updated date
Total reviews
Total 5 star ratings
Total 4 star ratings
Total 3 star ratings
Total 2 star ratings
Total 1 star ratings
Average rating
Developer website
Developer email
Developer name
Developer ID
Privacy policy
User reviews
Similar apps
Developer’s other apps
second person - You, your
third person - He, her
12. Percentage of emotional words:
We used the
list of emotional words such as exited, happey, and
suprised, provided by Mukherjee et al. [44] for this and
the total list of words are shown in Table 25.
13. Percentage of misspelled words:
A common
method used in spam emails is to deliberately misspell
words so that messages can bypass the keyword based
spam filters [33]. We used this feature to check whether
this is happening in app spam. We used the common
misspellings list provided by Wikipedia [61].
14. Percentage of words containing alphabet and
numeric characters:
Common technique used in
spam emails is the use of numbers to replace some alphabet characters. Numerical digits which have similar
appearances to some letters are used in order to bypass
keyword based spam filters (E.g. Instead of “cool” using
“c00l”). We tried to capture this by using this feature.
15. Automatic readability index (AR) [58]: Auto-
Table 21: Manually identified bigrams and tr-grams using
the descriptions of top 100 apps which are associated with
describing a functionality (size: 100)
play games
app for android
to find
is used to
way to
match your
app for
stay in touch
that allows
you have to
buy your
to use
keeps you
tap the
makes it easy
is a global
is a free
on the go
turns on
game from
on your phone
it helps
is a tool
lets you
support for
game of
to discover
this game
navigation app
your answer
is an easy
app which
train your
are available
you can
learn how
your phone
core functionality
is a smartphone
to play
this app
to enjoy
you to
is effortless
try to
action game
take a picture
follow what
brings together
discover more
more information
is an app
in your pocket
make your life
delivers the
full of features
is a simple
need your help
how to play
brings you
play with
makes mobile
drives you
game on
on android
game which
is the game
get notified
get your
to search
a simple
available for
key features
is available
take care of
can you beat
its easy
allows you
take advantage
save your
is the free
choose from
play as
learn more
face the challenges
your device
with android
offers essential
for android
it gives
enables you
at your fingertips
to learn
it brings
is a fun
strategy game
your way
application which
helps you
make your
matic readability index is a readability metric which
can be used to measure the understandability of a text.
AR = (4.71 ∗ average word length) +
(0.5 ∗ average sentence length) − (21.43)
16. Flesch readability score (FR) [19]: Similar to AR,
Flesch readability score is also a measurement of readability.
total words
F R = 206.835 − 1.015( total
)−
sentences
total syllables
84.6( total words )
9.13
Description Similarity Related Features
This section describes the writing style related features
we defined to map heuristic checkpoint S6 as described in
Section 5.6.
1. Total number of apps the developer has: The
total number of apps developer has published.
2. Total number of apps with english language description which can be used to measure similarity: Total number of apps with a English language
app description.
3. For a given app, number of apps having a cosine similarity higher than 60% from the same
developer: We converted the each app description
to vector-space-model after preprocessing the tex descriptions using the standard preprocessing techniques:
changing to lower case, removing punctuation, tokenise
using whitespaces. Then each descprion was represented as vector containing frequency of words.
E.g. dj = (w1,j , w2,j , ...., wt,j ) where w1,j is the frequency of word w1 in document dj . Then cosine similarity between two documents di and dj is defined as.
cos(di , dj ) =
di .dj
||di || ||dj ||
Then for given application we counted how many apps
are there with over 60% cosine similarity from the same
developer.
Table 22: Automatically identified bigrams and trigrams using the descriptions of spam and top-1x apps (size: 218)
play games
ads in
advertisement can support
all the
and share
app on
as samsung galaxy
been tested on
contact us
default after reboot
device is
devices such as
download the
follow us
for free
free great
from your
google play
has been
have implemented some
how to
if your device
implemented some ads
in settings advertisement
in the
is not
it is
live wallpaper
live wallpapers
more free
need put
note if
now with
of the
on android
on latest
on phone instead
on your
over million
please contact
put the app
resets to
samsung galaxy s
screen touch
settings advertisement can
such as
tap the
tested on latest
the best
the most
the top
this app
this is
to add
to default after
to get
to the
to use home
use home
us if your
wallpaper has been
wallpaper on
wallpaper resets to
wallpapers we
want to
we have implemented
will need put
with the
you can
your device
your friends
your own
your wallpaper resets
you will
you can
ads in settings
after reboot
all your
and the
app on phone
as you
can be
contact us if
develop more
device is not
d live
download the free
for a
for the
free great live
galaxy s
great live
has been tested
home menu
if you
if your wallpaper
in a
instead of
in your
is not supported
latest devices
live wallpaper has
live wallpapers this
more free great
need put the
note if your
of sd
of the screen
one of
on latest devices
on the
on your home
phone instead
please contact us
reboot you
resets to default
s and
sd card
some ads
such as samsung
terms of
the app
the free
the screen
the wallpaper
this application
this live
to be
to develop
to make
touch the
to your
use home menu
us on
wallpaper live
wallpaper on your
wallpapers this
wallpapers we have
way to
will be
with a
with your
you have
your device is
your home
your phone
you to
you will need
get notified
advertisement can
after reboot you
and more
app is
as samsung
been tested
can support
default after
develop more free
devices such
d live wallpaper
easy to
for android
for your
from the
get the
great live wallpapers
have implemented
home screen
if your
implemented some
in settings
instead of sd
is a
is the
latest devices such
live wallpaper on
live wallpapers we
more than
need to
not supported
of sd card
of your
on facebook
on phone
on twitter
or tap
phone instead of
put the
reboot you will
samsung galaxy
screen to
settings advertisement
some ads in
support our
tested on
the app on
the game
the screen to
the world
this game
this live wallpaper
to default
to develop more
to play
to use
up to
us if
wallpaper has
wallpaper live wallpaper
wallpaper resets
wallpapers this live
wallpaper to
we have
will need
with friends
you are
your android
your favorite
your home screen
your wallpaper
you want
4. For a given app, number of apps having a cosine similarity higher than 70% from the same
developer: Same as above with 70% similarity.
5. For a given app, number of apps having a cosine similarity higher than 80% from the same
developer: Same as above with 80% similarity.
6. For a given app, number of apps having a cosine similarity higher than 90% from the same
developer: Same as above with 90% similarity.
9.14
Appid Related Features
Appid is the unique identifier of an app in the Google
Play Store. It follows the Java package name convention.
Some example of app ids are,
Table 23: Common English words
a
after
because
but
for
if
now
or
since
that
unless
whenever
wherever
yet
what
whose
why
about
after
amid
around
before
by
despite
except
following
in
like
of
onto
over
plus
save
through
towards
unlike
upon
with
an
although
before
either
how
neither
once
provided
so
though
until
where
whether
where
who
when
whether
above
against
among
as
behind
concerning
down
excepting
for
inside
minus
off
opposite
past
regarding
since
to
under
until
versus
within
the
and
both
even
however
nor
only
rather
then
till
when
whereas
while
which
whom
how
abroad
across
along
anti
at
but
considering
during
excluding
from
into
near
on
outside
per
round
than
toward
underneath
up
via
without
Table 24: Personal pronouns
i
my
ours
we
yours
he
herself
his
itself
theirs
themselves
me
myself
ourselves
you
yourself
her
him
it
she
them
they
mine
our
us
your
yourselves
hers
himself
its
their
themself
E.g.
com. google. android. apps. maps
com. facebook. katana
com. rovio. angrybirds
We used the following features to map the heuristic
checkpoint S7 related to app id.
1. Number of characters: Length of the app id in
characters
2. Number of words: Number of words in the app id
which are separated by the period symbol.
3. Average word length: Total number characters to
the total number of words.
4. Percentage of of non-letter characters to total
characters: Total number of non-alphabet characters
to the total number of characters.
5. Percentage of upper case characters to total letter characters: Total number of upper case characters to the total number alphabet characters.
6. Presence of absence of words in the app name
matching parts of the appid : We checked whether
or not parts of app name appear in the app id by regular
expression match. The idea behind this is that automatically generates apps might not contain meaningful
app ids where in practise it is customary for developers
to give meaningful app ids as shown above.
Table 25: Emotional words
aggressive
annoyed
cautious
depressed
discouraged
embarrassed
excited
frustrated
helpless
humiliated
innocent
lonely
optimistic
proud
relieved
shocked
surprised
undecided
abundance
admirable
amazing
bargain
best
breakthrough
brimming
clear
confidence
cuddly
delightful
ecstatic
enjoy
exotic
flair
genius
heavenly
impressive
luxurious
speed
superb
supreme
treasure
ultimate
zest
alienated
anxious
confused
determined
disgusted
enthusiastic
exhausted
guilty
hopeful
hurt
interested
mischievous
paranoid
puzzled
sad
shy
suspicious
withdrawn
ace
adore
appealing
beaming
better
breeze
charming
colorful
cool
dazzling
dynamic
efficient
enormous
expert
free
great
ideal
incredible
outstanding
splendid
sweet
terrific
ultra
unique
angry
careful
curious
disappointed
ecstatic
envious
frightened
happy
hostile
hysterical
jealous
miserable
peaceful
regretful
satisfied
sorry
thoughtful
absolutely
active
agree
attraction
beautiful
boost
brilliant
clean
compliment
courteous
delicious
easy
enhance
excellent
exquisite
generous
graceful
immaculate
inspire
royal
spectacular
sure
treat
unbeatable
wow
Table 26: Punctuation characters
!
$
’
*
:
=
@
_
‘
}
”
%
(
+
.
;
>
[
^
{
∼
#
&
)
,
/
<
?
\
]
|
7. Percentage of bigrams with 1 non-letter to total bigrams: Total bigrams with one non-alphabet
character to the total number of bigrams.
8. Percentage of bigrams with 2 non-letters to total bigrams: Total bigrams with two non-alphabet
characters to the total number of bigrams.
9. Percentage of bigrams with 1 non-letter or 2
non-letters to total bigrams: Total bigrams with
at least non-alphabet character to the total number of
bigrams.
10. Percentage of trigrams with 1 non-letter to total trigrams: Total trigrams with one non-alphabet
characters to the total number of trigrams.
11. Percentage of trigrams with 2 non-letters to total trigrams: Total trigrams with two non-alphabet
characters to the total number of trigrams.
12. Percentage of trigrams with 3 non-letter to total
trigrams:
Total trigrams with three non-alphabet
characters to the total number of trigrams.
13. Percentage of trigrams with 1, 2 or 3 non-letters
to total trigrams: Total trigrams with at least one
non-alphabet character to the total number of trigrams.
9.15
App Metadata Features
This section describes the metadata we used in Section 9.15 which are not associated with a specific heuristic
checkpoint.
1. App category: In Google Play Store there are 30 app
categories and developer has to assign a single category
to the app when they publish the app.
2. Price: Price of the app in USD with 0 for free apps.
3. Length of app name:
characters.
Length of the app name in
4. Size of the app in KB: Size of the app executable.
5. Developer’s website address available: Whether
or not the developer has given a website address.
6. Developer’s website address is reachable:
Whether or not the given web site address is reachable.
7. Developer’s email address is available: Whether
or not the developer has given an email address.
8. Privacy policy for the app is given: Whether or
not a privacy policy is available for the app.