Why My App Got Deleted Detection of Spam Mobile Apps
Transcription
Why My App Got Deleted Detection of Spam Mobile Apps
Why My App Got Deleted Detection of Spam Mobile Apps Suranga Seneviratne†? , Aruna Seneviratne†? , Dali Kaafar† , Anirban Mahanti† , Prasant Mohapatra‡ † NICTA; ? School of EET, UNSW; ‡ Department of CS, University of California, Davis ABSTRACT Increased popularity of smartphones has attracted a large number of developers to develop apps for various smartphone platforms. As a result, app markets are also populated with spam apps, which reduce the users’ quality of experience and increase the workload of app market operators. The latter resort to remove those apps, upon user complaints or to deny the developers’ publication approval requests by relying on continuous human efforts to check the app compliance with anti-spam policies. Apps can be “spammy” in multiple ways including not having a specific functionality, unrelated app description or unrelated keywords and publishing similar apps several times and across diverse categories. Through a systematic crawl of a popular app market and by identifying a set of removed apps, we propose a method to detect spam apps solely using apps’ metadata available at the time of publication. We first propose a methodology to manually label a sample of removed apps, according to a set of checkpoint heuristics to reveal the reasons behind apps removal. This analysis suggests that approximately 35% of the apps being removed are very likely to be spam apps. We then map the identified heuristics to several quantifiable features and show how distinguishing these features are for spam apps. Finally, we build an Adaptive Boost classifier for early identification of spam apps using only the metadata of the apps. Our classifier achieves an accuracy over 95% with precision varying between 85%–95% and recall varying between 38%–98%. By applying the classifier on a set of apps present at the app market at the time of our crawl, we estimate that at least 2.7% of them are spam apps. 1. INTRODUCTION Recent years has seen wide adoption of mobile apps, and the number apps that are being offered are growing exponentially. As of mid-2014, Google Play Store and Apple App Store, each hosted approximately 1.2 million apps [40, 52], with around 20,000 new apps being published each month Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. in both of these app markets. These numbers are expected to significantly grow over the next few years [53, 40, 41]. Both Google and Apple have policies governing the publication of apps through their app markets. Google has an explicit spam policy [32] that describes their main reasons for considering an app to be “spammy”. These include : i) apps that were automatically generated and published, and as such do not have any specific functionality or a meaningful description; ii) multiple instances of the same app being published to obtain increased visibility in the app market; and iii) apps that make excessive use of unrelated keywords to attract unintended searches. Overall, spam apps vitiate the app market experience and its usefulness. At present, Google and Apple take different approaches to the spam detection problem. Google’s approach is reactive. Google’s app approval process does not check an app against its explicit spam app policy [46], and take action only on the basis of customer complaints [50]. In general, such a crowd sourced approach reveals spam apps only once the app becomes popular, which can lead to a considerable time lag between app submission and detection of spam behaviour. In contrast, Apple scrutinises the apps submitted for approval, presumably using a manual or semi-manual process, to determine whether the submitted app conforms to Apple’s policies. Although this approach is likely to detect spam apps before they are published, it lengthens the app approval process. With the ever increasing number of apps being submitted daily for approval, the app market operators need to be able to detect spam apps quickly and accurately. In this paper, we propose a methodology for automated detection of spam apps at the time of app submission that can be used in both app market management models. Our classifier utilises only features that can be derived from an app’s metadata available during the publication approval process. It does not require any human intervention such as manual inspection of the metadata or manual testing of the app. We validate our app classifier, by applying it to a large dataset of apps collected between December 2013 and May 2014, by crawling and identifying apps that were removed from the Google Play Store. We make the following contributions in this paper: • We develop a manual app classification methodology based on a set of heuristic checkpoints that can be used to identify reasons behind an app’s removal. Using this methodology we found that approximately 35% of the apps that were removed are spam apps; problematic content and counterfeits were the other key reasons for removal of apps. • We present a mapping of our proposed spam checkpoints to one or more quantifiable features that can be used to train a learning model. We provide a characterisation of these features highlighting differences between spam and non-spam apps and indicate which features are more discriminative. • We build an Adaptive Boost classifier for early detection of spam apps and show that our classifier can achieve an accuracy over 95% at a precision between 85%–95% and a recall between 38%–98%. • We applied our classifier to over 180,000 apps available in Google Play Store and show that approximately 2.7% of them are potentially spam apps. To the best of our knowledge, this is the first work to develop an early detection framework for identifying spam mobile apps and to demonstrate its efficacy using real world datasets. Our work complements prior research on detecting apps seeking over-permissions or apps containing malware [64, 23, 9, 63, 62, 49, 22]. The remainder of the paper is organised as follows. In Section 2 we discuss related work. In Section 3, we describe our datasets and our methodology. Section 4 presents our manual classification process and the heuristic checkpoints, while Section 5 introduces the process of mapping these checkpoints to machine learning features. We evaluate the performance of our classifier in Section 6. Section 7 discusses the implications of our work and concludes the paper. 2. RELATED WORK Machine learning and statistical analysis have been applied for a range of spam detection problems. However, to the best of our knowledge, these approaches have not been used to automatically identify spam apps in mobile app markets. In this section, we discuss related work in spam detection for web pages, SMS, and emails, and the detection of malware apps and over-privileged apps. Web spam refers to the publication of web pages that are specifically designed to influence search engine results. Using a set of manually classified samples of web pages obtained from “MSN Search”, Ntoulas et al. [45] characterise the web page features that can be used to classify a web page as spam or not. The features include top-level domain, language of the web page, number of words in the web page, number of words in the page title, and average word length etc. The authors show that a decision tree-based classifier achieves up to 85% precision and 90% recall for spam while keeping the respective values for non-spam around 98%. Fetterly et al. [18] characterise the features that can potentially be used to identify web spam through statistical analysis. The authors analyse features such as URL properties, host name resolutions, and linkage properties and find that outliers within each of the properties considered are spam. Castillo et al. [11] describe a cost-sensitive bagging classifier with decision trees, which when used individually with link-based features and content-based features can classify web pages into spam and non-spam. Gyöngyi et al. [24] propose the use of the link structure in a limited set of manually identified non-spam web pages to iteratively find spam and non-spam web pages, and show that a significant fraction of the web spam can be filtered using only a seed set of less than 200 sites. Krishnan et al. [37] use a similar approach. Erdélyi et al. [16] show that a computationally inexpensive feature subset, such as the number of words in a page, the number of words in a title, and the average word length, is sufficient to detect web spam. Detection of email spam has received considerable attention [7]. Various content related features of mail messages such as email header, text in the email body, and graphical elements have been used together with machine learning techniques, such as naive Bayesian [48, 54, 42, 39], support vector machines [15, 3, 55, 6], and k nearest neighbour [2]. Non-content related features such as SMTP path [38] and user’s social network [47, 13] has also been used in spam email detection. Currently, spam email detection is ubiquitous and most email services provide spam filtering. Spam has also been studied in the context of SMS [21, 14], product reviews [35, 34, 12], blog comments [43], social media [60, 5]. Cormack et al. [14] show that due to the limited size of SMS messages, bag of words or word bigram based spam classifiers do not perform well, and their performance can be improved by expanding the set of features to include orthogonal sparse word bigrams, and character bigrams and trigrams. Jindal et al. [35] identify spam product reviews using review centric features such as the number of feedback reports, textual features of the reviews, and product centric features such as price, sales rank as well as reviewer-centric features such as the average rating given by the reviewer. Wang et al. [60] study detection of spammers in Twitter. More recently, Malware mobile apps and over-privileged apps (i.e., apps with over-permissions) have received attention [64, 23, 9, 63, 62, 49, 22]. Zhou et al. [64] propose DroidRanger, which uses permission-based behavioural fingerprinting to detect new samples of known Android malware families and heuristics-based filtering to identify inherent behaviours of unknown malicious families. Gorla et al. [22] regroup app descriptions using a Latent Dirichlet Allocation based model and k-means clustering to identify apps that have unexpected API usage characteristics. 3. 3.1 METHODOLOGY Dataset We use apps collected in a previous study [57], as a seed for collecting a new dataset. This initial seed contained 94,782 apps and was curated from the lists of apps obtained from approximately 10,000 smartphone users. The user base consisted of volunteers, Amazon mechanical turk users, and users who published their lists of apps in social app discovery sites such as Appbrain 1 . We crawled each app’s Google Play Store page to discover: i) functionally similar apps; ii) other apps by the same developer. Then we collected metadata such as app name, app description, and app category for all the apps we discovered including the seed set. The complete list of metadata can be found in Appendix 9.15. We discarded the apps with a non-English description using the language detection API, “Detect Language” [1]. We refer to this final set as the Observed Set - O. After one month of identifying the Observed Set, apps, we revisited Google Play Store to check the availability of each app. The subset of apps that were unavailable at the time of this second crawl is referred to as Crawl 1 - C1 . This process was repeated two times, Crawl 2 - C2 and Crawl 3 - C3 with at least one month gap between two 1 www.appbrain.com consecutive crawls. Figure 1 provides an overview of the data collection process and Table 1 summarises the datasets in use. App discovery Crawl 1 Crawl 2 Observed Set Crawl 1 Crawl 2 Figure 1: Chronology of the data collection Table 1: Summary of the dataset Set Number of apps 232,906 6,566 9,184 18,897 Unofficial content Crawl 3 Dec’13 – May’14 Observed set (O) Crawl 1 (C1 ) Crawl 2 (C2 ) Crawl 3 (C3 ) Reason Spam Crawl 3 ~2 weeks 4 weeks ~2 weeks 4 weeks ~2 weeks 4 weeks ~2 weeks Initial Seed Table 2: Key reasons for removal of apps Copyrighted content Adult content Problematic content Android counterfeit Other counterfeit Developer deleted Developer banned Description These apps often have characteristics such as unrelated description, keyword misuse, and multiple instances of the same app. Section 4 presents details on spam app characteristics. Apps that provide unofficial interfaces to popular websites or services (e.g., an app providing an interface to a popular online shopping site without any official affiliation). Apps illegally distributing copyrighted content. Apps with explicit sexual content. Apps with illegal or problematic content. E.g., Hate speech and drug related. Apps pretending to be another popular app in the Google Play Store. A counterfeit app, for which the original app comes from a different source than Google Play Store (E.g., Apple App Store) Apps that were removed by the developer. Developer’s other apps were removed due to various reasons and Google decides to ban the developer. Thus all of his apps get removed. Table 3: Reviewer agreement in labelling reason for removal We identify temporary versus long-term removal of apps by re-checking the status of apps deemed to have been removed during an earlier crawl. For instance, all apps in Crawl 1 were checked again during Crawl 2. We found that only 85 (∼0.13%) apps identified as removed in Crawl 1 reappeared in Crawl 2. Similarly, only 153 apps (∼0.02%) identified as removed in Crawl 2 reappeared in Crawl 3. These apps were not included in our analysis. 3.2 App Labelling Process For a subset of the removed apps in our dataset, our goal was to manually identify the reasons behind their removal. As there can be multiple reasons for removal of apps, identification of the reasons behind the removal of an app is challenging and time consuming. We identified factors that lead to removal of apps from the market place by consulting numerous market reports [51, 26] as well as by examining the policies of the major app markets [32, 30, 28, 29, 31, 25]. We identified nine key reasons, which are summarised in Table 2. For each of these reasons, we formulated a set of heuristic checkpoints that can be used to manually label whether or not an app is likely to be removed. Owing to space limitations we do not provide the heuristic checkpoints for removed reasons except for spam. Full list of checkpoints for each reason can be found in Appendix 9.2–9.8. Section 4 delves into the checkpoints developed for identifying spam apps. From Crawl 1, we took a random sample of 1500 apps and asked three independent reviewers to identify the reasons behind the removal of each app using the heuristic checkpoints that we developed as a guideline. These reviewers were domain experts and had experience in developing and publishing Android apps. If a reviewer could not conclusively determine the reasons behind a removal, she labeled those apps as unknown. Moreover, for consistency, reviewers were asked to follow the same order of reasons for each app (i.e., check for spam first, then for unofficial content, then for copyrighted content followed by Android counterfeits and so on) and label the first reason they found as the most plausible. The set of apps that were used is referred as Labeled Set - L. Reason 3 agrees 2 agrees Total Percentage Spam Unofficial content Developer deleted Android counterfeit Developer banned Copyrighted content Other counterfeit Adult content Problematic content Unknown Sub total Reviewer disagreement Total Labeled 292 65 68 27 24 2 11 8 3 101 601 NA NA 259 127 56 61 54 34 23 4 4 127 749 NA NA 551 192 124 88 78 36 34 12 7 228 1350 150 1500 36.73% 12.80% 8.27% 5.87% 5.20% 2.40% 2.27% 0.80% 0.47% 15.20% 90.00% 10.00% 100.00% 3.3 Agreement Among the Reviewers We used majority voting to merge the results of the experts assessment, to arrive at the reason behind the app removal in L. We decided not to crowdsource the labelling task to avoid issues with training and expertise. Table 3 summarises reviewer agreement. For approximately 40% (601 out of 1500) of the apps in L, all three reviewers reached a consensus by identifying the same reason and for 90% (1350 out of 1500) of the apps majority of the reviewers agreed on the same reason. For the remaining 10% of apps, reviewers disagreed about the reasons. Table 3 also shows the composition of L after majority voting-based label merging. We observe that spam is the main reason for app removal, accounting for approximately 37% of the removals, followed by unofficial content accounting for approximately 13% of the removals. Around 15% of the apps were labelled as unknown. Figure 2 shows the conditional probability of the third reviewer’s reasoning, given that the other two reviewers are in agreement. There is over 50% probability of the third reviewer’s judgement of an app being spam, when two reviewers already judged the app to be spam. Other reasons showing such high probability are developer deleted and adult content apps. Our analysis through the manual labelling process shows that the main reason behind app removal is them being spam Value 0 0.1 0.2 0.3 0.4 0.5 0.6 Color Key on a reason Two reviewers agreeing 4.2 S7 S8 S9 89 115 11 11 3 26 3 15 The sum of the cases where 3 reviewers agreed, 2 reviewers agreed, and reviewers disagreed shown in Table 5, is more than the total number of spam apps identified by merging the reviewer agreement (i.e., 551 apps) because one reviewer might mark an app as satisfying multiple checkpoints. For example, consider a scenario where reviewer-1 labels the app as satisfying checkpoints S1 , S2 and reviewer-2 labels as S1 , S2 and reviewer-3 labels only as S1 . This will cause one scenario of 3 reviewer agreement (for S1 ) and another scenario of 2 reviewer agreement (for S2 ). Out of the 551 spam apps, three reviewers agreed on at least one checkpoint for 210 apps (∼ 38%), and two reviewers agreed on at least one checkpoint for 296 apps (∼ 54%). For 45 apps (∼ 8%), the three reviewers gave different checkpoints (i.e. when the reviewers assess an app as spam, over 90% of the time they also agreed on at least one checkpoint). 0.7 0.7 S9 0.6 S8 S7 S7 0.5 S2 S1 0.4 0.3 0.3 S5 0.4 S4 0.2 0.2 S3 Two reviewers agreeing on a checkpoint S5 S4 S6 S3 S2 S1 0.1 0.1 S6 Color Key 0.5 S8 0.6 Other Probability of third reviewer's reasoning Value S1 S2 S3 S4 S5 S6 S9 S7 Other 0 0.0 S9 S9 S8 S8 Two reviewers agreeing on a checkpoint S7 S7 S6 S6 S5 S5 S4 S4 S3 S3 Table 4 presents the nine heuristic checkpoints used by the reviewers and those were derived based on Google’s spam policy [32]. While we are unaware of Apple’s spam policy, we note that Apple’s app review guideline [25] includes certain provisions that match our spam checkpoints. For example, term 2.20 in the Apple app review guideline, “Developers spamming the App Store with many versions of similar apps will be removed from the iOS Developer Program” maps to our checkpoint S6 . Similarly, term 3.3 “Apps with names, descriptions, screenshots, or previews not relevant to the content and functionality of the app will be rejected” is similar to our checkpoint S2 . Reviewers were asked not to deem the apps as spam when they simply observe an app satisfying a single checkpoint, but to consider making their decision based on matches with multiple checkpoints according to their domain expertise. Note that Table 4 includes all checkpoint relevant to spam, including those that are considered only for manual labelling and not for automated labelling. In particular, checkpoints S8 and S9 are only used for manual labelling as features corresponding to these checkpoints are either not available at the time of app submission or require significant dynamic analysis for feature discovery. For instance, we do not use user “reviews” of the app (cf. checkpoint S8 ), as user reviews are not available prior to the app’s approval. Similarly, checking for excessive advertising (cf. checkpoint S9 ) was also used only for manual labelling, as it requires dynamic analysis by executing it in a controlled environment. S6 4 6 45 S2 Spam Checkpoints S5 0 3 S8 4.1 S4 20 75 S2 MANUAL LABELLING OF SPAM APPS This section introduces our heuristic checkpoints, which are used to manually label the spam apps. We also discuss the reviewers’ assessment according to the checkpoints. Section 5 maps the defined checkpoints into automated features. S3 63 81 S1 4. S2 22 52 S9 Figure 2: Probability of a third reviewer’s judgement when two reviewers already agreed on a reason S1 3 agreed 2 agreed Disagreed S1 0.3 0.2 0.1 0 Unknown Two reviewers agreeing on a reason Checkpoint Other 0.5 0.4 Value Color Key Two reviewers agreeing on a reason Spam 0.0 Problematic content Adult content Other counterfeit Copyrighted content Developer banned Developer deleted Android counterfeit Spam Unofficial content Unofficial.content Spam Unofficial content 0.1 gninosaer s'reweiver driht fo ytilibaborP 0.6 Unknown Problematic content Developer.deleted Spam Developer deleted Table 5: Checkpoint-wise reviewer agreement for spam Probability of third reviewer’s reasoning n w o nk n U tnetnoc.citamelborP tiefretnuoc.rehtO tnetnoc.tludA dennab.repoleveD tnetnoc.dethgirypoC deteled.repoleveD tiefretnuoc.diordnA mapS tnetnoc.laiciffonU Adult content Copyrighted content Developer banned Android counterfeit Developer deleted Unofficial content Spam Other counterfeit Android.counterfeit Probability of third reviewer’s reasoning Developer.banned Unofficial content 0.2 S1 Copyrighted.content Developer deleted Android counterfeit S2 Adult.content Other.counterfeit Android counterfeit Developer banned 0.3 S3 Unknown Problematic.content Developer banned S4 0.4 Copyrighted content S5 Other counterfeit Other counterfeit Copyrighted content S6 Adult content 0.5 S7 Adult content 0.6 Problematic content S8 Problematic content Unknown Table 5 shows how often the reviewers agreed upon each checkpoint. Checkpoints S2 and S6 led to the highest number of 3-reviewers agreements and 2-reviewer agreements. Checkpoints S1 and S3 have moderate number of 3-reviewer agreements while having a high number of 2-reviewer agreements. The table also suggests that checkpoints S1 , S2 , S3 and S6 are the most dominant checkpoints. S9 Probability of third reviewer's reasoning Unknown Reviewer Agreement on Spam Checkpoints Probability of third reviewer's reasoning apps. Furthermore, the reviewer agreement was high (more than 50%) when manually labelling spam apps indicating spam apps stands out clearly when looking at removed apps. Two reviewers agreeing on a checkpoint Color Key Figure 3: Probability of a third reviewer’s judgement when Value two reviewers already agreed on a checkpoint 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Figure 3 illustrates the conditional probability of the third reviewer’s checkpoint selection given two reviewers have already agreed on a checkpoint. We observe that for checkpoints S1 , S2 , S5 , S6 , S7 and S8 , there is a high probability that the third reviewer indicates the same checkpoint when two of the reviewers already agreed on a checkpoint. There is, however, a noticeable anomaly for checkpoint S4 , for which it seems that reviewers are getting confused with S3 . This is due to the lower frequency of occurrences of checkpoint S4 . There were no 3-reviewer agreements for S4 and there were only 3 cases where 2 reviewers agreed as shown in Table 5. In those 3 cases the other reviewer marked the checkpoint S3 in 2 cases giving a high proba- Table 4: Checkpoints used for labelling of spam apps Attribute ID S1 Description S2 S3 S4 S5 S6 Identifier S7 Reviews S8 Adware S9 Description and Examples Does the app description describe the app function clearly and concisely? E.g. Signature Capture App (Non - spam) - Description is clear on the functionality of the application “SignIt app allows a user to sign and take notes which can be saved and shared instantly in email or social media.” E.g. Manchester (Spam) - Description contains details about Manchester United Football Club without any detail on the functionality of the app. “Manchester United Football Club is an English professional football club, based in Old Trafford, Greater Manchester, that plays in the Premier League. ..... In 1998-99, the club won a continental treble of the Premier League, the FA Cup and the UEFA Champions League, an unprecedented feat for an English club.” Does the app description contain too much details / incoherent text / unrelated text for an app description? E.g. SpeedMoto (Non - spam) - Description is clear and concise about the functionality and usage. “SpeedMoto is a 3d moto racing game with simple control and excellent graphic effect. Just swap your phone to control moto direction. Tap the screen to accelerate the moto. In this game you can ride the motorcycle shuttle in outskirts, forest, snow mountain,bridge. More and More maps and motos will coming soon” E.g. Ferrari Wallpapers HD (Spam) - Description starts mentioning app as a wallpaper. However, then it goes into to details about Ferrari. “*HD WALLPAPERS *EASY TO SAVE *EASY TO SET WALLPAPER *INCLUDES ADS FOR ESTABLISHING THIS APP FREE TO YOU THANKS FOR UNDERSTANDING AND DOWNLOADING =) Ferrari S.p.A. is an Italian sports car manufacturer based in Maranello, Italy. Founded by Enzo Ferrari in 1929, as Scuderia Ferrari, the company sponsored drivers and manufactured race cars before moving into production of street-legal .....” Does the app description contain a noticeable repetition of words or keywords? E.g. English Chinese Dictionary - Keywords do not have excessive repetition. “Keywords: ec, dict, translator, learn, translate, lesson, course, grammar, phrases, vocabulary, translation, dict” E.g. Best live HD TV no ads (Spam) - Excessive repetition of words. “Keywords: live tv for free mobile tv tuner tv mobile android tv on line windows mobile tv verizon mobile tv tv streaming live tv for mobile phone mobile tv stream mobile tv phone mobile phone tv rogers mobile tv live mobile tv channels sony mobile tv free download mobile tv dstv mobile tv mobile tv.....” Does the app description contain unrelated keywords or references? E.g. FD Call Assistant Free (Non - spam) - All the keywords are related to the fire department. “Keywords: firefighter, fire department, emergency, police, ems, mapping, dispatch, 911” E.g. Diamond Eggs (Spam) - Reference to popular games Bejeweled Blitz and Diamond Blast without any reason. “Keywords : bejeweled, bejeweled blitz, gems twist, enjoy games, brain games, diamond, diamond blast, diamond cash, diamond gems,Eggs, jewels, jewels star” Does the app description contain excessive references for other applications from the same developer? E.g. Kids Puzzles (Non - spam) - Description does not contain references to developer’s other apps. “This kids game has 55 puzzles. Easy to build puzzles. Shapes Animals Nature and more... With sound telling your child what the puzzle is. Will be adding new puzzles very soon. Keywords: kids, puzzle, puzzles, toddler, fun” E.g. Diamond Snaker (Spam) - Excessive references to developer’s other applications. “If you like it, you can try our other apps (Money Exchange, Color Blocks, Chinese Chess Puzzel, PTT Web .....” Does the developer have multiple apps with approximately the same description? The developer “Universal App” have 16 apps having the following description, with each time XXXX term is replaced with some other term. “Get XXXX live wallpaper on your devices! Download the free XXXX live wallpaper featuring amazing animation. Now with ”Water Droplet”, ”Photo Cube”, ”3D Photo Gallery” effect! Touch or tap the screen to add water drops on your home screen! Touch the top right corner of the screen to customise the wallpaper <>. To Use: Home -> Menu -> Wallpaper -> Live Wallpaper -> XXXX 3D Live Wallpaper To develop more free great live wallpapers, we have implemented some ads in settings. Advertisement can support our develop more free great live wallpapers. This live wallpaper has been tested on latest devices such as Samsung Galaxy S3 and Galaxy Nexus. Please contact us if your device is not supported. Note: If your wallpaper resets to default after reboot, you will need put the app on phone instead of SD card. ” Does the app identifier make sense and have some relevance to the functionality of the application or does it look like auto generated? E.g. Angry Birds Seasons & Candy Crush Saga (Non - spam) - Identifier give an idea about the app. “com.rovio.angrybirdsseasons”, “com.king.candycrushsaga” E.g. Game of Thrones FREE Fan App & How To Draw Graffiti (Spam) - Identifiers appear to be auto generated. “com.a121828451851a959009786c1a.a10023846a”, “com.a13106102865265262e503a24a.a13796080a” Do users complain about app being spammy in reviews? E.g. Yoga Trainer & Fitness & Google Play History Cleaner (Spam) - Users complain about app being spammy. “Junk spam app Avoid”, “More like a spam trojan! Download if you like, but this is straight garbage!!” Do the online APK checking tools highlight app having excessive advertising? E.g. Biceps & Triceps Workouts “AVG threat labs” [27] gives a warning about inclusion of malware causing excessive advertising. word-trigram as features and the frequency of occurrence as the value of the feature. Full list of these bigrams and trigrams can be found in Appendix 9.11.Figure 6 visualises top50 bigrams (when ranked according to information gain); the size of the bigrams proportional to the logarithm of the normalised information gain. Note that bigrams such as “live wallpaper”, “wallpaper on”, and “some ads” are mostly present in spam whereas bigrams such as “follow us”, “to play” and “on twitter” are popular in non-spam. implemented.some settings.advertisement your.wallpaper phone.instead on.twitter 5.1 Checkpoint S1 - Does the app description describe the app function clearly and concisely? We automate this heuristic by identifying word-bigrams and word-trigrams that are used to describe a functionality and are popular among either spam or non-spam apps. First, we manually read the descriptions of top 100 nonspam apps and identified 100 word bigrams and word trigrams that describe app functionality, such as “this app’, “allows you”, and “app for”. Full list of these bigrams and trigrams found in Appendix 9.9. Figure 4 shows the cumulative distribution function (CDF) of the number of bigrams and trigrams observed in app descriptions in each dataset. There is a high probability of spam apps not having these features in their app descriptions. For example, 50% of the spam apps had one or more occurrence of these bigrams whereas 80% of the top 1x apps had one or more of these bi- and tri-grams. The difference in distribution of these bigrams and trigrams between spam and non-spam apps decreases when more and more lower ranked apps are considered in the non-spam category. This finding is consistent across some of the features we discuss in subsequent sections and we believe this indicates lower ranked apps can potentially include spams that are as yet not identified by Google. Second, as manual identification of bigrams and trigrams is not exhaustive, we identified the word-bigrams and wordtrigrams that appear in at least 10% of the spam and 10% of the non-spam apps. We used each such word-bigram and 2 https://evernote.com http://www.tictoc.net 4 http://www.saavn.com 3 latest.devices wallpaper.to wallpaper.on live.wallpaper reboot.you note.if.your on.latest can.support s.and of.sd this.live the.wallpaper screen.to on.facebook follow.us devices.such to.use.home wallpaper.has wallpapers.we advertisement.can develop.more have.implemented Non−Spam wallpapers.this tested.on to.default default.after on.phone free.great great.live resets.to use.home more.free wallpaper.resets support.our ads.in some.ads us.if.your been.tested live.wallpapers Checkpoint heuristics for spam (cf. Table 4) need to be mapped into features that can be extracted and used for automatic spam detection during the app approval process. This section presents this mapping and a characterisation of these features with respect to spam and non spam apps. A spam app is an app labelled as spam in L. We assumed that top apps with respect to the number of downloads, number of user reviews, and rating, are quite likely to be non spam. For example, according to this ranking, number one ranked app was Facebook and number 100 was Evernote 2 , number 500 was Tictoc 3 and number 1000 was Saavn 4 . Thus, for non spam apps, we selected top k times the number of labelled spam apps (551) from the set O, except all removed apps, after ranking them according to total number of downloads, total number of user reviews, and average rating. We varied k logarithmically between 1 and 32, (i.e., 1x, 2x, 4x,..., 32x) to obtain 6 datasets of non-spam apps. At larger k values it is possible that spam apps being considered to be non-spam, as discussed later in this section. As mentioned earlier in Section 4.1, checkpoints S8 and S9 are not used to develop features as we intend to enable automated spam detection at the time of developer submission. Total number of characters in the description Total number of words in the description Total number of sentences in the description Average word length Average sentence length Percentage of upper case characters Percentage of punctuations Percentages of numeric characters Percentage of non-alphabet characters Percentage of common English words [10] Percentage of personal pronouns [10] Percentage of emotional words [44] Percentage of misspelled words [61] Percentage of words with alphabet and numeric characters Automatic readability index (AR) [58] Flesch readability score (FR) [19] after.reboot need.put screen.touch FEATURE MAPPING Feature 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 as.samsung 5. Table 6: Features associated with Checkpoint S2 to.play bility (∼ 67%) for S4 getting confused with S3 . Note that Other refers to all the checkpoints associated with reasons for app removals other than spam; the probability of Others being chosen is high because of the accumulated sum of probabilities of checkpoints associated with other reasons. your.friends Spam Figure 6: Top-50 bigrams differentiating Spam and Top-1x 5.2 Checkpoint S2 - Does the app description contain too much details, incoherent text, or unrelated text? We use writing style related “stylometry” features to map this checkpoint to a set of features anticipating spam and non-spam apps might have a different writing style. We used the 16 features listed in Table 6 for this checkpoint and Appendix 9.12 provides further details on these features.. For characterisation, we select a set of features using the greedy forward feature selection in wrapper method [36]. We use a decision tree classifier with maximum depth 10 and the feature subset was selected such that the classifier optimise the performance metric asymmetric F-Measure, Fβ = (1 + β 2 ). (β 2precision.recall , with β = 0.5. This met.precision)+recall ric was used since in spam app detection, precision is more important than recall [59]. This process identified the Total number of words in the description, Total number of sentences in the description, Percentages of numeric characters, Percentage of non-alphabet characters, and Percentage of common English words as the most discriminative features. Figures 5a shows the CDF of the total number of words in the app description. Spam apps tend to have less wordy 0.8 0.6 Spam Top-1x Top-2x Top-4x Top-8x Top-16x Top-32x 0.4 0.2 0 0 2 4 6 8 10 Number of manually identified bigrams and trigrams 1 0.8 0.6 Spam Top-1x Top-2x Top-4x Top-8x Top-16x Top-32x 0.4 0.2 0 0 100 200 300 400 500 600 700 Word count (a) Figure 4: CDF of the frequency of manually identified word-bigrams and word-trigrams 1 0.8 0.6 Spam Top-1x Top-2x Top-4x Top-8x Top-16x Top-32x 0.4 0.2 0 0 0.15 0.3 0.45 0.6 Percentage of common English words (b) Figure 5: Example features associated with Checkpoint S2 app descriptions than non-spam apps. For instance, nearly 30% of the spam apps have less than 100 words whereas approximately only 10% top-1x apps and top-2x apps have less than 100 words. As we inject more and more apps of lower popularity the difference diminishes. Figure 5b presents the CDF of the percentage of common English words [10] in the app description. It illustrates that spam apps typically use fewer common English words compared to non-spam apps. 5.3 Checkpoint S3 - Does the app description contain a noticeable repetition of words or keywords? We quantify this checkpoint using the vocabulary richness of unique words (VR)= Number metric. Figure 7 shows the CDF Total number of words of the VR measure. We expected spam apps to have low VR due to repetition of keywords. However, we observe this only in a limited number of cases. According to Figure 7, if VR is less than 0.3, an app is only marginally more likely to be spam. Perhaps the most surprising finding is that the apps with VR close to 1 are more likely to be spam. 10% of the spam apps had VR over 0.9 and none of the non-spam apps had such high VR values. When we manually checked these spam apps we found that they had terse app descriptions. We showed earlier that apps with shorter description are more likely to be spam under checkpoint S2 . 5.4 Cumulative Probability Cumulative Probability Cumulative Probability 1 Checkpoint S4 - Does the app description contain unrelated keywords or references? Inserting unrelated keywords such that an app would appear in popular search results is a common spamming technique. Since the topics of the apps can vary significantly it is difficult to find when the terms included in the description are actually related to the functionality. In this study, we focussed on a specific strategy the developers might use, that is mentioning the names of the popular applications with the hope of appearing in the search results for these applications. For each app we calculated the number of mentioning of the top 100 popular app names in the app’s description. Figure 8a shows the distribution of the number of times the name of top-100 apps appear in the description. Somewhat surprisingly, we found that only 20% of the spam apps had more than one mention of popular apps, whereas 40%– 60% of the top-kx apps had more than a single mention of popular apps. On investigation, we find that many top-kx apps have social networking interfaces and fan pages and as a result they mention the names of popular social networking apps. This is evident in Figure 6 as well where Twitter or Facebook are mentioned mostly in non spam apps. To sum- marise, presence of social sharing features or availability of fan pages can be used to identify non spam apps. Next, we consider the sum of the tf-idf weights of the top100 app names. This feature attempts to reduce the effect of highly popular apps. For example, Facebook can have a high frequency because of its sharing capability and many apps might refer it. However, if an app refers to another app which is not as popular as Facebook, such as reference to popular games, it indicates a likelihood of it being a spam app. To discount the effect of the commonly available app names, we used the sum of tf-idf weights as a feature. For a given dataset let tfik where i = 1 : 100 be the number of times the app name of ith app (ni ) in top-100 apps appear in an app description of app k (ak ). Then we define IDFi for ith app in N ∀k. Then feature top-100 apps as, IDFi = log(1+|{n i ∈ak }|) P th tf − idf (k) for k app is, tf − idf (k) = i=100 tfik . IDFi . i=1 The calculation above is dataset dependent. Despite this, as Figures 8b shows, the CDF of top-1x dataset and the results still indicates that if popular app names are found in the app description, the app tends to be non spam rather than spam. We repeated the same with top-1000 app names and found that the results were similar. 5.5 Checkpoint S5 - Does the app description contain excessive references to other applications from the same developer? We use the number of times a developer’s other app names appear as the feature corresponding to this checkpoint. However, none of the cases marked by the reviewers as matching checkpoint S5 satisfied this feature. This was because the description contained links to the applications rather than the app names and only 10 spam apps satisfied this feature (cf. Table 5). We do not use checkpoint S5 in our classifier. 5.6 Checkpoint S6 - Does the developer have multiple apps with approximately the same description? For this checkpoint, for each app we considered the following features: (i) The total number of other apps the developer has, (ii) The total number of apps with an English language description which can be used to measure descriptions similarity and (iii) The number of other apps from the same developer having a description cosine similarity(s), of over 60%, 70%, 80% and 90%. Further details of these features can be found in Appendix 9.13. We observe that the features based on the similarity between app descriptions are the most discriminative. As examples, Figures 9a and 9b re- 0.8 0.6 Spam Top-1x Top-2x Top-4x Top-8x Top-16x Top-32x 0.4 0.2 0 0 0.2 0.4 0.6 0.8 Vocabulary richness 1 1 0.8 0.6 Spam Top-1x Top-2x Top-4x Top-8x Top-16x Top-32x 0.4 0.2 0 0 2 4 6 8 Frequency of top-100 app names appear Cumulative Probability Cumulative Probability Cumulative Probability 1 1 0.8 0.6 0.4 Spam Top-1x 0 5 10 15 20 25 30 tf-idf of top-100 app names (1x dataset) (a) Figure 7: Vocabulary Richness Figure 8: Mentioning popular app names 0.9 0.8 0.7 Spam Top-1x Top-2x Top-4x Top-8x Top-16x Top-32x 0 20 40 60 80 100 120 140 Apps by developer with similarity over 60% (a) Over 60% similarity Cumulative Probability Cumulative Probability spectively show the CDF of number of apps with s over 60% and 90% by the same developer. Figure 9a shows that only about 10%–15% of the nonspam apps have more than 5 other apps from the same developer with over 60% of description similarity. However, approximately 27% of the spam apps have more than 5 apps with over 60% of description similarity. This difference becomes more evident when the number of apps from the same developer with over 90% description similarity is considered, indicating that spam apps tend to have multiple clones with similar app descriptions. 1 1 Table 7: Features associated with Checkpoint S7 Feature 1 2 3 4 5 6 7 8 9 10 11 12 13 Number of characters Number of words Average word length Percentage of of non-letter characters to total characters Percentage of upper case characters to total letter characters Presence of parts of app name in appid Percentage of bigrams with 1 non-letter to total bigrams Percentage of bigrams with 2 non-letters to total bigrams Percentage of bigrams with 1 or 2 non-letters to total bigrams Percentage of trigrams with 1 non-letter to total trigrams Percentage of trigrams with 2 non-letters to total trigrams Percentage of trigrams with 3 non-letter to total trigrams Percentage of trigrams with 1, 2 or 3 non-letters to total trigrams 0.9 0.8 0.7 Spam Top-1x Top-2x Top-4x Top-8x Top-16x Top-32x 0 20 40 60 80 100 120 140 Apps by developer with similarity over 90% (b) Over 90% similarity Figure 9: Similarity with developer’s other apps 5.7 (b) Checkpoint S7 - Does the app identifier (appid) make sense and have some relevance to the functionality of the application or does it appear to be auto generated? Table 7 shows the features corresponding to this checkpoint; these are derived from the appid. Further details about these features can be found in Appendix 9.14. We applied the feature selection method noted in Section 5.2 to identify the most discriminative features (Number of words, Average word length, Percentage of bigrams with 2 nonletters). In Figure 10a we show the CDF of the number of words in the appid. The spam apps tend to have more words compared to non-spam apps. For example, 15% of the spam apps had more than 5 words in the appid where as only 5% of the non-spam had the same. Figure 10b shows the CDF of the average word length of a word in appid. For 10% of the spam apps the average word length is higher than 10 and it was so only for 2%–3% of the non-spam apps. Figure 10c shows the percentage of non-letter bigrams (e.g., “01”, “8 ”) among all character bigrams that can be generated from the appid. None of the non-spam apps had more 20% of non-letter bigrams in the appid, whereas about 5% of the spam apps had more than 20% of non-letter bigrams. Therefore, if an appid contains over 20% of nonletter bigrams out of all possible bigrams that can be generated, that app is more likely to be spam than non-spam. 5.8 Other Metadata In addition to the features derived from the checkpoints, we added the metadata related features listed on Table 8. Further details about these features can be found in Appendix 9.15. Figure 11 shows the app category distribution in the spam and the top-1x categories. We note that approximately 42% of the spam apps belong to the categories Personalisation and Entertainment. We found a negligibly small number of spam apps in the categories Communication, Photography, Racing, Social, Travel, and Weather. Moreover, for categories Arcade, Tools, and Casual, the percentage of spam is significantly less than the percentage of non-spam. Qualitatively similar observations hold for these other top-kx sets. Figure 12a shows the CDF of the length of the app name. Spam apps tend to have longer names. For example, 40% of the spam apps had more than 25 characters in the app name. Only 20% of the non-spam apps had more than 25 characters in their app names. Figures 12b shows the CDF of the size of the app in kilobytes. Spam apps tend to have a high probability of having a size in some specific ranges. For example if the size of the app is very low those apps are more likely to be non-spam. For example, 30% of the topkx apps were less than 100KB in size and the corresponding percentage of spam apps is almost zero. Almost all the spam 0.6 Spam Top-1x Top-2x Top-4x Top-8x Top-16x Top-32x 0.4 0.2 0 2 4 6 8 10 No of words in appid 12 1 Cumulative Probability Cumulative Probability Cumulative Probability 1 0.8 0.8 0.6 Spam Top-1x Top-2x Top-4x Top-8x Top-16x Top-32x 0.4 0.2 0 0 5 10 15 Average word length appid (a) 1 0.98 0.96 Spam Top-1x Top-2x Top-4x Top-8x Top-16x Top-32x 0.94 0.92 20 0.9 0 0.2 0.4 0.6 0.8 Percentage of 2 non-letters to total bigrams (b) (c) Table 8: Features associated with other app metadata 1 Spam Top-1x Top-2x Top-4x Top-8x Top-16x Top-32x 0.8 0.6 Cumulative Probability apps were having sizes less than 30MB whereas 10%–15% of the top-kx apps were more than 30MB in size. CDF of the spam app shows a sharp rise between 1000KB and 3000KB indicating there are more apps in that size range. Table 9 shows the external information related features. For example, if a link to a developer web site or a privacy policy is given the app is more likely to be non spam. Cumulative Probability Figure 10: Example features associated with Checkpoint S7 0.4 0.2 0 0 5 10 15 20 25 30 Length of app name (in characters) 1 0.8 0.6 Spam Top-1x Top-2x Top-4x Top-8x Top-16x Top-32x 0.4 0.2 0 0 5000 Size (KB) (a) Feature 1 2 3 4 App category Price Length of app name Size of the app (KB) 5 6 7 8 Developer’s website available Developer’s website reachable Developer’s email available Privacy policy available Website availa. Website reacha. Email availa. Priv. policy availa. Spam top-1x 15 10 5 Arcade Books Brain Business Cards Casual Comics Communication Education Entertainment Finance Health Libraries Lifestyle Media Medical Music News Personalisation Photography Productivity Racing Shopping Social Sports Sports Games Tools Transport Travel Weather Percentage of apps Figure 12: Features associated with other app metadata Table 9: Availability of developer’s external information 20 App Category Figure 11: App category 6.1 (b) Feature 25 6. 10000 15000 20000 25000 30000 EVALUATION Classifier Performance We build a binary classifier based on the Adaptive Boost algorithm [20] to detect whether or not a given app is spam. We use the spam apps from L as positives and apps from a top-kx set as negatives. Each app is represented by a vector of the features studied in Section 5. The classifier is trained using 80% of the data and the remaining 20% of the data is used for testing. We repeat this experiment for k = 1, 2, 4, · · · , 32. Some of the features discussed in Section 5 depend on the other apps in the dataset rather than on the metadata Spam top 1x top 2x top 4x top 8x top 16x top 32x 57% 93% 99% 9% 93% 98% 84% 56% 94% 97% 89% 50% 93% 97% 91% 48% 91% 96% 93% 38% 89% 96% 94% 32% 86% 95% 95% 26% of the individual apps. For such cases we calculate the features based only on the training set to avoid any information propagation from the training to the testing set that would artificially inflate the performance of the classifier. For example, when extracting the bigrams (and trigrams) that are popular in at least 10% of the apps, only spam and nonspam apps from the training set are used. This vocabulary is then used to generate feature values for individual apps in both training and testing sets. Decision trees [8] with a maximum depth of 5 are used as the weak classifiers in the Adaptive Boost classifier5 and we performed 500 iterations of the algorithm. Table 10 summarises the result. Our classifiers, while varying the value of k, have precision over 85% with recall varying between 38%–98%. Notably, when k is small (e.g., when the total number of non spam apps represents ≤ 2x the number of spam apps) the classifier achieves an accuracy as high as 95%. Recall drops when we increase the number of negative examples because larger k values implies inclusion of lower ranked apps as negative (non-spam) examples. Some of these lower ranked apps, however, exhibit some spam features and some may indeed be spam (not yet been detected as spam). As we have only a small number of spam apps 5 Adaptive Boost algorithm combines the output of set of weak classifiers to build a strong classifier. Interested readers may refer to [20]. in the training set, when unlabelled spam apps are added as non-spam, some spam related features become non-spam features as the number of app satisfying that feature is high in non-spam. As a result recall drops when k increases. As the number of negative examples increases the classifier becomes more conservative and correctly identifies a relatively small portion of spam apps. Nonetheless, even at k = 32 we achieve a precision over 85%. This is particularly helpful in spam detection, as marking a non-spam as spam can be more expensive than missing a spam. Additionally, if the objective is to build an aggressive classifier that identifies as much spam as possible so that app market operators can flag these apps to make a decision later, a classifier built using smaller number of negative examples (i.e., k = 1 or k = 2) can be used. For example, in Table 11 we show the k = 2 classifier’s performance in the higher order datasets. As can be seen, this classifier identifies nearly 90% of the spam. However, the precision goes down as we traverse down the app ranking because of increased number of false positives. Nonetheless, in reality some of these apps may actually be spam and as a result may not exactly be false positives. Table 10: Classifier Performance k Precision Recall Accuracy F0.5 1 2 4 8 16 32 0.9310 0.9533 0.9126 0.9405 0.8833 0.8571 0.9818 0.9273 0.8545 0.7182 0.4818 0.3818 0.9545 0.9606 0.9545 0.9636 0.9658 0.9793 0.9408 0.9480 0.9004 0.8857 0.7571 0.6863 Table 11: Classifier Performance: k = 2 model 6.2 k Precision Recall Accuracy F0.5 4 8 16 32 0.8080 0.5549 0.2730 0.1164 0.9182 0.9182 0.9182 0.9182 0.9400 0.9091 0.8513 0.7862 0.8279 0.6026 0.3176 0.1410 Applying the Classifier in the Wild To get an estimate of the volume of potential spam that might be in Google Play Store, we used two of our classifiers to predict whether or not an app in our dataset was spam. We selected a conservative classifier (k = 32) and an aggressive classifier (k = 2) to obtain a lower and upper bound. We did this prediction only for the apps which were not used for training during the classifier evaluation phase to avoid classifier artificially reporting higher number of spam. Table 12 shows the results. C1 , C2 , and C3 are three sets of removed apps we identified as described in Section 3.1 and Others are the apps in O which were neither removed nor belong up to top-32x. Thus Others apps represent the average apps in Google Play Store and we know that they were there in the market for at least for 6 months, and by time we stopped monitoring they were still there in the market. According to the results, more aggressive classifier (k = 2) predicted around 70% of the removed apps and 55% of the other apps to be spam. The conservative classifier (k = 32) predicted 6%–12% of the removed apps and approximately 2.7% of the other apps as spam; this can be considered as the lower bound for the percentage of spam apps currently available in Google Play Store. Table 12: Predictions on spam apps in Google Play Store 7. Dataset Size Crawl 1 (C1 ) Crawl 2 (C2 ) Crawl 3 (C3 ) Others 6,566 9,184 18,897 180,627 k=2 70.37 73.14 72.99 54.59 % % % % k = 32 12.89 % 6.57 % 6.49 % 2.69 % CONCLUDING REMARKS In this paper, we propose a classifier for automated detection of spam apps at the time of app submission. Our app classifier utilises only those features that can be derived from an app’s metadata available during the publication approval process. It does not require any human intervention such as manual inspection of the metadata or manual testing of the app. We validate our app classifier, by applying it to a large dataset of apps collected between December 2013 and May 2014, by crawling and identifying apps that were removed from the Google Play Store. Our results show that it is possible, to automate the process of detecting spam apps solely based on apps’ metadata available at the time of publication and achieve both high precision and recall. In the remainder of this section, we discuss two challenges that can be fruitful avenues for future work. The manual labelling challenge. In this work, we used a relatively small set (1500) of labelled spam apps and used the top apps from the market place as proxies for nonspam apps. Our choices were dictated largely by the time consuming nature of the labelling process. For this study, it took on average 20 working days, for each of our reviewers to label all the 1500 apps. Obviously, the performance of the classifier can be improved by increasing the number of labelled spam apps and further manually labelling non-spam apps through an extension of our manual effort. One approach is to rely on crowdsourcing and recruit appsavvy users as reviewers. This is not a trivial task as it requires a major effort of providing individualised guidelines and interactions with the reviewers, and need to deal with assessment inconsistencies due to variable technical expertise. Nonetheless, from an app market provider perspective, this is a plausible option to consider. A framework that can quickly assess whether an app is spam, when the developers submit apps for approval will enable faster approvals whilst reducing the number of spam apps in the market. Alternatively, hybrid schemes can be developed where apps are flagged during the approval process and removed based on a limited number of customer complaints. Another potential direction is to consider a mix of supervised and unsupervised learning, known as semi-supervised learning [4]. The basic idea behind semi-supervised learning is to use “unlabelled data” to improve the classification model that was previously being built using only labelled data. Semi-supervised learning has previously being successfully applied to real-time traffic classification problems [17]. The classification arms race. It is possible that spammers adapt to the spam app detection framework and change their strategies according to our selected features to avoid the detection. The relevant questions in this context are: i) how frequently the should the classifier be retrained? and ii) how to detect when retraining is required?. We believe that spam app developers will find it challenging and have significant cost overheads to adapt their apps to avoid features that allow discriminating between spam and non-spam apps. For instance, to avoid the similarity of descriptions of multi- ple apps spanning over different categories, the spammer has to consider editing the different descriptions of the apps and to customise each of the app description to contain sufficient details and coherent text, etc. An important direction for future work is to study the longevity of our classifier. Additionally, we plan to investigate how the process of identifying retraining requirements can be automated. Prior work on traffic classification suggests that automated identification of re-training points is possible using semi-supervised learning approaches [17]. 8. REFERENCES [1] Language Detection API. http://detectlanguage.com, 2013. [2] I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C. D. Spyropoulos, and P. Stamatopoulos. Learning to filter spam e-mail: A comparison of a naive bayesian and a memory-based approach. arXiv preprint cs/0009009, 2000. [3] H. B. Aradhye, G. K. Myers, and J. A. Herson. Image analysis for efficient categorization of image-based spam e-mail. In Document Analysis and Recognition, 2005. Proceedings. Eighth International Conference on, pages 914–918. IEEE, 2005. [4] S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised clustering. In Proc. of KDD, KDD ’04, pages 59–68, 2004. [5] F. Benevenuto, G. Magno, T. Rodrigues, and V. Almeida. Detecting spammers on twitter. In Collaboration, electronic messaging, anti-abuse and spam conference (CEAS), volume 6, page 12, 2010. [6] E. Blanzieri and A. Bryl. Evaluation of the highest probability svm nearest neighbor classifier with variable relative error cost. 2007. [7] E. Blanzieri and A. Bryl. A survey of learning-based techniques of email spam filtering. Artificial Intelligence Review, 29(1):63–92, 2008. [8] L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen. Classification and regression trees. CRC press, 1984. [9] I. Burguera, U. Zurutuza, and S. Nadjm-Tehrani. Crowdroid: behavior-based malware detection system for android. In Proceedings of the 1st ACM workshop on Security and privacy in smartphones and mobile devices, pages 15–26. ACM, 2011. [10] O. Canales, V. Monaco, T. Murphy, E. Zych, J. Stewart, C. Tappert, A. Castro, O. Sotoye, L. Torres, and G. Truley. A stylometry system for authenticating students taking online tests. Proc. Student-Faculty CSIS Research Day, 2011. [11] C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: Web spam detection using the web topology. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 423–430. ACM, 2007. [12] R. Chandy and H. Gu. Identifying spam in the ios app store. In Proceedings of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality, pages 56–59. ACM, 2012. [13] P.-A. Chirita, J. Diederich, and W. Nejdl. Mailrank: using ranking for spam detection. In Proceedings of the 14th ACM international conference on Information and knowledge management, pages 373–380. ACM, 2005. [14] G. V. Cormack, J. M. Gómez Hidalgo, and E. P. Sánz. Spam filtering for short messages. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 313–320. ACM, 2007. [15] H. Drucker, S. Wu, and V. N. Vapnik. Support vector machines for spam categorization. Neural Networks, IEEE Transactions on, 10(5):1048–1054, 1999. [16] M. Erdélyi, A. Garzó, and A. A. Benczúr. Web spam classification: a few features worth more. In Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality, pages 27–34. ACM, 2011. [17] J. Erman, A. Mahanti, M. Arlitt, I. Cohen, and C. Williamson. Offline/realtime traffic classification using semi-supervised learning. Perform. Eval., 64(9-12):1194–1213, Oct. 2007. [18] D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases: colocated with ACM SIGMOD/PODS 2004, pages 1–6. ACM, 2004. [19] R. Flesch. A new readability yardstick. Journal of applied psychology, 32(3):221, 1948. [20] Y. Freund, R. E. Schapire, et al. Experiments with a new boosting algorithm. In ICML, volume 96, pages 148–156, 1996. [21] J. M. Gómez Hidalgo, G. C. Bringas, E. P. Sánz, and F. C. Garcı́a. Content based sms spam filtering. In Proceedings of the 2006 ACM symposium on Document engineering, pages 107–114. ACM, 2006. [22] A. Gorla, I. Tavecchia, F. Gross, and A. Zeller. Checking app behavior against app descriptions. In ICSE, pages 1025–1035, 2014. [23] M. Grace, Y. Zhou, Q. Zhang, S. Zou, and X. Jiang. Riskranker: scalable and accurate zero-day android malware detection. In Proceedings of the 10th international conference on Mobile systems, applications, and services, pages 281–294. ACM, 2012. [24] Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In Proceedings of the Thirtieth international conference on Very large data bases-Volume 30, pages 576–587. VLDB Endowment, 2004. [25] Apple. Inc. App store review guidelines. https://developer.apple.com/app-store/review/ guidelines/, 2014. [26] Apple. Inc. Common app rejections. https://developer.apple.com/app-store/review/ rejections/, 2014. [27] AVG Threat Labs. Inc. Website safety ratings and reputation. http://www.avgthreatlabs.com/ website-safety-reports/app, 2014. [28] Google. Inc. Ads. http://developer.android.com/ distribute/googleplay/policies/ads.html, 2014. [29] Google. Inc. Google play developer program policies. https://play.google.com/about/ developer-content-policy.html, 2014. [30] Google. Inc. Intellectual property. http://developer.android.com/distribute/ googleplay/policies/ip.html, 2014. [31] Google. Inc. Rating your application content for Google Play. https://support.google.com/ googleplay/android-developer/answer/188189, 2014. [32] Google. Inc. Spam. http: //developer.android.com/distribute/googleplay/, 2014. [33] WiseGeek. Inc. Why do spam emails have a lot of misspellings? http://www.wisegeek.com/ why-do-spam-emails-have-a-lot-of-misspellings. htm, 2014. [34] N. Jindal and B. Liu. Review spam detection. In Proceedings of the 16th international conference on World Wide Web, pages 1189–1190. ACM, 2007. [35] N. Jindal and B. Liu. Opinion spam and analysis. In [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] Proceedings of the 2008 International Conference on Web Search and Data Mining, pages 219–230. ACM, 2008. R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial intelligence, 97(1):273–324, 1997. V. Krishnan and R. Raj. Web spam detection with anti-trust rank. In AIRWeb, volume 6, pages 37–40, 2006. B. Leiba, J. Ossher, V. Rajan, R. Segal, and M. N. Wegman. Smtp path analysis. In CEAS, 2005. K. Li and Z. Zhong. Fast statistical spam filter by approximate classifications. In ACM SIGMETRICS Performance Evaluation Review, volume 34, pages 347–358. ACM, 2006. AppBrain Inc. AppBrain Stats. http://www.appbrain.com/stats, 2014. Statista Inc. Number of newly developed applications/games submitted for release to the iTunes App Store from January 2012 to September 2013. http://www.statista.com, 2013. V. Metsis, I. Androutsopoulos, and G. Paliouras. Spam filtering with naive Bayes-which naive Bayes? In CEAS, pages 27–28, 2006. G. Mishne, D. Carmel, and R. Lempel. Blocking blog spam with language model disagreement. In AIRWeb, volume 5, pages 1–6, 2005. A. Mukherjee and B. Liu. Improving gender classification of blog authors. In Proceedings of the 2010 conference on Empirical Methods in natural Language Processing, pages 207–217. Association for Computational Linguistics, 2010. A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of the 15th international conference on World Wide Web, pages 83–92. ACM, 2006. J. Oberheide and C. Miller. Dissecting the android bouncer. SummerCon2012, New York, 2012. P. Oscar and V. Roychowdbury. Leveraging social networks to fight spam. IEEE Computer, 38(4):61–68, 2005. P. Pantel, D. Lin, et al. Spamcop: A spam classification & organization program. In Proceedings of AAAI-98 Workshop on Learning for Text Categorization, pages 95–98, 1998. H. Peng, C. Gates, B. Sarma, N. Li, Y. Qi, R. Potharaju, C. Nita-Rotaru, and I. Molloy. Using probabilistic generative models for ranking risks of android apps. In Proceedings of the 2012 ACM conference on Computer and communications security, pages 241–252. ACM, 2012. S. Perez. Developer spams Google Play with ripoffs of well-known apps again. http://techcrunch.com, 2013. S. Perez. Nearly 60K low-quality apps booted from Google Play Store in February, points to increased spam-fighting. http://techcrunch.com, 2013. S. Perez. iTunes App Store now has 1.2 million apps, has seen 75 billion downloads to date. http://techcrunch.com, 2014. D. Rowinski. Apple iOS App Store adding 20,000 apps a month, hits 40 billion downloads. http://readwrite.com, 2013. M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. A Bayesian approach to filtering junk e-mail. In Learning for Text Categorization: Papers from the 1998 workshop, volume 62, pages 98–105, 1998. D. Sculley and G. M. Wachman. Relaxed online svms for spam filtering. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 415–422. ACM, 2007. [56] S. Seneviratne, A. Seneviratne, D. Kaafar, A. Mahanti, and P. Mohapatra. Why my app got deleted: Detection of spam mobile apps. Technical report, NICTA, Australia, November 2013. [57] S. Seneviratne, A. Seneviratne, P. Mohapatra, and A. Mahanti. Predicting user traits from a snapshot of apps installed on a smartphone. ACM SIGMOBILE Mobile Computing and Communications Review, 18(2):1–8, 2014. [58] E. Smith and R. Senter. Automated readability index. AMRL-TR. Aerospace Medical Research Laboratories (6570th), page 1, 1967. [59] I. Soboroff, I. Ounis, J. Lin, and I. Soboroff. Overview of the trec-2012 microblog track. In Proceedings of the Twenty-First Text REtrieval Conference (TREC 2012), 2012. [60] A. H. Wang. Don’t follow me: Spam detection in twitter. In Security and Cryptography (SECRYPT), Proceedings of the 2010 International Conference on, pages 1–10. IEEE, 2010. [61] Wikipedia. Wikipedia:lists of common misspellings. http://en.wikipedia.org/wiki/, 2014. [62] D.-J. Wu, C.-H. Mao, T.-E. Wei, H.-M. Lee, and K.-P. Wu. Droidmat: Android malware detection through manifest and API calls tracing. In Information Security (Asia JCIS), 2012 Seventh Asia Joint Conference on, pages 62–69. IEEE, 2012. [63] W. Zhou, Y. Zhou, X. Jiang, and P. Ning. Detecting repackaged smartphone applications in third-party android marketplaces. In Proceedings of the second ACM conference on Data and Application Security and Privacy, pages 317–326. ACM, 2012. [64] Y. Zhou, Z. Wang, W. Zhou, and X. Jiang. Hey, you, get off of my market: Detecting malicious apps in official and alternative android markets. In NDSS, 2012. 9. 9.1 APPENDIX Collected Metadata Table 13 shows all the metadata our crawler collected from Google Play Store in related to apps. 9.2 Unofficial Content Apps Table 14 shows the heuristics used to identify unofficial content apps that are developed based on the Google’s content policy [29] and intellectual property policy [31]. 9.3 Copyrighted Content Apps Table 15 shows the heuristics used to identify copyrighted content apps that are developed based on the Google’s content policy [29] and intellectual property policy [31]. 9.4 Problematic Content Apps Heuristics used to identify problematic or illegal content apps are shown in Table 16. Those are developed based on the Google’s content policy [29]. 9.5 Counterfeit Apps Table 17 summarises checkpoints used to identify Android Counterfeits and Other Counterfeits. This was more challenging as the reviewers had to manually search in Google Play Store and Google Search Engine with parts or full app names and descriptions. Then the retrieved top results (links/images) were checked whether they lead to other apps (in Google Play Store or other App Markets). Also they verified whether the other apps are possibly from the same de- Table 14: Checklist used for manual labeling of unofficial content apps Attribute Description ID U1 Description and Examples Does the app description mention the app as an unofficial app? E.g. The Verge Notifications - App description says the app is unofficial “.... Unofficial The Verge Notifications application. Notifies you of new replies or recommendations to your The Verge comments. ....” E.g. Flash Download [unofficial] - App description says the app is unofficial “.... Unofficial Flash Install Supporter will able you to install Latest Flash Player from official site. . ....” U2 Does the app description mention it provide unofficial interfaces to popular websites or content? E.g. Shortcut to Hotmail - App description says it connects to Hotmail. However when developer details are checked it is not officially from Hotmail. “.... Access Hotmail (currently Outlook) from this convenient shortcut, from your smartphone or table. Do not waste time ....” E.g. Pin Search For BBM - App description says it connects to Blackberry messenger without any affiliation with Blackberry. “.... Pin Search BBM app is for all user Blackberry Messenger (BBM) , You can find here BBM pin and make new friends. Share BlackBerry PIN’s world wide conveniently on iPhone, Android and BlackBerry phone devices which support the BBM app. Disclaimer :PLEASE NOTE , This app is not affiliated with Blackberry Limited ....” U3 Does the app description mention popular brand names without any affiliation. i.e. App developer is different from the official app developer of that brand, official web site does not mention about an Android application. E.g. Reebok Wall Ball - Unofficial use of the brand name Reebok “.... Reebok Realflex Wall Ball. Flex your shoe to shoot medicine balls at targets on the wall. Every level ....” U4 When the app delivers contents from others sources does it mention it has the right to do so? E.g. Radio City Canada - The app description does not mention whether the app is unofficial or has the right to deliver this content. When the official web site of Radio City web site is checked there was no reference for an Android app. “.... Listen to CBC Radio across Canada. Keywords: Radio, Canada, CBC, Radio-Canada, Radio, TV, MP3, webradios, webradio. ....” Name U5 Does the app name mention app being unofficial? E.g. Flash Download [unofficial], American Express (UnOfficial), Khan Academy (Unofficial), Pitbull Unofficial Fan App - App names mention app being unofficial Website U6 Does the developer’s website mention app being unofficial? E.g. STC Collect Call - The app description does not mention does not mention whether the app is unofficial. However in the developer website it is mentioned that app is unofficial. App description “ Collect Call This app makes it easier to dial the collect call. Just select or enter the number you wish to call, click on dial button and then select ”collect call” from the menu. Below are information about STC collect call service. With the Collect Call service from STC, the call will be charged to the receiver once he or she is notified and accepts to pay the charges. *- No settings or activation required *- You can call even when you donÕt have credit *- Free and has no monthly subscription fees *- Available to all Jawal postpaid and prepaid customers (Sawa, Lana) *- Works only among STC customers and not with other operators *- Does not apply while roaming for more info please visit www.STC.com.sa” Description at developer web site “ is a service provided by Saudi telecom (STC). This application is an unofficial application which I have developed for this service. Please post your comments, inquires or wishes in English ” Reviews U7 Do users complain about app being unofficial? E.g. Khan Academy (Unofficial) “Fantastic app Being the unofficial gives me mild concern.but overall really easy interface’ veloper or not as some developers may publish apps in multiple app markets. Description and app names were check as regular expressions with our master dataset collected from the total crawl as well. Reviewers marked an app as counterfeit only if they could reliably conclude the original app in Google Play Store or any other app market such as Apple App Store and Amazon App Store and they were sure that it is not coming from the same developer. 9.6 Adult Content Apps Adult content apps were identified using the heuristics shown in Table 18. These were developed based on the Google’s content policy [29]. 9.7 Developer Deleted Apps Table 19 shows the heuristics used to identify developer deleted applications. These were developed based on the Table 15: Checklist used for manual labelling of copyrighted content apps Attribute ID R1 Description Description and Examples Does the app description indicate potential copyright infringement or a trademark infringement (unauthorized used of registered trademarks)? E.g. WP8 Surface Wallpaper HD - App description indicates potential copyright infringement. “.... Windows Phone 8 & Surface Tablet Wallpaper HD Official Original From Microsoft Device. ....” E.g. GHOST IN THE SHELL SAC+2ndGIG - App description indicates potential copyright infringement. “.... This app includes all episodes of GHOST IN THE SHELL SAC VIDEO Please ENJOY!!!. ....” R2 Does the app provide copyrighted material at a price? E.g. Dungeon Keeper 2 Soundboard - According to app description it provides sounds from a PC game and the app is priced at AU$7.44. The developer has no affiliation to original content owner. “A lot of sounds from the old but good Dungeon Keeper 2 PC game” E.g. Pin Search For BBM - According to app description it provides contents from a movie and a book and the app is priced at AU$1.49. The developer has no affiliation to original content owner. “.... MOVIE AND BOOK BY CHUCK PALAHNIUK, THIS FIGHT CLUB SOUNDBOARD BRINGS YOU YOUR FAVORITE QUOTES FROM THE MOVIE RIGHT TO YOUR PHONE WITH JUST THE TOUCH OF A FINGER! ....” R3 Does the app description contain disclaimers related to copyrights? E.g. Love Stories - App description contains a disclaimer on potential copyright infrigements. “.... Smart Applications does not support plagiarism in any form. Author must note that the story should not be copied from anywhere; it should be an original piece of work. If found otherwise, we reserve the right to delete it. ....” Reviews R4 Do the users mention about copyright infringements? E.g. Astro Soldier War Of The Stars - Users mention about copyright infringements. “.... You stole copyrighted material from TimeGate Studios. ....”, “..... Your so icon is artwork from Section 8, one of their video games .....” E.g. Harlem Shake Creator Pro - Users mention about copyright infringements. “..... Want my money back!! Payed for this app so that my kids could save and share their video, then Facebook removed it on account of copyright infringement!! .....” Table 16: Checklist used for manual labeling of problematic/illegal apps Attribute ID P1 Violence and Bullying - Depictions of gratuitous violence are not allowed. Apps should not contain materials that threaten, harass or bully other users. Hate Speech - Apps advocating against groups of people based on their race or ethnic origin, religion, disability, gender, age, veteran status, or sexual orientation/gender identity. Illegal Activities - Apps for unlawful activities, such as the sale of prescription drugs without a prescription. Gambling - Apps allowing content or services that facilitate online gambling, including but not limited to, online casinos, sports betting and lotteries, or games of skill that offer prizes of cash or other value. Illegal Activities - Apps for unlawful activities, such as the sale of prescription drugs without a prescription. Description Reviews Description and Examples Does the app description indicate any relationship to following P2 Do the users mention about the app related to the any points mentioned in P1 ? domain knowledge on the app market eco system and by observing some samples from the removed apps dataset. These checks were carried out similar to previously mentioned counterfeit search in Section 9.5 by searching in Google Play Store, Google Search Engine and our master database. 9.8 Developer Banned Apps Developer banned apps were identified by checking whether developer’s all other apps were deleted by querying our master database and searching in Google Play Store. 9.9 Top 100 Apps Table 20 shows the top 100 non spam apps we identified when we ranked the applications in the descending order of total number of downloads, total number of user reviews, and average rating. 9.10 Manually Identified Bigrams and Trigrams Table 22 shows the manually identified bigrams and trigrams from the top 100 app descriptions which are associated with describing a functionality of an app. Table 17: Checklist used for labelling of counterfeit apps Attribute Name ID C1 Description and Examples Does the app name resemble another application existing in the Android market? E.g. Fruit Sniper Free -air. com. oligarch. us. FrootSniperFree (Counterfeit) vs. Fruit Sniper Free - com. rongbong. fruitsniperfree (Original) - App name is exactly the same E.g. SmsVault - PSS. SmsVault (Counterfeit) vs. SMSVault Lite - com. macrosoft. android. smsvault. light (Original) - Similar app names E.g. Volume Booster - Amplifier - aa.volume.booster (Counterfeit) vs. volume booster com.feaclipse.volume.booster (Original) - Similar app names Description C2 Does the app description resemble another application existing in the Android market? E.g. Description of Volume Booster - Amplifier (Counterfeit) “...the volumes will be restored to their previous values. Of course you can adjust the volumes with the phone’s hardware volume keys or via menu or other applications, so no worries there. Just as a tip for you,....” E.g. Description of volume booster (Original) “...the volumes will be restored to their previous values. Of course you can adjust the volumes with the phone’s hardware volume keys or via menu or other applications, so no worries there. Just as a tip for you,....” C3 Does the app description itself mention the app as a clone of another application? E.g. Rain of Blocks (zeljkoa.games.rob) is a counterfeit of Tetris (com.ea.game.tetris2011_ row) “... Rain of Blocks is a Tetris clone that will take all your free time away .....” Logo & Screenshots C4 Does the app logo or the screenshots resemble another application existing in the Android market? Reviews C5 Do users complain about app being counterfeit in reviews? E.g. School tetris (me.mikegol.stetris) is a counterfeit of Tetris (com.ea.game.tetris2011\ _row) - Users mentions the app being a counterfeit in reviews. “Basic but not bad Controls need work but not bad have played a lot worse Terris clones”, “Didn’t work good. This game is a rip-off ” Table 18: Checklist used for manual labeling of adult content apps 9.11 Attribute Description Name Content rating ID A1 A2 A3 Description and Examples Does the app description indicate potential inclusion of pornography? Does the app name indicate potential inclusion of pornography? Does the content rating of the app is ”High Maturity”? Screenshots & App icons A4 Do provided screenshots or the icons of the app indicate potential inclusion of pornography? Reviews A5 Do the users mention about inclusion of pornography? This condition alone was not used to decide an app as Adult content as there are many legitimate applications with content rating as ”High Maturity”. It was rather used as an assistance for borderline cases. Automatically Identified Bigrams and Trigrams We identified bigrams and trigrams that a present in at least 10% of the instances in the sets spam and top-kx in order to augment the manually identified bigrams and trigrams mentioned in Section 5.1. For example, we in Table 22 we show the automatically identified bigrams and trigrams using the descriptions of spam and top-1x apps. Similarly we identified popular bigrams and trigrams for topk (k = 2, 4, 8, 16, 32) sets as well. Due to size limitations we do not show them here. We performed standard text pre-processing techniques:changing to lower case, removing punctuation, removing numbers, and normalise white spaces before identifying the bigrams and trigrams. 9.12 Stylometric Features This section describes the writing style related features we defined to map heuristic checkpoint S2 as described in Section 5.2. All of the below features were generated on text which are converted to lower case. We used Natural Language Toolkit (NLTK) 6 in Python to generate these features. 1. Total number of characters in the description: Total number of characters in the app description text without the white spaces. 2. Total number of words in the description: The description text was tokenised using white spaces and each resulting element was considered as a word. 6 www.nltk.org Table 19: Checklist used for manual labeling of developer deleted apps Attribute ID D1 Description Developer ID Description and Examples Does the app description indicates app being deprecated/discontinued or replaced by a new application. E.g. Barrier Plus - App description indicates discontinuation “.... This app will be ended on 12/25/2013 please consider to install ”SmartGate” thank you for using this app ....” E.g. Norton Utilities task killer - App description indicates discontinuation. “.... Symantec will discontinue selling Norton Mobile Utilities as of Wednesday, January 22, 2014. For more information ....” D2 Is there another app with the same app name by the same developer showing an implicit indication of the discontinuation. E.g. App DroidGuard(DroidGuard.main) was deleted from the market “However currently there is another app DroidGuard(com. dynacolor. droidguard. main ) with same description. This indicates developer has replaced the old app with a new application. ....” E.g. Smart App Protector(App Lock+) (com.sp.protector) “Developer deleted this application and released free version of the same (com. sp. protector. free ) with a Premium membership option.’ Table 20: Top 100 apps Facebook Gmail Google Play services Candy Crush Saga Torch Tiny Flashlight GO Launcher EX LINE Pandora internet radio Tango Messenger Chrome Browser Google Shazam Dropbox Flipboard Your News Magazine ChatON Google TexttoSpeech Despicable Me PicsArt - Photo Studio WeChat SuperBright LED Torch MX Player Talking Tom Cat Free Ant Smasher Mobile Security & Antivirus Jetpack Joyride TuneIn Radio Barcode Scanner Pool Billiards Pro Kindle Glow Hockey DragonFlight for Kakao DEER HUNTER 2014 Paradise Island Badoo Meet New People Evernote Google Maps Street View Instagram Subway Surfers Viber Pou Twitter Drag Racing Fruit Ninja Free Talking Tom Cat 2 Free Google+ Google Play Music Voice Search Samsung Push Service Samsung Link (AllShare Play) Opera Mini browser for Android Hill Climb Racing GO SMS Pro Angry Birds Rio ZEDGE Camera360 Ultimate Netflix eBay Advanced Task Killer GO Locker Google Earth SoundHound 3D Bowling Google Play Games Words With Friends Free Waze Social GPS Maps & Traffic Clash of Clans Castle Clash 3. Total number of sentences in the description: We used the English “pickle” tokeniser provided with NLTK library for this. 4. Average word length: Total number of characters divided by the total number of words. 5. Average sentence length: Total number of words divided by the total number of sentences. 6. Percentage of upper case characters: Percentage of upper case characters from the total alphabetical characters. 7. Percentage of punctuations: Percentage of punctuation characters from the total number of characters. Characters used as punctuations are shown in Table 26. 8. Percentages of numeric characters: Percentage of numeric characters from total number of characters. 9. Percentage of non-alphabet characters: Percentage of non-alphabetic characters(e.g. numeric charac- YouTube Google Search WhatsApp Messenger Angry Birds Skype Facebook Messenger Temple Run Temple Run 2 AntiVirus Security FREE Google Translate Adobe Reader Hangouts Google Play Books Google Play Movies & TV Google Play Newsstand KakaoTalk Free Calls & Text Brightest Flashlight Free Angry Birds Seasons Photo Grid Collage Maker Angry Birds Space Yahoo Mail Top Security & Anti Virus FREE Firefox Browser for Android ColorNote Notepad Notes To do Angry Birds Star Wars Adobe AIR Shoot Bubble Deluxe Perfect Piano HP Print Service Plugin Dolphin Browser DrḊriving Easy Battery Saver GROUP PLAY ters, punctuation) from total number of characters. 10. Percentage of common English words: We used the list of common English words provided by Canales et al. [10] which consists of three components: Conjunctions, Interrogatives, Prepositions.. Some examples of each category are shown below and the total list of words are shown in Table 23. E.g. Conjunctions - after, although Interrogatives - where, which Prepositions - about, below 11. Percentage of personal pronouns: We used the list of personal pronouns provided by Canales et al. [10] under three categories: first person, second person, and third person. Some examples of each category are shown below and the total list of words are shown in Table 24. E.g. first person - I, me Table 13: Collected metadata through the crawl Metadata Description App id Java package name of the app. This uniquely identifies an app in Google Play Store. Name of the app Category of the app (Out of 30 predefined categories in Google Play Store). App description (Max limit 4000 characters ) Whether app is free or paid and the price of app for paid apps. Content rating defined by the developer (Everyone, High maturity etc.) Number of downloads (in ranges 5001000 etc.) Size of the APK file in KB Developer defined version of the app. Minimum Android version the app can be executed on. Last date the developer updated the app. Total number of users reviewed the app. Total number of users gave 5 start rating for the app. Total number of users gave 4 start rating for the app. Total number of users gave 3 start rating for the app. Total number of users gave 2 start rating for the app. Total number of users gave 1 start rating for the app. Average rating of the app. Link to the developer’s website. Developer’s email address. Name of the developer. Unique identifier of the developer for Google Play Store. Link to the privacy policy. Default number of user review text returned for an app page. App ids of similar apps recommended by Google. App ids of apps from the same developer. App name App category App description Price Content Rating Number of downloads Size Version Minimum Android version supported Last updated date Total reviews Total 5 star ratings Total 4 star ratings Total 3 star ratings Total 2 star ratings Total 1 star ratings Average rating Developer website Developer email Developer name Developer ID Privacy policy User reviews Similar apps Developer’s other apps second person - You, your third person - He, her 12. Percentage of emotional words: We used the list of emotional words such as exited, happey, and suprised, provided by Mukherjee et al. [44] for this and the total list of words are shown in Table 25. 13. Percentage of misspelled words: A common method used in spam emails is to deliberately misspell words so that messages can bypass the keyword based spam filters [33]. We used this feature to check whether this is happening in app spam. We used the common misspellings list provided by Wikipedia [61]. 14. Percentage of words containing alphabet and numeric characters: Common technique used in spam emails is the use of numbers to replace some alphabet characters. Numerical digits which have similar appearances to some letters are used in order to bypass keyword based spam filters (E.g. Instead of “cool” using “c00l”). We tried to capture this by using this feature. 15. Automatic readability index (AR) [58]: Auto- Table 21: Manually identified bigrams and tr-grams using the descriptions of top 100 apps which are associated with describing a functionality (size: 100) play games app for android to find is used to way to match your app for stay in touch that allows you have to buy your to use keeps you tap the makes it easy is a global is a free on the go turns on game from on your phone it helps is a tool lets you support for game of to discover this game navigation app your answer is an easy app which train your are available you can learn how your phone core functionality is a smartphone to play this app to enjoy you to is effortless try to action game take a picture follow what brings together discover more more information is an app in your pocket make your life delivers the full of features is a simple need your help how to play brings you play with makes mobile drives you game on on android game which is the game get notified get your to search a simple available for key features is available take care of can you beat its easy allows you take advantage save your is the free choose from play as learn more face the challenges your device with android offers essential for android it gives enables you at your fingertips to learn it brings is a fun strategy game your way application which helps you make your matic readability index is a readability metric which can be used to measure the understandability of a text. AR = (4.71 ∗ average word length) + (0.5 ∗ average sentence length) − (21.43) 16. Flesch readability score (FR) [19]: Similar to AR, Flesch readability score is also a measurement of readability. total words F R = 206.835 − 1.015( total )− sentences total syllables 84.6( total words ) 9.13 Description Similarity Related Features This section describes the writing style related features we defined to map heuristic checkpoint S6 as described in Section 5.6. 1. Total number of apps the developer has: The total number of apps developer has published. 2. Total number of apps with english language description which can be used to measure similarity: Total number of apps with a English language app description. 3. For a given app, number of apps having a cosine similarity higher than 60% from the same developer: We converted the each app description to vector-space-model after preprocessing the tex descriptions using the standard preprocessing techniques: changing to lower case, removing punctuation, tokenise using whitespaces. Then each descprion was represented as vector containing frequency of words. E.g. dj = (w1,j , w2,j , ...., wt,j ) where w1,j is the frequency of word w1 in document dj . Then cosine similarity between two documents di and dj is defined as. cos(di , dj ) = di .dj ||di || ||dj || Then for given application we counted how many apps are there with over 60% cosine similarity from the same developer. Table 22: Automatically identified bigrams and trigrams using the descriptions of spam and top-1x apps (size: 218) play games ads in advertisement can support all the and share app on as samsung galaxy been tested on contact us default after reboot device is devices such as download the follow us for free free great from your google play has been have implemented some how to if your device implemented some ads in settings advertisement in the is not it is live wallpaper live wallpapers more free need put note if now with of the on android on latest on phone instead on your over million please contact put the app resets to samsung galaxy s screen touch settings advertisement can such as tap the tested on latest the best the most the top this app this is to add to default after to get to the to use home use home us if your wallpaper has been wallpaper on wallpaper resets to wallpapers we want to we have implemented will need put with the you can your device your friends your own your wallpaper resets you will you can ads in settings after reboot all your and the app on phone as you can be contact us if develop more device is not d live download the free for a for the free great live galaxy s great live has been tested home menu if you if your wallpaper in a instead of in your is not supported latest devices live wallpaper has live wallpapers this more free great need put the note if your of sd of the screen one of on latest devices on the on your home phone instead please contact us reboot you resets to default s and sd card some ads such as samsung terms of the app the free the screen the wallpaper this application this live to be to develop to make touch the to your use home menu us on wallpaper live wallpaper on your wallpapers this wallpapers we have way to will be with a with your you have your device is your home your phone you to you will need get notified advertisement can after reboot you and more app is as samsung been tested can support default after develop more free devices such d live wallpaper easy to for android for your from the get the great live wallpapers have implemented home screen if your implemented some in settings instead of sd is a is the latest devices such live wallpaper on live wallpapers we more than need to not supported of sd card of your on facebook on phone on twitter or tap phone instead of put the reboot you will samsung galaxy screen to settings advertisement some ads in support our tested on the app on the game the screen to the world this game this live wallpaper to default to develop more to play to use up to us if wallpaper has wallpaper live wallpaper wallpaper resets wallpapers this live wallpaper to we have will need with friends you are your android your favorite your home screen your wallpaper you want 4. For a given app, number of apps having a cosine similarity higher than 70% from the same developer: Same as above with 70% similarity. 5. For a given app, number of apps having a cosine similarity higher than 80% from the same developer: Same as above with 80% similarity. 6. For a given app, number of apps having a cosine similarity higher than 90% from the same developer: Same as above with 90% similarity. 9.14 Appid Related Features Appid is the unique identifier of an app in the Google Play Store. It follows the Java package name convention. Some example of app ids are, Table 23: Common English words a after because but for if now or since that unless whenever wherever yet what whose why about after amid around before by despite except following in like of onto over plus save through towards unlike upon with an although before either how neither once provided so though until where whether where who when whether above against among as behind concerning down excepting for inside minus off opposite past regarding since to under until versus within the and both even however nor only rather then till when whereas while which whom how abroad across along anti at but considering during excluding from into near on outside per round than toward underneath up via without Table 24: Personal pronouns i my ours we yours he herself his itself theirs themselves me myself ourselves you yourself her him it she them they mine our us your yourselves hers himself its their themself E.g. com. google. android. apps. maps com. facebook. katana com. rovio. angrybirds We used the following features to map the heuristic checkpoint S7 related to app id. 1. Number of characters: Length of the app id in characters 2. Number of words: Number of words in the app id which are separated by the period symbol. 3. Average word length: Total number characters to the total number of words. 4. Percentage of of non-letter characters to total characters: Total number of non-alphabet characters to the total number of characters. 5. Percentage of upper case characters to total letter characters: Total number of upper case characters to the total number alphabet characters. 6. Presence of absence of words in the app name matching parts of the appid : We checked whether or not parts of app name appear in the app id by regular expression match. The idea behind this is that automatically generates apps might not contain meaningful app ids where in practise it is customary for developers to give meaningful app ids as shown above. Table 25: Emotional words aggressive annoyed cautious depressed discouraged embarrassed excited frustrated helpless humiliated innocent lonely optimistic proud relieved shocked surprised undecided abundance admirable amazing bargain best breakthrough brimming clear confidence cuddly delightful ecstatic enjoy exotic flair genius heavenly impressive luxurious speed superb supreme treasure ultimate zest alienated anxious confused determined disgusted enthusiastic exhausted guilty hopeful hurt interested mischievous paranoid puzzled sad shy suspicious withdrawn ace adore appealing beaming better breeze charming colorful cool dazzling dynamic efficient enormous expert free great ideal incredible outstanding splendid sweet terrific ultra unique angry careful curious disappointed ecstatic envious frightened happy hostile hysterical jealous miserable peaceful regretful satisfied sorry thoughtful absolutely active agree attraction beautiful boost brilliant clean compliment courteous delicious easy enhance excellent exquisite generous graceful immaculate inspire royal spectacular sure treat unbeatable wow Table 26: Punctuation characters ! $ ’ * : = @ _ ‘ } ” % ( + . ; > [ ^ { ∼ # & ) , / < ? \ ] | 7. Percentage of bigrams with 1 non-letter to total bigrams: Total bigrams with one non-alphabet character to the total number of bigrams. 8. Percentage of bigrams with 2 non-letters to total bigrams: Total bigrams with two non-alphabet characters to the total number of bigrams. 9. Percentage of bigrams with 1 non-letter or 2 non-letters to total bigrams: Total bigrams with at least non-alphabet character to the total number of bigrams. 10. Percentage of trigrams with 1 non-letter to total trigrams: Total trigrams with one non-alphabet characters to the total number of trigrams. 11. Percentage of trigrams with 2 non-letters to total trigrams: Total trigrams with two non-alphabet characters to the total number of trigrams. 12. Percentage of trigrams with 3 non-letter to total trigrams: Total trigrams with three non-alphabet characters to the total number of trigrams. 13. Percentage of trigrams with 1, 2 or 3 non-letters to total trigrams: Total trigrams with at least one non-alphabet character to the total number of trigrams. 9.15 App Metadata Features This section describes the metadata we used in Section 9.15 which are not associated with a specific heuristic checkpoint. 1. App category: In Google Play Store there are 30 app categories and developer has to assign a single category to the app when they publish the app. 2. Price: Price of the app in USD with 0 for free apps. 3. Length of app name: characters. Length of the app name in 4. Size of the app in KB: Size of the app executable. 5. Developer’s website address available: Whether or not the developer has given a website address. 6. Developer’s website address is reachable: Whether or not the given web site address is reachable. 7. Developer’s email address is available: Whether or not the developer has given an email address. 8. Privacy policy for the app is given: Whether or not a privacy policy is available for the app.