Challenges and Big Opportuni es for Data Scien st at VCCORP
Transcription
Challenges and Big Opportuni es for Data Scien st at VCCORP
ChallengesandBig Opportuni3esforData Scien3statVCCORP HoangAnhTuan CTOAdmicro-VCCORP [email protected] 1 Content CompanyOverview BigDataatVCCORP Ourmainchallenges 2 COMPANYOVERVIEW 3 VCCORPMilestone 4 FOUNDERS Mr.VuongVuThangistheFounderandChairmanofVCCorp HissuccessindevelopingVCCtobetheleadingnew-mediacompanyinVietnambeganwhenhefoundedhisfirst onlinecommunity,TTVNOL,fromagaragestart-upin2000.AMerfoundingthisnetwork,hefoundedthefirstonline media&newsportalTintucvietnamin2002.In2005,hefoundedthefirstprivatecompanyinmobilevalueadded servicesandinternetcontent,whichformedtheiniRalfoundaRonofthecurrent-dayVCC. Asanaturalentrepreneur,ThanghasbeenthefirstmoverinalltheemerginginternetsectorsinVietnamincluding mediacontent,socialnetwork,mobilecontent&services,andecommerce.Naturally,VCCbecameaninnovaRve leaderofVietnam’sinternet&mediaindustry.Asthetechnologystrategist,hehasbeenthechiefarchitectof disrupRvetechnologiesinVietnaminthelast10yearsincludingCMS,keyportaltechnologies,andcloudcompuRngto nameafew. Mr.VuongVuThang Founder,Chairman Mr.NguyenTheTanistheCo-FounderandCEOofVCCorp Withunparalleledknow-howininternetmoneRzaRon,NguyenTheTan’sleadershipisdrivingthegrowthofVCC.His resultsareapparentasheoversaw100%annualgrowthinadverRsing,mobileservicesandecommercerevenueforeach ofhislast6yearsatVCC.Asastrategicvisionarywithintheindustry,NguyenTheTan’sworkhasmadeatremendous impactwithintheVietnamesemarketplace. BeforejoiningVCC,TanwasVice-DirectorofoneofthelargesttelcocompaniesinVietnam,Vie[elFixedTelecom.Before Vie[elTelecom,hewasthedivisiondirectorataleadingsoMwareandsystemintegraRoncompany,CMC.Therehebuilt ae-librarysoluRonspla]ormandexpandedthedistribuRonofCMC’ssoMwareandsoluRonstowardsbroadersegments Mr.NguyenTheTan withinthemarket. Co-Founder,CEO 5 VCCORPOverview Overview ü FirstmoverDNA ü 50%YoYGrowth ü 43Mwebaudience ü 38Mmobile audience ü 1,700employees Investors 6 VCCORPMARKETCOVERAGE 43Minternetuserreach(97.6%ofVNinternetpopulation) 38Mmobileuserreach(95%ofVNSmartphonepopulation) 10,000+online&mobileadvertisers 100,000+smallbusinessmerchants 12Me-marketplacevisitors&buyers LargestAdnetworkinVietnamwith 1000+publishers,including200+top- publishers, 30 ofthemareexclusive 22leadingproductswithpresencein20+verticals;14sitesareintop100 websitesinVietnam(news,finance,family,teenage,auto,high-tech,online advertising,B2CandC2C,contentconsumptionmobile) 7 BIGDATAATVCCORP 8 BigDatainVCCORP Inthe2007,BigDatawasappliedearlyinBaambooSearchEngine Since2009,BigDataplatformhavebeeninstalledforservingad systeminVCCORP Currently,BigDataplatformisbeingdevelopedandimprovedin majorareas. Advertisement DigitalContent Ecommerce Game Currentstaffs:100DataEngineers 9 ThechallengesinVCCORP BigDataskillsetsin-house Thelarge-scaledata Thehugeamountofspecificproblems,spreadingover manyareaswhichisrequiredcreativeproblemsolving,self-motivedperson Humanresourceisnotenough 10 SystemInfo 11 OURMAINCHALLENGES 12 Userbehaviors AdOptimization CoreNLPanditsapplication NewsDistribution RecommendationEngine VccorpAnalytic 13 USERBEHAVIOR 14 UserbehavioranalyRcs Threemainprojects: Demographic:gender,age Userprofile:behavior,interest Crossdevices:trackinguseronmultipledevices 15 Demographic-Userprofiling Detectuserprofileincludinggender,age,userinteresting(12-basedinterests),long term,shortterm.Basedondata Browsinghistory Keywordsearchhistory Timeusage,timeonsite Data: 43Musersinpc 38Musersinmobile 1Terabyteloggingdata Result–accuracy: Gender:82.5% Age:67.5% 16 Systemoverview Sparkstreaming– FilternewURL Actioninwebsite Updateurl content Updateuseractionforbatch processing Longterm andshort term OurLDAmodel SVM- predict user profile UserProfile 17 Demographic-Behavior 18 Benchmark Benchmarkingdata:43Musers,200Mdocuments,30000*10^6actions, 23otopics VCCORPcluster:20nodes,640cores,640GBram OurModel Oldclassificationmodel LDAwithSparkMLLIB Time:18h, Accuracy: Time:16h, Accuracy: Time:36h, Accuracy: Recall:92% Recall:92% Recall:91% Gender:82.5% Age:67.5% Gender:79.5% Age:63.4% Gender:75.1% Age:60.1% 19 CrossDevice 20 Crossdevice Weusedinformation: BothUser-IPandtimestampintheirdevices Websiteandcategorieshistory Userdemographicanduserinterest Result: Accuracy:60% Numberdetectedusers:11M 21 ADOPTIMIZATION 22 AdmicroOverview #1adnetworkinVietnam:cover38%marketshare 200+toppublishersinVietnam 10,000+advertisers 4Bpageviewspermonth 1,5Bimpressionsperday 22leadingadproducts 43Minternetuserreach(97.6%ofVNinternetpopulation) 38Mmobileuserreach(95%ofVNSmartphone population) 23 AdOpRmizaRon Theadvancedtechniqueswereimplemented: Personalization AudienceTargetingPlatform RealTimeBidding Retargeting ContextualTargeting SSP/DSP/DMP 24 PersonalizaRon Intraditionaladvertising,adsaredisplayedtoeveryonein thefixedlocation Bycontrast,personalizationtechniquewillchoosethebest fitadsforeachuser: 43Minternetuser 10.000ads 430Bestimatedoperationsforeachtime Usingmultipletechnologies: Highloadcapacitywebserver Optimizationalgorithms Estimateandpredictionalgorithms 25 AudienceTargeRngPla]orm Advertiserscantargettheirparticularaudience Subsetoftheaudiencecanbeprebuiltby“setoperators” ofuserproperties Location Demographic(gender,age,relationship…) Interest/Behavior Especially,anaudiencecanbemadeupfrom: Listofemail/phonenumber Automaticallyfindsimilaraudiences(look-alike) 26 RealTimeBidding Atransaction(sell/purchase)ofadimpressionsis immediatelyproceededwhenanaudiencetriggerthe adzones 27 RealTimeBidding Atransaction(sell/purchase)ofadimpressionsis immediatelyproceededwhenanaudiencetriggerthe adzones Challenges: 80msisthemaximumtimeofatransaction 1000sitesinVN 4.5billionrequest/day NumberofTransaction:$200,000/day 28 RetargeRng Retargetingisapowerfulbrandingandconversion optimizationtool Adswillfollowcustomersaftertheydotheshopping Adswillbedisplayedin AnyWebpages Multipledevices 29 ContextualTargeRng Contextualtargetinglooksat thecategoryorkeywordsofthecurrentpageaconsumer isviewingandthenservesthemadsthatarehighly relevanttothatcontent. Categorytarget Keywordtarget Weimplemented: Contentclassificationsystems(LDAP) Keywordindexandsearchenginesystem 30 SSP/DSP/DMP 31 NEWSDISTRIBUTION 32 NewsdistribuRon VCCORPpossessesmanylargeonlinenewspapersinVN WeareproudofbecomingthefirstcompanyinVietnam abletoimplementanautomaticmethodforpublishing news AnalyticSystem Autopublish 33 NewsdistribuRon Akeychallengeofnewswebsitesistohelpusersfind thearticlesthatareinterestingtoread Manytechniquesareappliedas: Real-timeengagementstatistic Personalization NLP Eventdetection Trendingdetection Breakingnewsdetection 34 35 36 CORENLP 37 CORENLP System Accuracy (%) (VCCorp) Others Speed(/s) Word Segmentation 98.8 97.0 (VLSp) 47,855 tokens POS tagging 94.5 93 (VLSp) ~50k tokens NER 87.0 85.0 (Baomoi.com) ~22k tokens Chunking 84.0 81.0 (VLSp) 800 tokens UniversalDependency Parser 72.0(UAS), 66.0 ( LAS) 68.28%(UAS),66.30% (LAS) 1200 sentences Co-reference resolution 57.0 N/A 106 docs VLSp:http://vlsp.hpda.vn:8080/demo/?page=resources 38 EntitylinkingApplication question AilàgiámđốcNgân hàngACB? [entity1,relation, entity2] … [entityN,relation, entityN] Knowledge base answer ÔngNguyễnVănHòa làgiámđốcNgânhàng ACB QueryandretrieveinformationfromKB 1 Input sentence Dependency tree "Ông Dũng là Thạc_sĩ ngành cơ_điện Trường Đại_hoc New_York (Mỹ) và Thạc_sĩ Quan_hệ quốc_tế Trường Đại_học Georgetown (Mỹ) . “ <Rawtext> Outputlinks [ÔngDũng,là,Thạc_sĩ] [ÔngDũng,là,Thạc_sĩngành cơ_điện] [ÔngDũng,là,Thạc_sĩQuan_hệ quốc_tế] … <CoNLLformat> <Entity1,relation,Entity2> 2 Input sentence Dependency tree "Trong năm 2015, lãi trước thuế của BIDV đạt 7.944 tỷ đồng (tăng 26,16 %), lãi sau thuế đạt 6.382 tỷ đồng (tăng hơn 28%), vốn_điều_lệ của BIDV tăng lên 34.187 tỷ đồng, tổng tài_sản đạt 850.748 tỷ đồng (tăng 30,8%). Outputlinks [lãitrướcthuếBIDV,đạt7.944tỷtrong, năm2015] [lãisauthuếBIDV,đạt6.381tỷđồngtrong, năm2015] [vốnđiềulệBIDV,tăng34.187tỷđồng trong,năm2015] [tổngtàisảnBIDV,đạt850.748tỷđồng trong,năm2015] … <Entity1,relation,Entity2> <Rawtext> <CoNLLformat> CORENLP-KnowledgeNetwork BuildingRelevantBrandusingDeepLearningand CoreNLP 42 CORENLP-NER 43 SenRmentAnalysis 44 SenRmentAnalysis Level:Doc,sentence,entity,aspectlevel Data:~1billionrecords,1TBprocessingdata Facebook:5Mpages,500kgroups News:500 Forums:200 Approach:UsingNLP+TopicModeling+DeepLearning Accuracy:~70% 45 SenRmentLexicon(Social) 46 AspectbasedsenRmentanalysis 47 RECOMMENDATION ENGINE 48 RecommendaRonEngine Buildingpurchaserecommendationsystemfore-commercesites Oursuggestionbasedoninformation PurchaseHistoryandweb-browserhistory Productandbuyersknowledge 49 RecommendaRonEngine Thealgorithmapplied: NER+DeepNeuralNetwork NetworkandProductknowledge Collaborativefiltering F-CTR:combinescollaborativealgorithmandproductknowledge 50 RecommendaRonEngine–deeplearning 51 RecommendaRonEngine 52 RecommendaRonEngine–CollaboraRveFiltering 53 RecommendaRonEngine–Ranking NER Knowledge History Recommender Rank ( knowledge, NER, history ) r ( i ) = ∑ λk rk (i ) k 54 RECPerformance Increase45%trafficfromtheRecommendEngineboxes 55 RecommendaRonEngine-News 56 VCCORPANALYTIC 57 VccorpAnalyRc DevelopingAnalyticToolforwebsites,showinggoodperformancein comparedwithGoogleAnalytic(GA)inVietnam Technologies: No-SQLselectedasdata-warehouseforlarge-scaledataanalysis Real-timeanalytic:Streaminglogging 58 VccorpAnalyRc-architecture 59 VccorpAnalyRc–Framework 60 VccorpAnalyRc Thealgorithmapplied: Samplingdata Abnormaldetection(removefaultclicks/sessions) Results:agoodcandidatetoreplaceGA,ensuringbothaccuracyand performance 61 SamplingData strata 2 size = N2 strata 1 size = N1 population size=N s1 RSWR s3 strata 3 size = N3 - N:Sizeofthesamplingdata - n:Totalstrata - 𝑁↓𝑖 :Sizeofthe𝑖↑𝑡ℎ strata s 2 s4 strata 4 size = N4 VccorpAnalyRc–abnormaldetecRon Regression 63 VccorpAnalyRc 64 OneMoreThing… 65 Thanks 66
Similar documents
Assessed Value Ad Valorem Tax Trend
Actual & Projected Change in Ad Valorem Assessed Value Entergy 1995-‐2025
More information