Panama Papers: Tools to Investigate Data
Transcription
Panama Papers: Tools to Investigate Data
Panama Papers: Tools to Investigate Data Matthew Caruana Galizia & Mar Cabra http://bit.ly/icijplatformseijc16 grave face photo of the editor’s kids no computer in sight heaps of papers The difference now is our tools and applications. Four years ago... 260 GB Nuix (to search documents locally) Forum I (Fudforum, implemented by Sebastian Mondial) Forum II (Vanilla, implemented by Chris Zubak-Skees) Interdata (DTSearch, implemented by Duncan Campbell & Matt Fowler) Offshore Leaks Database (done with La Nación’s data unit in Costa Rica) The most popular product of the past years at ICIJ (and CPI) Let’s build a stack! controlling application ocr engine blacklight file to text conversion index web server operating system operating system operating system Open source *first* Who are our users? Skills Needs The developer Knows all about data (France) The “Watergate-type reporter” Investigated the President (Paraguay) What are our needs? Communicate Search documents 3 million files x 10 seconds per file = 1 year queue 35 machines extracting text from files index 1 year ÷ 35 machines = 11 days Scanned document: Extracted text: Discover beneficial owners Visual is good (for reporting) MAGIC!! ● ● ● I click on “dots” and I find stories! I discover stories thanks to fuzzy searching Find shortest path Wow! ● ● ● ● Cypher queries Public widgets API https://offshoreleaks.icij.org with download in CSV and Neo4j MATCH (a:Officer),(b:Officer) WHERE a.name CONTAINS 'Smith' AND b.name CONTAINS 'Grant' MATCH p=allShortestPaths((a)-[:OFFICER_OF|:INTERMEDIARY_OF|:REGISTERED_ADDRESS*..10]-(b)) RETURN p LIMIT 50 Next steps entity name recognition From: Igor Czernecki Sent: To: Mossack Fonseca & Co. (Attorneys-at-Law) Cc: Saran Harris Subject: Re: Payment instruction on the basis of lease agreement Dear Mrs Rogers, I would like Dagar to write an invoice for; GEMINI HOLDING Sp. z o.o. 3/7 Friedleina Street, 30-009 Krakow, Poland datashare [email protected] [email protected] Thanks! http://bit.ly/icijplatformseijc16