YaCy_CampusParty_20120825 print.key

Transcription

YaCy_CampusParty_20120825 print.key
Uncensorable,
Untraceable
Search Engines
for
Freedom of Information
Michael Christen, [email protected]
Campus Party 2012, Berlin
Abstract
SearchEngine
Search portals in the web are vital decision tools for knowledge and cultural
values of people. Free content should be accessible with free search. Instead
of going through a centralized server that acts as a gatekeeper, keeps logs of
your searches and directs you to selected information, your own self-made
search engine can deliver information with no censorship, and no tracking.
In this talk, search use-cases like a project search, file search (with attached
downloader), faceted search with user-defined categories, social search and
peer-to-peer search are explained and demonstrated. You will be
familiarized with search engine technology in general and different software
modules which can be used to create amazing search portals with unusual
but useful functions in just some minutes.
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
Michael Christen
[email protected], http://yacy.net
Human Rights
Knowledge is free
Access must be free
for everyone
Privacy is a human right
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
Michael Christen
[email protected], http://yacy.net
Human Rights statement from United Nations
UNO World Summit 2003 on the Information Society:
CHARTER OF CIVIL RIGHTS FOR A SUSTAINABLE KNOWLEDGE SOCIETY
(a) Knowledge is the heritage and the property of
humanity and is thus free.
(b) Access to knowledge must be free.
(c) Everyone has an unlimited right of access to the
documents of public and publicily controlled bodies.
(d) The right to privacy is a human right and is
essential for free and self-determined human
development in the knowledge society.
from: http://www.worldsummit2003.de/en/web/375.htm
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
Michael Christen
[email protected], http://yacy.net
Centralized Search Portals
can trace your behaviour
danger of censoring,
blocking, spamming
they own your data
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
Michael Christen
[email protected], http://yacy.net
Access to Information bridge between data and user
free information can only be truly free if it can be
accessed with free search
free Data
Search
User
u.a.:
as it is today:
free Software
proprietary & centralized,
it traces you and data can be
censored, blocked, removed,
spammed
User needs proprietary and
centralized software to
discover free content
Data unter Creative
Commons License
Open Access Archive
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
Michael Christen
[email protected], http://yacy.net
Access to Information bridge between data and user
free information can only be truly free if it can be
accessed with free search
free Data
Ranking
Search
Ordering
User
Relevancy
Community
In a specific community people share the same relevancy criteria.
Ranking influences standards and opinions within a community!
Centralized Search Engines have a cultural impact on communities!
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
Michael Christen
[email protected], http://yacy.net
Your Own Search Engine
Independence
...from Centralized Search Portals: collect your own search
index and search in a special way as needed for the content.
Privacy
...you are the search engine operator: nobody can trace you!
Freedom
...of Information: no data access limits, no censoring, no
filtering, no user observation, no content spamming, your ranking
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
Michael Christen
[email protected], http://yacy.net
Requirements for a „homebrew“ search engine
Search Technology
Software Modules
Examples
Easy
Knowledge
Learn how the search
Available
The software must be free.
for use cases and
possibilities.
engine components work.
Demo
A ,Hello World‘ - search
engine is a good startpoint
to hack.
Everyone must be able to
install and operate the
software
Hackable
APIs and transparency.
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
Michael Christen
[email protected], http://yacy.net
Examples for use cases and possibilities.
your own
search portal
search for
files
data protection
& sanctuaries
projects
+communities
(ftp/smb)
persecuted
content
share knowledge
...with
downloader?
topic-oriented
(news-) feeds
distributed
search
social search
federated search
share
share
your intelligence
service
your search
index
your search
experience
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
torrents etc.
Michael Christen
[email protected], http://yacy.net
Knowledge how search engine components work
search server
web interface
I
crawler
api
search index
opensearch gsa
robots balancer queues
schema
facets
network interfaces
ranking
moderation
file http ftp smb oai-pmh
doc
parser
document cache
pdf
xls html rss zip eml
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
solr
monitoring
I/O requests Disk/RAM
administration/
steering
Michael Christen
[email protected], http://yacy.net
Knowledge how search engine components work
search server
crawler
api
search index
monitoring
network interfaces
document cache
parser
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
administration/
steering
Michael Christen
[email protected], http://yacy.net
Knowledge how search engine components work
Easy
3-minute installation
just decompress and start
Available
all parts are free software
http://yacy.net
http://lucene.apache.org/solr/
Hackable
lots of APIs, many standards
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
Michael Christen
[email protected], http://yacy.net
Knowledge how search engine components work
Demo:
•
•
•
•
•
•
curl -OL „http://archive.apache.org/dist/lucene/solr/3.6.1/apache-solr-3.6.1.tgz“
tar xfz apache-solr-3.6.1.tgz
cd apache-solr-3.6.1/example/
java -jar start.jar
open http://localhost:8983/solr/admin/
curl 'http://localhost:8983/solr/update/json?commit=true' -H
'Content-type:application/json' -d '{"add":{"doc":{"id":"data1",
"title":"Hello World"}}}'
• curl 'http://localhost:8983/solr/update/json?commit=true' --databinary @exampledocs/books.json -H 'Content-type:application/json'
• curl 'http://localhost:8983/solr/select/?q=*%3A*'
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
Michael Christen
[email protected], http://yacy.net
Knowledge how search engine components work
SearchEngine
Demo:
•
•
•
•
•
•
•
•
curl -OL „http://yacy.net/release/yacy_v1.04_20120709_9000.tar.gz“
tar xfz yacy_v1.04_20120709_9000.tar.gz
cd yacy
./startYACY.sh
open http://localhost:8090
solr search interface is at
http://localhost:8090/solr/select?q=*:*&start=0&rows=10
start a web crawl at
http://localhost:8090/CrawlStartSite_p.html
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
Michael Christen
[email protected], http://yacy.net
your own
search portal
projects
+communities
share knowledge
Demo:
• Make
•
SearchEngine
a federated search portal for:
gnu.org, fsfe.org, campus-party.eu
Add a FTP video archive from
ftp://dewy.fem.tu-ilmenau.de/CCC/
search engine
Create and Share
Project Steering
Discussion
Produce
Documents
Version Control
(micro)Blogging
Bugtracker
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
Michael Christen
[email protected], http://yacy.net
search for
files
(ftp/smb)
...with
downloader?
Demo:
• Choose
•
•
SearchEngine
„File Search“ or
http://localhost:8090/yacyinteractive.html
After searching, click
„create a download script“
copy-paste the result to your terminal
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
Michael Christen
[email protected], http://yacy.net
data protection
& sanctuaries
persecuted
content
torrents etc.
Demo:
• Do an
•
SearchEngine
indexing of thepiratebay using the
sitemap provided by their robots.txt
Use
http://localhost:8090/CrawlStartSite_p.html
and check the ,Sitemap URL‘ option.
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
Michael Christen
[email protected], http://yacy.net
topic-oriented
(news-) feeds
federated search
your intelligence
service
Demo:
• Feed
•
•
•
•
YaCy with rss feeds at
SearchEngine
http://localhost:8090/Load_RSS_p.html
Activate the scheduler to do this frequently
Do a web search and add /date to the query
to order by date
change the page to rss format by replacing
the html extension of the result page with
rss
read the search result page with your rss
reader
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
Michael Christen
[email protected], http://yacy.net
distributed
search
share
your search
index
YaCy has an integrated Peer-toPeer protocol to connect to other
YaCy users.
But how can this scale? How are
peer connected?
Peer-to-Peer
Shared Search Index
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
Michael Christen
[email protected], http://yacy.net
SearchEngine
distributed
search
A Search Engine Cluster consist of
independent search engines in
the form of a search matrix.
share
vertical scaling: more performance
your search
index
Search Engine Cluster
Search
Engine
Search
Engine
Search
Engine
Search
Engine
Search
Engine
Search
Engine
Search
Engine
Search
Engine
Search
Engine
Search
Engine
Search
Engine
Search
Engine
Search
Engine
Search
Engine
Search
Engine
horizontal scaling: more documents
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
Michael Christen
[email protected], http://yacy.net
SearchEngine
distributed
search
We want to take the search
matrix out of the data center to
your home.
share
your search
index
Search
Engine
Search
Engine
Search
Engine
Search
Engine
Search
Engine
Search
Engine
Search
Engine
Search
Engine
Search
Engine
Search
Engine
Search
Engine
Search
Engine
Search
Engine
Search
Engine
Search
Engine
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
Michael Christen
[email protected], http://yacy.net
SearchEngine
distributed
search
The distributed search matrix in
your home is connected using a
peer-to-peer protocol.
share
your search
index
Search
Engine
Peer
Search
Engine
Peer
Search
Engine
Peer
Search
Engine
Peer
Search
Engine
Peer
Search
Engine
Peer
Search
Engine
Peer
Search
Engine
Peer
Search
Engine
Peer
Search
Engine
Peer
Search
Engine
Peer
Search
Engine
Peer
Search
Engine
Peer
Search
Engine
Peer
Search
Engine
Peer
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
Michael Christen
[email protected], http://yacy.net
SearchEngine
The YaCy Search Engine Cluster
consist of independent search
engines, but they are connected
in an efficient way using a
distributed hash table.
distributed
search
share
your search
index
Peer
Peer
Peer
Peer
Crawl the web, create a
web index, distribute
the index
Peer
Peer
Peer
DHT
Peer
SearchEngine
Peer
Distributed Hash Table
Peer
Search in a
Distributed Hash Table
Peer
Peer
DHT-Store
Peer
Peer
Peer
Peer
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
DHT-Read
Michael Christen
[email protected], http://yacy.net
distributed
search
Everyone can join the network.
Nobody can censor the search index.
SearchEngine
share
your search
index
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
Michael Christen
[email protected], http://yacy.net
social search
Peer-to-Peer
share
Shared Search Experience
your search
experience
Peer-to-Peer
Shared Search Index
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
Michael Christen
[email protected], http://yacy.net
SearchEngine
Knowledge how search engine components work
Demo:
• read http://seeks-project.info/wiki/index.php/Download#Download
• or just build seeks yourself:
>
>
>
>
>
>
git clone git://seeks.git.sourceforge.net/gitroot/seeks/seeks
cd seeks
./autogen.sh
./configure LDFLAGS="-Wl,--no-as-needed" --disable-opencv
make
cd src && ./seeks
• attach YaCy: use opensearch interface from
http://localhost:8090/yacysearch.rss?query=%query
in seeks/src/plugins/websearch/websearch-config add the line
• search-engine
opensearch_rss http://localhost:8090/yacysearch.rss?query=%query yacy default
• set seeks as your web proxy at port 8250
• open your browser at http://s.s/websearch-hp
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
Michael Christen
[email protected], http://yacy.net
APIs in Search Interface - Opensearch, SRU
SearchEngine
SRU
Facets
File Types, Protocols,
Domains, Authors
user-generated
ontologies
every link is verified
before it is displayed: the content is loaded,
parsed and used for a search snippet generation
Standards
APIs
Opensearch (search results with RSS), JSON, AJAX tools
Tools
search widget, ready-to-use code snippets to embed search everywhere
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
Michael Christen
[email protected], http://yacy.net
APIs in Search Interface - Opensearch
SearchEngine
> curl http://localhost:8080/yacysearch.rss?query=foaf&maximumRecords=10
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type='text/xsl' href='/yacysearch.xsl' version='1.0'?>
<rss version="2.0"
xmlns:yacy="http://www.yacy.net/"
How to get Opensearch/JSON
xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/"
Search Results:
<!-- very short example -->
<item>
• do a normal web search in YaCy
<title>Friend of a Friend (FOAF) project</title>
• replace the ‘html‘ extension of
<link>http://www.foaf-project.org/</link>
the result page URL with ‘rss‘
<pubDate>Fri, 23 May 2008 02:00:00 +0200</pubDate>
• for json, replace the ‘html‘
</item>
extension with ‘json‘
<item>
<title>FOAF - Wikipedia</title>
<link>http://de.wikipedia.org/wiki/FOAF</link>
<pubDate>Tue, 08 Jan 2008 01:00:00 +0100</pubDate>
</item>
<item>
<link>http://microformats.org/wiki/xfn-to-foaf</link>
<pubDate>Fri, 09 May 2008 02:00:00 +0200</pubDate>
</item>
</rss>
http://www.opensearch.org
Opensearch Standard:
SRU Standard for Queries: http://www.loc.gov/standards/sru/specs/search-retrieve.html
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
Michael Christen
[email protected], http://yacy.net
Search Interface Integration
Code Snippet Example #1: a search window in an iframeSearchEngine
How to integrate a YaCy
Search Portal:
Just copy-paste the code
snippet to your web page
source code.
Code Snippet #2 looks like:
The YaCy administration interface
offers more code snippets. An
example from
/ConfigSearchBox.html
looks like:
<iframe name="target2"
src="http://141.52.175.43:8080/yacysearch.html?
display=2&resource=local"
width="100%" height="180"
frameborder="0" scrolling="auto" id="target2"
</iframe>
Code Snippet Example #2: a search box (points to new page)
<form method="get" accept-charset="UTF-8"
action="http://141.52.175.43:8080/yacysearch.html">
<div>
<div>MySearch</div>
<input type="text" name="query" value="" maxlength="80" />
<input type="hidden" name="verify" value="true" />
<input type="hidden" name="maximumRecords" value="10" />
<input type="hidden" name="meanCount" value="5" />
<input type="hidden" name="resource" value="local" />
<input type="hidden" name="urlmaskfilter" value=".*" />
<input type="hidden" name="prefermaskfilter" value="" />
<input type="hidden" name="display" value="2" />
<input type="hidden" name="nav" value="all" />
<input type="submit" name="Enter" value="Search" />
</div>
</form>
your YaCy peer provides help pages with code snippets for an easy integration!
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
Michael Christen
[email protected], http://yacy.net
APIs in Harvesting: Dublin Core Dump Import
SearchEngine
Standards:
<?xml version="1.0" encoding="utf-8"?>
<!-- YaCy surrogate using dublin core notion -->
<surrogates
xmlns:dc="http://purl.org/dc/elements/1.1/">
YaCy can import standard
Dublin Core Metadata XML
files as input for indexing
<record>
<dc:title><![CDATA[Alan Smithee]]></dc:title>
<dc:identifier>http://de.wikipedia.org/wiki/Alan_Smithee</dc:identifier>
<dc:description>
<![CDATA['''Alan Smithee''' ist ein Anagramm von „The Alias Men“.]]>
</dc:description>
<dc:language>de</dc:language>
<dc:date>2009-04-14T00:00:00Z</dc:date>
<!-- date is in ISO 8601 -->
</record>
</surrogates>
How to import Dublin Core Files:
just place the xml files into a hand-over directory
at DATA/SURROGATES/in/
The Dublin Core XML File Standard:
http://dublincore.org/documents/dc-xml-guidelines/
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
Michael Christen
[email protected], http://yacy.net
Summary
1. Access to knowledge and the right to privacy is a
human right. Communites need their own ranking.
Centralized search engines are not sufficient to provide
this right to everyone. We need decentralized systems.
2. We demonstrated search use cases that are unmatched
with current search portal providers
Free content need more appropriate search technology
for such content.
3. We explained how search technology works in general
This was just the icetip. There is a lot more to know.
4. We demonstrated search tools which are easy, available
and hackable: Solr, YaCy and Seeks
For each tool you find a short tutorial inside this slides.
5. Please support the idea of free search and the projects
Please help, test the software, ask questions, tell other
people and help hacking!
Uncensorable, Untraceable Search Engines for Freedom of Information
Talk at Campus Party 2012 Berlin - http://www.campus-party.eu/2012/
Michael Christen
[email protected], http://yacy.net
SearchEngine
Thank You for Listening
SearchEngine
QR-Code: vCard
Dipl. Inf. Michael Christen,
[email protected]
http://yacy.net
Download
http://yacy.net
http://latest.yacy.net
Documentation
http://wiki.yacy.net
http://yacy-kochbuch.de
Discussion
http://forum.yacy.de
Bugs
http://bugs.yacy.net
News
Development
http://twitter.com/#!/yacy_search https://gitorious.org/yacy
http://blog.yacy.de
http://blog.yacy-kochbuch.de
all images are (CC0),
many are from http://openclipart.org