LWP::Simple - Purdue Genomics Wiki

Transcription

22 – 26 February
Week 7
• Topics
○ Subroutines
○ Complex data structures
○ Internet agents / programming
• Reading
○ CPAN libwww-perl (LWP)
○ LWP cookbook
−
http://search.cpan.org/~gaas/libwww-perl-6.04/lwpcook.pod
○
Perl Cookbook (available on Safari)
○ Review HTML Forms
−
http://www.w3schools.com/html/html_forms.asp
Biol 59500-033 - Practical Biocomputing
1
Subroutines
Main program
• my $answer = times( $a, $b )
Subroutine
• sub times {
my ( $a, $b ) = @_;
my $answer = $a * $b;
return $answer;
}
2
Complex Data Structures
# hash of hashes
# 1. gene information
my %gene = (
At5g04870 => {
At1g18890 => {
At4g21940 => {
gene => "cpk1",
begin => 1416783,
end
=> 1420338,
xsome => 5
},
gene => "cpk10",
begin => 6522764,
end
=> 6525962,
xsome => 1
},
gene => "cpk15",
begin => 11640802,
end
=> 11643762,
xsome => 4
},
);
# 1.1 print gene info sorted by systematic name (e.g., At5g04870 )
# 1.2 print gene info sorted by chromosome
# 1.3 print gene info sorted by gene length
3
# 2. array of hashes. this corresponds to the info in a fasta file
my @sequence = (
{
{
{
name
doc
seq
name
doc
seq
name
doc
seq
=>
=>
=>
=>
=>
=>
=>
=>
=>
"seqa",
"sequence of gene a",
"CGCATCGTATCCGATCGTAGCCTGCATCGTATGCTA" },
"wtfin",
"know one knows what this gene does",
"NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN" },
"lookase1",
"related to ADHD",
"CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC" },
);
#2.1 print the name of each sequence
#2.2 print out info in alphabetical order of sequence name
#2.3 print out information in sequence (seq) length order
4
# hash of arrays
# 3. location of cities ( latitude, longitude
my %location = (
Montgomery
Little_Rock
Phoenix
Sacramento
Denver
Hartford
Dover
Tallahassee
Atlanta
Des_Moines
Boise
Springfield
Indianapolis
Topeka
Frankfort
Baton_Rouge
);
)
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
32.3615,
34.736,
33.5284,
38.5556,
39.7263,
41.7626,
39.1619,
30.4518,
33.7595,
41.5909,
43.6137,
39.7833,
39.7909,
39.0392,
38.1973,
30.4581,
-86.2791
-92.3311
-112.076
-121.469
-104.965
-72.6886
-75.5268
-84.2728
-84.4032
-93.6209
-116.238
-89.6504
-86.1477
-95.6895
-84.8631
-91.1402
],
],
],
],
],
],
],
],
],
],
],
],
],
],
],
],
# 3.1 print the locations of Phoenix, Topeka, and Atlanta
# 3.2 print all the cities west of springfield
# 3.3 print the cities in east to west order
5
# array of arrays
# 4. some Euclidian x, y, z coordinates
my @coord = (
[ -66.838, -0.754,
[ -67.651, -1.371,
[ -67.424, -0.595,
[ -68.320,
0.089,
[ -67.234, -2.829,
[ -66.691, -3.521,
[ -67.718, -3.597,
[ -67.130, -4.281,
[ -68.089, -4.324,
[ -66.213, -0.711,
[ -65.842, -0.029,
[ -65.325,
1.368,
[ -64.130,
1.565,
[ -64.763, -0.831,
-25.764
-24.677
-23.384
-22.888
-24.467
-25.710
-26.826
-28.051
-29.188
-22.847
-21.614
-21.944
-22.165
-20.883
],
],
],
],
],
],
],
],
],
],
],
],
],
]
);
# 4.1 sort by z coordinate
# 4.2 find the center of these coordinates ( ave_x, ave_y, ave_z )
# 4.3 find all coordinates within 2.0 of the center
6
# 5. array of hashes of arrays of arrays
# this one is more complicated and only intended for those who feel the above is
# is trivial. note that loc is an array of the beginning and ending positions of
# the gene on the chromosome, and exon is an array of arrays of the beginning
# and ending position of each exon; the exon coordinates are an offset from the
# beginning of the gene given in loc.
my %gene = (
At5g04870 => { gene => "cpk1",
loc
=> [ 1416783,1420338 ],
exon => [ [ 1001, 1809 ],
[ 2171, 2314 ],
[ 2400, 2552 ] ],
xsome => 5
},
At1g18890 => {
gene
loc
exon
=> "cpk10",
=> [ 6522764, 6525962 ],
=> [ [ 1001, 1298 ],
[ 1540, 2693 ] ],
xsome => 1
},
At4g21940 => {
gene
loc
exon
=> "cpk15",
=> [ 11640802, 11643762 ],
=> [ [ 1001, 2379 ],
[ 2497, 2640 ],
[ 2736, 2888 ],
[ 3050, 3165 ],
[ 3321, 3488 ] ],
xsome => 4
},
);
# 5.1 list each gene and its exons in alphabetical order (by the "gene" key)
# 5.2 list the genes and their locations in order of the number of exons
# 5.3 list the genes and their locations in order of the longest exon in each gene
7
Internet Programming
• CPAN libwww-perl (LWP)
• LWP cookbook –
http://search.cpan.org/~gaas/libwww-perl-6.04/lwpcook.pod
• Perl Cookbook (available on Safari)
8
Internet packages
• LWP::Simple
○ Simple fetching of web pages and "GET" method forms
• LWP::UserAgent
○ More complicated fetching of "POST" method forms, uses
HTTP::Request and HTTP::Response
• HTTP::Request
○ Create HTTP formatted requests
• HTTP::Response
○ Parse HTTP formatted respnses
• URI::URL
○ mthods for handling URLs
• HTML
○ methods for handling HTML formatted files
9
wget
• wget is available to fetch webpages on most unix systems
use strict;
my $url = "http://plantsp.genomics.purdue.edu";
my $content = `wget $url `;
10
LWP Package
• Short for libwww-Perl
• LWP::Simple
• get($url)
○ The get() function will fetch the document identified by the given URL and
return it. It returns undef if it fails. The $url argument can be either a simple
string or a reference to a URI object.
• head($url)
○ Get document headers. Returns the following 5 values if successful:
($content_type, $document_length, $modified_time, $expires, $server)
○ Returns an empty list if it fails. In scalar context returns TRUE if successful.
• getprint($url)
○ Get and print a document identified by a URL. The document is printed to
the selected default filehandle for output (normally STDOUT) as data is
received from the network. If the request fails, then the status code and
message are printed on STDERR. The return value is the HTTP response
code.
• getstore($url, $file)
○ Gets a document identified by a URL and stores it in the file. The return
value is the HTTP response code.
• mirror($url, $file)
11
LWP::Simple
• Getting a web page
• Most basic, little more than wget
use strict;
use LWP::Simple;
my $content = get ( $url );
• What if something goes wrong?
12
LWP::Simple
• Checking for errors, better than wget
use strict;
use LWP::Simple;
unless ( my $content = get ( $url ) ) {
die "unable to access $url\n\n";
}
# test for success
• Inconvenient
○ Have to alter code each time
○ I get bored typing http://
13
LWP::Simple
• More useful with getopt
○ Doesn't hard code
○ supply http:// prefix
use strict;
use Getopt::Std;
use LWP::Simple;
my $option = {};
getopts( 'u', $option );
if ( $$option{u} ) {
$url = $$option{u};
}
# default URL
unless ( $url =~ /http:\/\//i ) {
$url = "http://".$url;
}
# add http:// prefix if missing
unless ( my $content = get ( $url ) ) {
die "unable to access $url\n\n";
}
# test for success
14
LWP::Simple
• LWP::Simple works well with REST–based web services
• NCBI E-utilities (eutils, http://www.ncbi.nlm.nih.gov/books/NBK25500/)
Provided by NCBI to
○ search databases (esearch)
○ download summaries (esummary)
○ download complete entries (efetch)
○ upload UIDs to NCBI server for later processing (epost)
○ query Entrez (egquery)
○ trace links in entries (elink)
○ examine database statistics and fields (einfo)
○ retrieve spelling suggestions (espell)
• Base URL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/xxx.fcgi
15
LWP::Simple
• esearch
○ url: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
○ parameters:
−
db – databases to search (pubmed,protein, nucleotide, genome, etc)
term – search term
usehistory – y|n, store the results of search on server
−
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?
db=pubmed&term=science[journal]+AND+breast+cancer
note: no spaces in term
−
−
16
LWP::Simple
• efetch
○ url: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
○ parameters:
−
−
−
db – databases to search (pubmed,protein, nucleotide, genome, etc)
id – uid list (e.g., &id=15718680,157427902,119703751)
rettype – retrieval type, varies with database
Abstract or MEDLINE from PubMed, or
GenPept or FASTA from protein
−
−
−
−
−
retmode – e.g., text, HMTL or XML
retstart - Sequential index of the first record to be retrieved
retmax - Total number of records from the input set to be retrieved
WebEnv – specifies the Web Environment that contains the UID list to be
provided as input to EFetch
query_key - specifies which of the UID lists attached to the given Web
Environment will be used as input to Efetch
efetch.fcgi?db=protein&retmode=text&rettype=fasta&id=15718680,157427902,
119703751
17
LWP::Simple
• simple esearch script
#!/usr/bin/perl
################################################################################
#
# Use NCBI eutil service to retrieve sequences from pubmed
#
# Gribskov Admin
Feb 26, 2013
################################################################################
use strict;
use Getopt::Std;
use LWP::Simple;
# base URL for NCBI eutil services
my $BASE = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
my $database = "protein";
my $query
= "arsenite reductase AND arabidopsis";
my $search
= $BASE."esearch.fcgi?db=$database&term=$query";
print "searching $search...\n\n";
my $result = get $search;
print "$result\n";
exit 0;
18
LWP::Simple
searching http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=arsenite reductase AND
arabidopsis...
<?xml version="1.0" ?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD eSearchResult, 11 May 2002//EN"
"http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd">
<eSearchResult>
<Count>5</Count>
<RetMax>5</RetMax>
<RetStart>0</RetStart>
<IdList>
<Id>28868893</Id>
<Id>62286622</Id>
<Id>410092315</Id>
<Id>409760340</Id>
<Id>28852132</Id>
</IdList>
<TranslationSet>
<Translation>
<From>arsenite reductase</From>
<To>arsenite reductase[Protein Name] OR (arsenite[All Fields] AND reductase[All Fields])</To>
</Translation>
<Translation>
<From>arabidopsis</From>
<To>"Arabidopsis"[Organism] OR arabidopsis[All Fields]</To>
</Translation>
</TranslationSet>
19
LWP::Simple
<TranslationStack>
<TermSet>
<Term>arsenite reductase[Protein Name]</Term>
<Field>Protein Name</Field>
<Count>32</Count>
<Explode>N</Explode>
</TermSet>
<TermSet>
<Term>arsenite[All Fields]</Term>
<Field>All Fields</Field>
<Count>94001</Count>
</TermSet>
<TermSet>
<Term>reductase[All Fields]</Term>
</TermSet>
<OP>AND</OP>
<OP>GROUP</OP>
<OP>OR</OP>
<OP>GROUP</OP>
<TermSet>
<Term>"Arabidopsis"[Organism]</Term>
<Field>Organism</Field>
<Count>0</Count>
</TermSet>
<TermSet>
<Term>arabidopsis[All Fields]</Term>
</TermSet>
<OP>OR</OP>
<OP>GROUP</OP>
<OP>AND</OP>
</TranslationStack>
<QueryTranslation>(arsenite reductase[Protein Name] OR (arsenite[All Fields] AND reductase[All Fields])) AND ("Arabidopsis"[Organism]
OR arabidopsis[All Fields])
</QueryTranslation>
</eSearchResult>
20
LWP::Simple
#!/usr/bin/perl
################################################################################
#
#
# Gribskov Admin
Feb 26, 2013
################################################################################
use strict;
use Getopt::Std;
use LWP::Simple;
my $query
my $search
#print "$result\n";
# get IDs
my ( $ids ) = $result =~ /<IdList>(.*)<\/IdList>/s;
$ids =~ s/<\/?Id>//g;
my @idlist = split " ", $ids;
print "idlist:@idlist\n";
exit 0;
21
LWP::Simple
#!/usr/bin/perl
################################################################################
#
#
# Gribskov Admin
Feb 26, 2013
################################################################################
use strict;
use Getopt::Std;
use LWP::Simple;
my $query
#my $search
= $BASE."esearch.fcgi?db=$database&term=$query&usehistory=y";
my $search
#print "$result\n";
# get IDs
my ( $ids ) = $result =~ /<IdList>(.*)<\/IdList>/s;
$ids =~ s/<\/?Id>//g;
my @idlist = split " ", $ids;
print "idlist:@idlist\n";
# retrieve with efetch
my $idstring = join ",", @idlist;
my $fetch = $BASE."efetch.fcgi?db=protein&retmode=text&rettype=fasta&id=$idstring";
print "fetch:$fetch\n";
my $sequence = get $fetch;
print $sequence;
22
LWP::Simple
>gi|28868893|ref|NP_791512.1| arsenate reductase [Pseudomonas syringae pv. tomato str. DC3000]
MTDLTLYHNPRCTKSRGALELLQARGLTPDIILYLETPPDAGTLHDLLGKLGISARQLLRTGEDDYKQLN
LADPSLSDEQLVAAMAAHPKLIERPILVAGNKAVIGRPPENILELLP
>gi|62286622|sp|Q8GY31.1|CDC25_ARATH RecName: Full=Dual specificity phosphatase Cdc25; AltName: Full=Arath;CDC25;
AltName: Full=Arsenate reductase 2; AltName: Full=Sulfurtransferase 5; Short=AtStr5
MGRSIFSFFTKKKKMAMARSISYITSTQLLPLHRRPNIAIIDVRDEERNYDGHIAGSLHYASGSFDDKIS
HLVQNVKDKDTLVFHCALSQVRGPTCARRLVNYLDEKKEDTGIKNIMILERGFNGWEASGKPVCRCAEVP
CKGDCA
>gi|410092315|ref|ZP_11288844.1| arsenate reductase [Pseudomonas viridiflava UASWS0038]
MTDLTLYHNPRCTKSRGALELLQARGLSPDVVLYLETPPDAAQLRELLGKLGISARQLLRTGEDDYKQLN
LADASLSDEQLIAAMAAHPKLIERPILVVGDKAVIGRPPENVLELLP
>gi|409760340|gb|EKN45494.1| arsenate reductase [Pseudomonas viridiflava UASWS0038]
>gi|28852132|gb|AAO55207.1| arsenate reductase [Pseudomonas syringae pv. tomato str. DC3000]
23
LWP::Simple
• Large retrievals with Eutils
○ NCBI allows the results of a large query to be stored on their database
and used in other queries using the usehistory=y parameter with
esearch
○ multiple sets of sequences can then be retrieved in chunks using
−
−
○
retstart – index of first sequence to retrieve
retmax – number of sequences to retrieve
NCBI recommends setting retmax = 500 to avoid having an adverse
impact on their services
24
LWP::Simple
• esearch
○ additional information with &usehistory=y
<?xml version="1.0" ?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD eSearchResult, 11 May 2002//EN"
"http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd">
<eSearchResult>
<Count>5</Count>
<RetMax>5</RetMax>
<RetStart>0</RetStart>
<QueryKey>1</QueryKey>
<WebEnv>NCID_1_3419506_165.112.9.24_5555_1362406574_334475218</WebEnv>
<IdList>
<Id>28868893</Id>
<Id>62286622</Id>
<Id>410092315</Id>
<Id>409760340</Id>
<Id>28852132</Id>
25
LWP::Simple
#!/usr/bin/perl
################################################################################
#
#
# Gribskov Admin
Feb 26, 2013
################################################################################
use strict;
use Getopt::Std;
use LWP::Simple;
my $query
my $search
print "searching for $query ...\n\n";
print $result;
# get number of matches, WebEnv and query_key
my ( $webenv )
= $result =~ /<WebEnv>(\S+)<\/WebEnv>/s;
my ( $query_key ) = $result =~ /<QueryKey>(\d+)<\/QueryKey>/s;
my ( $matches )
= $result =~ /<Count>(\d+)<\/Count>/s;
print "WebEnv:$webenv
query_key:$query_key
matches: $matches\n";
my $retmax
= 2;
my $retstart = 0;
while ( $retstart < 6 ) {
my $fetch = $BASE."efetch.fcgi?db=protein";
$fetch .= "&retmode=text&rettype=fasta";
$fetch .= "&retmax=$retmax&retstart=$retstart";
$fetch .= "&WebEnv=$webenv&query_key=$query_key";
print "start=$retstart
query:$fetch\n";
print $sequence;
$retstart += $retmax;
}
exit 0;
26
LWP::Simple
WebEnv:NCID_1_2501752_130.14.22.76_5555_1362406846_1531808803
query_key:1
matches: 5
start=0
query:http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&retmode=text&rettype=fasta&retmax=2&retstart=0&We
bEnv=NCID_1_2501752_130.14.22.76_5555_1362406846_1531808803&query_key=1
>gi|28868893|ref|NP_791512.1| arsenate reductase [Pseudomonas syringae pv. tomato str. DC3000]
>gi|62286622|sp|Q8GY31.1|CDC25_ARATH RecName: Full=Dual specificity phosphatase Cdc25; AltName: Full=Arath;CDC25; AltName:
Full=Arsenate reductase 2; AltName: Full=Sulfurtransferase 5; Short=AtStr5
MGRSIFSFFTKKKKMAMARSISYITSTQLLPLHRRPNIAIIDVRDEERNYDGHIAGSLHYASGSFDDKIS
HLVQNVKDKDTLVFHCALSQVRGPTCARRLVNYLDEKKEDTGIKNIMILERGFNGWEASGKPVCRCAEVP
CKGDCA
start=2
>gi|410092315|ref|ZP_11288844.1| arsenate reductase [Pseudomonas viridiflava UASWS0038]
>gi|409760340|gb|EKN45494.1| arsenate reductase [Pseudomonas viridiflava UASWS0038]
start=4
>gi|28852132|gb|AAO55207.1| arsenate reductase [Pseudomonas syringae pv. tomato str. DC3000]
27
LWP::Simple
• Drawbacks to eutils script
○ must change the program for every different search
○ no progress report while running
○ no error checking
• Enhancing flexibility with getopt
28
LWP::Simple
#!/usr/bin/perl
################################################################################
#
#
# Gribskov Admin
Feb 26, 2013
################################################################################
use strict;
use Getopt::Std;
use LWP::Simple;
my $DEFAULT_INTERVAL = 10000;
my $DEFAULT_DATABASE = 'protein';
my $USAGE = qq{eutil.pl <query_string>
-h
this usage message
-d
<string>, NCBI database to search (default=$DEFAULT_DATABASE)
-i
<int>, interval for reporting progress (default=$DEFAULT_INTERVAL)
};
# command line options
my $option = {};
getopts( 'd:hi:', $option );
# help
if ( $$option{h} ) {
print "$USAGE\n";
exit 1;
}
my $database = $DEFAULT_DATABASE; if ( $$option{d} ) { $database = $$option{d}; }
my $interval = $DEFAULT_INTERVAL; if ( $$option{i} ) { $interval = $$option{i}; }
my $query
= $ARGV[0];
29
LWP::Simple
#-----------------------------------------------------------------------------# main program
#-----------------------------------------------------------------------------print STDERR "eutil.pl\n";
print STDERR "
report interval: $interval\n";
print STDERR "
database: $database\n";
print STDERR "
query: $query\n\n";
my $search
print "searching for $query ...\n\n";
# get number of matches, WebEnv and query_key
my ( $webenv )
= $result =~ /<WebEnv>(\S+)<\/WebEnv>/s;
my ( $query_key ) = $result =~ /<QueryKey>(\d+)<\/QueryKey>/s;
my ( $matches )
= $result =~ /<Count>(\d+)<\/Count>/s;
print STDERR "matches: $matches
WebEnv:$webenv
query_key:$query_key\n\n";
my $retmax
= 500;
my $retstart = 0;
while ( $retstart < $matches ) {
my $fetch = $BASE."efetch.fcgi?db=protein";
$fetch .= "&retmode=text&rettype=fasta";
$fetch .= "&retmax=$retmax&retstart=$retstart";
$fetch .= "&WebEnv=$webenv&query_key=$query_key";
print $sequence;
$retstart += $retmax;
unless ( $retstart % $interval ) { print STDERR "
$retstart sequences retrieved\n"; }
}
exit 0;
30
Running servers without using the web form
• LWP::UserAgent
• Blast @ PlantsP
○ Find homologous sequences
using BLAST
http://xplantsp.genomics.purdue.edu/cgibin/blast_tmpl_soap.cgi?db=PlantsP
31
LWP::UserAgent
$ua->agent( $product_id )
• Get/set the product token that is used to identify the user agent on the network. The agent value is sent as the
``User-Agent'' header in the requests. The default is the string returned by the _agent() method (see below).
• If the $product_id ends with space then the _agent() string is appended to it.
• The user agent string should be one or more simple product identifiers with an optional version number
separated by the ``/'' character. Examples are:
• $ua->agent('Checkbot/0.4 ' . $ua->_agent); $ua->agent('Checkbot/0.4 '); # same as above $ua>agent('Mozilla/5.0'); $ua->agent(""); # don't identify
$ua->_agent
• Returns the default agent identifier. This is a string of the form ``libwww-perl/#.##'', where ``#.##'' is substituted
with the version number of this library.
$ua->from( $email_address )
• Get/set the e-mail address for the human user who controls the requesting user agent. The address should be
machine-usable, as defined in RFC 822. The from value is send as the ``From'' header in the requests. Example:
• $ua->from('[email protected]');
• The default is to not send a ``From'' header. See the default_headers() method for the more general interface that
allow any header to be defaulted.
$ua->max_size( $bytes )
• Get/set the size limit for response content. The default is undef, which means that there is no limit. If the returned
response content is only partial, because the size limit was exceeded, then a ``Client-Aborted'' header will be
added to the response. The content might end up longer than max_size as we abort once appending a chunk of
data makes the length exceed the limit. The ``Content-Length'' header, if present, will indicate the length of the
full content and will normally not be the same as length($res->content).
$ua->timeout( $secs )
• Get/set the timeout value in seconds. The default timeout() value is 180 seconds, i.e. 3 minutes.
• The requests is aborted if no activity on the connection to the server is observed for timeout seconds. This
means that the time it takes for the complete transaction and the request() method to actually return might be
longer.
32
LWP::UserAgent - REQUEST METHODS
$ua->get( $url , $field_name => $value, ... )
• This method will dispatch a GET request on the given $url. Further arguments can be given to initialize the
headers of the request. These are given as separate name/value pairs. The return value is a response object..
$ua->head( $url , $field_name => $value, ... )
• This method will dispatch a HEAD request on the given $url. Otherwise it works like the get() method described
above.
$ua->post( $url, \%form )
$ua->post( $url, \@form )
$ua->post( $url, \%form, $field_name => $value, ... )
• This method will dispatch a POST request on the given $url, with %form or @form providing the key/value pairs
for the fill-in form content. Additional headers and content options are the same as for the get() method.
$ua->request( $request, $content_file )
• This method will dispatch the given $request object. Normally this will be an instance of the HTTP::Request
class, but any object with a similar interface will do. The return value is a response object. See the
HTTP::Request manpage and the HTTP::Response manpage for a description of the interface provided by these
classes.
• The request() method will process redirects and authentication responses transparently. This means that it may
actually send several simple requests via the simple_request() method described below.
$ua->simple_request( $request )
This method dispatches a single request and returns the response received.
Arguments are the same as for request() described above.
• The difference from request() is that simple_request() will not try to handle redirects or authentication
responses. The request() method will in fact invoke this method for each simple request it sends.
$ua->redirect_ok( $prospective_request, $response )
• This method is called by request() before it tries to follow a redirection to the request in $response. This should
return a TRUE value if this redirection is permissible. The $prospective_request will be the request to be sent if
this method returns TRUE.
• The base implementation will return FALSE unless the method is in the object's requests_redirectable list,
FALSE if the proposed redirection is to a ``file://...'' URL, and TRUE otherwise.
33
Running servers without using the web form
• LWP::UserAgent
• Blast @ PlantsP
○ Find homologous sequences
using BLAST
http://xplantsp.genomics.purdue.edu/cgibin/blast_tmpl_soap.cgi?db=PlantsP
34
Internet programming
Form information
• chrome – web developer extension
• firefox
○ firebug
○ web developer
• safari
○ web inspector
○ firebug
35
LWP::UserAgent
• Blast search form
• Gets from user
○ DATALIB
○ SEQUENCE
• Defaults
○ PROGRAM
○ UNGAPPED_ALIGNMENT
○ FSET
○ EXPECT
○ DESCRIPTIONS
○ ALIGNMENTS
• Hidden
○ db
36
Finding Form
variables in
page source
• <FORM>
○ may be more than
one
• <INPUT>
○ box
○ radio button
○ checkbox
• <SELECT>
○ pulldown menu
• <TEXTAREA>
○ a large box for text
37
BLAST@PlantsP
• DB
○ hidden -> "plantsp"
• PROGRAM
○ Select ->
○
<option value=blastp SELECTED> blastp (prot. vs prot.)
<option value=blastn > blastn (DNA vs DNA)
<option value=blastx> blastx (transl. DNA vs prot.)
<option value=tblastn> tblastn (prot. vs transl. DNA)
<option value=tblastx> tblastx (transl. DNA vs transl. DNA)
• DATALIB
○ Select
○
<OPTION VALUE=ap>----- Protein databases ----<OPTION VALUE=tigr_osa5prot>     Rice Proteins (TIGR release 5 - 01/24/2007)
<OPTION VALUE=physco_pro>     Physcomitrella proteins (JGI v1.1 - March 2007)
<OPTION VALUE=selmo1_pro>     Selaginella proteins (JGI v1.0 - March 2007 (released 10/31))
<OPTION VALUE=all_pro>     All Plants(P+T+Ubq) Proteins (Purdue - 28 Jan 2008)
<OPTION VALUE=tair_ath8prot>     Arabidopsis proteins (TAIR release 8 - 2008-05-16)
<OPTION VALUE=vp_090116>     Viridiplantae proteins (All viridiplantae proteins - 01/16/2009)
<OPTION VALUE=PlantProteinDB>     Plant Protein (Combined plant protein database - 2009-0313)
<OPTION VALUE=an>----- DNA databases ----<OPTION VALUE=osa_indica>     Indica (chinese) Rice Genomic Sequence (Yu et al. - 9/5/2002)
…
38
HTML Forms
<INPUT TYPE=checkbox NAME=UNGAPPED_ALIGNMENT VALUE=is_set>
Perform ungapped alignment <BR>
The query sequence is <INPUT TYPE=checkbox NAME=FSET VALUE=isset CHECKED>
filtered for low complexity regions by default. <BR>
<TEXTAREA NAME=SEQUENCE ROWS=6 COLS=80 VALUE=></TEXTAREA>
Expect Cutoff   <select name=EXPECT>
<option> 1e-100 <option> 1e-50 <option> 1e-25
<option> 1e-20 <option> 1e-15 <option> 1e-10
<option> 1e-5 <option> 1.0 <option selected> 10.0
<option> 100.0 <option> 500.0
<option> 1000.0 </select>  
39
Finding Form
Fields by
Element Info
40
LWP::UserAgent
• BLAST @ PlantsP
use strict;
use HTTP::Request::Common qw( POST );
use LWP::UserAgent;
my $site = "http://plantsp.genomics.purdue.edu/";
my $target = "plantsp/cgi-bin/blast_basic.cgi";
my $seq = "
MAKNVMQLAILSTQRVVLLLWLLHAPAAADAALTTVAGCPSKCGDVDIPLPFGIGDHCAW
ESFDVVCNESFSPPRPHTGNIEIKEISVEAGEMRVYTPVADQCYNSSSTSAPGFGASLEL
TAPFLLAQSNEFTAIGCNTVAFLDGRNNGSYSTGCITTCGSVEAAAQNGEPCTGLGCCQV
PSIPPNLTTLHISWNDQGFLNFTPIGTPCSYAFVAQKDWYNFSRQDFGPVGSKDFITNST";
$seq =~ s/\s*//g;
my $library = "pp_active_prots";
my $agent = LWP::UserAgent->new();
my $request = POST $site.$target,
[ DATALIB
=> $library,
SEQUENCE
=> $seq,
PROGRAM
=> "blastp",
UNGAPPED_ALIGNMENT => 1,
FSET
=> 1,
EXPECT
=> 10.0,
DESCRIPTIONS
=> 10,
ALIGNMENTS
=> 10,
db
=> "plantsp"
];
my $response = $agent->request( $request );
print $response->as_string;
41
ClustalW2 @ EBI
• Multiple sequence
alignment
• http://www.ebi.ac.uk/Tools/msa/clustalw2/
42
ClustalW2 @ EBI
43
TmHmm 2.0
• Predict transmembrane
helices
• http://www.cbs.dtu.dk/services/TMHMM/
44
TmHmm - Page info
45
Psort
• Prediction of sorting signals
• http://wolfpsort.seq.cbrc.jp/
46
PSort Output
• Formatted HTML
47
PSort Output
• HTML Source
• Requires
"Screen Scraping"
48
Removing HTML Tags
• A stupid approach
( $text ) = $html =~ s/<[^>]*>//g;
• Fails for
○ <IMG SRC="foo.gif"
ALT="A foo in its natural habitat">
○ <IMG SRC="foo.gif" ALT="A > B" );
○ 
○ <script>if (a<b && a>c)</script>
○ etc...
49
Parsing HTML
• Use HTML package to find or remove tags
• better, but complicated
use HTML::Parser;
$tree = HTML::Parser->new(
start_h => [ sub{ print shift, "\n"}, "tag"],
text_h => [ sub{ print shift;
print "
",shift,"\n"},
"line, dtext"
]
);
$tree->parse_file( "origins_life.htm" );
50
Parsing HTML
• Parser package uses "event handlers"
• event => [ subr, information]
○ Event types:
−
−
−
−
−
−
−
−
−
text
start
end
declaration
comment
process
start_document
end_document
default
• subr is a subroutine to process the information
○ for example a subroutine: sub{ print shift }
51
Parsing HTML
• information types
○ attr - returns a reference to a hash of attribute name/value pairs
○ @attr - Basically the same as attr, but keys and values are returned as
individual arguments and the original sequence is preserved
○ attrseq - returns a reference to an array of attribute names
○ column - returns the column number of the start of the event
○ Dtext - returns the decoded text
○ Event - returns the event name
○ Length - returns the number of bytes of the source text
○ Line - returns the line number of the start of the event
○ skipped_text - returns the concatenated text of all the events that have
been skipped since the last time an event
○ tagname - returns the element name
○ Tokens - returns a reference to an array of token strings to be passed.
The strings are exactly as they were found in the original text, no
decoding or case changes are applied.
○ Text - returns the source text (including markup element delimiters)
52
Checking Document Links
• HTML::LinkExtor
○ Get all the links in a document and possibly process each
○ $parser = HTML::LinkExtor->new( $function, $url );
○ links clears and returns a list of links. Each element is an array
reference with the type of link and the attribute-value pairs from the tag
○ $function is normally undef, but can be a reference to a function that you
want to act on every link
<A HREF=http://www.perl.com/>Home</A>
<IMG SRC="images/big/jpg" LOWSRC="images/big-lowres.jpg">
• $parser->links returns
[
[ a,
[ img,
href
src
lowsrc
=> http://www.perl.com/ ],
=>"images/big/jpg" ],
=>"images/big-lowres.jpg"]
]
53
Checking Document Links
• LinkExtor
use strict;
use HTML::LinkExtor;
use LWP::Simple qw( get head );
my $base_url = shift || die "usage $0 <start_url>\n";
my $content = get( $base_url );
my $parser = HTML::LinkExtor->new();
$parser->parse( $content );
my @links = $parser->links;
print "base URL: $base_url\n\n";
foreach my $linkref ( @links ) {
my @linklist = @$linkref;
my $type = shift @linklist;
my ( $attr, $value ) = @linklist;
# print "type: $type @linklist\n";
# print " attr: $attr
value:$value\n";
if ( $value =~ /ftp|http|https?/ ) {
if ( head( $value ) ) {
print "$value is OK\n";
} else {
print "$value is BAD\n";
}
}
}
# $linkref is a reference to a list
54

LWP::Simple - Purdue Genomics Wiki

Transcription

Similar documents

XGames.ESPN.go.com

urban ready living

Honour Our Heroes

PDF of deck

Quick Intro to SSRS for New WINGAP SQL users

Creating an Ubuntu scope

View Deck

Getting Started with Tovek Tools

Web Crawling