LWP::Simple - Purdue Genomics Wiki

Transcription

LWP::Simple - Purdue Genomics Wiki
22 – 26 February
Week 7
• Topics
○ Subroutines
○ Complex data structures
○ Internet agents / programming
• Reading
○ CPAN libwww-perl (LWP)
○ LWP cookbook
−
http://search.cpan.org/~gaas/libwww-perl-6.04/lwpcook.pod
○
Perl Cookbook (available on Safari)
○ Review HTML Forms
−
http://www.w3schools.com/html/html_forms.asp
Biol 59500-033 - Practical Biocomputing
1
Subroutines
Main program
• my $answer = times( $a, $b )
Subroutine
• sub times {
my ( $a, $b ) = @_;
my $answer = $a * $b;
return $answer;
}
Biol 59500-033 - Practical Biocomputing
2
Complex Data Structures
# hash of hashes
# 1. gene information
my %gene = (
At5g04870 => {
At1g18890 => {
At4g21940 => {
gene => "cpk1",
begin => 1416783,
end
=> 1420338,
xsome => 5
},
gene => "cpk10",
begin => 6522764,
end
=> 6525962,
xsome => 1
},
gene => "cpk15",
begin => 11640802,
end
=> 11643762,
xsome => 4
},
);
# 1.1 print gene info sorted by systematic name (e.g., At5g04870 )
# 1.2 print gene info sorted by chromosome
# 1.3 print gene info sorted by gene length
Biol 59500-033 - Practical Biocomputing
3
Complex Data Structures
# 2. array of hashes. this corresponds to the info in a fasta file
my @sequence = (
{
{
{
name
doc
seq
name
doc
seq
name
doc
seq
=>
=>
=>
=>
=>
=>
=>
=>
=>
"seqa",
"sequence of gene a",
"CGCATCGTATCCGATCGTAGCCTGCATCGTATGCTA" },
"wtfin",
"know one knows what this gene does",
"NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN" },
"lookase1",
"related to ADHD",
"CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC" },
);
#2.1 print the name of each sequence
#2.2 print out info in alphabetical order of sequence name
#2.3 print out information in sequence (seq) length order
Biol 59500-033 - Practical Biocomputing
4
Complex Data Structures
# hash of arrays
# 3. location of cities ( latitude, longitude
my %location = (
Montgomery
Little_Rock
Phoenix
Sacramento
Denver
Hartford
Dover
Tallahassee
Atlanta
Des_Moines
Boise
Springfield
Indianapolis
Topeka
Frankfort
Baton_Rouge
);
)
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
32.3615,
34.736,
33.5284,
38.5556,
39.7263,
41.7626,
39.1619,
30.4518,
33.7595,
41.5909,
43.6137,
39.7833,
39.7909,
39.0392,
38.1973,
30.4581,
-86.2791
-92.3311
-112.076
-121.469
-104.965
-72.6886
-75.5268
-84.2728
-84.4032
-93.6209
-116.238
-89.6504
-86.1477
-95.6895
-84.8631
-91.1402
],
],
],
],
],
],
],
],
],
],
],
],
],
],
],
],
# 3.1 print the locations of Phoenix, Topeka, and Atlanta
# 3.2 print all the cities west of springfield
# 3.3 print the cities in east to west order
Biol 59500-033 - Practical Biocomputing
5
Complex Data Structures
# array of arrays
# 4. some Euclidian x, y, z coordinates
my @coord = (
[ -66.838, -0.754,
[ -67.651, -1.371,
[ -67.424, -0.595,
[ -68.320,
0.089,
[ -67.234, -2.829,
[ -66.691, -3.521,
[ -67.718, -3.597,
[ -67.130, -4.281,
[ -68.089, -4.324,
[ -66.213, -0.711,
[ -65.842, -0.029,
[ -65.325,
1.368,
[ -64.130,
1.565,
[ -64.763, -0.831,
-25.764
-24.677
-23.384
-22.888
-24.467
-25.710
-26.826
-28.051
-29.188
-22.847
-21.614
-21.944
-22.165
-20.883
],
],
],
],
],
],
],
],
],
],
],
],
],
]
);
# 4.1 sort by z coordinate
# 4.2 find the center of these coordinates ( ave_x, ave_y, ave_z )
# 4.3 find all coordinates within 2.0 of the center
Biol 59500-033 - Practical Biocomputing
6
Complex Data Structures
# 5. array of hashes of arrays of arrays
# this one is more complicated and only intended for those who feel the above is
# is trivial. note that loc is an array of the beginning and ending positions of
# the gene on the chromosome, and exon is an array of arrays of the beginning
# and ending position of each exon; the exon coordinates are an offset from the
# beginning of the gene given in loc.
my %gene = (
At5g04870 => { gene => "cpk1",
loc
=> [ 1416783,1420338 ],
exon => [ [ 1001, 1809 ],
[ 2171, 2314 ],
[ 2400, 2552 ] ],
xsome => 5
},
At1g18890 => {
gene
loc
exon
=> "cpk10",
=> [ 6522764, 6525962 ],
=> [ [ 1001, 1298 ],
[ 1540, 2693 ] ],
xsome => 1
},
At4g21940 => {
gene
loc
exon
=> "cpk15",
=> [ 11640802, 11643762 ],
=> [ [ 1001, 2379 ],
[ 2497, 2640 ],
[ 2736, 2888 ],
[ 3050, 3165 ],
[ 3321, 3488 ] ],
xsome => 4
},
);
# 5.1 list each gene and its exons in alphabetical order (by the "gene" key)
# 5.2 list the genes and their locations in order of the number of exons
# 5.3 list the genes and their locations in order of the longest exon in each gene
Biol 59500-033 - Practical Biocomputing
7
Internet Programming
• CPAN libwww-perl (LWP)
• LWP cookbook –
http://search.cpan.org/~gaas/libwww-perl-6.04/lwpcook.pod
• Perl Cookbook (available on Safari)
Biol 59500-033 - Practical Biocomputing
8
Internet Programming
Internet packages
• LWP::Simple
○ Simple fetching of web pages and "GET" method forms
• LWP::UserAgent
○ More complicated fetching of "POST" method forms, uses
HTTP::Request and HTTP::Response
• HTTP::Request
○ Create HTTP formatted requests
• HTTP::Response
○ Parse HTTP formatted respnses
• URI::URL
○ mthods for handling URLs
• HTML
○ methods for handling HTML formatted files
Biol 59500-033 - Practical Biocomputing
9
Internet Programming
wget
• wget is available to fetch webpages on most unix systems
use strict;
my $url = "http://plantsp.genomics.purdue.edu";
my $content = `wget $url `;
Biol 59500-033 - Practical Biocomputing
10
Internet Programming
LWP Package
• Short for libwww-Perl
• LWP::Simple
• get($url)
○ The get() function will fetch the document identified by the given URL and
return it. It returns undef if it fails. The $url argument can be either a simple
string or a reference to a URI object.
• head($url)
○ Get document headers. Returns the following 5 values if successful:
($content_type, $document_length, $modified_time, $expires, $server)
○ Returns an empty list if it fails. In scalar context returns TRUE if successful.
• getprint($url)
○ Get and print a document identified by a URL. The document is printed to
the selected default filehandle for output (normally STDOUT) as data is
received from the network. If the request fails, then the status code and
message are printed on STDERR. The return value is the HTTP response
code.
• getstore($url, $file)
○ Gets a document identified by a URL and stores it in the file. The return
value is the HTTP response code.
• mirror($url, $file)
Biol 59500-033 - Practical Biocomputing
11
Internet Programming
LWP::Simple
• Getting a web page
• Most basic, little more than wget
use strict;
use LWP::Simple;
my $url = "http://plantsp.genomics.purdue.edu";
my $content = get ( $url );
• What if something goes wrong?
Biol 59500-033 - Practical Biocomputing
12
Internet Programming
LWP::Simple
• Getting a web page
• Checking for errors, better than wget
use strict;
use LWP::Simple;
my $url = "http://plantsp.genomics.purdue.edu";
unless ( my $content = get ( $url ) ) {
die "unable to access $url\n\n";
}
# test for success
• Inconvenient
○ Have to alter code each time
○ I get bored typing http://
Biol 59500-033 - Practical Biocomputing
13
Internet Programming
LWP::Simple
• Getting a web page
• More useful with getopt
○ Doesn't hard code
○ supply http:// prefix
use strict;
use Getopt::Std;
use LWP::Simple;
my $option = {};
getopts( 'u', $option );
my $url = "http://plantsp.genomics.purdue.edu";
if ( $$option{u} ) {
$url = $$option{u};
}
# default URL
unless ( $url =~ /http:\/\//i ) {
$url = "http://".$url;
}
# add http:// prefix if missing
unless ( my $content = get ( $url ) ) {
die "unable to access $url\n\n";
}
# test for success
Biol 59500-033 - Practical Biocomputing
14
Internet Programming
LWP::Simple
• LWP::Simple works well with REST–based web services
• NCBI E-utilities (eutils, http://www.ncbi.nlm.nih.gov/books/NBK25500/)
Provided by NCBI to
○ search databases (esearch)
○ download summaries (esummary)
○ download complete entries (efetch)
○ upload UIDs to NCBI server for later processing (epost)
○ query Entrez (egquery)
○ trace links in entries (elink)
○ examine database statistics and fields (einfo)
○ retrieve spelling suggestions (espell)
• Base URL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/xxx.fcgi
Biol 59500-033 - Practical Biocomputing
15
Internet Programming
LWP::Simple
• esearch
○ url: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
○ parameters:
−
db – databases to search (pubmed,protein, nucleotide, genome, etc)
term – search term
usehistory – y|n, store the results of search on server
−
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?
db=pubmed&term=science[journal]+AND+breast+cancer
note: no spaces in term
−
−
Biol 59500-033 - Practical Biocomputing
16
Internet Programming
LWP::Simple
• efetch
○ url: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi
○ parameters:
−
−
−
db – databases to search (pubmed,protein, nucleotide, genome, etc)
id – uid list (e.g., &id=15718680,157427902,119703751)
rettype – retrieval type, varies with database
Abstract or MEDLINE from PubMed, or
GenPept or FASTA from protein
−
−
−
−
−
retmode – e.g., text, HMTL or XML
retstart - Sequential index of the first record to be retrieved
retmax - Total number of records from the input set to be retrieved
WebEnv – specifies the Web Environment that contains the UID list to be
provided as input to EFetch
query_key - specifies which of the UID lists attached to the given Web
Environment will be used as input to Efetch
efetch.fcgi?db=protein&retmode=text&rettype=fasta&id=15718680,157427902,
119703751
Biol 59500-033 - Practical Biocomputing
17
Internet Programming
LWP::Simple
• simple esearch script
#!/usr/bin/perl
################################################################################
#
# Use NCBI eutil service to retrieve sequences from pubmed
#
# Gribskov Admin
Feb 26, 2013
################################################################################
use strict;
use Getopt::Std;
use LWP::Simple;
# base URL for NCBI eutil services
my $BASE = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
my $database = "protein";
my $query
= "arsenite reductase AND arabidopsis";
my $search
= $BASE."esearch.fcgi?db=$database&term=$query";
print "searching $search...\n\n";
my $result = get $search;
print "$result\n";
exit 0;
Biol 59500-033 - Practical Biocomputing
18
Internet Programming
LWP::Simple
searching http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=arsenite reductase AND
arabidopsis...
<?xml version="1.0" ?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD eSearchResult, 11 May 2002//EN"
"http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd">
<eSearchResult>
<Count>5</Count>
<RetMax>5</RetMax>
<RetStart>0</RetStart>
<IdList>
<Id>28868893</Id>
<Id>62286622</Id>
<Id>410092315</Id>
<Id>409760340</Id>
<Id>28852132</Id>
</IdList>
<TranslationSet>
<Translation>
<From>arsenite reductase</From>
<To>arsenite reductase[Protein Name] OR (arsenite[All Fields] AND reductase[All Fields])</To>
</Translation>
<Translation>
<From>arabidopsis</From>
<To>"Arabidopsis"[Organism] OR arabidopsis[All Fields]</To>
</Translation>
</TranslationSet>
Biol 59500-033 - Practical Biocomputing
19
Internet Programming
LWP::Simple
<TranslationStack>
<TermSet>
<Term>arsenite reductase[Protein Name]</Term>
<Field>Protein Name</Field>
<Count>32</Count>
<Explode>N</Explode>
</TermSet>
<TermSet>
<Term>arsenite[All Fields]</Term>
<Field>All Fields</Field>
<Count>94001</Count>
<Explode>N</Explode>
</TermSet>
<TermSet>
<Term>reductase[All Fields]</Term>
<Field>All Fields</Field>
<Count>1646624</Count>
<Explode>N</Explode>
</TermSet>
<OP>AND</OP>
<OP>GROUP</OP>
<OP>OR</OP>
<OP>GROUP</OP>
<TermSet>
<Term>"Arabidopsis"[Organism]</Term>
<Field>Organism</Field>
<Count>0</Count>
<Explode>N</Explode>
</TermSet>
<TermSet>
<Term>arabidopsis[All Fields]</Term>
<Field>All Fields</Field>
<Count>1005396</Count>
<Explode>N</Explode>
</TermSet>
<OP>OR</OP>
<OP>GROUP</OP>
<OP>AND</OP>
</TranslationStack>
<QueryTranslation>(arsenite reductase[Protein Name] OR (arsenite[All Fields] AND reductase[All Fields])) AND ("Arabidopsis"[Organism]
OR arabidopsis[All Fields])
</QueryTranslation>
</eSearchResult>
Biol 59500-033 - Practical Biocomputing
20
Internet Programming
LWP::Simple
#!/usr/bin/perl
################################################################################
#
# Use NCBI eutil service to retrieve sequences from pubmed
#
# Gribskov Admin
Feb 26, 2013
################################################################################
use strict;
use Getopt::Std;
use LWP::Simple;
# base URL for NCBI eutil services
my $BASE = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
my $database = "protein";
my $query
= "arsenite reductase AND arabidopsis";
my $search
= $BASE."esearch.fcgi?db=$database&term=$query";
print "searching $search...\n\n";
my $result = get $search;
#print "$result\n";
# get IDs
my ( $ids ) = $result =~ /<IdList>(.*)<\/IdList>/s;
$ids =~ s/<\/?Id>//g;
my @idlist = split " ", $ids;
print "idlist:@idlist\n";
exit 0;
Biol 59500-033 - Practical Biocomputing
21
Internet Programming
LWP::Simple
#!/usr/bin/perl
################################################################################
#
# Use NCBI eutil service to retrieve sequences from pubmed
#
# Gribskov Admin
Feb 26, 2013
################################################################################
use strict;
use Getopt::Std;
use LWP::Simple;
# base URL for NCBI eutil services
my $BASE = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
my $database = "protein";
my $query
= "arsenite reductase AND arabidopsis";
#my $search
= $BASE."esearch.fcgi?db=$database&term=$query&usehistory=y";
my $search
= $BASE."esearch.fcgi?db=$database&term=$query";
print "searching $search...\n\n";
my $result = get $search;
#print "$result\n";
# get IDs
my ( $ids ) = $result =~ /<IdList>(.*)<\/IdList>/s;
$ids =~ s/<\/?Id>//g;
my @idlist = split " ", $ids;
print "idlist:@idlist\n";
# retrieve with efetch
my $idstring = join ",", @idlist;
my $fetch = $BASE."efetch.fcgi?db=protein&retmode=text&rettype=fasta&id=$idstring";
print "fetch:$fetch\n";
my $sequence = get $fetch;
print $sequence;
Biol 59500-033 - Practical Biocomputing
22
Internet Programming
LWP::Simple
>gi|28868893|ref|NP_791512.1| arsenate reductase [Pseudomonas syringae pv. tomato str. DC3000]
MTDLTLYHNPRCTKSRGALELLQARGLTPDIILYLETPPDAGTLHDLLGKLGISARQLLRTGEDDYKQLN
LADPSLSDEQLVAAMAAHPKLIERPILVAGNKAVIGRPPENILELLP
>gi|62286622|sp|Q8GY31.1|CDC25_ARATH RecName: Full=Dual specificity phosphatase Cdc25; AltName: Full=Arath;CDC25;
AltName: Full=Arsenate reductase 2; AltName: Full=Sulfurtransferase 5; Short=AtStr5
MGRSIFSFFTKKKKMAMARSISYITSTQLLPLHRRPNIAIIDVRDEERNYDGHIAGSLHYASGSFDDKIS
HLVQNVKDKDTLVFHCALSQVRGPTCARRLVNYLDEKKEDTGIKNIMILERGFNGWEASGKPVCRCAEVP
CKGDCA
>gi|410092315|ref|ZP_11288844.1| arsenate reductase [Pseudomonas viridiflava UASWS0038]
MTDLTLYHNPRCTKSRGALELLQARGLSPDVVLYLETPPDAAQLRELLGKLGISARQLLRTGEDDYKQLN
LADASLSDEQLIAAMAAHPKLIERPILVVGDKAVIGRPPENVLELLP
>gi|409760340|gb|EKN45494.1| arsenate reductase [Pseudomonas viridiflava UASWS0038]
MTDLTLYHNPRCTKSRGALELLQARGLSPDVVLYLETPPDAAQLRELLGKLGISARQLLRTGEDDYKQLN
LADASLSDEQLIAAMAAHPKLIERPILVVGDKAVIGRPPENVLELLP
>gi|28852132|gb|AAO55207.1| arsenate reductase [Pseudomonas syringae pv. tomato str. DC3000]
MTDLTLYHNPRCTKSRGALELLQARGLTPDIILYLETPPDAGTLHDLLGKLGISARQLLRTGEDDYKQLN
LADPSLSDEQLVAAMAAHPKLIERPILVAGNKAVIGRPPENILELLP
Biol 59500-033 - Practical Biocomputing
23
Internet Programming
LWP::Simple
• Large retrievals with Eutils
○ NCBI allows the results of a large query to be stored on their database
and used in other queries using the usehistory=y parameter with
esearch
○ multiple sets of sequences can then be retrieved in chunks using
−
−
○
retstart – index of first sequence to retrieve
retmax – number of sequences to retrieve
NCBI recommends setting retmax = 500 to avoid having an adverse
impact on their services
Biol 59500-033 - Practical Biocomputing
24
Internet Programming
LWP::Simple
• esearch
○ additional information with &usehistory=y
<?xml version="1.0" ?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD eSearchResult, 11 May 2002//EN"
"http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd">
<eSearchResult>
<Count>5</Count>
<RetMax>5</RetMax>
<RetStart>0</RetStart>
<QueryKey>1</QueryKey>
<WebEnv>NCID_1_3419506_165.112.9.24_5555_1362406574_334475218</WebEnv>
<IdList>
<Id>28868893</Id>
<Id>62286622</Id>
<Id>410092315</Id>
<Id>409760340</Id>
<Id>28852132</Id>
Biol 59500-033 - Practical Biocomputing
25
Internet Programming
LWP::Simple
#!/usr/bin/perl
################################################################################
#
# Use NCBI eutil service to retrieve sequences from pubmed
#
# Gribskov Admin
Feb 26, 2013
################################################################################
use strict;
use Getopt::Std;
use LWP::Simple;
# base URL for NCBI eutil services
my $BASE = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
my $database = "protein";
my $query
= "arsenite reductase AND arabidopsis";
my $search
= $BASE."esearch.fcgi?db=$database&term=$query&usehistory=y";
print "searching for $query ...\n\n";
my $result = get $search;
print $result;
# get number of matches, WebEnv and query_key
my ( $webenv )
= $result =~ /<WebEnv>(\S+)<\/WebEnv>/s;
my ( $query_key ) = $result =~ /<QueryKey>(\d+)<\/QueryKey>/s;
my ( $matches )
= $result =~ /<Count>(\d+)<\/Count>/s;
print "WebEnv:$webenv
query_key:$query_key
matches: $matches\n";
# retrieve with efetch
my $retmax
= 2;
my $retstart = 0;
while ( $retstart < 6 ) {
my $fetch = $BASE."efetch.fcgi?db=protein";
$fetch .= "&retmode=text&rettype=fasta";
$fetch .= "&retmax=$retmax&retstart=$retstart";
$fetch .= "&WebEnv=$webenv&query_key=$query_key";
print "start=$retstart
query:$fetch\n";
my $sequence = get $fetch;
print $sequence;
$retstart += $retmax;
}
exit 0;
Biol 59500-033 - Practical Biocomputing
26
Internet Programming
LWP::Simple
WebEnv:NCID_1_2501752_130.14.22.76_5555_1362406846_1531808803
query_key:1
matches: 5
start=0
query:http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&retmode=text&rettype=fasta&retmax=2&retstart=0&We
bEnv=NCID_1_2501752_130.14.22.76_5555_1362406846_1531808803&query_key=1
>gi|28868893|ref|NP_791512.1| arsenate reductase [Pseudomonas syringae pv. tomato str. DC3000]
MTDLTLYHNPRCTKSRGALELLQARGLTPDIILYLETPPDAGTLHDLLGKLGISARQLLRTGEDDYKQLN
LADPSLSDEQLVAAMAAHPKLIERPILVAGNKAVIGRPPENILELLP
>gi|62286622|sp|Q8GY31.1|CDC25_ARATH RecName: Full=Dual specificity phosphatase Cdc25; AltName: Full=Arath;CDC25; AltName:
Full=Arsenate reductase 2; AltName: Full=Sulfurtransferase 5; Short=AtStr5
MGRSIFSFFTKKKKMAMARSISYITSTQLLPLHRRPNIAIIDVRDEERNYDGHIAGSLHYASGSFDDKIS
HLVQNVKDKDTLVFHCALSQVRGPTCARRLVNYLDEKKEDTGIKNIMILERGFNGWEASGKPVCRCAEVP
CKGDCA
start=2
query:http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&retmode=text&rettype=fasta&retmax=2&retstart=2&We
bEnv=NCID_1_2501752_130.14.22.76_5555_1362406846_1531808803&query_key=1
>gi|410092315|ref|ZP_11288844.1| arsenate reductase [Pseudomonas viridiflava UASWS0038]
MTDLTLYHNPRCTKSRGALELLQARGLSPDVVLYLETPPDAAQLRELLGKLGISARQLLRTGEDDYKQLN
LADASLSDEQLIAAMAAHPKLIERPILVVGDKAVIGRPPENVLELLP
>gi|409760340|gb|EKN45494.1| arsenate reductase [Pseudomonas viridiflava UASWS0038]
MTDLTLYHNPRCTKSRGALELLQARGLSPDVVLYLETPPDAAQLRELLGKLGISARQLLRTGEDDYKQLN
LADASLSDEQLIAAMAAHPKLIERPILVVGDKAVIGRPPENVLELLP
start=4
query:http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&retmode=text&rettype=fasta&retmax=2&retstart=4&We
bEnv=NCID_1_2501752_130.14.22.76_5555_1362406846_1531808803&query_key=1
>gi|28852132|gb|AAO55207.1| arsenate reductase [Pseudomonas syringae pv. tomato str. DC3000]
MTDLTLYHNPRCTKSRGALELLQARGLTPDIILYLETPPDAGTLHDLLGKLGISARQLLRTGEDDYKQLN
LADPSLSDEQLVAAMAAHPKLIERPILVAGNKAVIGRPPENILELLP
Biol 59500-033 - Practical Biocomputing
27
Internet Programming
LWP::Simple
• Drawbacks to eutils script
○ must change the program for every different search
○ no progress report while running
○ no error checking
• Enhancing flexibility with getopt
Biol 59500-033 - Practical Biocomputing
28
Internet Programming
LWP::Simple
#!/usr/bin/perl
################################################################################
#
# Use NCBI eutil service to retrieve sequences from pubmed
#
# Gribskov Admin
Feb 26, 2013
################################################################################
use strict;
use Getopt::Std;
use LWP::Simple;
# base URL for NCBI eutil services
my $BASE = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
my $DEFAULT_INTERVAL = 10000;
my $DEFAULT_DATABASE = 'protein';
my $USAGE = qq{eutil.pl <query_string>
-h
this usage message
-d
<string>, NCBI database to search (default=$DEFAULT_DATABASE)
-i
<int>, interval for reporting progress (default=$DEFAULT_INTERVAL)
};
# command line options
my $option = {};
getopts( 'd:hi:', $option );
# help
if ( $$option{h} ) {
print "$USAGE\n";
exit 1;
}
my $database = $DEFAULT_DATABASE; if ( $$option{d} ) { $database = $$option{d}; }
my $interval = $DEFAULT_INTERVAL; if ( $$option{i} ) { $interval = $$option{i}; }
my $query
= $ARGV[0];
Biol 59500-033 - Practical Biocomputing
29
Internet Programming
LWP::Simple
#-----------------------------------------------------------------------------# main program
#-----------------------------------------------------------------------------print STDERR "eutil.pl\n";
print STDERR "
report interval: $interval\n";
print STDERR "
database: $database\n";
print STDERR "
query: $query\n\n";
my $search
= $BASE."esearch.fcgi?db=$database&term=$query&usehistory=y";
print "searching for $query ...\n\n";
my $result = get $search;
# get number of matches, WebEnv and query_key
my ( $webenv )
= $result =~ /<WebEnv>(\S+)<\/WebEnv>/s;
my ( $query_key ) = $result =~ /<QueryKey>(\d+)<\/QueryKey>/s;
my ( $matches )
= $result =~ /<Count>(\d+)<\/Count>/s;
print STDERR "matches: $matches
WebEnv:$webenv
query_key:$query_key\n\n";
# retrieve with efetch
my $retmax
= 500;
my $retstart = 0;
while ( $retstart < $matches ) {
my $fetch = $BASE."efetch.fcgi?db=protein";
$fetch .= "&retmode=text&rettype=fasta";
$fetch .= "&retmax=$retmax&retstart=$retstart";
$fetch .= "&WebEnv=$webenv&query_key=$query_key";
my $sequence = get $fetch;
print $sequence;
$retstart += $retmax;
unless ( $retstart % $interval ) { print STDERR "
$retstart sequences retrieved\n"; }
}
exit 0;
Biol 59500-033 - Practical Biocomputing
30
Internet Programming
Running servers without using the web form
• LWP::UserAgent
• Blast @ PlantsP
○ Find homologous sequences
using BLAST
http://xplantsp.genomics.purdue.edu/cgibin/blast_tmpl_soap.cgi?db=PlantsP
Biol 59500-033 - Practical Biocomputing
31
Internet Programming
LWP::UserAgent
$ua->agent( $product_id )
• Get/set the product token that is used to identify the user agent on the network. The agent value is sent as the
``User-Agent'' header in the requests. The default is the string returned by the _agent() method (see below).
• If the $product_id ends with space then the _agent() string is appended to it.
• The user agent string should be one or more simple product identifiers with an optional version number
separated by the ``/'' character. Examples are:
• $ua->agent('Checkbot/0.4 ' . $ua->_agent); $ua->agent('Checkbot/0.4 '); # same as above $ua>agent('Mozilla/5.0'); $ua->agent(""); # don't identify
$ua->_agent
• Returns the default agent identifier. This is a string of the form ``libwww-perl/#.##'', where ``#.##'' is substituted
with the version number of this library.
$ua->from( $email_address )
• Get/set the e-mail address for the human user who controls the requesting user agent. The address should be
machine-usable, as defined in RFC 822. The from value is send as the ``From'' header in the requests. Example:
• $ua->from('[email protected]');
• The default is to not send a ``From'' header. See the default_headers() method for the more general interface that
allow any header to be defaulted.
$ua->max_size( $bytes )
• Get/set the size limit for response content. The default is undef, which means that there is no limit. If the returned
response content is only partial, because the size limit was exceeded, then a ``Client-Aborted'' header will be
added to the response. The content might end up longer than max_size as we abort once appending a chunk of
data makes the length exceed the limit. The ``Content-Length'' header, if present, will indicate the length of the
full content and will normally not be the same as length($res->content).
$ua->timeout( $secs )
• Get/set the timeout value in seconds. The default timeout() value is 180 seconds, i.e. 3 minutes.
• The requests is aborted if no activity on the connection to the server is observed for timeout seconds. This
means that the time it takes for the complete transaction and the request() method to actually return might be
longer.
Biol 59500-033 - Practical Biocomputing
32
Internet Programming
LWP::UserAgent - REQUEST METHODS
$ua->get( $url , $field_name => $value, ... )
• This method will dispatch a GET request on the given $url. Further arguments can be given to initialize the
headers of the request. These are given as separate name/value pairs. The return value is a response object..
$ua->head( $url , $field_name => $value, ... )
• This method will dispatch a HEAD request on the given $url. Otherwise it works like the get() method described
above.
$ua->post( $url, \%form )
$ua->post( $url, \@form )
$ua->post( $url, \%form, $field_name => $value, ... )
• This method will dispatch a POST request on the given $url, with %form or @form providing the key/value pairs
for the fill-in form content. Additional headers and content options are the same as for the get() method.
$ua->request( $request, $content_file )
• This method will dispatch the given $request object. Normally this will be an instance of the HTTP::Request
class, but any object with a similar interface will do. The return value is a response object. See the
HTTP::Request manpage and the HTTP::Response manpage for a description of the interface provided by these
classes.
• The request() method will process redirects and authentication responses transparently. This means that it may
actually send several simple requests via the simple_request() method described below.
$ua->simple_request( $request )
This method dispatches a single request and returns the response received.
Arguments are the same as for request() described above.
• The difference from request() is that simple_request() will not try to handle redirects or authentication
responses. The request() method will in fact invoke this method for each simple request it sends.
$ua->redirect_ok( $prospective_request, $response )
• This method is called by request() before it tries to follow a redirection to the request in $response. This should
return a TRUE value if this redirection is permissible. The $prospective_request will be the request to be sent if
this method returns TRUE.
• The base implementation will return FALSE unless the method is in the object's requests_redirectable list,
FALSE if the proposed redirection is to a ``file://...'' URL, and TRUE otherwise.
Biol 59500-033 - Practical Biocomputing
33
Internet Programming
Running servers without using the web form
• LWP::UserAgent
• Blast @ PlantsP
○ Find homologous sequences
using BLAST
http://xplantsp.genomics.purdue.edu/cgibin/blast_tmpl_soap.cgi?db=PlantsP
Biol 59500-033 - Practical Biocomputing
34
Internet programming
Form information
• chrome – web developer extension
• firefox
○ firebug
○ web developer
• safari
○ web inspector
○ firebug
Biol 59500-033 - Practical Biocomputing
35
Internet Programming
LWP::UserAgent
• Blast search form
• Gets from user
○ DATALIB
○ SEQUENCE
• Defaults
○ PROGRAM
○ UNGAPPED_ALIGNMENT
○ FSET
○ EXPECT
○ DESCRIPTIONS
○ ALIGNMENTS
• Hidden
○ db
Biol 59500-033 - Practical Biocomputing
36
Internet Programming
Finding Form
variables in
page source
• <FORM>
○ may be more than
one
• <INPUT>
○ box
○ radio button
○ checkbox
• <SELECT>
○ pulldown menu
• <TEXTAREA>
○ a large box for text
Biol 59500-033 - Practical Biocomputing
37
Internet Programming
BLAST@PlantsP
• DB
○ hidden -> "plantsp"
• PROGRAM
○ Select ->
○
<option value=blastp SELECTED> blastp (prot. vs prot.)
<option value=blastn > blastn (DNA vs DNA)
<option value=blastx> blastx (transl. DNA vs prot.)
<option value=tblastn> tblastn (prot. vs transl. DNA)
<option value=tblastx> tblastx (transl. DNA vs transl. DNA)
• DATALIB
○ Select
○
<OPTION VALUE=ap>----- Protein databases ----<OPTION VALUE=tigr_osa5prot>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Rice Proteins (TIGR release 5 - 01/24/2007)
<OPTION VALUE=physco_pro>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Physcomitrella proteins (JGI v1.1 - March 2007)
<OPTION VALUE=selmo1_pro>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Selaginella proteins (JGI v1.0 - March 2007 (released 10/31))
<OPTION VALUE=all_pro>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;All Plants(P+T+Ubq) Proteins (Purdue - 28 Jan 2008)
<OPTION VALUE=tair_ath8prot>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Arabidopsis proteins (TAIR release 8 - 2008-05-16)
<OPTION VALUE=vp_090116>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Viridiplantae proteins (All viridiplantae proteins - 01/16/2009)
<OPTION VALUE=PlantProteinDB>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Plant Protein (Combined plant protein database - 2009-0313)
<OPTION VALUE=an>----- DNA databases ----<OPTION VALUE=osa_indica>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Indica (chinese) Rice Genomic Sequence (Yu et al. - 9/5/2002)
…
Biol 59500-033 - Practical Biocomputing
38
Internet Programming
HTML Forms
<INPUT TYPE=checkbox NAME=UNGAPPED_ALIGNMENT VALUE=is_set>
Perform ungapped alignment <BR>
The query sequence is <INPUT TYPE=checkbox NAME=FSET VALUE=isset CHECKED>
filtered for low complexity regions by default. <BR>
<TEXTAREA NAME=SEQUENCE ROWS=6 COLS=80 VALUE=></TEXTAREA>
Expect Cutoff &nbsp; <select name=EXPECT>
<option> 1e-100 <option> 1e-50 <option> 1e-25
<option> 1e-20 <option> 1e-15 <option> 1e-10
<option> 1e-5 <option> 1.0 <option selected> 10.0
<option> 100.0 <option> 500.0
<option> 1000.0 </select> &nbsp;
Biol 59500-033 - Practical Biocomputing
39
Internet Programming
Finding Form
Fields by
Element Info
Biol 59500-033 - Practical Biocomputing
40
Internet Programming
LWP::UserAgent
• BLAST @ PlantsP
use strict;
use HTTP::Request::Common qw( POST );
use LWP::UserAgent;
my $site = "http://plantsp.genomics.purdue.edu/";
my $target = "plantsp/cgi-bin/blast_basic.cgi";
my $seq = "
MAKNVMQLAILSTQRVVLLLWLLHAPAAADAALTTVAGCPSKCGDVDIPLPFGIGDHCAW
ESFDVVCNESFSPPRPHTGNIEIKEISVEAGEMRVYTPVADQCYNSSSTSAPGFGASLEL
TAPFLLAQSNEFTAIGCNTVAFLDGRNNGSYSTGCITTCGSVEAAAQNGEPCTGLGCCQV
PSIPPNLTTLHISWNDQGFLNFTPIGTPCSYAFVAQKDWYNFSRQDFGPVGSKDFITNST";
$seq =~ s/\s*//g;
my $library = "pp_active_prots";
my $agent = LWP::UserAgent->new();
my $request = POST $site.$target,
[ DATALIB
=> $library,
SEQUENCE
=> $seq,
PROGRAM
=> "blastp",
UNGAPPED_ALIGNMENT => 1,
FSET
=> 1,
EXPECT
=> 10.0,
DESCRIPTIONS
=> 10,
ALIGNMENTS
=> 10,
db
=> "plantsp"
];
my $response = $agent->request( $request );
print $response->as_string;
Biol 59500-033 - Practical Biocomputing
41
Internet Programming
ClustalW2 @ EBI
• Multiple sequence
alignment
• http://www.ebi.ac.uk/Tools/msa/clustalw2/
Biol 59500-033 - Practical Biocomputing
42
Internet Programming
ClustalW2 @ EBI
Biol 59500-033 - Practical Biocomputing
43
Internet Programming
TmHmm 2.0
• Predict transmembrane
helices
• http://www.cbs.dtu.dk/services/TMHMM/
Biol 59500-033 - Practical Biocomputing
44
Internet Programming
TmHmm - Page info
Biol 59500-033 - Practical Biocomputing
45
Internet Programming
Psort
• Prediction of sorting signals
• http://wolfpsort.seq.cbrc.jp/
Biol 59500-033 - Practical Biocomputing
46
Internet Programming
PSort Output
• Formatted HTML
Biol 59500-033 - Practical Biocomputing
47
Internet Programming
PSort Output
• HTML Source
• Requires
"Screen Scraping"
Biol 59500-033 - Practical Biocomputing
48
Internet Programming
Removing HTML Tags
• A stupid approach
( $text ) = $html =~ s/<[^>]*>//g;
• Fails for
○ <IMG SRC="foo.gif"
ALT="A foo in its natural habitat">
○ <IMG SRC="foo.gif" ALT="A > B" );
○ <!-- <a comment> -->
○ <script>if (a<b && a>c)</script>
○ etc...
Biol 59500-033 - Practical Biocomputing
49
Internet Programming
Parsing HTML
• Use HTML package to find or remove tags
• better, but complicated
use HTML::Parser;
$tree = HTML::Parser->new(
start_h => [ sub{ print shift, "\n"}, "tag"],
text_h => [ sub{ print shift;
print "
",shift,"\n"},
"line, dtext"
]
);
$tree->parse_file( "origins_life.htm" );
Biol 59500-033 - Practical Biocomputing
50
Internet Programming
Parsing HTML
• Parser package uses "event handlers"
• event => [ subr, information]
○ Event types:
−
−
−
−
−
−
−
−
−
text
start
end
declaration
comment
process
start_document
end_document
default
• subr is a subroutine to process the information
○ for example a subroutine: sub{ print shift }
Biol 59500-033 - Practical Biocomputing
51
Internet Programming
Parsing HTML
• information types
○ attr - returns a reference to a hash of attribute name/value pairs
○ @attr - Basically the same as attr, but keys and values are returned as
individual arguments and the original sequence is preserved
○ attrseq - returns a reference to an array of attribute names
○ column - returns the column number of the start of the event
○ Dtext - returns the decoded text
○ Event - returns the event name
○ Length - returns the number of bytes of the source text
○ Line - returns the line number of the start of the event
○ skipped_text - returns the concatenated text of all the events that have
been skipped since the last time an event
○ tagname - returns the element name
○ Tokens - returns a reference to an array of token strings to be passed.
The strings are exactly as they were found in the original text, no
decoding or case changes are applied.
○ Text - returns the source text (including markup element delimiters)
Biol 59500-033 - Practical Biocomputing
52
Internet Programming
Checking Document Links
• HTML::LinkExtor
○ Get all the links in a document and possibly process each
○ $parser = HTML::LinkExtor->new( $function, $url );
○ links clears and returns a list of links. Each element is an array
reference with the type of link and the attribute-value pairs from the tag
○ $function is normally undef, but can be a reference to a function that you
want to act on every link
<A HREF=http://www.perl.com/>Home</A>
<IMG SRC="images/big/jpg" LOWSRC="images/big-lowres.jpg">
• $parser->links returns
[
[ a,
[ img,
href
src
lowsrc
=> http://www.perl.com/ ],
=>"images/big/jpg" ],
=>"images/big-lowres.jpg"]
]
Biol 59500-033 - Practical Biocomputing
53
Internet Programming
Checking Document Links
• LinkExtor
use strict;
use HTML::LinkExtor;
use LWP::Simple qw( get head );
my $base_url = shift || die "usage $0 <start_url>\n";
my $content = get( $base_url );
my $parser = HTML::LinkExtor->new();
$parser->parse( $content );
my @links = $parser->links;
print "base URL: $base_url\n\n";
foreach my $linkref ( @links ) {
my @linklist = @$linkref;
my $type = shift @linklist;
my ( $attr, $value ) = @linklist;
# print "type: $type @linklist\n";
# print " attr: $attr
value:$value\n";
if ( $value =~ /ftp|http|https?/ ) {
if ( head( $value ) ) {
print "$value is OK\n";
} else {
print "$value is BAD\n";
}
}
}
Biol 59500-033 - Practical Biocomputing
# $linkref is a reference to a list
54