Regular Expressions and Pattern Matching Regular Expression (regex):

Transcription

Regular Expressions and Pattern Matching Regular Expression (regex):
Regular Expressions and Pattern Matching
[email protected]
Regular Expression (regex):
a separate language, allowing the construction of patterns.
used in most programming languages.
very powerful in Perl.
Pattern Match:
using regex to search data and look for a match.
Overview:
how to create regular expressions
how to use them to match and extract data
biological context
So Why Regex?
Parse files of data and information:
fasta
embl / genbank format
html (web-pages)
user input to programs
Check format
Find illegal characters (validation)
Search for sequences motifs
Simple Patterns
place regex between pair of forward slashes (/ /).
try:
#!/usr/bin/perl
while (<STDIN>) {
if (/abc/) {
print “1 >> $_”;
}
}
Run the script.
Type in something that contains abc:
abcfoobar
Type in something that doesn't:
fgh cba foobar
ab
c foobar
print statement is returned if abc is matched within the typed input.
Simple Patterns (2)
Can also match strings from files.
genomes_desc.txt contains a few text lines containing information about
three genomes.
try:
#!/usr/bin/perl
open IN, “<genomes_desc.txt”;
while (<IN>)
{
if (/elegans/) { #match lines with this regex
print;
#print lines with match
}
}
Parses each line in turn.
Looks for elegans anywhere in line $_
Flexible matching
There are many characters with special meanings – metacharacters.
star (*) matches any number of instances
/ab*c/ => 'a' followed by zero or more 'b' followed by 'c'
=> abc or abbbbbbbc or ac
plus (+) matches at least one instance
/ab+c/ => 'a' followed by one or more 'b' followed by 'c'
=> abc or abbc or abbbbbbbbbbbbbbc NOT ac
question mark (?) matches zero or one instance
/ab?c/ => 'a' followed by 0 or 1 'b' followed by 'c'
=> abc or ac
More General Quantifiers
Match a character a specific number or range of instances
{x} will match x number of instances.
/ab{3}c/ => abbbc
{x,y} will match between x and y instances.
/a{2,4}bc/ => aabc or aaabc or aaaabc
{x,} will match x+ instances.
/abc{3,}/ => abccc or abccccccccc or
abcccccccccccccccccccccccccccccccccccccccccccccc
cccccccccccccccccccccccccccccccccccccccccccccccc
ccccccccccccccccccccccc
More metacharacters
dot (.) refers to any character even tab (\t) and space but not newline (\n).
/a.*c/ => 'a' followed by any number of any characters followed by 'c'
Escaping
But I want to use these symbols in my regex!?!
to use a * , + , ? or . in the pattern when not a metacharacter, need to
'escape' them with a backslash.
/C\. elegans/ => C. elegans only
/C. elegans/ => Ca , Cb , C3 , C> , C. , etc...
The 'delimitor' of the regex, forward slash “/”, and the 'escape'
character, backslash “\”, are also metacharacters. These need to be
escaped if required in regex.
Important when trying to match URLs and email addresses.
/joe\.bloggs\@darwin\.co\.uk/
/www\.envgen\.nox\.ac\.uk\/biolinux\.html/
Using metacharacters.
The file nemaglobins.embl contains 21 embl database files that contain a
globin protein within their sequence.
try:
#!/usr/bin/perl
$count;
open IN, “<nemaglobins.embl” or die;
while (<IN>)
{
if (/AC
.*/) { #that's three spaces
print;
$count++;
}
}
print “total=$count\n”;
Grouping Patterns
Can group patterns in parentheses “()”.
Useful when coupled with quantifiers
/elegans+/ => eleganssssssssssssss
/(elegans)+/ => eleganselegans...elegans
n
2
1
/eleg(ans){4}/ => elegansansansans
1
2
3
4
Alternatives
Want either this pattern or that pattern.
Two ways:
1.) the vertical bar '|' either the left side matches or the right side matches
/(human|mouse|rat)/ => any string with human or mouse or rat.
Combine with previous examples:
/Fugu( |\t)+rubripes/ matches if Fugu and rubripes are
seperated by any mixture of spaces and tabs
2.) character class is a list of characters within '[]'. It will match any
single character within the class.
/[wxyz1234\t]/ => any of the nine.
a range can be specified with '-'
/[w-z1-4\t]/ => as above
to match a hyphen it must be first in the class
/[-a-zA-Z]/ => any letter character and a hyphen
negating a character with '^'
/[^z]/ => any character except z
/[^abc]/ => any character except a or b or c
Other Shortcuts
\d => any digit [0-9]
\w => any “word” character [A-Za-z0-9_]
\s => any white space [\t\n\r\f ]
\D => any character except a digit [^\d]
\W => any character except a “word” character [^\w]
\S => any character except a white space [^\s]
Can use any of these in conjunction with quantifiers,
/\s*/ => any amount of white space
Using alternatives to find a hydrophobic region...
try:
open IN, "< nippo_sigpept.fsa" or die;
while (<IN>)
{
if (/>/) { #a header line
$count++;
#keep running total of sequence number
}
else { #not a header
if (/[VILMFWCA]{8,}/) {
$match++;
}
}
}
print "Hydrophobic region found in $match sequences from
$count\n";
Could also have used /(V|I|L|M|F|W|C|A){8,}/
Binding Operator
Revisited?
So far matching against $_
The binding operator “=~”matches the pattern on right against the string on
left.
Usually add the m operator (optional).
$sumthing = 'Ascaris suum is a nematode';
if ($sumthing=~m/suum.*nematode/)
{
print “this organism infects pigs!\n”;
}
Anchors
/pattern/ will match anywhere in the string.
Use anchors to hold pattern to a point in the string.
caret “^” (shift 6) marks the beginning of string while dollar “$” marks end
of a string.
/^elegans/ => elegans only at start of string. Not C. elegans.
/Canis$/ => Canis only at end of string. Not Canis lupus.
/^\s*$/ => a blank line.
“$” ignores new line character “\n”.
N.B. compare use of “^” as an anchor with that in the character class.
Anchors (2)
Word Boundary
\b matches the start or end of a word.
/\bmus\b/ would match mus but not musculus
/la\b/ => Drosophila but not Plasmodium
/\btes/ => Comamonas testosteroni but not Pan
troglodytes
\b ignores newline character.
Be careful with full stops they're characters too!
Memory Variables
Able to extract sections of the pattern match and store in a variable.
Anything stored in parentheses “()” is written into a special variable.
The first instance is $1, the second $2, the fourth $4 and so on.
Extract from file:
Organism: Homo sapiens
...
Extract from Perl script:
while ($line=<IN>) {
if ($line=~m/Organism:\s(\w)+\s(\w)+/) {
$genus=$1;
#stores Homo
$species=$2;
#stores sapiens
}
}
Substitutions
Able to replace a pattern within a string with another string.
Use the “s” operator
s/abc/xyz/ => find abc and replace with xyz
By default only the first instance of a match.
Using 'g' modifier (global) will find and replace all instances.
$line = 'abccdcbabc';
$line =~ s/abc/xyz/g;
print $line;
#produces xyzcdcbxyz;
1
Run dna2rna.pl
Now look at dna2rna.pl
2
dna2rna.pl
#!/usr/bin/perl
print "Enter DNA sequence\n";
while ($line = <STDIN>) {
chomp $line;
#remove trailing \n
if ($line=~m/[^AGCT]/i) {
#case insensitive infered by 'i'
#modifier
print "your sequence contained an invalid nucleotide:
$&\nPlease try again\n";
#'$&' is a special variable which stores what the
#regular expression matched.
Don't worry about it for now.
}
else
{
$line=~s/t/u/g; #replace all lower case 't'
$line=~s/T/U/g; #replace all upper case 'T'
print "The RNA sequence is:\n$line\n";
print “Try again or ctrl C to quit\n”;
}
}
EMBL file revisited
using shortcuts and anchors to help make more robust:
if (/AC
.*/) {
#that's three spaces
can be rewritten as;
if (/^AC\s{3}(.*)\n$/){ #more certain to return what you want
$accession=$1; #now have info stored to use later.
}
Now Its Your Turn :o)
nemaglobins.embl contains entries for complete cds of nematode sequences.
Foreach entry print the ACcession, OrganiSm name and AGCT content of
the SeQuence.
Output should read:
Accession: AC00000 <tab> Species: Toxocara canis <newline>
A: 34 G: 65 C: 24 T: 75 <newline><newline>
Hints:
The lines of interest are AC, OS, and SQ.
Three regular expressions - one for each query.
Use a series of if and elsif loops to search for regular expressions.
Print when matched.
Bonus point - remove the semi-colon from the accession id.
Shout if need help.