BioPHP Technical Reference
Written by Serge Gregorio
The BioPHP Classes
As of version 1.0, there are 7 classes in BioPHP. They
are (in alphabetical order):
1) Protein Class
2) RestEn Class
3) Seq Class
4) SeqAlign Class
5) SeqDB Class
6) SeqMatch Class
7) SubMatrix Class
class Protein
This class represents the end-products of genetic processes of translation and
transcription -- the proteins. While a protein's primary structure (its amino
acid sequence) is ably represented as a Seq object, its secondary and tertiary
structures are not. This is the main rationale for creating a separate Protein
class. This class is still under development.
id A string that uniquely identifies a protein.
name The long name used to refer to this protein.
Purpose : Returns the molecular weight of the protein object.
Syntax : array molwt()
Arguments: None (The sequence property of the current Protein object is implied.)
Returns : An array of the form: ( lower_molwt, upper_molwt )
Specifics: This is similar to the molwt() method of the Seq Class.
Purpose : Returns the length of the protein object, i.e. the number of amino acids in it.
Syntax : int seqlen()
Arguments: None (The sequence property of the current Protein object is implied.)
Returns : An integer representing the number of amino acids in the protein.
Specifics: This has similarities with the seqlen() method of the Seq Class.
class RestEn
The RestEn class is short for "restriction enzymes" or "restriction
endonucleases", which are substances that can "cut" a DNA strand
into two or more fragments along special sites called restriction sites. They
are an important tool in recombinant DNA technology.
name The short name of the restriction endonuclease following the accepted
naming convention (the first three letters represent the organism from
which the enzyme was first sourced or discovered, followed by a Roman
numeral, etc.) Examples are EcoRI fro the Escherichia Coli bacteria,
and BamHI for the Bacillus Amyloliquefaciens, etc.
pattern A string representing the restriction pattern recognized by the enzyme.
cutpos An integer representing the position within the restriction pattern
where the enzyme actually cuts the DNA strand. This could range from
0 to 1 less than the length of the restriction pattern.
length The number of symbols (or base pairs) in the restriction pattern.
RestEn() Constructor method for the RestEn object.
Purpose : Cuts the DNA sequence into fragments using the restriction enzyme object.
Syntax : array CutSeq(Seq object $seq, char $options)
Arguments:
$seq (mandatory) - the sequence to cut using the current restriction enzyme object.
$options (optional) - may be "N" or "O". If "N", the sequence is cut using the patpos() group
of methods (no overlapping patterns). If "O", the sequence is cut using the patposo() group
of methods (with overlapping patterns). If omitted, this defaults to "N".
Returns : An array of fragments (substrings of the parameter sequence $seq) of the form:
( fragment1, fragment2, ... )
See also : patpos(), patposo() methods of the Seq class.
Purpose : A powerful method for searching our database of endonucleases for a particular
restriction enzyme exhibiting certain properties like pattern, cutting position,
and length, or combinations thereof.
Syntax : string FindRestEn(string $pattern, string/int $cutpos, string/int $plen)
Arguments:
$pattern (optional) - the pattern of the restriction enzyme we wish to look for. If
omitted, this is set to the blank string ("").
$cutpos (optional) - the cutting position of the restriction enzyme we wish to look for.
If omitted, this is set to the blank string ("").
$plen (optional) - the length of the restriction enzyme we wish to look for. If omitted,
this is set to the blank string ("").
Returns :
A list of restriction enyzmes that meet the criteria specified by the $pattern, $cutpos,
and $plen parameters. The list is an array of the form:
( rest_en1, rest_en2, ... )
Purpose : Determines the cutting position of the restriction enzyme object.
Syntax : int GetCutPos(string $RestEnName)
Arguments:
$RestEnName (optional) - the name of the restriction enzyme. If omitted, this defaults
to the name property of the RestEnd object.
Returns : Returns the cutting position (an integer) of the restriction enzyme object.
Specifics: This method looks up the cutting position in a Restriction Enzyme database,
provided with BioPHP.
See also : GetLength(), GetPattern()
Purpose : Returns the length of the restriction pattern recognized by the enzyme.
Syntax : int GetLength($RestEnName)
Arguments:
$RestEnName (optional) - the name of the restriction enzyme. If omitted, $RestEnName
defaults to the name property of the RestEn object.
Returns : The length (integer) of the restriction pattern recognized by the enzyme.
Specifics: This method looks up the length in the Restriction Enzyme database of BioPHP.
See also :GetCutPos() and GetPattern().
Purpose : Returns the restriction pattern recognized by the restriction enzyme object.
Syntax : string GetPattern(string $RestEnName)
Arguments:
$RestEnName (optional) - the name of the restriction enzyme. If omitted, $RestEnName
defaults to the name property of the RestEn object.
Returns : The sequence pattern (string) recognized by the given restriction enzyme.
Specifics: This method looks up the pattern in the Restriction Enzyme database of BioPHP.
See also : GetCutPos(), GetLength().
class Seq
An instance of the Seq class represents a single sequence record in a SeqDB
database object. Usually, we instantiate this with a call to fetch() method
of the SeqDB class.
For a more detailed description of the properties of this class, consult the
gbrel.txt file released by the NCBI. It describes the content and format of
the GenBank database.
id The primary sequence id which uniquely identifies a sequence record
in a SeqDB database.
strands This further describes the molecule type of a particular sequence.
May be one of these three values: SINGLE, DOUBLE, MIXED.
moltype The type of molecule which makes up a particular sequence. Typical
values are DNA, RNA, etc.
topology The shape of a particular sequence. May be LINEAR or CIRCULAR.
division A high-level classification of the sequence as to its
source, e.g. PRI if it came from a primate (ape), BCT if it came
from a bacteria, etc.
date A date value associated with the sequence record. See gbrel.txt.
accession An alternative identifier for a sequence record much like the id
property.
sec_accession
Secondary accession numbers for the sequence record, returned
as an array.
keywords Words or phrases associated with the sequence record, aimed at facilitating
searches, for example, by topic of interest.
organism The creature from which the sequence was obtained or extracted. Typically
uses the Latin-based scientific name (homo sapiens for humans, etc.).
sequence A string of symbols denoting the arrangement of "monomeres"
(basic structural units) within the sequence. Different types of molecular
sequences use different alphabets. DNA sequences have only four letters in
its alphabet - A, T, G, and C while protein sequences have 20 letters (G, A,
V, L, I, etc.).
seqlength The number of monomers making up the sequence. While numerically the same
as the value returned by the seqlen() method, using the seqlength property
may result in faster code execution in some cases.
reference Returns a multi-dimensional associative array of reference information
including AUTHOR, JOURNAL (CITATION), etc. To be described in greater detail
in BioPHP Documentation 2.0. For now, try doing a print_r() on it to learn
more.
features Returns a multi-dimensional associative array of information on portions
of the sequence that code for proteins and RNA molecules, etc. To be described
in greater detail in BioPHP Documentation 2.0. For now, try a print_r() on
it.
For all the other properties listed below, consult gbrel.txt:
version, definition, ncbi_gi_id, segment, segment_no, segment_count
Purpose : Translates an amino acid sequence into its equivalent "charge sequence".
Syntax : string charge(string $amino_seq)
Arguments:
$amino_seq (optional) - A string representing an amino acid sequence (e.g. GAVLIFYWKRH).
If omitted, this is set to the sequence property of the "calling" Seq object. If the
latter is not set either, the function returns the boolean value of FALSE.
Returns : A string where each amino acid "letter" is replaced by A (if amino acid is acidic),
C (if amino acid is basic), or N (if amino acid is neutral), e.g. ACNNCCNANCCNA.
Specifics: $amino_seq must be of format type 1 (a string of single-letter amino acid symbols)
and not of type 3 (a string of three-letter amino acid symbols).
Versions : In BioPHP version 1.0, $amino_seq is mandatory. In version 1.1, it is optional.
See also : chemgrp().
Purpose : Returns a string of symbols from an 8-letter alphabet: A, L, M, R, C, H, I, S.
Syntax : string chemgrp(string $amino_seq)
Arguments: $amino_seq (optional) - A string representing an amino acid chain (e.g. GAVLI).
If omitted, this is set to the sequence property of the "calling" Seq object. If the
latter is not set either, the function returns the boolean value of FALSE.
Returns : A string where each amino acid "letter" is replaced by one of the
following: A (acidic group), L (aliphatic group), M (amide group), R (aromatic group),
C (basic group), H (hydroxyl), I (iminio group), S (sulfur group).
Example : This is a sample output: ALMRLIISACHL
Specifics: $amino_seq must be of format type 1 (a string of single-letter amino acid symbols)
and not of type 3 (a string of three-letter amino acid symbols).
Versions : In BioPHP version 1.0, $amino_seq is mandatory. In version 1.1, it is optional.
See also : charge().
Purpose : Returns a string representing the genetic complement of a sequence.
Syntax : string complement(string $seq, string $moltype)
Arguments:
$seq (mandatory) - The string whose complement we want to obtain.
$moltype (optional) - The type of molecule we are dealing with. If omitted, $moltype
is set to the moltype property of the sequence object. If the moltype property is
not initialized, then $moltype is set to "DNA" by default.
Returns : A string which is the genetic complement of the input string.
Specifics: As of now, this method handles only two molecule types, DNA and RNA.
See also : revcomp(), moltype property of the Seq class.
Purpose : Counts the number of codons (a trio of nucleotide base-pairs) in a sequence.
Syntax : int count_codons()
Arguments: None (Seq->sequence property is implied).
Returns : The number of codons within a sequence, expressed as an non-negative integer.
Specifics:
This method makes use of the value of the /codon_start qualifier inside the Features (CDS)
collection of properties. If this qualifier is missing, this method starts counting codons
from the very first "letter" in the sequence.
Purpose : Returns the expansion of a nucleic acid sequence, replacing special wildcard symbols
with the proper PERL regular expression.
Syntax : string expand_na(string $string)
Arguments: $string (mandatory) - the nucleic acid sequence to expand
Returns : An "expanded" string where special metacharacters are replaced by the appropriate
Perl regular expression. For example, an N or X is replaced by the dot (.) meta-
character, an R is replaced by [AG], etc.
Example 1: Original sequence: ATGXCCRTT (1)
Expanded sequence: ATG.CC[AG]TT (2)
In line (2), the X symbol is replaced by the dot symbol, R is replaced by [AG].
The dot symbol (.) matches any of A, T, G, or C. R matches either A or G but
not C or T.
Example 2: Matching Sequences
ATGACCATT (3)
ATGTCCGTT (4)
Line (3) matches the expanded sequence because A matches the dot, and A matches [AG].
Line (4) matches the expanded sequence because T matches the dot, and G matches [AG].
Purpose : Returns a three-dimensional associative array listing all mirror substrings contained
within a given sequence, and their location (expressed as a zero-based index number).
Syntax :
array find_mirror(string $haystack, int $pallen1, int $pallen2 = "", string $options = "E")
Arguments:
$haystack (optional) - the sequence which will be searched by the method for any occurrences
of mirrors. If omitted, this is set to the sequence property of the current Seq object.
$pallen1 (mandatory) - the length of the shortest mirror to look for.
$pallen2 (optional) - the length of the longest mirror to look for.
$options (optional) - may be "E" or "O" or "A". If "E" is passed, then the method only looks
for mirrors with even lengths. If "O" is passed, the method only looks for mirrors with odd
lengths. If "A" is passed, then method looks for all mirrors (odd and even lengths). If
omitted, this is set to "E" by default.
Returns : A three-dimensional associative array with the ff. format:
([len1] => ((mirror1, pos1), (mirror2, pos2)), [len2] => (...), ...)
Example : ( [3] => ( ("ATA", 3), ("GCG", 12) ), [4] => ( ("GAAG", 18) ) )
See also : find_palindrome().
Purpose : Returns a two-dimensional array containing palindromic substrings found in a sequence,
and their location, in terms of zero-based indices.
Syntax : array find_palindrome(string $haystack, int $seqlen = "", int $pallen = "")
Arguments:
$haystack (optional) - the sequence to be searched by the method for any genetic palindromes.
If omitted, this is set to the sequence property of the current Seq object.
$seqlen (optional) - the length of the palindromic substring within $haystack. If omitted,
the method searches for palindromes of whatever length.
$pallen (optional) - the length of one of two palindromic edges in a palindromic substring
within $haystack. If omitted, the method does not restrict its search to substrings with
an edge of a specified length.
Returns : A two-dimensional array of the form:
( (palindrome1, position1), (palindrome2, position2), ... )
Example : ( ("ATGttCAT", 2), ("ATGccccccCAT", 18) ), palindromic edges are shown in uppercase.
Specifics: While $seqlen and $pallen are optional, omitting both of them is not allowed.
See also : find_mirror().
Purpose : Returns a one-dimensional array enumerating each occurrence or instance of a given
pattern in a larger string or sequence. This returns the actual substring (that
matches the pattern) itself.
Syntax : array findpattern(string $pattern, char $options)
Arguments:
$pattern (mandatory) - the pattern to search for as a Perl regular expression (no enclosing
"/" symbol).
$options (optional) - if set to "I", pattern-matching will be case-insensitive. Passing
anything else would cause the pattern-matching to be case-sensitive. If not passed,
$options is set to "I" (case-insensitive).
Returns : The function returns a one-dimensional array of the form:
( substring1, substring2, substring3, ... )
Specifics:
1) By itself the findpattern() method does not give the location of substrings that match
the given pattern. You use the patpos() or the patposo() function to do this.
2) This method does not find "overlapping patterns", such as the following:
seqobj->id = 1234;
seqobj->sequence = "AGATACA";
$matches = seqobj->findpattern("A.A", "I");
In the code above, $matches is ( "AGA", "ACA"), and does not include ATA because it
"overlaps" with AGA. Matching is done from left to right. If a pattern is found from
position index m to n, search for the next matching substring resumes at position index
n+1.
Examples :
1) Pattern is made up of literals (no metacharacters)
seqobj->id = 1234;
seqobj->sequence = "aagatcagac";
seqobj->findpattern("AGA", "I");
// The above statement returns ( "AGA", "AGA").
2) Pattern uses the dot (.) metacharacter
The dot metacharacter matches any single character (in this case, A, T, G, or C).
seqobj->id = 1234;
seqobj->sequence = "aagcagtaggag";
seqobj->findpattern(".AG", "I");
// The above statement returns ( "AAG", "CAG", "TAG", "GAG").
3) Pattern uses the [] metacharacter.
The [] metacharacter matches any one character listed inside the square brackets.
Thus, [AG] matches either an A or G.
seqobj->id = 1234;
seqobj->sequence = "gatgacgaggaa";
seqobj->findpattern("GA[TC]", "I");
// The above statement returns ( "GAT", "GAC").
See also : patpos(), patposo(), patfreq().
Purpose : Returns the sequence located between two palindromic halves of a palindromic string.
Take note that the "bridge", as I call it, is not necessarily a genetic mirror
or a palindrome.
Syntax : string get_bridge(string $string)
Arguments: $string (mandatory) - a palindromic or mirror sequence containing the bridge.
Returns : A string representing the bridge (as defined above).
Example :
In the sequence "ATGcacgtcCAT", the "cacgtc" is the bridge, while ATG and CAT
are the palindromic halves.
See also : is_mirror(), is_palindrome(), find_mirror(), find_palindrome().
Purpose : Returns the n-th codon in a sequence, with numbering starting at 0.
Syntax : string getcodon(int $index, int $readframe = 0)
Arguments:
$index (mandatory) - the index number of the codon.
$readframe (optional) - the reading frame, which may be 0, 1, or 2 only. If omitted, this
is set to 0 by default.
$this->sequence (implied) - the sequence from which to exract the n-th codon.
Returns : The n-th codon in the sequence.
See also : count_codons().
Purpose : Returns one of the two palindromic "halves" of a palindromic string.
Syntax : string halfstr(string $string, int $no)
Arguments:
$string (mandatory) - a palindromic sequence.
$no (mandatory) - pass 0 to get he first palindromic half, pass any other number (e.g. 1) to
get the second palindromic half.
Returns : A string representing either the first or the second palindromic half of the string.
Specifics: Later on, this will be modified so that $string will become optional. If omitted, the
implied argument is $this->sequence.
See also : get_bridge() for an example.
Purpose : Returns TRUE if the given sequence or string is a "genetic mirror" which is the same
as a "string palindrome", i.e., a sequence that "looks" the same when read backwards.
Syntax : is_mirror(string $string)
Arguments:
$string (mandatory) - a sequence which we want to test if it is a mirror or not. If omitted,
$string is, by default, set to the sequence property of the Seq object from which we invoke
the method.
Returns : TRUE if the given string is a mirror, FALSE otherwise.
Specifics: The genetic mirror is the same as a palindromic string in traditional programming.
See also :is_palindrome(), find_mirror(), find_palindrome().
Purpose : Tests if a given sequence is a "genetic palindrome" (as opposed to a "string
palindrome"). A "genetic palindrome" is one where the ends of a sequence are
reverse complements of each other.
Syntax : is_palindrome(string $string)
Arguments:
$string (mandatory) - a sequence which we want to test if it is a genetic palindrome or not.
If omitted, $string is, by default, set to the sequence property of the Seq object from
which we invoke the method.
Returns : TRUE if the given string is a genetic palindrome, FALSE otherwise.
Specifics: The genetic palindrome is not the same as a palindromic string in traditional
programming. It is any sequence having "edges" that are reverse complements of
each other.
See also : is_mirror(), find_mirror(), find_palindrome().
Purpose : Computes the molecular weight of a particular sequence.
Syntax : array molwt()
Arguments: $this->sequence (implied) - the sequence whose molecular weight we wish to determine
Returns : An array with exactly two elements, of the form: ( lower_molwt, upper_molwt )
Specifics:
In general, the molecular weight of a chain is the arithmetic sum of the molecular weights
of the individual "links" (e.g. an amino or nucleic acid) in that chain (plus/minus some
minor adjustments). However, special metacharacters like X, R, and Y may be present in the
sequence which leads to ambiguities in the molecular weight of the entire sequence. These
special symbols are evaluated and replaced by the expand_na() method.
Thus, the method has been designed to always return two values - a lower value and an upper
value - even in cases when the two might be equal or the same.
See also : expand_na().
Purpose : Returns a one-dimensional associative array where each key is a substring matching the
given pattern, and each value is the frequency count of the substring within the larger
string.
Syntax : array patfreq(string $pattern, char $options)
Arguments:
$pattern (mandatory) - The pattern (a Perl regular expression without enclosing "/") to search for
and tally.
$options (optional) - If set to "I", pattern-matching and tallying will be case-insensitive. Passing
anything else would cause it to be case-sensitive. If not passed, $options is set to "I" (case-
insensitive).
Returns : The function returns an array of the form:
( substring1 => frequency1, substring2 => frequency2, ... )
Specifics:
1) patfreq() uses the findpattern() function to search for substrings that match the pattern.
2) Because of #1 above, patfreq() does not recognize and count "overlapping patterns".
Example :
seqobj->id = 1234;
seqobj->sequence = "agaataacatgacaataaca";
seqobj->patfreq("A.A", "I");
// The above statement returns ( "AGA" => 1, "ATA" => 2, "ACA" => 3).
See also : findpattern(), patpos(), and patposo().
Purpose : Returns a two-dimensional associative array where each key is a substring matching a
given pattern, and each value is an array of positional indexes which indicate the
location of each occurrence of the substring (needle) in the larger string (haystack).
Syntax : array patpos(string $pattern, char $options)
Arguments:
$pattern (mandatory) - the pattern (a Perl regular expression without enclosing "/") to locate.
$options (optional) - If set to "I", pattern-matching will be case-insensitive. Passing anything
else would cause it to be case-sensitive. If omitted, $options is set to "I" (case-insensitive).
Returns : This method returns an array of the form:
( substring1 => (position1, position2, ... ), substring2 => (position1, position2, ... ), ... )
where substring is a substring within the sequence that matches the given $pattern, and
position is a zero-based index indicating the location of the substring within the larger
sequence. Thus, if substring is found at the very beginning of sequence, its position is
equal to zero (0).
Specifics: Like findpattern() and patfreq(), patpos() does not recognize "overlapping patterns".
Example :
seqobj->sequence = "agaataacatgacaataaca";
seqobj->patpos("A.A", "I");
// The above returns ( "AGA" => (0), "ATA" => (3, 14), "ACA" => (6, 11, 17) )
See also : findpattern(), patfreq(), and patposo().
Purpose : Similar to patpos() except that this counts so-called "overlapping patterns".
Syntax : array patpos(string $pattern, char $options, int $cutpos)
Arguments:
$pattern (mandatory) - the pattern (a Perl regular expression without enclosing "/") to
locate.
$options (optional) - If set to "I", pattern-matching will be case-insensitive. Passing a
anything else would cause it to be case-sensitive. If omitted, $options is set to "I"
(case-insensitive).
$cutpos (optional) - A non-negative integer specifying where search for the next pattern
will resume, relative to the current matching substring. If omitted, $cutpos is set to
the value of 1 by default.
Returns : This method returns one-dimensional array of the form:
( position1, position2, position3, ... )
where position is a zero-based index indicating the location of the substring within the
larger sequence. Thus, if substring is found at the very beginning of sequence, its
position is equal to zero (0).
Unlike patpos(), the return value does not contain the actual substrings that match the
given pattern. However, code can be written to work around this problem.
Specifics: Unlike findpattern(), patfreq(), and patpos(), this method recognizes so-called
"overlapping patterns".
Example :
seqobj->sequence = "agataca";
seqobj->patposo("A.A", "I", 1);
// The above returns ( 0, 2, 4)
With the third argument set to 1, the search proceeds in this manner:
1) The result array is initialized to the blank or empty array or ( ).
2) "AGA" is checked if it matches the pattern "A.A"
3) Since it does, the position of its first character (the "A" before the "G") relative to
the larger sequence is added to our "result array". Our result array is now: ( 0 ).
4) Since the third argument is set to 1, the search for the next match resumes at the letter
"G" in "AGA", as illustrated below.
a gataca
0 123456
That is, the remaining substring "GATACA" will be searched for a substring matching the
pattern "A.A".
5) The next match is "ATA", which is found at position 2. 2 is added to the result array
which is now ( 0, 2).
6) The search resumes at 1 character to the right of the first "A" in "ATA", or at letter "T".
aga taca
012 3456
7) The remaining string "TACA" is searched for the pattern "A.A".
8) The next match is "ACA", which is found at position 4. 4 is added to the result array
which is now ( 0, 2, 4). The search ends and this array is returned by the function.
See also : findpattern(), patfreq(), and patpos().
Purpose : Gets the reverse complement of a genetic sequence.
Syntax : string revcomp(string $seq, string $moltype)
Arguments:
$seq (mandatory) - the sequence to "reverse, and then complement".
$moltype (mandatory) - the molecule type of the input sequence
Returns : The reverse complement of the argument string $seq.
See also : complement().
Purpose : Gets the length of the sequence, i.e., the number of symbols in the sequence
property.
Syntax : int seqlen()
Arguments: None
Returns : The number of "symbols" or "monomeres" in an amino or nucleic acid chain,
expressed as a non-negative integer.
Specifics: The "implied" argument is the sequence property of the Seq object which invokes
the seqlen() method.
See also seqlength property of the Seq Class.
Purpose : Creates a new sequence object with a sequence that is a substring of another.
Syntax : Seq object subseq(int $start, int $count)
Arguments:
$start (mandatory) - the position in the original sequence from which we will begin extracting
the subsequence; the position is expressed as a zero-based index.
$count (mandatory) - the number of "letters" to include in the subsequence, starting from the
position specified by the $start parameter.
Returns : A new Seq object with the sequence property set to the subsequence specified by the
$start and $count parameters. All other properties of this new Seq object are NULL.
See also : trunc().
Purpose : Returns the frequency of a given symbol in the sequence property string. Note that you
can pass this a symbol argument which may be not be part of the sequence's alphabet.
In this case, the method will simply return zero (0) value.
Syntax : int symfreq(char $symbol)
Arguments:
$symbol (mandatory) - the symbol whose frequency in a sequence we wish to determine.
$this->sequence (implied) - the sequence to search and (do a) tally.
Returns : The frequency (number of occurrences) of a particular symbol in a sequence string.
Purpose : Translates a particular DNA sequence into its protein product sequence, using
the given substitution matrix.
Syntax : string translate(int $readframe, int $format)
Arguments:
$readframe (optional)
The reading frame (0, 1, or 2) to be used in translating a nucleic sequence into a protein.
A value of 0 means that the first codon would start at the first "letter" in the sequence,
a value of 1 means that the second codon would start the second "letter" in the sequence,
and so on. When omitted, this argument is set to reading frame 0 by default.
$format (optional)
This may be passed the value 1 or 3 and determines the format of the output string. Passing
1 would cause translate() to output a string made up of single-letter amino acid symbols strung
together without any space in between. Passing 3 would output a string made up of three-letter
amino acid symbols separated by a space. See Return Value section for examples.
When omitted, $format is set to 3 by default.
Returns :
When $format is passed a value of 1, the function returns a string of this format:
GAVLISNFYW
where each of G, A, V, and the other letters represent a single amino acid residue.
When $format is passed a value of 3, the function returns a string of this format:
Phe Leu Ser Tyr Cys STP
where each of Phe, Leu, and the other 3-letter "words" represent a single amino acid
residue.
Specifics:
Aside from the symbols for the 20 known amino acid residues, there are two other special
symbols, the STOP codon and the UNKNOWN codon.
In Bio theory, the STOP codon terminates the translation process. The symbol for STOP is
usually "*" (when $format is 1) and "STP" (when $format is 3).
In Bio theory, the UNKNOWN codon is substituted when a codon cannot be found in the transla-
tion table (or substitution matrix) and therefore, cannot be translated. The symbol for the
UNKNOWN codon is "X" (when $format is 1) and "XXX" (when $format is 3).
See also : translate_codon().
Purpose : Translates a single codon into an amino acid.
Syntax : string translate_codon(string $codon, int $format)
Arguments:
$codon (mandatory) - A three-letter nucleic acid sequence (each letter can be A, U, G,
or C) which translates into a single amino acid residue.
$format (optional) - This may be passed the value 1 or 3 and determines the format of
the output string. When omitted, $format is set to 3 by default.
Returns :
When $format is passed a value of 1, the function returns a single letter. When $format
is passed a value of 3, the function returns a string of three letters. The return value
represents a single amino acid residue.
Specifics:
This is a helper function called by the translate() method. However, it can be accessed
by the programmer to translate a single codon into its amino acid equivalent.
The translation process uses a translation table (also called a substitution matrix), which
can be defined by using the Submatrix Class. If not specified, the method uses a default
matrix for human beings.
See also : translate().
Purpose : Extracts a subset or substring from a particular sequence.
Syntax : string trunc(int $start, int $count)
Arguments:
$start (mandatory) - the position in the original sequence from which we will begin
extracting the subsequence; the position is expressed as a zero-based index.
$count (mandatory) - the number of "letters" to include in the subsequence, starting
from the position specified by the $start parameter.
The sequence property of the Seq object invoking the method is the "implied" third
argument. It is the string from which the substring will be extracted or truncated
from.
Returns : A substring of the original string as specified by $start and $count.
Specifics: Unlike the subseq() method, this method returns an ordinary string and not
a Seq object.
See also : subseq().
class SeqAlign
The class is used to manipulate sequence alignment data (usually stored in a file)
produced by third-party sequence alignment software. For now, only the FASTA format
is supported.
length The length of the longest sequence in the alignment set.
seq_count The number of sequences in the alignment set.
gap_count The total number of gaps ("-") in all sequences in the
alignment set.
seqset An array containing all the sequences in the alignment set.
seqptr Short for sequence pointer; this is an index number indicating which
sequence (in the alignment set) we are currently processing.
is_flush A boolean or logical value: TRUE if all the sequences in the alignment
have the same length, FALSE otherwise.
Purpose : Constructor method for the SeqAlign class. This "fetches" all sequences into the
seqset property of the SeqAlign object.
Syntax : new SeqAlign(string $filename)
Arguments: $filename (mandatory) - the name of the FASTA alignment file.
Returns : A SeqAlign object.
Specifics: To access individual sequences in the SeqAlign object, use seqset property.
Purpose : Adds a sequence to the alignment set.
Syntax : int add_seq(Seq object $seq)
Arguments: $seq (mandatory) - a Seq object to be added to the alignment set.
Returns : The number of sequences in the alignment set after the call to add_seq().
Specifics: It is possible to build a new alignment set solely by using the add_seq() method.
Example :
// Create a blank alignment set.
$aln = new SeqAlign();
// Add the first sequence to our blank alignment set.
$seqobj = new Seq();
$seqobj->id = "1234";
$seqobj->sequence = "ATGC-TGA--CTGA";
$aln->add_seq($seqobj);
// Add the second sequence to our blank alignment set.
$seqobj = new Seq();
$seqobj->id = "1234";
$seqobj->sequence = "TTGT-TAA--CCGT";
$aln->add_seq($seqobj);
Take note that this method simply adds the sequence to the alignment set. It does not
perform any sequence alignment.
See also : del_seq().
Purpose : Gets the character at a given residue number in the specified sequence.
Syntax : char char_at_res(int $seqidx, int $res)
Arguments:
$seqidx (mandatory) - the index number of the desired sequence in the alignment set.
$res (mandatory) - the residue number of the character we wish to get or extract.
Returns : A single character representing an amino acid residue or a "gap".
Purpose : Converts a column number to a residue number in the sequence specified by
its index number.
Syntax : int col2res(int $seqidx, int $col)
Arguments:
$seqidx (mandatory) - index number of the desired sequence within the alignment set.
$col (mandatory) - the column number which we want to convert to a residue number.
Returns : An integer representing the residue number corresponding to the given column
number.
Specifics: This is the opposite of res2col().
Purpose : Constructs a consensus string for all the sequences in the alignment set.
Syntax : string consensus(int $threshold)
Arguments:
$threshold (optional) - a number between 0 to 100, indicating the percentage threshold before
(or below which) the unknown character "?" is used in a particular position or column in the
consensus string. If omitted, this is set to 100 by default.
Returns : The consensus string formed according to the given threshold.
Purpose : Removes a sequence specified by its id to the alignment set.
Syntax : int del_seq(string $seqid)
Arguments: $seqid (mandatory) - the id of the sequence to be deleted from the alignment set.
Returns : The number of sequences in the alignment set after the call to del_seq().
See also : add_seq().
Purpose : Retrieves a specified sequence from the alignment set and returns it as a
Seq object.
Syntax : Seq object fetch($id)
Arguments:
$id (optional) - the id of the sequence we wish to retrieve from the alignment set.
If omitted, the current sequence (pointed to by the sequence pointer or seqptr
property), is fetched instead.
Returns : Seq object
Specifics: Once fetched, the sequence can use all Seq class properties and methods.
Purpose : Moves the sequence pointer to the first sequence in an alignment set.
Syntax : void first()
Arguments: None
Returns : None
See also : last(), next(), prev().
Purpose : Computes and returns the gap_count property of an alignment set or object.
Syntax : int get_gap_count()
Arguments: None
Returns : The number of "gap characters" in the all sequences in the alignment set.
Specifics: See also the gap_count property of the SeqAlign class.
Purpose : Computes and returns the is_flush property of the SeqAlign object.
Syntax : boolean get_is_flush()
Arguments: None
Returns : TRUE if the all the sequences in the alignment set have the same length,
FALSE otherwise.
See also : the is_flush property of the SeqAlign class.
Purpose : Returns the length of a SeqAlign object.
Syntax : int get_length()
Arguments: None
Returns : An integer representing the length of the longest sequence in the alignment set.
Specifics: A sure-fire way to get the length of a SeqAlign object. However, it is slower
than simply accessing the length property.
See also : The length property of the SeqAlign object.
Purpose : Moves the sequence pointer to the last sequence in an alignment set.
Syntax : void last()
Arguments: None
Returns : None
See also : first(), next(), prev().
Purpose : Moves the sequence pointer to the next sequence in the alignment set.
Syntax : void next()
Arguments: None
Returns : None
See also : first(), last(), prev().
Purpose : Moves the sequence pointer to the previous sequence in the alignment set.
Syntax : Moves the sequence pointer to the previous sequence in the alignment set.
Arguments: None
Returns : None
See also : first(), last(), next()
Purpose : Converts a residue number to a column number in the sequence specified by its
index number.
Syntax : int res2col(int $seqidx, int $res)
Arguments:
$seqidx (mandatory) - the index number of the desired sequence in the alignment set.
$res (mandatory) - the residue number we wish to convert into a column number.
Returns : An integer representing the column number corresponding to the given residue number.
Specifics: This is the opposite of col2res().
Purpose : Determines the index position of both variant and invariant residues according
to a given "percentage threshold" similar to that in the consensus() method.
Syntax : assoc array res_var(int $threshold)
Arguments:
$threshold (optional) - a number between 0 to 100, indicating the percentage threshold below
which the current index position is considered variant, and on or above which the current
index position is considered invariant. If omitted, this is set to 100 by default.
Returns : An associative array of the form:
( "INVARIANT" => ( 0, 2, 4, 6 ), "VARIANT" => ( 1, 3, 5 ) )
Specifics: See also consensus() method.
Purpose : Creates a new alignment set from non-consecutive sequences found in another existing
alignment set.
Syntax : SeqAlign object select(int $index1, int $index2, ... )
Arguments:
$index1, $index2, ... (at least one index argument is required) - the index number(s) of the
sequence(s) in the current alignment set to include in the new SeqAlign object.
Returns : A SeqAlign object made up of specific sequences indicated by $index1, $index2, etc.
Specifics: The number of arguments is variable and is theoretically unlimited.
See also : subalign().
Purpose : Sorts the arrangement of sequences in an alignment set by name/id and then by start
number, in either ascending ("ASC") or descending ("DESC") order.
Syntax : array sort_alpha($option)
Arguments:
$option (optional) - accepts either "ASC" or "DESC" in whatever case (uppercase, lowercase,
mixed case). This determines the sort order of the alignment set.
Returns : An array of Seq objects belonging to an alignment set, sorted in either ascending or
descending order. Format is similar to the seqset property of the SeqAlign class.
Specifics: This permanently changes the seqset property of the SeqAlign object.
Purpose : Creates a new alignment set from a series of contiguous/consecutive sequences.
Syntax : SeqAlign object subalign(int $beg, int $end)
Arguments:
$beg (mandatory) - the index number of the first sequence to include in the new SeqAlign object.
$end (mandatory) - the index number of the last sequence to include in the new SeqAlign object.
Returns : A SeqAlign object consisting of consecutive sequences from another SeqAlign object.
See also : select().
Purpose : Gets a substring in between residue numbers from the specified sequence.
Syntax : string substr_bw_res(int $seqidx, int $res_beg, int $res_end)
Arguments:
$seqidx (mandatory) - the index number of the desired sequence in the alignment set.
$res_beg (mandatory) - the residue number of the character at the beginning of the
substring we wish to extract.
$res_end (optional) - the residue number of the character at the end of the substring
we wish to extract.
Returns : A substring within the specified sequence.
See also : char_at_res().
class SeqDB
The SeqDB class represents a set of sequence data stored in one or more physical
electronic files. Thus, we can create a SeqDB class to represent the entire
GenBank database or a subset thereof.
dbname The name of a SeqDB database object, which you assign when creating
it. This is also the name of primary index and directory files of the
database.
data_fn The complete name (with extension) of the primary index file of the
database.
dir_fn The complete name (with extension) of the directory file of the database.
seqptr A variable which points to the current sequence record. When processing
a SeqDB object, BioPHP maintains a "record pointer" which can be moved
using the first(), last(), prev(), and next() methods. This is an integer
value which is 0 when at the first record in the database, 1 when at the
second record, and so on.
seqcount This is the number of sequence records in the SeqDB database object.
SeqDB() The constructor method for the SeqDB database class. This performs
various preliminary tasks such as creating the primary index and directory
files for the database, etc.
Purpose : Tests if the current line signals that start of a new sequence record
in a data file. In a Genbank file, the first line of a new entry begins
with the keyword "LOCUS". In a SwissProt file, the keyword is "ID".
Syntax : boolean at_entrystart(string $linestr, string $dbformat)
Arguments:
$linestr (mandatory) - a line from the datafile which we are currently processing.
$dbformat (mandatory) - identifies the database format to use in parsing the
data file. As of version 1.0, the only valid values are "GENBANK"
and "SWISSPROT".
Returns : TRUE if the current line signals the start of a new sequence
record, FALSE otherwise.
Specifics: The format of the data file is specified in the open() method.
Purpose : This "closes" the SeqDB database, after we're through working with it. This is
the opposite of the open() method.
Syntax : void close()
Arguments: None
Returns : None
See also : new keyword, open().
Purpose : Retrieves all data from the specified sequence record and returns them in the
form of a Seq object. This method invokes one of several parser methods. For
version 1.0, we normally do not access the parser method directly.
Syntax : Seq object fetch(string $seqid)
Arguments:
$seqid (optional) - The unique identifier of the sequence. If omitted, this is set to
the id of the "current sequence record" (as pointed to by the sequence pointer).
Returns : A Seq object
Purpose : This moves the record pointer to the first record in the SeqDB database.
Syntax : void first()
Arguments: None
Returns : None
Specifics: This moves the record pointer to the first record in the SeqDB database.
See also : next(), prev(), last().
Purpose : Returns the ID of a GenBank or Swissprot sequence record.
Syntax : string get_entryid(array &$flines, string $linestr, string $dbformat)
Arguments:
&$flines (mandatory) - an array of lines from the sequence record. The method uses this only
when dealing with Swissprot sequences.
$linestr (mandatory) - the line containing the ID of the sequence record.
$dbformat (mandatory) - the specific database format of the sequence record.
Returns : The ID of the sequence record.
Purpose : Determines if a string found in the features section of a GenBank entry is
that of a "qualifier" (see gbrel.txt for its definition).
Syntax : boolean isa_qualifier(string $str)
Arguments: $str (mandatory) - the string to test if it is a GenBank feature qualifier or not.
Returns : TRUE if the string $str is a qualifier, FALSE otherwise.
Purpose : Moves the record pointer to the last record in the SeqDB database.
Syntax : void last()
Arguments: None
Returns : None
Specifics: This moves the record pointer to the last record in the SeqDB database.
See also : first(), next(), prev().
Purpose : This moves the record pointer to the next record in the SeqDB database.
Syntax : void next()
Arguments: None
Returns : None
Specifics: This moves the record pointer to the next record in the SeqDB database.
See also : first(), last(), prev().
Purpose : Opens the SeqDB object for processing.
Syntax : void open($dbname)
Arguments: $dbname (mandatory) - the name of the database (index file) you wish to open.
Returns : None
Specifics:
1) This method initializes the dbname, data_fn, dir_fn, and seqptr properties of the
SeqDB object.
2) Right now, this method issues a call to die() when the database cannot be opened.
See also : close().
Purpose : Parses one sequence entry or record in a Genbank file.
Syntax : Seq object parse_id($flines)
Arguments: An array of strings representing lines of one sequence entry in a
Genbank database or file.
Returns : A Sequence object corresponding to the sequence entry.
See also : parse_swissprot().
parse_swissprot() |
code |
prev |
next |
top |
Purpose : Parses one sequence entry or record in a Swissprot sequence file.
Syntax : Seq object parse_id($flines)
Arguments: An array of strings representing lines of one sequence entry in a
Swissprot database or file.
Returns : A Sequence object corresponding to the sequence entry.
See also :parse_id().
Purpose : This moves the record pointer to the previous record in the SeqDB database.
Syntax : void prev()
Arguments: None
Returns : None
Specifics: This method also updates the value of the seqptr property.
See also :first(), last(), next().
class SeqMatch
This class represents the results of performing sequence analysis and matching on two or more
sequences. While the basic matching algorithm is present, this class is largely still under
development.
Before anything else, a short foray in terminology is in order.
The terms "symbol", "character", or "letter" are synonymous, and represent a single element (e.g.
a nucleotide or amino residue) in a larger chain of such elements. Note that the use of the term
"letter" is still valid even if it's actually a dash ("-") or a symbol not in the Roman alphabet
of A to Z.
The words "string" and "sequence" will be used interchangeably to mean the same thing -- a series
of symbols representing a biological entity such as a nucleotide or amino acid chain. Sometimes,
the term "input string" is used to indicate that the string will serve as an input to a function
that may itself produce or return another string, which can be called an "output string" or a
"result string". This latter usually (but not always) has the same length has the input strings.
The term "corresponding characters" (or symbols or letters) refer to characters in two strings that
are found in the same location or have equal position indexes.
When two corresponding characters are identical, they constitute an "exact match". When two corres-
ponding characters are not identical but belong to the same group (as defined by a submatrix), they
constitute a "partial match". When two corresponding characters are not identical and do not belong
to the same group, they are a "mismatch" or a "non-match".
result The result string produced by the match() method. Example: "GAV++ Y+ R"
hamdist The Hamming Distance between two strings. This is simply the number of corresponding
characters that are not exact matches.
levdist The Levenshtein Distance between two strings. This is the minimum number of insertion,
deletion, or replacement operations that need to be performed on either input string to
make them identical. Note that insertion and deletion are symmetrical, i.e., inserting
a letter in one string has the same effect (of making two strings more identical) as
deleting a letter in the other string.
compare_letter() |
code |
prev |
next |
top |
Purpose : Compares two symbols if they are exact, partial, or negative matches.
Syntax : char compare_letter(char $let1, char $let2, array $matrix, char $equal,
char $partial = "+", char $nomatch = ".")
Arguments:
$let1 (mandatory) - the first amino acid residue symbol to match.
$let2 (mandatory) - the second amino acid residue symbol to match.
$matrix (optional) - the substitution matrix to use in matching. If omitted, the
default $chemgrp_matrix table is used.
$equal (optional) - the character symbol to return if $let1 and $let2 are exact matches.
If omitted, the symbol ($let1 or $let2) itself is returned.
$partial (optional) - the character symbol to return if $let1 and $let2 are partial
matches. If omitted, the "+" symbol is returned.
$nomatch (optional) - the character symbol to return if $let1 and $let2 are totally
mismatched. If omitted, a whitespace (" ") is returned.
Returns : A character symbol which indicates if the two residues are exact, partial, or
negative matches.
See also : partial_match, match(), and SubMatrix class properties and methods.
Purpose : Computes the Hamming Distance between two strings or sequences.
Syntax : int hamdist(string/Seq object $seq1, string/Seq object $seq1);
Arguments:
$seq1 (mandatory) - the first string or sequence
$seq2 (mandatory) - the second string or sequence
Returns : The hamming distance between the two strings or sequences, defined to be the number of
"mismatching" characters found in corresponding positions in the two strings.
See also : levdist(), xlevdist().
Purpose : Computes the Levenshtein Distance between two strings or sequences.
Syntax :
int levdist(string/Seq object $seq1, string/Seq object $seq2, int $cost_ins, int $cost_rep,
int $cost_del)
Arguments:
$seq1 (mandatory) - the first string or sequence, with a length not exceeding 255 symbols.
$seq2 (mandatory) - the second string or sequence, with a length not exceeding 255 symbols.
$cost_ins (optional) - the cost or weight of an insertion operation. Set to 1 if omitted.
$cost_rep (optional) - the cost or weight of a replacement operation. Set to 1 if omitted.
$cost_del (optional) - the cost or weight of a deletion operation. Set to 1 if omitted.
Returns : The Levenshtein Distance between two strings, defined to be the number of insertion,
deletion, or replacement operations that must be performed on the strings before they
can become identical.
Specifics: This uses PHP's built-in levenshtein() function, which has a 255-character limit.
For longer strings, use xlevdist() method. However, levdist()
allows you to alter the cost of insertions, deletions, and replacements. You cannot
do that in xlevdist.
Purpose : This is typically used to compare protein sequences but may be used on nucleic
acid sequences. It compares two sequences and returns a "result string" of
special symbols.
Syntax :
string match(string $str1, string $str2, array $matrix, char $equal, char $partial,
char $nomatch)
Arguments:
$str1 (mandatory) - the first of two sequences being compared.
$str2 (mandatory) - the second of two sequences being compared.
$matrix (optional) - an array specifying valid symbol substitution and equivalence rules.
Its format is similar to the rules property of a SubMatrix object.
$equal (optional) - the symbol to output if the symbol in the first sequence ($str1) is
exactly the same as the corresponding symbol in the second sequence ($str2).
$partial (optional) - the symbol to output if the symbol in the first sequence ($str1) is
equivalent but not identical to the corresponding symbol in the second sequence ($str2).
$nomatch (optional) - the symbol to output if the symbol in the first sequence ($str1) is
neither identical nor equivalent to the corresponding symbol in the second sequence ($str2).
If any of the last threee arguments are omitted, the following symbols are used by default:
1) whitespace - no match
2) + sign - a partial match (the two amino acids belong to the same chemical group)
3) the original symbol itself - in case of an exact match.
Returns : A string which indicates where exact, partial and no matches occur between the first
and second sequences being compared.
Example :
<?php
$seqm_o = new SeqMatch();
$result = $seqm_o->match("GAVLIFYWKR", "GAVILGYFVR");
?>
In the above code, $result is "GAV++ Y+ R". How this came to be is explained below.
Position index 0123456789
1st input string GAVLIFYWKR
2nd input string GAVILGYFVR
Result string GAV++ Y+ R
In positions 0, 1, and 2, the characters in the 1st and 2nd input strings are identical, so
we simply "copy" them onto the result string. In positions 3, 4, and 7, the characters are
not identical BUT they belong to the same chemical group (in the default submatrix), so "+"
is copied onto the result string. In positions 5 and 8, the characters are not identical,
and DO NOT belong to the same chemical group, and so a blank space " " is copied onto the
result string. Other bioinformatics software use the vertical bar "|" to denote an exact
match, and ":" to indicate a partial match.
See also : SubMatrix Class, its properties and methods.
Purpose : Determines if two symbols are partial matches or not. For example, two
amino acid residues are partial matches if they belong to the same
chemical group according to a substitution matrix.
Syntax : boolean partial_match(char $let1, char $let2, array $matrix)
Arguments:
$let1 (mandatory) - The first amino acid residue.
$let2 (mandatory) - The second amino acid residue.
$matrix (optional) - The substitution matrix to use for determining partial matches.
If omitted, the default $chemgrp_matrix table is used.
Returns : TRUE if the two symbols belong to the same chemical group, FALSE otherwise.
See also : SubMatrix class properties and methods.
Purpose : Computes the Levenshtein Distance between two strings.
Syntax : int levdist(string/Seq object $seq1, string/Seq object $seq2)
Arguments:
$seq1 (mandatory) - the first string or sequence, with a length not exceeding 1024 symbols.
$seq2 (mandatory) - the second string or sequence, with a length not exceeding 1024 symbols.
Returns : The Levenshtein Distance between two strings, as defined in levdist().
See also : levdist().
class SubMatrix
This class is a tool used to create customized look up tables or substitution matrices employed
by various methods such as the translate() method of the Seq object. This is necessary to enable
translation of proteins using a different (non-human) set of "genetic code" as observed in other
organisms.
A substitution matrix is defined to be a two dimensional array of symbols. The outer array
is itself the collection of substitution rules. Each element of this array (inner array) is
made up of "equivalent or substitutable symbols", i.e., symbols, which if found in correspon-
ding positions in two sequence strings, would consitute a partial match.
Example: ( ('D','E'), ('K', 'R', 'H'), (X) )
In the example above, our submatrix has three elements. The last element contains only one
symbol, X. For our purposes, we shall refer to the outer array as the substitution matrix
and its element a substitution rule.
rules A two-dimensional array representing allowable or valid symbol substitutions.
Purpose : The constructor method of the submatrix class, which does nothing but initialize
the rules property to an empty array.
Purpose : Method for defining rules and adding them to a rule base (the rules property).
Syntax : void addrules(array $rule)
Arguments: $rule (mandatory) - an array containing equivalent or interchangeable symbols.
Returns : None
Back to Top
Back to Home
Page
|