BioPHP: PHP for Biocomputing

Last updated: April 24, 2003


BioPHP Technical Reference
Written by Serge Gregorio

The BioPHP Classes

As of version 1.0, there are 7 classes in BioPHP. They are (in alphabetical order):

   1) Protein Class
   2) RestEn Class
   3) Seq Class
   4) SeqAlign Class
   5) SeqDB Class
   6) SeqMatch Class
   7) SubMatrix Class

class Protein

Description       top
  This class represents the end-products of genetic processes of translation and
  transcription -- the proteins.  While a protein's primary structure (its amino
  acid sequence) is ably represented as a Seq object, its secondary and tertiary 
  structures are not. This is the main rationale for creating a separate Protein 
  class. This class is still under development.
Properties       top
  id          A string that uniquely identifies a protein.
  name        The long name used to refer to this protein.
Methods       top

molwt() code prev next top
  Purpose  : Returns the molecular weight of the protein object.
  Syntax   : array molwt()
  Arguments: None (The sequence property of the current Protein object is implied.)
  Returns  :  An array of the form: ( lower_molwt, upper_molwt )
  Specifics: This is similar to the molwt() method of the Seq Class.
seqlen() code prev next top
  Purpose  : Returns the length of the protein object, i.e. the number of amino acids in it.
  Syntax   : int seqlen()
  Arguments: None (The sequence property of the current Protein object is implied.)
  Returns  : An integer representing the number of amino acids in the protein.
  Specifics: This has similarities with the seqlen() method of the Seq Class.

class RestEn

Description       top
  The RestEn class is short for "restriction enzymes" or "restriction 
  endonucleases", which are substances that can "cut" a DNA strand 
  into two or more fragments along special sites called restriction sites. They 
  are an important tool in recombinant DNA technology.
Properties       top
  name      The short name of the restriction endonuclease following the accepted 
            naming convention (the first three letters represent the organism from 
            which the enzyme was first sourced or discovered, followed by a Roman 
            numeral, etc.)  Examples are EcoRI fro the Escherichia Coli bacteria, 
            and BamHI for the Bacillus Amyloliquefaciens, etc.
		 
  pattern   A string representing the restriction pattern recognized by the enzyme.
			  
  cutpos    An integer representing the position within the restriction pattern 
            where the enzyme actually cuts the DNA strand. This could range from 
			   0 to 1 less than the length of the restriction pattern. 
			  
  length    The number of symbols (or base pairs) in the restriction pattern.
Methods       top

RestEn() code prev next top
  RestEn() Constructor method for the RestEn object.
CutSeq() code prev next top
  Purpose  : Cuts the DNA sequence into fragments using the restriction enzyme object.
  Syntax   : array CutSeq(Seq object $seq, char $options)

  Arguments:     

    $seq (mandatory) - the sequence to cut using the current restriction enzyme object.
	 
    $options (optional) - may be "N" or "O".  If "N", the sequence is cut using the patpos() group 
      of methods (no overlapping patterns).  If "O", the sequence is cut using the patposo() group 
      of methods (with overlapping patterns). If omitted, this defaults to "N".

  Returns  : An array of fragments (substrings of the parameter sequence $seq) of the form:
  
    ( fragment1, fragment2, ... )

  See also : patpos(), patposo() methods of the Seq class.

FindRestEn() code prev next top
  Purpose  : A powerful method for searching our database of endonucleases for a particular 
             restriction enzyme exhibiting certain properties like pattern, cutting position, 
             and length, or combinations thereof. 

  Syntax   : string FindRestEn(string $pattern, string/int $cutpos, string/int $plen)

  Arguments:

    $pattern (optional) - the pattern of the restriction enzyme we wish to look for.  If 
      omitted, this is set to the blank string ("").

    $cutpos (optional) - the cutting position of the restriction enzyme we wish to look for. 
      If omitted, this is set to the blank string ("").

    $plen (optional) - the length of the restriction enzyme we wish to look for. If omitted, 
      this is set to the blank string ("").

  Returns  : 

    A list of restriction enyzmes that meet the criteria specified by the $pattern, $cutpos, 
    and $plen parameters.  The list is an array of the form: 

      ( rest_en1, rest_en2, ... ) 
GetCutPos() code prev next top
  Purpose  : Determines the cutting position of the restriction enzyme object.
  Syntax   : int GetCutPos(string $RestEnName)

  Arguments: 

    $RestEnName (optional) - the name of the restriction enzyme.  If omitted, this defaults
      to the name property of the RestEnd object.		

  Returns  : Returns the cutting position (an integer) of the restriction enzyme object.
  Specifics: This method looks up the cutting position in a Restriction Enzyme database, 
             provided with BioPHP.				 
				 
  See also : GetLength(), GetPattern()				 
GetLength() code prev next top
  Purpose  : Returns the length of the restriction pattern recognized by the enzyme.
  Syntax   : int GetLength($RestEnName)

  Arguments:

    $RestEnName (optional) - the name of the restriction enzyme.  If omitted, $RestEnName 
      defaults to the name property of the RestEn object.

  Returns  : The length (integer) of the restriction pattern recognized by the enzyme.
  Specifics: This method looks up the length in the Restriction Enzyme database of BioPHP.
  
  See also :GetCutPos() and GetPattern().
GetPattern() code prev next top
  Purpose  : Returns the restriction pattern recognized by the restriction enzyme object.
  Syntax   : string GetPattern(string $RestEnName)

  Arguments: 

    $RestEnName (optional) - the name of the restriction enzyme.  If omitted, $RestEnName
      defaults to the name property of the RestEn object.

  Returns  : The sequence pattern (string) recognized by the given restriction enzyme.
  Specifics: This method looks up the pattern in the Restriction Enzyme database of BioPHP.

  See also : GetCutPos(), GetLength().

class Seq

Description       top
  An instance of the Seq class represents a single sequence record in a SeqDB 
  database object. Usually, we instantiate this with a call to fetch() method 
  of the SeqDB class.
Properties       top
  For a more detailed description of the properties of this class, consult the 
  gbrel.txt file released by the NCBI. It describes the content and format of 
  the GenBank database.

  id        The primary sequence id which uniquely identifies a sequence record
            in a SeqDB database.
	
  strands   This further describes the molecule type of a particular sequence. 
            May be one of these three values: SINGLE, DOUBLE, MIXED.
moltype The type of molecule which makes up a particular sequence. Typical values are DNA, RNA, etc. topology The shape of a particular sequence. May be LINEAR or CIRCULAR. division A high-level classification of the sequence as to its source, e.g. PRI if it came from a primate (ape), BCT if it came from a bacteria, etc. date A date value associated with the sequence record. See gbrel.txt. accession An alternative identifier for a sequence record much like the id property. sec_accession Secondary accession numbers for the sequence record, returned as an array. keywords Words or phrases associated with the sequence record, aimed at facilitating searches, for example, by topic of interest. organism The creature from which the sequence was obtained or extracted. Typically uses the Latin-based scientific name (homo sapiens for humans, etc.). sequence A string of symbols denoting the arrangement of "monomeres" (basic structural units) within the sequence. Different types of molecular sequences use different alphabets. DNA sequences have only four letters in its alphabet - A, T, G, and C while protein sequences have 20 letters (G, A, V, L, I, etc.). seqlength The number of monomers making up the sequence. While numerically the same as the value returned by the seqlen() method, using the seqlength property may result in faster code execution in some cases. reference Returns a multi-dimensional associative array of reference information including AUTHOR, JOURNAL (CITATION), etc. To be described in greater detail in BioPHP Documentation 2.0. For now, try doing a print_r() on it to learn more. features Returns a multi-dimensional associative array of information on portions of the sequence that code for proteins and RNA molecules, etc. To be described in greater detail in BioPHP Documentation 2.0. For now, try a print_r() on it. For all the other properties listed below, consult gbrel.txt: version, definition, ncbi_gi_id, segment, segment_no, segment_count
Methods       top

charge() code prev next top
  Purpose  : Translates an amino acid sequence into its equivalent "charge sequence".
  Syntax   : string charge(string $amino_seq) 
  
  Arguments:

    $amino_seq (optional) - A string representing an amino acid sequence (e.g. GAVLIFYWKRH).
	   If omitted, this is set to the sequence property of the "calling" Seq object. If the 
		latter is not set either, the function returns the boolean value of FALSE.    

  Returns  : A string where each amino acid "letter" is replaced by A (if amino acid is acidic), 
             C (if amino acid is basic), or N (if amino acid is neutral), e.g. ACNNCCNANCCNA.

  Specifics: $amino_seq must be of format type 1 (a string of single-letter amino acid symbols) 
             and not of type 3 (a string of three-letter amino acid symbols).
				
  Versions : In BioPHP version 1.0, $amino_seq is mandatory.  In version 1.1, it is optional.
				 
  See also : chemgrp().				 
chemgrp() code prev next top
  Purpose  : Returns a string of symbols from an 8-letter alphabet: A, L, M, R, C, H, I, S.
  Syntax   : string chemgrp(string $amino_seq)
  
  Arguments: $amino_seq (optional) - A string representing an amino acid chain (e.g. GAVLI).
	   If omitted, this is set to the sequence property of the "calling" Seq object. If the 
		latter is not set either, the function returns the boolean value of FALSE.    
                
  Returns  : A string where each amino acid "letter" is replaced by one of the 
             following: A (acidic group), L (aliphatic group), M (amide group), R (aromatic group), 
             C (basic group), H (hydroxyl), I (iminio group), S (sulfur group).

  Example  : This is a sample output: ALMRLIISACHL

  Specifics: $amino_seq must be of format type 1 (a string of single-letter amino acid symbols)
             and not of type 3 (a string of three-letter amino acid symbols).

  Versions : In BioPHP version 1.0, $amino_seq is mandatory.  In version 1.1, it is optional.
  
  See also : charge().

complement() code prev next top
  Purpose  : Returns a string representing the genetic complement of a sequence.  
  Syntax   : string complement(string $seq, string $moltype)
  Arguments: 
  
    $seq (mandatory) - The string whose complement we want to obtain.

    $moltype (optional) - The type of molecule we are dealing with.  If omitted, $moltype 
      is set to the moltype property of the sequence object.  If the moltype property is
      not initialized, then $moltype is set to "DNA" by default.

  Returns  : A string which is the genetic complement of the input string.
  Specifics: As of now, this method handles only two molecule types, DNA and RNA.
				 
  See also : revcomp(), moltype property of the Seq class.
count_codons() code prev next top
  Purpose  : Counts the number of codons (a trio of nucleotide base-pairs) in a sequence.    
  Syntax   : int count_codons() 
  Arguments: None (Seq->sequence property is implied).  
  Returns  : The number of codons within a sequence, expressed as an non-negative integer.

  Specifics:

    This method makes use of the value of the /codon_start qualifier inside the Features (CDS) 
    collection of properties. If this qualifier is missing, this method starts counting codons 
    from the very first "letter" in the sequence.
expand_na() code prev next top
  Purpose  : Returns the expansion of a nucleic acid sequence, replacing special wildcard symbols 
             with the proper PERL regular expression. 

  Syntax   : string expand_na(string $string) 

  Arguments: $string (mandatory) - the nucleic acid sequence to expand

  Returns  : An "expanded" string where special metacharacters are replaced by the appropriate 
             Perl regular expression.  For example, an N or X is replaced by the dot (.) meta-
				 character, an R is replaced by [AG], etc. 

  Example 1: Original sequence: ATGXCCRTT      (1)
             Expanded sequence: ATG.CC[AG]TT   (2)

             In line (2), the X symbol is replaced by the dot symbol, R is replaced by [AG].
             The dot symbol (.) matches any of A, T, G, or C.  R matches either A or G but
             not C or T.

  Example 2: Matching Sequences

             ATGACCATT (3) 
             ATGTCCGTT (4)

             Line (3) matches the expanded sequence because A matches the dot, and A matches [AG].
             Line (4) matches the expanded sequence because T matches the dot, and G matches [AG]. 
find_mirror() code prev next top
  Purpose  : Returns a three-dimensional associative array listing all mirror substrings contained 
             within a given sequence, and their location (expressed as a zero-based index number).

  Syntax   : 

    array find_mirror(string $haystack, int $pallen1, int $pallen2 = "", string $options = "E")

  Arguments: 

    $haystack (optional) - the sequence which will be searched by the method for any occurrences 
	   of mirrors. If omitted, this is set to the sequence property of the current Seq object.

    $pallen1 (mandatory) - the length of the shortest mirror to look for.

    $pallen2 (optional) - the length of the longest mirror to look for.

    $options (optional) - may be "E" or "O" or "A". If "E" is passed, then the method only looks
      for mirrors with even lengths. If "O" is passed, the method only looks for mirrors with odd
      lengths.  If "A" is passed, then method looks for all mirrors (odd and even lengths).  If
      omitted, this is set to "E" by default. 

  Returns  : A three-dimensional associative array with the ff. format:

    ([len1] => ((mirror1, pos1), (mirror2, pos2)), [len2] => (...), ...)

  Example  : ( [3] => ( ("ATA", 3), ("GCG", 12) ), [4] => ( ("GAAG", 18) ) )
  
  See also : find_palindrome().  
find_palindrome() code prev next top
  Purpose  : Returns a two-dimensional array containing palindromic substrings found in a sequence, 
             and their location, in terms of zero-based indices.

  Syntax   : array find_palindrome(string $haystack, int $seqlen = "", int $pallen = "")

  Arguments:

    $haystack (optional) - the sequence to be searched by the method for any genetic palindromes. 
      If omitted, this is set to the sequence property of the current Seq object.

    $seqlen (optional) - the length of the palindromic substring within $haystack.  If omitted, 
      the method searches for palindromes of whatever length.

    $pallen (optional) - the length of one of two palindromic edges in a palindromic substring 
      within $haystack. If omitted, the method does not restrict its search to substrings with 
      an edge of a specified length.

  Returns  : A two-dimensional array of the form:

    ( (palindrome1, position1), (palindrome2, position2), ... )

  Example  : ( ("ATGttCAT", 2), ("ATGccccccCAT", 18) ), palindromic edges are shown in uppercase.

  Specifics: While $seqlen and $pallen are optional, omitting both of them is not allowed. 
  
  See also : find_mirror().
findpattern() code prev next top
  Purpose  : Returns a one-dimensional array enumerating each occurrence or instance of a given 
             pattern in a larger string or sequence.  This returns the actual substring (that 
             matches the pattern) itself.

  Syntax   : array findpattern(string $pattern, char $options) 

  Arguments:

   $pattern (mandatory) - the pattern to search for as a Perl regular expression (no enclosing 
	  "/" symbol).

   $options (optional) - if set to "I", pattern-matching will be case-insensitive.  Passing 
     anything else would cause the pattern-matching to be case-sensitive.  If not passed, 
     $options is set to "I" (case-insensitive).

  Returns  : The function returns a one-dimensional array of the form:

    ( substring1, substring2, substring3, ... )

  Specifics:

    1) By itself the findpattern() method does not give the location of substrings that match 
       the given pattern. You use the patpos() or the patposo() function to do this.

    2) This method does not find "overlapping patterns", such as the following:

       seqobj->id = 1234;
       seqobj->sequence = "AGATACA";
       $matches = seqobj->findpattern("A.A", "I"); 

       In the code above, $matches is ( "AGA", "ACA"), and does not include ATA because it 
       "overlaps" with AGA. Matching is done from left to right.  If a pattern is found from
       position index m to n, search for the next matching substring resumes at position index 
		 n+1. 

  Examples : 

    1) Pattern is made up of literals (no metacharacters)

       seqobj->id = 1234;
       seqobj->sequence = "aagatcagac";
       seqobj->findpattern("AGA", "I"); 
       // The above statement returns ( "AGA", "AGA"). 

    2) Pattern uses the dot (.) metacharacter 

       The dot metacharacter matches any single character (in this case, A, T, G, or C).

       seqobj->id = 1234;
       seqobj->sequence = "aagcagtaggag";
       seqobj->findpattern(".AG", "I"); 
       // The above statement returns ( "AAG", "CAG", "TAG", "GAG"). 

    3) Pattern uses the [] metacharacter.

       The [] metacharacter matches any one character listed inside the square brackets.	
       Thus, [AG] matches either an A or G. 

       seqobj->id = 1234;
       seqobj->sequence = "gatgacgaggaa";
       seqobj->findpattern("GA[TC]", "I"); 
       // The above statement returns ( "GAT", "GAC"). 
		 
  See also : patpos(), patposo(), patfreq().
get_bridge() code prev next top
  Purpose  : Returns the sequence located between two palindromic halves of a palindromic string. 
             Take note that the "bridge", as I call it, is not necessarily a genetic mirror 
             or a palindrome.

  Syntax   : string get_bridge(string $string)
  Arguments: $string (mandatory) - a palindromic or mirror sequence containing the bridge.
  Returns  :  A string representing the bridge (as defined above).

  Example  :

    In the sequence "ATGcacgtcCAT", the "cacgtc" is the bridge, while ATG and CAT 
    are the palindromic halves. 
	 
  See also : is_mirror(), is_palindrome(), find_mirror(), find_palindrome().	 
getcodon() code prev next top
  Purpose  : Returns the n-th codon in a sequence, with numbering starting at 0.
  Syntax   : string getcodon(int $index, int $readframe = 0)

  Arguments: 

    $index (mandatory) - the index number of the codon.

    $readframe (optional) - the reading frame, which may be 0, 1, or 2 only.  If omitted, this
      is set to 0 by default.

    $this->sequence (implied) - the sequence from which to exract the n-th codon.

  Returns  : The n-th codon in the sequence.

  See also : count_codons().  
halfstr() code prev next top
  Purpose  : Returns one of the two palindromic "halves" of a palindromic string. 
  Syntax   : string halfstr(string $string, int $no)

  Arguments: 

    $string (mandatory) - a palindromic sequence.

    $no (mandatory) - pass 0 to get he first palindromic half, pass any other number (e.g. 1) to 
      get the second palindromic half.

  Returns  : A string representing either the first or the second palindromic half of the string.

  Specifics: Later on, this will be modified so that $string will become optional. If omitted, the 
    implied argument is $this->sequence.
	 
  See also : get_bridge() for an example.	 
is_mirror() code prev next top
  Purpose  : Returns TRUE if the given sequence or string is a "genetic mirror" which is the same 
             as a "string palindrome", i.e., a sequence that "looks" the same when read backwards.

  Syntax   : is_mirror(string $string)

  Arguments: 

    $string (mandatory) - a sequence which we want to test if it is a mirror or not. If omitted, 
      $string is, by default, set to the sequence property of the Seq object from which we invoke 
      the method.

  Returns  : TRUE if the given string is a mirror, FALSE otherwise.

  Specifics: The genetic mirror is the same as a palindromic string in traditional programming.

  See also :is_palindrome(), find_mirror(), find_palindrome().
is_palindrome() code prev next top
  Purpose  : Tests if a given sequence is a "genetic palindrome" (as opposed to a "string 
             palindrome").  A "genetic palindrome" is one where the ends of a sequence are 
             reverse complements of each other. 

  Syntax   : is_palindrome(string $string)

  Arguments:

    $string (mandatory) - a sequence which we want to test if it is a genetic palindrome or not.
      If omitted, $string is, by default, set to the sequence property of the Seq object from 
      which we invoke the method.

  Returns  : TRUE if the given string is a genetic palindrome, FALSE otherwise.

  Specifics: The genetic palindrome is not the same as a palindromic string in traditional 
             programming.  It is any sequence having "edges" that are reverse complements of
             each other.

  See also : is_mirror(), find_mirror(), find_palindrome().	 
molwt() code prev next top
  Purpose  : Computes the molecular weight of a particular sequence. 
  Syntax   : array molwt()
  Arguments: $this->sequence (implied) - the sequence whose molecular weight we wish to determine
  Returns  : An array with exactly two elements, of the form: ( lower_molwt, upper_molwt )

  Specifics:

    In general, the molecular weight of a chain is the arithmetic sum of the molecular weights
    of the individual "links" (e.g. an amino or nucleic acid) in that chain (plus/minus some 
    minor adjustments).  However, special metacharacters like X, R, and Y may be present in the
    sequence which leads to ambiguities in the molecular weight of the entire sequence.  These
    special symbols are evaluated and replaced by the expand_na() method. 

    Thus, the method has been designed to always return two values - a lower value and an upper 
    value - even in cases when  the two might be equal or the same.
	 
  See also : expand_na().	 
patfreq() code prev next top
  Purpose  : Returns a one-dimensional associative array where each key is a substring matching the 
             given pattern, and  each value is the frequency count of the substring within the larger 
             string.
  
  Syntax   : array patfreq(string $pattern, char $options) 

  Arguments: 

    $pattern (mandatory) - The pattern (a Perl regular expression without enclosing "/") to search for 
      and tally. 

    $options (optional) - If set to "I", pattern-matching and tallying will be case-insensitive.  Passing 
      anything else would cause it to be case-sensitive. If not passed, $options is set to "I" (case-
      insensitive).

  Returns  : The function returns an array of the form:

    ( substring1 => frequency1, substring2 => frequency2, ... ) 

  Specifics:

    1) patfreq() uses the findpattern() function to search for substrings that match the pattern. 
    2) Because of #1 above, patfreq() does not recognize and count "overlapping patterns".

  Example  : 

    seqobj->id = 1234;
    seqobj->sequence = "agaataacatgacaataaca";
    seqobj->patfreq("A.A", "I"); 
    // The above statement returns ( "AGA" => 1, "ATA" => 2, "ACA" => 3). 

  See also : findpattern(), patpos(), and patposo().
patpos() code prev next top
  Purpose  : Returns a two-dimensional associative array where each key is a substring matching a 
             given pattern, and each value is an array of positional indexes which indicate the 
             location of each occurrence of the substring (needle) in the larger string (haystack).

  Syntax   : array patpos(string $pattern, char $options) 

  Arguments:

    $pattern (mandatory) - the pattern (a Perl regular expression without enclosing "/") to locate.

    $options (optional) - If set to "I", pattern-matching will be case-insensitive.  Passing anything 
      else would cause it to be case-sensitive. If omitted, $options is set to "I" (case-insensitive).

  Returns  : This method returns an array of the form:

    ( substring1 => (position1, position2, ... ), substring2 => (position1, position2, ... ), ... )

    where substring is a substring within the sequence that matches the given $pattern, and 

          position is a zero-based index indicating the location of the substring within the larger 
          sequence. Thus, if substring is found at the very beginning of sequence, its position is 
          equal to zero (0).

  Specifics: Like findpattern() and patfreq(), patpos() does not recognize "overlapping patterns".

  Example  :

    seqobj->sequence = "agaataacatgacaataaca";
    seqobj->patpos("A.A", "I"); 
    // The above returns ( "AGA" => (0), "ATA" => (3, 14), "ACA" => (6, 11, 17) )
  
  See also : findpattern(), patfreq(), and patposo().
patposo() code prev next top
  Purpose  : Similar to patpos() except that this counts so-called "overlapping patterns".
  Syntax   : array patpos(string $pattern, char $options, int $cutpos) 

  Arguments:

    $pattern (mandatory) - the pattern (a Perl regular expression without enclosing "/") to
	   locate.

    $options (optional) - If set to "I", pattern-matching will be case-insensitive.  Passing a
      anything else would cause it to be case-sensitive.  If omitted, $options is set to "I"
      (case-insensitive).

    $cutpos (optional) - A non-negative integer specifying where search for the next pattern
      will resume, relative to the current matching substring.  If omitted, $cutpos is set to 
      the value of 1 by default. 

  Returns  : This method returns one-dimensional array of the form:

    ( position1, position2, position3, ... )

    where position is a zero-based index indicating the location of the substring within the
          larger sequence.  Thus, if substring is found at the very beginning of sequence, its
          position is equal to zero (0).

    Unlike patpos(), the return value does not contain the actual substrings that match the
    given pattern.  However, code can be written to work around this problem.

  Specifics: Unlike findpattern(), patfreq(), and patpos(), this method recognizes so-called 
             "overlapping patterns".

  Example  :

    seqobj->sequence = "agataca";
    seqobj->patposo("A.A", "I", 1); 
    // The above returns ( 0, 2, 4) 

    With the third argument set to 1, the search proceeds in this manner: 

    1) The result array is initialized to the blank or empty array or ( ). 

    2) "AGA" is checked if it matches the pattern "A.A"

    3) Since it does, the position of its first character (the "A" before the "G") relative to 
       the larger sequence is added to our "result array". Our result array is now: ( 0 ).

    4) Since the third argument is set to 1, the search for the next match resumes at the letter 
       "G" in "AGA", as illustrated below. 

       a gataca
       0 123456

       That is, the remaining substring "GATACA" will be searched for a substring matching the 
       pattern "A.A".

    5) The next match is "ATA", which is found at position 2. 2 is added to the result array 
       which is now ( 0, 2).

    6) The search resumes at 1 character to the right of the first "A" in "ATA", or at letter "T". 

       aga taca
       012 3456

    7) The remaining string "TACA" is searched for the pattern "A.A". 

    8) The next match is "ACA", which is found at position 4. 4 is added to the result array
       which is now ( 0, 2, 4). The search ends and this array is returned by the function.
		 
  See also : findpattern(), patfreq(), and patpos().		 
revcomp() code prev next top
  Purpose  : Gets the reverse complement of a genetic sequence.
  Syntax   : string revcomp(string $seq, string $moltype)

  Arguments:

    $seq (mandatory) - the sequence to "reverse, and then complement".
    $moltype (mandatory) - the molecule type of the input sequence 

  Returns  : The reverse complement of the argument string $seq.

  See also : complement().  

seqlen()

code prev next top
  Purpose  : Gets the length of the sequence, i.e., the number of symbols in the sequence 
             property. 
  Syntax   : int seqlen()
  Arguments: None
  Returns  : The number of "symbols" or "monomeres" in an amino or nucleic acid chain, 
             expressed as a non-negative integer.
  Specifics: The "implied" argument is the sequence property of the Seq object which invokes
             the seqlen() method.

  See also seqlength property of the Seq Class.
subseq() code prev next top
  Purpose  : Creates a new sequence object with a sequence that is a substring of another.

  Syntax   : Seq object subseq(int $start, int $count)

  Arguments:

    $start (mandatory) - the position in the original sequence from which we will begin extracting 
      the subsequence; the position is expressed as a zero-based index. 

    $count (mandatory) - the number of "letters" to include in the subsequence, starting from the 
      position specified by the $start parameter. 

  Returns  : A new Seq object with the sequence property set to the subsequence specified by the 
             $start and $count parameters. All other properties of this new Seq object are NULL. 

  See also : trunc(). 
symfreq() code prev next top
  Purpose  : Returns the frequency of a given symbol in the sequence property string. Note that you 
             can pass this a symbol argument which may be not be part of the sequence's alphabet. 
             In this case, the method will simply return zero (0) value.

  Syntax   : int symfreq(char $symbol)

  Arguments:

    $symbol (mandatory) - the symbol whose frequency in a sequence we wish to determine.
    $this->sequence (implied) - the sequence to search and (do a) tally.

  Returns  : The frequency (number of occurrences) of a particular symbol in a sequence string.
translate() code prev next top
  Purpose  : Translates a particular DNA sequence into its protein product sequence, using
             the given substitution matrix.

  Syntax   : string translate(int $readframe, int $format)

  Arguments:

    $readframe (optional)

    The reading frame (0, 1, or 2) to be used in translating a nucleic sequence into a protein.
    A value of 0 means that the first codon would start at the first "letter" in the sequence, 
    a value of 1 means that the second codon would start the second "letter" in the sequence, 
    and so on.  When omitted, this argument is set to reading frame 0 by default. 

    $format (optional)

    This may be passed the value 1 or 3 and determines the format of the output string.  Passing 
    1 would cause translate() to output a string made up of single-letter amino acid symbols strung 
    together without any space in between. Passing 3 would output a string made up of three-letter 
    amino acid symbols separated by a space.  See Return Value section for examples.

    When omitted, $format is set to 3 by default.

  Returns  :

    When $format is passed a value of 1, the function returns a string of this format:

      GAVLISNFYW 
 
    where each of G, A, V, and the other letters represent a single amino acid residue.

    When $format is passed a value of 3, the function returns a string of this format:

      Phe Leu Ser Tyr Cys STP

    where each of Phe, Leu, and the other 3-letter "words" represent a single amino acid 
    residue.

  Specifics:

    Aside from the symbols for the 20 known amino acid residues, there are two other special 
    symbols, the STOP codon and the UNKNOWN codon.

    In Bio theory, the STOP codon terminates the translation process.  The symbol for STOP is 
    usually "*" (when $format is 1) and "STP" (when $format is 3). 

    In Bio theory, the UNKNOWN codon is substituted when a codon cannot be found in the transla-
    tion table (or substitution matrix) and therefore, cannot be translated.  The symbol for the
    UNKNOWN codon is "X" (when $format is 1) and "XXX" (when $format is 3). 
	 
  See also : translate_codon().	 
translate_codon() code prev next top
  Purpose  : Translates a single codon into an amino acid. 
  Syntax   : string translate_codon(string $codon, int $format) 

  Arguments:

    $codon (mandatory) - A three-letter nucleic acid sequence (each letter can be A, U, G, 
      or C) which translates into a single amino acid residue.

    $format (optional) - This may be passed the value 1 or 3 and determines the format of 
      the output string. When omitted, $format is set to 3 by default.

  Returns  :

    When $format is passed a value of 1, the function returns a single letter.  When $format 
    is passed a value of 3, the function returns a string of three letters. The return value
    represents a single amino acid residue.

  Specifics:

    This is a helper function called by the translate() method.  However, it can be accessed 
    by the programmer to translate a single codon into its amino acid equivalent.

    The translation process uses a translation table (also called a substitution matrix), which
    can be defined by using the Submatrix Class. If not specified, the method uses a default
    matrix for human beings. 
	 
  See also : translate().
trunc() code prev next top
  Purpose : Extracts a subset or substring from a particular sequence.
  Syntax : string trunc(int $start, int $count)

  Arguments:

    $start (mandatory) - the position in the original sequence from which we will begin
      extracting the subsequence; the position is expressed as a zero-based index. 

    $count (mandatory) - the number of "letters" to include in the subsequence, starting
      from the position specified by the $start parameter. 

    The sequence property of the Seq object invoking the method is the "implied" third
    argument.  It is the string from which the substring will be extracted or truncated
    from.

  Returns  : A substring of the original string as specified by $start and $count.

  Specifics: Unlike the subseq() method, this method returns an ordinary string and not 
             a Seq object.
  
  See also : subseq().  

class SeqAlign

Description       top
  The class is used to manipulate sequence alignment data (usually stored in a file)
  produced by third-party sequence alignment software. For now, only the FASTA format 
  is supported.
Properties       top
  length    The length of the longest sequence in the alignment set.
			
  seq_count The number of sequences in the alignment set.

  gap_count The total number of gaps ("-") in all sequences in the 
            alignment set.

  seqset    An array containing all the sequences in the alignment set.

  seqptr    Short for sequence pointer; this is an index number indicating which 
            sequence (in the alignment set) we are currently processing.

  is_flush  A boolean or logical value: TRUE if all the sequences in the alignment
            have the same length, FALSE otherwise.
Methods       top

SeqAlign() code prev next top
  Purpose  : Constructor method for the SeqAlign class. This "fetches" all sequences into the 
             seqset property of the SeqAlign object. 
  Syntax   : new SeqAlign(string $filename) 
  Arguments: $filename (mandatory) - the name of the FASTA alignment file.
  Returns  : A SeqAlign object.
  Specifics: To access individual sequences in the SeqAlign object, use seqset property.
add_seq() code prev next top
  Purpose  : Adds a sequence to the alignment set.
  Syntax   : int add_seq(Seq object $seq)
  Arguments: $seq (mandatory) - a Seq object to be added to the alignment set. 
  Returns  : The number of sequences in the alignment set after the call to add_seq().

  Specifics: It is possible to build a new alignment set solely by using the add_seq() method.

  Example  :

    // Create a blank alignment set.
    $aln = new SeqAlign();

    // Add the first sequence to our blank alignment set.
    $seqobj = new Seq();
    $seqobj->id = "1234";
    $seqobj->sequence = "ATGC-TGA--CTGA";
    $aln->add_seq($seqobj);

    // Add the second sequence to our blank alignment set.
    $seqobj = new Seq();
    $seqobj->id = "1234";
    $seqobj->sequence = "TTGT-TAA--CCGT";
    $aln->add_seq($seqobj);

    Take note that this method simply adds the sequence to the alignment set.  It does not 
    perform any sequence alignment. 

  See also : del_seq(). 
char_at_res() code prev next top
  Purpose  : Gets the character at a given residue number in the specified sequence.
  Syntax   : char char_at_res(int $seqidx, int $res) 

  Arguments:

    $seqidx (mandatory) - the index number of the desired sequence in the alignment set. 
    $res (mandatory) - the residue number of the character we wish to get or extract.

  Returns  : A single character representing an amino acid residue or a "gap".
col2res() code prev next top
  Purpose  : Converts a column number to a residue number in the sequence specified by
             its index number.
  Syntax   : int col2res(int $seqidx, int $col)

  Arguments: 

    $seqidx (mandatory) - index number of the desired sequence within the alignment set.
    $col (mandatory) - the column number which we want to convert to a residue number.

  Returns  : An integer representing the residue number corresponding to the given column
             number. 

  Specifics: This is the opposite of res2col().
consensus() code prev next top
  Purpose  : Constructs a consensus string for all the sequences in the alignment set. 
  Syntax   : string consensus(int $threshold)

  Arguments: 

    $threshold (optional) - a number between 0 to 100, indicating the percentage threshold before 
      (or below which) the unknown character "?" is used in a particular position or column in the 
      consensus string. If omitted, this is set to 100 by default. 

  Returns  : The consensus string formed according to the given threshold.
del_seq() code prev next top
  Purpose  : Removes a sequence specified by its id to the alignment set.
  Syntax   : int del_seq(string $seqid)
  Arguments: $seqid (mandatory) - the id of the sequence to be deleted from the alignment set.
  Returns  : The number of sequences in the alignment set after the call to del_seq().

  See also : add_seq().
fetch() code prev next top
  Purpose  : Retrieves a specified sequence from the alignment set and returns it as a
             Seq object.
  Syntax   : Seq object fetch($id)

  Arguments: 

    $id (optional) - the id of the sequence we wish to retrieve from the alignment set.  
      If omitted, the current sequence (pointed to by the sequence pointer or seqptr
      property), is fetched instead.

  Returns  : Seq object
  Specifics: Once fetched, the sequence can use all Seq class properties and methods.
first() code prev next top
  Purpose  : Moves the sequence pointer to the first sequence in an alignment set. 
  Syntax   : void first()
  Arguments: None
  Returns  : None
  See also : last(), next(), prev(). 
get_gap_count() code prev next top
  Purpose  : Computes and returns the gap_count property of an alignment set or object.
  Syntax   : int get_gap_count()
  Arguments: None
  Returns  : The number of "gap characters" in the all sequences in the alignment set.
  Specifics: See also the gap_count property of the SeqAlign class.
get_is_flush() code prev next top
  Purpose  : Computes and returns the is_flush property of the SeqAlign object.
  Syntax   : boolean get_is_flush()
  Arguments: None
  Returns  : TRUE if the all the sequences in the alignment set have the same length, 
             FALSE otherwise.
  See also : the is_flush property of the SeqAlign class.
get_length() code prev next top
  Purpose  : Returns the length of a SeqAlign object.
  Syntax   : int get_length()
  Arguments: None
  Returns  : An integer representing the length of the longest sequence in the alignment set.
  Specifics: A sure-fire way to get the length of a SeqAlign object. However, it is slower 
             than simply accessing the length property.
  See also : The length property of the SeqAlign object.
last() code prev next top
  Purpose  : Moves the sequence pointer to the last sequence in an alignment set.
  Syntax   : void last()
  Arguments: None
  Returns  : None
  See also : first(), next(), prev().  
next() code prev next top
  Purpose  : Moves the sequence pointer to the next sequence in the alignment set.
  Syntax   : void next()
  Arguments: None
  Returns  : None
  See also : first(), last(), prev().   
prev() code prev next top
  Purpose  : Moves the sequence pointer to the previous sequence in the alignment set.
  Syntax   : Moves the sequence pointer to the previous sequence in the alignment set.
  Arguments: None
  Returns  : None
  See also : first(), last(), next()
res2col() code prev next top
  Purpose  : Converts a residue number to a column number in the sequence specified by its 
             index number.
  Syntax   : int res2col(int $seqidx, int $res)

  Arguments:

    $seqidx (mandatory) - the index number of the desired sequence in the alignment set. 
    $res (mandatory) - the residue number we wish to convert into a column number.

  Returns  : An integer representing the column number corresponding to the given residue number.
  Specifics: This is the opposite of col2res().  
res_var() code prev next top
  Purpose  : Determines the index position of both variant and invariant residues according 
             to a given "percentage threshold" similar to that in the consensus() method.
  Syntax   : assoc array res_var(int $threshold)

  Arguments: 


    $threshold (optional) - a number between 0 to 100, indicating the percentage threshold below 
      which the current index position is considered variant, and on or above which the current 
      index position is considered invariant. If omitted, this is set to 100 by default. 

  Returns  : An associative array of the form:

    ( "INVARIANT" => ( 0, 2, 4, 6 ), "VARIANT" => ( 1, 3, 5 ) )

  Specifics: See also consensus() method. 	 
select() code prev next top
  Purpose  : Creates a new alignment set from non-consecutive sequences found in another existing 
             alignment set. 
  Syntax   : SeqAlign object select(int $index1, int $index2, ... )

  Arguments: 

    $index1, $index2, ... (at least one index argument is required) - the index number(s) of the 
      sequence(s) in the current alignment set to include in the new SeqAlign object.

  Returns  : A SeqAlign object made up of specific sequences indicated by $index1, $index2, etc.
  Specifics: The number of arguments is variable and is theoretically unlimited. 
  See also : subalign().
sort_alpha() code prev next top
  Purpose  : Sorts the arrangement of sequences in an alignment set by name/id and then by start 
             number, in either ascending ("ASC") or descending ("DESC") order.
  Syntax   : array sort_alpha($option)

  Arguments: 
  
    $option (optional) - accepts either "ASC" or "DESC" in whatever case (uppercase, lowercase, 
      mixed case). This determines the sort order of the alignment set. 

  Returns  : An array of Seq objects belonging to an alignment set, sorted in either ascending or 
             descending order. Format is similar to the seqset property of the SeqAlign class.

  Specifics: This permanently changes the seqset property of the SeqAlign object.
subalign() code prev next top
  Purpose  : Creates a new alignment set from a series of contiguous/consecutive sequences.
  Syntax   : SeqAlign object subalign(int $beg, int $end)

  Arguments:

    $beg (mandatory) - the index number of the first sequence to include in the new SeqAlign object. 
    $end (mandatory) - the index number of the last sequence to include in the new SeqAlign object.

  Returns  : A SeqAlign object consisting of consecutive sequences from another SeqAlign object.

  See also : select().
substr_bw_res() code prev next top
  Purpose  : Gets a substring in between residue numbers from the specified sequence.
  Syntax   : string substr_bw_res(int $seqidx, int $res_beg, int $res_end)

  Arguments: 

    $seqidx (mandatory) - the index number of the desired sequence in the alignment set. 

    $res_beg (mandatory) - the residue number of the character at the beginning of the 
      substring we wish to extract.

    $res_end (optional) - the residue number of the character at the end of the substring
      we wish to extract.

  Returns  : A substring within the specified sequence.
  
  See also : char_at_res().  

class SeqDB

Description       top
   The SeqDB class represents a set of sequence data stored in one or more physical 
	electronic files.  Thus, we can create a SeqDB class to represent the entire 
	GenBank database or a subset thereof.
Properties       top
  dbname    The name of a SeqDB database object, which you assign when creating 
            it. This is also the name of primary index and directory files of the 
            database. 

  data_fn   The complete name (with extension) of the primary index file of the 
            database.
				  
  dir_fn    The complete name (with extension) of the directory file of the database.

  seqptr    A variable which points to the current sequence record. When processing 
            a SeqDB object, BioPHP maintains a "record pointer" which can be moved 
            using the first(), last(), prev(), and next() methods.  This is an integer
            value which is 0 when at the first record in the database, 1 when at the
            second record, and so on. 

  seqcount  This is the number of sequence records in the SeqDB database object.
Methods       top

SeqDB() code prev next top
  SeqDB() The constructor method for the SeqDB database class. This performs 
  various preliminary tasks such as creating the primary index and directory 
  files for the database, etc.
at_entrystart() code prev next top
  Purpose  : Tests if the current line signals that start of a new sequence record
             in a data file. In a Genbank file, the first line of a new entry begins 
             with the keyword "LOCUS". In a SwissProt file, the keyword is                                  "ID".

  Syntax   : boolean at_entrystart(string $linestr, string $dbformat)

  Arguments:

    $linestr (mandatory) - a line from the datafile which we are currently processing.

    $dbformat (mandatory) - identifies the database format to use in parsing the 
      data file. As of version 1.0, the only valid values are "GENBANK" 
      and "SWISSPROT".

  Returns  : TRUE if the current line signals the start of a new sequence 
             record, FALSE otherwise.

  Specifics: The format of the data file is specified in the open() method.
close() code prev next top
  Purpose  : This "closes" the SeqDB database, after we're through working with it.  This is
             the opposite of the open() method.
  Syntax   : void close()
  Arguments: None
  Returns  : None

  See also : new keyword, open().
fetch() code prev next top
  Purpose  : Retrieves all data from the specified sequence record and returns them in the 
             form of a Seq object. This method invokes one of several parser methods.  For
             version 1.0, we normally do not access the parser method directly.
  Syntax   : Seq object fetch(string $seqid)

  Arguments: 

    $seqid (optional) - The unique identifier of the sequence. If omitted, this is set to 
      the id of the "current sequence record" (as pointed to by the sequence pointer). 

  Returns  : A Seq object 
first() code prev next top
  Purpose  : This moves the record pointer to the first record in the SeqDB database. 
  Syntax   : void first()
  Arguments: None 
  Returns  : None
  Specifics: This moves the record pointer to the first record in the SeqDB database. 

  See also : next(), prev(), last().  
get_entryid() code prev next top
  Purpose  : Returns the ID of a GenBank or Swissprot sequence record.
  Syntax   : string get_entryid(array &$flines, string $linestr, string $dbformat)
  Arguments: 

    &$flines (mandatory) - an array of lines from the sequence record. The method uses this only 
      when dealing with Swissprot sequences.
    $linestr (mandatory) - the line containing the ID of the sequence record.
    $dbformat (mandatory) - the specific database format of the sequence record.

  Returns  : The ID of the sequence record.
isa_qualifier() code prev next top
  Purpose  : Determines if a string found in the features section of a GenBank entry is 
             that of a "qualifier" (see gbrel.txt for its definition).
  Syntax   : boolean isa_qualifier(string $str)
  Arguments: $str (mandatory) - the string to test if it is a GenBank feature qualifier or not.
  Returns  : TRUE if the string $str is a qualifier, FALSE otherwise.
last() code prev next top
  Purpose  : Moves the record pointer to the last record in the SeqDB database.
  Syntax   : void last()
  Arguments: None 
  Returns  : None
  Specifics: This moves the record pointer to the last record in the SeqDB database.
  
  See also : first(), next(), prev().  
next() code prev next top
  Purpose  : This moves the record pointer to the next record in the SeqDB database.
  Syntax   : void next()
  Arguments: None 
  Returns  : None
  Specifics: This moves the record pointer to the next record in the SeqDB database.
  See also : first(), last(), prev().
open() code prev next top
  Purpose  : Opens the SeqDB object for processing.
  Syntax   : void open($dbname)
  Arguments: $dbname (mandatory) - the name of the database (index file) you wish to open.
  Returns  : None
  Specifics: 

    1) This method initializes the dbname, data_fn, dir_fn, and seqptr properties of the 
       SeqDB object. 
    2) Right now, this method issues a call to die() when the database cannot be opened.

  See also : close().
parse_id() code prev next top
  Purpose  : Parses one sequence entry or record in a Genbank file.
  Syntax   : Seq object parse_id($flines)
  Arguments: An array of strings representing lines of one sequence entry in a 
             Genbank database or file.
  Returns  : A Sequence object corresponding to the sequence entry.
  
  See also : parse_swissprot().  
parse_swissprot() code prev next top
  Purpose  : Parses one sequence entry or record in a Swissprot sequence file.
  Syntax   : Seq object parse_id($flines)
  Arguments: An array of strings representing lines of one sequence entry in a 
             Swissprot database or file.
  Returns  : A Sequence object corresponding to the sequence entry.

  See also :parse_id().  
prev() code prev next top
  Purpose  : This moves the record pointer to the previous record in the SeqDB database.
  Syntax   : void prev()
  Arguments: None 
  Returns  : None
  Specifics: This method also updates the value of the seqptr property. 

  See also :first(), last(), next(). 

class SeqMatch

Description       top
  This class represents the results of performing sequence analysis and matching on two or more 
  sequences.  While the basic matching algorithm is present, this class is largely still under 
  development. 
  
  Before anything else, a short foray in terminology is in order.
  
  The terms "symbol", "character", or "letter" are synonymous, and represent a single element (e.g.
  a nucleotide or amino residue) in a larger chain of such elements.  Note that the use of the term
  "letter" is still valid even if it's actually a dash ("-") or a symbol not in the Roman alphabet
  of A to Z.
  
  The words "string" and "sequence" will be used interchangeably to mean the same thing -- a series
  of symbols representing a biological entity such as a nucleotide or amino acid chain.  Sometimes,
  the term "input string" is used to indicate that the string will serve as an input to a function
  that may itself produce or return another string, which can be called an "output string" or a
  "result string".  This latter usually (but not always) has the same length has the input strings. 

  The term "corresponding characters" (or symbols or letters) refer to characters in two strings that
  are found in the same location or have equal position indexes.
  
  When two corresponding characters are identical, they constitute an "exact match".  When two corres-
  ponding characters are not identical but belong to the same group (as defined by a submatrix), they
  constitute a "partial match".  When two corresponding characters are not identical and do not belong
  to the same group, they are a "mismatch" or a "non-match".    
Properties       top
  result    The result string produced by the match() method.  Example: "GAV++ Y+ R" 
				  
  hamdist   The Hamming Distance between two strings.  This is simply the number of corresponding
            characters that are not exact matches.
				  
  levdist   The Levenshtein Distance between two strings.  This is the minimum number of insertion, 
            deletion, or replacement operations that need to be performed on either input string to
            make them identical.  Note that insertion and deletion are symmetrical, i.e., inserting
            a letter in one string has the same effect (of making two strings more identical) as 
            deleting a letter in the other string.
compare_letter() code prev next top
  Purpose  : Compares two symbols if they are exact, partial, or negative matches.
  Syntax   : char compare_letter(char $let1, char $let2, array $matrix, char $equal, 
             char $partial = "+", char $nomatch = ".")

  Arguments: 

    $let1 (mandatory) - the first amino acid residue symbol to match.

    $let2 (mandatory) - the second amino acid residue symbol to match.

    $matrix (optional) - the substitution matrix to use in matching.  If omitted, the 
      default $chemgrp_matrix table is used.

    $equal (optional) - the character symbol to return if $let1 and $let2 are exact matches. 
      If omitted, the symbol ($let1 or $let2) itself is returned.

    $partial (optional) - the character symbol to return if $let1 and $let2 are partial 
      matches. If omitted, the "+" symbol is returned.

    $nomatch (optional) - the character symbol to return if $let1 and $let2 are totally 
      mismatched. If omitted, a whitespace (" ") is returned.

  Returns  : A character symbol which indicates if the two residues are exact, partial, or 
             negative matches.

  See also : partial_match, match(), and SubMatrix class properties and methods.
hamdist() code prev next top
  Purpose  : Computes the Hamming Distance between two strings or sequences.
  Syntax   : int hamdist(string/Seq object $seq1, string/Seq object $seq1);

  Arguments: 

    $seq1 (mandatory) - the first string or sequence 
    $seq2 (mandatory) - the second string or sequence

  Returns  : The hamming distance between the two strings or sequences, defined to be the number of 
             "mismatching" characters found in corresponding positions in the two strings.

  See also : levdist(), xlevdist().
levdist() code prev next top
  Purpose  : Computes the Levenshtein Distance between two strings or sequences.
  Syntax   :
  
  int levdist(string/Seq object $seq1, string/Seq object $seq2, int $cost_ins, int $cost_rep, 
              int $cost_del) 

  Arguments:

    $seq1 (mandatory) - the first string or sequence, with a length not exceeding 255 symbols.
    $seq2 (mandatory) - the second string or sequence, with a length not exceeding 255 symbols.
    $cost_ins (optional) - the cost or weight of an insertion operation. Set to 1 if omitted.
    $cost_rep (optional) - the cost or weight of a replacement operation. Set to 1 if omitted.
    $cost_del (optional) - the cost or weight of a deletion operation. Set to 1 if omitted.

  Returns  : The Levenshtein Distance between two strings, defined to be the number of insertion, 
             deletion, or replacement operations that must be performed on the strings before they
             can become identical.

  Specifics: This uses PHP's built-in levenshtein() function, which has a 255-character limit.  
             For longer strings, use xlevdist() method.  However, levdist() 
             allows you to alter the cost of insertions, deletions, and replacements.  You cannot 
             do that in xlevdist. 
match() code prev next top
  Purpose  : This is typically used to compare protein sequences but may be used on nucleic
             acid sequences.  It compares two sequences and returns a "result string" of 
             special symbols. 

  Syntax   :
  
  string match(string $str1, string $str2, array $matrix, char $equal, char $partial,
               char $nomatch)

  Arguments:

    $str1 (mandatory) - the first of two sequences being compared.

    $str2 (mandatory) - the second of two sequences being compared. 

    $matrix (optional) - an array specifying valid symbol substitution and equivalence rules. 
      Its format is similar to the rules property of a SubMatrix object.

    $equal (optional) - the symbol to output if the symbol in the first sequence ($str1) is 
      exactly the same as the corresponding symbol in the second sequence ($str2).

    $partial (optional) - the symbol to output if the symbol in the first sequence ($str1) is 
      equivalent but not identical to the corresponding symbol in the second sequence ($str2). 

    $nomatch (optional) - the symbol to output if the symbol in the first sequence ($str1) is 
      neither identical nor equivalent to the corresponding symbol in the second sequence ($str2).

    If any of the last threee arguments are omitted, the following symbols are used by default: 

      1) whitespace - no match
      2) + sign - a partial match (the two amino acids belong to the same chemical group)
      3) the original symbol itself - in case of an exact match.

  Returns  : A string which indicates where exact, partial and no matches occur between the first
             and second sequences being compared.

  Example  : 
  
    <?php
    $seqm_o = new SeqMatch();
    $result = $seqm_o->match("GAVLIFYWKR", "GAVILGYFVR");
	 ?>

  In the above code, $result is "GAV++ Y+ R".  How this came to be is explained below. 
  
    Position index       0123456789
    1st input string     GAVLIFYWKR
    2nd input string     GAVILGYFVR 
    Result string	       GAV++ Y+ R 
	 
  In positions 0, 1, and 2, the characters in the 1st and 2nd input strings are identical, so
  we simply "copy" them onto the result string.  In positions 3, 4, and 7, the characters are
  not identical BUT they belong to the same chemical group (in the default submatrix), so "+"
  is copied onto the result string.  In positions 5 and 8, the characters are not identical,
  and DO NOT belong to the same chemical group, and so a blank space " " is copied onto the
  result string.  Other bioinformatics software use the vertical bar "|" to denote an exact
  match, and ":" to indicate a partial match. 

  See also : SubMatrix Class, its properties and methods.
partial_match() code prev next top
  Purpose  : Determines if two symbols are partial matches or not. For example, two 
             amino acid residues are partial matches if they belong to the same 
             chemical group according to a substitution matrix.
  Syntax   : boolean partial_match(char $let1, char $let2, array $matrix)

  Arguments: 

    $let1 (mandatory) - The first amino acid residue.

    $let2 (mandatory) - The second amino acid residue.

    $matrix (optional) - The substitution matrix to use for determining partial matches. 
      If omitted, the default $chemgrp_matrix table is used.

  Returns  : TRUE if the two symbols belong to the same chemical group, FALSE otherwise. 
  
  See also : SubMatrix class properties and methods.
xlevdist() code prev next top
  Purpose  : Computes the Levenshtein Distance between two strings.
  Syntax   : int levdist(string/Seq object $seq1, string/Seq object $seq2)

  Arguments:

    $seq1 (mandatory) - the first string or sequence, with a length not exceeding 1024 symbols.
    $seq2 (mandatory) - the second string or sequence, with a length not exceeding 1024 symbols.

  Returns  : The Levenshtein Distance between two strings, as defined in levdist().

  See also : levdist().

class SubMatrix

Description       top
  This class is a tool used to create customized look up tables or substitution matrices employed 
  by various methods such as the translate() method of the Seq object. This is necessary to enable 
  translation of proteins using a different (non-human) set of "genetic code" as observed in other 
  organisms.
  
  A substitution matrix is defined to be a two dimensional array of symbols.  The outer array
  is itself the collection of substitution rules. Each element of this array (inner array) is
  made up of "equivalent or substitutable symbols", i.e., symbols, which if found in correspon-
  ding positions in two sequence strings, would consitute a partial match.

  Example: ( ('D','E'), ('K', 'R', 'H'), (X) )
  
  In the example above, our submatrix has three elements.  The last element contains only one 
  symbol, X.  For our purposes, we shall refer to the outer array as the substitution matrix 
  and its element a substitution rule.  
Properties       top
  rules       A two-dimensional array representing allowable or valid symbol substitutions.
Methods       top

SubMatrix() code prev next top
  Purpose  : The constructor method of the submatrix class, which does nothing but initialize 
             the rules property to an empty array.
addrule() code prev next top
  Purpose  : Method for defining rules and adding them to a rule base (the rules property).
  Syntax   : void addrules(array $rule)
  Arguments: $rule (mandatory) - an array containing equivalent or interchangeable symbols.	
  Returns  : None

 

 

 

 

 

Back to Top
Back to Home Page

 


Copyright © 2003 by Sergio Gregorio, Jr.
All rights reserved.