                                    dreg



Wiki

   The master copies of EMBOSS documentation are available at
   http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki.

   Please help by correcting and extending the Wiki pages.

Function

   Regular expression search of nucleotide sequence(s)

Description

   dreg searches one or more sequences with the supplied regular
   expression and writes a report file with the matches.

Usage

   Here is a sample session with dreg


% dreg
Regular expression search of nucleotide sequence(s)
Input nucleotide sequence(s): tembl:x13776
Regular expression pattern: ggtacc
Output report [x13776.dreg]:


   Go to the input files for this example
   Go to the output files for this example

Command line arguments

Regular expression search of nucleotide sequence(s)
Version: EMBOSS:6.4.0.0

   Standard (Mandatory) qualifiers:
  [-sequence]          seqall     Nucleotide sequence(s) filename and optional
                                  format, or reference (input USA)
  [-pattern]           regexp     Any regular expression pattern is accepted)
  [-outfile]           report     [*.dreg] Output report file name (default
                                  -rformat seqtable)

   Additional (Optional) qualifiers: (none)
   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-sequence" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -sformat1           string     Input sequence format
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-pattern" associated qualifiers
   -pformat2           string     File format
   -pname2             string     Pattern base name

   "-outfile" associated qualifiers
   -rformat3           string     Report format
   -rname3             string     Base file name
   -rextension3        string     File name extension
   -rdirectory3        string     Output directory
   -raccshow3          boolean    Show accession number in the report
   -rdesshow3          boolean    Show description in the report
   -rscoreshow3        boolean    Show the score in the report
   -rstrandshow3       boolean    Show the nucleotide strand in the report
   -rusashow3          boolean    Show the full USA in the report
   -rmaxall3           integer    Maximum total hits to report
   -rmaxseq3           integer    Maximum hits to report for one sequence

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options and exit. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages
   -version            boolean    Report version number and exit


Input file format

   Any nucleic sequence.

  Input files for usage example

   'tembl:x13776' is a sequence entry in the example nucleic acid database
   'tembl'

  Database entry: tembl:x13776

ID   X13776; SV 1; linear; genomic DNA; STD; PRO; 2167 BP.
XX
AC   X13776; M43175;
XX
DT   19-APR-1989 (Rel. 19, Created)
DT   14-NOV-2006 (Rel. 89, Last updated, Version 24)
XX
DE   Pseudomonas aeruginosa amiC and amiR gene for aliphatic amidase regulation
XX
KW   aliphatic amidase regulator; amiC gene; amiR gene.
XX
OS   Pseudomonas aeruginosa
OC   Bacteria; Proteobacteria; Gammaproteobacteria; Pseudomonadales;
OC   Pseudomonadaceae; Pseudomonas.
XX
RN   [1]
RP   1167-2167
RA   Rice P.M.;
RT   ;
RL   Submitted (16-DEC-1988) to the EMBL/GenBank/DDBJ databases.
RL   Rice P.M., EMBL, Postfach 10-2209, Meyerhofstrasse 1, 6900 Heidelberg, FRG.
XX
RN   [2]
RP   1167-2167
RX   DOI; 10.1016/0014-5793(89)80249-2.
RX   PUBMED; 2495988.
RA   Lowe N., Rice P.M., Drew R.E.;
RT   "Nucleotide sequence of the aliphatic amidase regulator gene (amiR) of
RT   Pseudomonas aeruginosa";
RL   FEBS Lett. 246(1-2):39-43(1989).
XX
RN   [3]
RP   1-1292
RX   PUBMED; 1907262.
RA   Wilson S., Drew R.;
RT   "Cloning and DNA sequence of amiC, a new gene regulating expression of the
RT   Pseudomonas aeruginosa aliphatic amidase, and purification of the amiC
RT   product";
RL   J. Bacteriol. 173(16):4914-4921(1991).
XX
RN   [4]
RP   1-2167
RA   Rice P.M.;
RT   ;
RL   Submitted (04-SEP-1991) to the EMBL/GenBank/DDBJ databases.
RL   Rice P.M., EMBL, Postfach 10-2209, Meyerhofstrasse 1, 6900 Heidelberg, FRG.
XX
DR   GOA; Q51417.
DR   InterPro; IPR003211; AmiSUreI_transpt.
DR   UniProtKB/Swiss-Prot; Q51417; AMIS_PSEAE.


  [Part of this file has been deleted for brevity]

FT                   /replace=""
FT                   /note="ClaI fragment deleted in pSW36,  constitutive
FT                   phenotype"
FT   misc_feature    1
FT                   /note="last base of an XhoI site"
FT   misc_feature    648..653
FT                   /note="end of 658bp XhoI fragment, deletion in  pSW3 causes
FT                   constitutive expression of amiE"
FT   conflict        1281
FT                   /replace="g"
FT                   /citation=[3]
XX
SQ   Sequence 2167 BP; 363 A; 712 C; 730 G; 362 T; 0 other;
     ggtaccgctg gccgagcatc tgctcgatca ccaccagccg ggcgacggga actgcacgat        60
     ctacctggcg agcctggagc acgagcgggt tcgcttcgta cggcgctgag cgacagtcac       120
     aggagaggaa acggatggga tcgcaccagg agcggccgct gatcggcctg ctgttctccg       180
     aaaccggcgt caccgccgat atcgagcgct cgcacgcgta tggcgcattg ctcgcggtcg       240
     agcaactgaa ccgcgagggc ggcgtcggcg gtcgcccgat cgaaacgctg tcccaggacc       300
     ccggcggcga cccggaccgc tatcggctgt gcgccgagga cttcattcgc aaccgggggg       360
     tacggttcct cgtgggctgc tacatgtcgc acacgcgcaa ggcggtgatg ccggtggtcg       420
     agcgcgccga cgcgctgctc tgctacccga ccccctacga gggcttcgag tattcgccga       480
     acatcgtcta cggcggtccg gcgccgaacc agaacagtgc gccgctggcg gcgtacctga       540
     ttcgccacta cggcgagcgg gtggtgttca tcggctcgga ctacatctat ccgcgggaaa       600
     gcaaccatgt gatgcgccac ctgtatcgcc agcacggcgg cacggtgctc gaggaaatct       660
     acattccgct gtatccctcc gacgacgact tgcagcgcgc cgtcgagcgc atctaccagg       720
     cgcgcgccga cgtggtcttc tccaccgtgg tgggcaccgg caccgccgag ctgtatcgcg       780
     ccatcgcccg tcgctacggc gacggcaggc ggccgccgat cgccagcctg accaccagcg       840
     aggcggaggt ggcgaagatg gagagtgacg tggcagaggg gcaggtggtg gtcgcgcctt       900
     acttctccag catcgatacg cccgccagcc gggccttcgt ccaggcctgc catggtttct       960
     tcccggagaa cgcgaccatc accgcctggg ccgaggcggc ctactggcag accttgttgc      1020
     tcggccgcgc cgcgcaggcc gcaggcaact ggcgggtgga agacgtgcag cggcacctgt      1080
     acgacatcga catcgacgcg ccacaggggc cggtccgggt ggagcgccag aacaaccaca      1140
     gccgcctgtc ttcgcgcatc gcggaaatcg atgcgcgcgg cgtgttccag gtccgctggc      1200
     agtcgcccga accgattcgc cccgaccctt atgtcgtcgt gcataacctc gacgactggt      1260
     ccgccagcat gggcggggga ccgctcccat gagcgccaac tcgctgctcg gcagcctgcg      1320
     cgagttgcag gtgctggtcc tcaacccgcc gggggaggtc agcgacgccc tggtcttgca      1380
     gctgatccgc atcggttgtt cggtgcgcca gtgctggccg ccgccggaag ccttcgacgt      1440
     gccggtggac gtggtcttca ccagcatttt ccagaatggc caccacgacg agatcgctgc      1500
     gctgctcgcc gccgggactc cgcgcactac cctggtggcg ctggtggagt acgaaagccc      1560
     cgcggtgctc tcgcagatca tcgagctgga gtgccacggc gtgatcaccc agccgctcga      1620
     tgcccaccgg gtgctgcctg tgctggtatc ggcgcggcgc atcagcgagg aaatggcgaa      1680
     gctgaagcag aagaccgagc agctccagga ccgcatcgcc ggccaggccc ggatcaacca      1740
     ggccaaggtg ttgctgatgc agcgccatgg ctgggacgag cgcgaggcgc accagcacct      1800
     gtcgcgggaa gcgatgaagc ggcgcgagcc gatcctgaag atcgctcagg agttgctggg      1860
     aaacgagccg tccgcctgag cgatccgggc cgaccagaac aataacaaga ggggtatcgt      1920
     catcatgctg ggactggttc tgctgtacgt tggcgcggtg ctgtttctca atgccgtctg      1980
     gttgctgggc aagatcagcg gtcgggaggt ggcggtgatc aacttcctgg tcggcgtgct      2040
     gagcgcctgc gtcgcgttct acctgatctt ttccgcagca gccgggcagg gctcgctgaa      2100
     ggccggagcg ctgaccctgc tattcgcttt tacctatctg tgggtggccg ccaaccagtt      2160
     cctcgag                                                                2167
//

Output file format

   The output is a standard EMBOSS report file.

   The results can be output in one of several styles by using the
   command-line qualifier -rformat xxx, where 'xxx' is replaced by the
   name of the required format. The available format names are: embl,
   genbank, gff, pir, swiss, dasgff, debug, listfile, dbmotif, diffseq,
   draw, restrict, excel, feattable, motif, nametable, regions, seqtable,
   simple, srs, table, tagseq.

   See: http://emboss.sf.net/docs/themes/ReportFormats.html for further
   information on report formats.

   By default dreg writes a 'seqtable' report file.

  Output files for usage example

  File: x13776.dreg

########################################
# Program: dreg
# Rundate: Fri 15 Jul 2011 12:00:00
# Commandline: dreg
#    -sequence tembl:x13776
#    -pattern ggtacc
# Report_format: seqtable
# Report_file: x13776.dreg
########################################

#=======================================
#
# Sequence: X13776     from: 1   to: 2167
# HitCount: 1
#
# Pattern: ggtacc
#
#=======================================

  Start     End  Strand Pattern       Sequence
      1       6       + regex: GGTACC ggtacc

#---------------------------------------
#---------------------------------------

#---------------------------------------
# Total_sequences: 1
# Total_length: 2167
# Reported_sequences: 1
# Reported_hitcount: 1
#---------------------------------------

Data files

   None.

Notes

   A regular expression is a way of specifying an ambiguous pattern to
   search for. Regular expressions are commonly used in some computer
   programming languages and may be more familiar to some users than to
   others.

   The following is a short guide to regular expressions in EMBOSS:

   ^
          use this at the start of a pattern to insist that the pattern
          can only match at the start of a sequence. (eg. '^AUG' matches a
          start codon at the start of the sequence)

   $
          use this at the end of a pattern to insist that the pattern can
          only match at the end of a sequence (eg. 'A+$' matches a poly-A
          sequence at the end of the sequence)

   ()
          groups a pattern. This is commonly used with '|' (eg.
          '(AUG)|(ATG)' matches either the DNA or RNA form of the
          initiation codon )

   |
          This is the OR operator to enable a match to be made to either
          one pattern OR another. There is no AND operator in this version
          of regular expressions.

   The following quantifier characters specify the number of time that the
   character before (in this case 'x') matches:

   x?
          matches 0 or 1 times (ie, '' or 'x')

   x*
          matches 0 or more times (ie, '' or 'x' or 'xx' or 'xxx', etc)

   x+
          matches 1 or more times (ie, 'x' or 'xx' or 'xxx', etc)

   {min,max}
          Braces can enclose the specification of the minimum and maximum
          number of matches. A match of 'x' of between 3 and 6 times is:
          'x{3,6}'

   Quantifiers can follow any of the following types of character
   specification:

   x
          any character (ie 'A')

   \x
          the character after the backslash is used instead of its normal
          regular expression meaning. This is commonly used to turn off
          the special meaning of the characters '^$()|?*+[]-.'. It may be
          especially useful when searching for gap characters in a
          sequence (eg '\.' matches only a dot character '.')

   [xy]
          match one of the characters 'x' or 'y'. You may have one or more
          characters in this set.

   [x-z]
          match any one of the set of characters starting with 'x' and
          ending in 'y' in ASCII order (eg '[A-G]' matches any one of:
          'A', 'B', 'C', 'D', 'E', 'F', 'G')

   [^x-z]
          matches anything except any one of the group of characters in
          ASCII order (eg '[^A-G]' matches anything EXCEPT any one of:
          'A', 'B', 'C', 'D', 'E', 'F', 'G')

   .
          the dot character matches any other character (eg: 'A.G' matches
          'AAG', 'AaG', 'AZG', 'A-G' 'A G', etc.)

   Combining some of these features gives the example:
'([AGC]+GGG)|(TTTGGG)'

   which matches one or more of any one of 'A' or 'G' or 'C' followed by
   three 'G's or it matches just 'TTTGGG'.

   Regular expressions are case-sensitive. The pattern 'AAAA' will not
   match the sequence 'aaaa'. For this reason, both your pattern and the
   input sequences are converted to upper-case.

The syntax in detail

   EMBOSS uses the publicly available PCRE code library to do regular
   expressions.

   The full documentation of the PCRE system can be seen at
   http://www.pcre.org/pcre.txt

   A condensed description of the syntax of PCRE follows, without features
   that are thought not to be required for searching for patterns in
   sequences (e.g. matching non-printing characters, atomic grouping,
   back-references, assertion, conditional sub-patterns, recursive
   patterns, subpatterns as subroutines, callouts). If you do neot see a
   required function described below, please see the full description on
   the PCRE web site.

  PCRE REGULAR EXPRESSION DETAILS

   The syntax and semantics of the regular expressions supported by PCRE
   are described below. Regular expressions are also described in the Perl
   documentation and in a number of other books, some of which have
   copious examples. Jeffrey Friedl's "Mastering Regular Expressions",
   published by O'Reilly, covers them in great detail. The description
   here is intended as reference documentation.

   A regular expression is a pattern that is matched against a subject
   string from left to right. Most characters stand for themselves in a
   pattern, and match the corresponding characters in the subject. As a
   trivial example, the pattern

   The quick brown fox

   matches a portion of a subject string that is identical to itself. The
   power of regular expressions comes from the ability to include
   alternatives and repetitions in the pattern. These are encoded in the
   pattern by the use of meta-characters, which do not stand for
   themselves but instead are interpreted in some special way.

   There are two different sets of meta-characters: those that are
   recognized anywhere in the pattern except within square brackets, and
   those that are recognized in square brackets. Outside square brackets,
   the meta-characters are as follows:

       \      general escape character with several uses
       ^      assert start of string (or line, in multiline mode)
       $      assert end of string (or line, in multiline mode)
       .      match any character except newline (by default)
       [      start character class definition
       |      start of alternative branch
       (      start subpattern
       )      end subpattern
       ?      extends the meaning of (
              also 0 or 1 quantifier
              also quantifier minimizer
       *      0 or more quantifier
       +      1 or more quantifier
              also "possessive quantifier"
       {      start min/max quantifier


   Part of a pattern that is in square brackets is called a "character
   class". In a character class the only meta-characters are:

       \      general escape character
       ^      negate the class, but only if the first character
       -      indicates character range
       [      POSIX character class (only if followed by POSIX
                syntax)
       ]      terminates the character class

   The following sections describe the use of each of the meta-characters.

    BACKSLASH

   The backslash character has several uses. Firstly, if it is followed by
   a non-alphameric character, it takes away any special meaning that
   character may have. This use of backslash as an escape character
   applies both inside and outside character classes.

   For example, if you want to match a * character, you write \* in the
   pattern. This escaping action applies whether or not the following
   character would otherwise be interpreted as a meta-character, so it is
   always safe to precede a nonalphameric with backslash to specify that
   it stands for itself. In particular, if you want to match a backslash,
   you write \\.

   The third use of backslash is for specifying generic character types:

       \d     any decimal digit
       \D     any character that is not a decimal digit
       \s     any whitespace character
       \S     any character that is not a whitespace character
       \w     any "word" character
       W     any "non-word" character

   Each pair of escape sequences partitions the complete set of characters
   into two disjoint sets. Any given character matches one, and only one,
   of each pair.

   A "word" character is any letter or digit or the underscore character,
   that is, any character which can be part of a Perl "word". The
   definition of letters and digits is controlled by PCRE's character
   tables, and may vary if locale- specific matching is taking place (see
   "Locale support" in the pcreapi page). For example, in the "fr"
   (French) locale, some character codes greater than 128 are used for
   accented letters, and these are matched by \w.

   These character type sequences can appear both inside and outside
   character classes. They each match one character of the appropriate
   type. If the current matching point is at the end of the subject
   string, all of them fail, since there is no character to match.

   The fourth use of backslash is for certain simple assertions. An
   assertion specifies a condition that has to be met at a particular
   point in a match, without consuming any characters from the subject
   string. The use of subpatterns for more complicated assertions is
   described below. The backslashed assertions are

       \b     matches at a word boundary
       \B     matches when not at a word boundary
       \A     matches at start of subject
       \Z     matches at end of subject or before newline at end
       \z     matches at end of subject
       \G     matches at first matching position in subject

   These assertions may not appear in character classes (but note that \b
   has a different meaning, namely the backspace character, inside a
   character class).

   A word boundary is a position in the subject string where the current
   character and the previous character do not both match \w or \W (i.e.
   one matches \w and the other matches \W), or the start or end of the
   string if the first or last character matches \w, respectively. The \A,
   \Z, and \z assertions differ from the traditional circumflex and dollar
   (described below) in that they only ever match at the very start and
   end of the subject string, whatever options are set. Thus, they are
   independent of multiline mode.

    CIRCUMFLEX AND DOLLAR

   Outside a character class, in the default matching mode, the circumflex
   character is an assertion which is true only if the current matching
   point is at the start of the subject string. Inside a character class,
   circumflex has an entirely different meaning (see below).

   Circumflex need not be the first character of the pattern if a number
   of alternatives are involved, but it should be the first thing in each
   alternative in which it appears if the pattern is ever to match that
   branch. If all possible alternatives start with a circumflex, that is,
   if the pattern is constrained to match only at the start of the
   subject, it is said to be an "anchored" pattern. (There are also other
   constructs that can cause a pattern to be anchored.)

   A dollar character is an assertion which is true only if the current
   matching point is at the end of the subject string, or immediately
   before a newline character that is the last character in the string (by
   default). Dollar need not be the last character of the pattern if a
   number of alternatives are involved, but it should be the last item in
   any branch in which it appears. Dollar has no special meaning in a
   character class.

    FULL STOP (PERIOD, DOT)

   Outside a character class, a dot in the pattern matches any one
   character in the subject, including a non-printing character, but not
   (by default) newline. The handling of dot is entirely independent of
   the handling of circumflex and dollar, the only relationship being that
   they both involve newline characters. Dot has no special meaning in a
   character class.

    SQUARE BRACKETS

   An opening square bracket introduces a character class, terminated by a
   closing square bracket. A closing square bracket on its own is not
   special. If a closing square bracket is required as a member of the
   class, it should be the first data character in the class (after an
   initial circumflex, if present) or escaped with a backslash.

   A character class matches a single character in the subject. A matched
   character must be in the set of characters defined by the class, unless
   the first character in the class definition is a circumflex, in which
   case the subject character must not be in the set defined by the class.
   If a circumflex is actually required as a member of the class, ensure
   it is not the first character, or escape it with a backslash.

   For example, the character class [aeiou] matches any lower case vowel,
   while [^aeiou] matches any character that is not a lower case vowel.
   Note that a circumflex is just a convenient notation for specifying the
   characters which are in the class by enumerating those that are not. It
   is not an assertion: it still consumes a character from the subject
   string, and fails if the current pointer is at the end of the string.

   When caseless matching is set, any letters in a class represent both
   their upper case and lower case versions, so for example, a caseless
   [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
   match "A", whereas a caseful version would. PCRE does not support the
   concept of case for characters with values greater than 255. A class
   such as [^a] will always match a newline.

   The minus (hyphen) character can be used to specify a range of
   characters in a character class. For example, [d-m] matches any letter
   between d and m, inclusive. If a minus character is required in a
   class, it must be escaped with a backslash or appear in a position
   where it cannot be interpreted as indicating a range, typically as the
   first or last character in the class.

   It is not possible to have the literal character "]" as the end
   character of a range. A pattern such as [W-]46] is interpreted as a
   class of two characters ("W" and "-") followed by a literal string
   "46]", so it would match "W46]" or "-46]". However, if the "]" is
   escaped with a backslash it is interpreted as the end of range, so
   [W-\]46] is interpreted as a single class containing a range followed
   by two separate characters. The octal or hexadecimal representation of
   "]" can also be used to end a range.

   The character types \d, \D, \s, \S, \w, and \W may also appear in a
   character class, and add the characters that they match to the class.
   For example, [\dABCDEF] matches any hexadecimal digit. A circumflex can
   conveniently be used with the upper case character types to specify a
   more restricted set of characters than the matching lower case type.
   For example, the class [^\W_] matches any letter or digit, but not
   underscore.

   All non-alphameric characters other than \, -, ^ (at the start) and the
   terminating ] are non-special in character classes, but it does no harm
   if they are escaped.

    VERTICAL BAR

   Vertical bar characters are used to separate alternative patterns. For
   example, the pattern

   gilbert|sullivan

   matches either "gilbert" or "sullivan". Any number of alternatives may
   appear, and an empty alternative is permitted (matching the empty
   string). The matching process tries each alternative in turn, from left
   to right, and the first one that succeeds is used. If the alternatives
   are within a subpattern (defined below), "succeeds" means matching the
   rest of the main pattern as well as the alternative in the subpattern.

    INTERNAL OPTION SETTING

   The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
   PCRE_EXTENDED options can be changed from within the pattern by a
   sequence of Perl option letters enclosed between "(?" and ")". The
   option letters are

       i  for PCRE_CASELESS
       m  for PCRE_MULTILINE
       s  for PCRE_DOTALL
       x  for PCRE_EXTENDED

   For example, (?im) sets caseless, multiline matching. It is also
   possible to unset these options by preceding the letter with a hyphen,
   and a combined setting and unsetting such as (?im-sx), which sets
   PCRE_CASELESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and
   PCRE_EXTENDED, is also permitted. If a letter appears both before and
   after the hyphen, the option is unset.

   When an option change occurs at top level (that is, not inside
   subpattern parentheses), the change applies to the remainder of the
   pattern that follows. If the change is placed right at the start of a
   pattern, PCRE extracts it into the global options (and it will
   therefore show up in data extracted by the pcre_fullinfo() function).

   An option change within a subpattern affects only that part of the
   current pattern that follows it, so

   (a(?i)b)c

   matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
   used). By this means, options can be made to have different settings in
   different parts of the pattern. Any changes made in one alternative do
   carry on into subsequent branches within the same subpattern. For
   example,

   (a(?i)b|c)

   matches "ab", "aB", "c", and "C", even though when matching "C" the
   first branch is abandoned before the option setting. This is because
   the effects of option settings happen at compile time. There would be
   some very weird behaviour otherwise.

   The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can be changed
   in the same way as the Perl-compatible options by using the characters
   U and X respectively. The (?X) flag setting is special in that it must
   always occur earlier in the pattern than any of the additional features
   it turns on, even when it is at top level. It is best put at the start.

    SUBPATTERNS

   Subpatterns are delimited by parentheses (round brackets), which can be
   nested. Marking part of a pattern as a subpattern does two things:

   1. It localizes a set of alternatives. For example, the pattern

   cat(aract|erpillar|)

   matches one of the words "cat", "cataract", or "caterpillar". Without
   the parentheses, it would match "cataract", "erpillar" or the empty
   string.

   2. It sets up the subpattern as a capturing subpattern (as defined
   above). When the whole pattern matches, that portion of the subject
   string that matched the subpattern is passed back to the caller via the
   ovector argument of pcre_exec(). Opening parentheses are counted from
   left to right (starting from 1) to obtain the numbers of the capturing
   subpatterns.

   For example, if the string "the red king" is matched against the
   pattern

   the ((red|white) (king|queen))

   the captured substrings are "red king", "red", and "king", and are
   numbered 1, 2, and 3, respectively.

   The fact that plain parentheses fulfil two functions is not always
   helpful. There are often times when a grouping subpattern is required
   without a capturing requirement. If an opening parenthesis is followed
   by a question mark and a colon, the subpattern does not do any
   capturing, and is not counted when computing the number of any
   subsequent capturing subpatterns. For example, if the string "the white
   queen" is matched against the pattern

   the ((?:red|white) (king|queen))

   the captured substrings are "white queen" and "queen", and are numbered
   1 and 2. The maximum number of capturing subpatterns is 65535, and the
   maximum depth of nesting of all subpatterns, both capturing and
   non-capturing, is 200.

   As a convenient shorthand, if any option settings are required at the
   start of a non-capturing subpattern, the option letters may appear
   between the "?" and the ":". Thus the two patterns

       (?i:saturday|sunday)
       (?:(?i)saturday|sunday)

   match exactly the same set of strings. Because alternative branches are
   tried from left to right, and options are not reset until the end of
   the subpattern is reached, an option setting in one branch does affect
   subsequent branches, so the above patterns match "SUNDAY" as well as
   "Saturday".

    REPETITION

   Repetition is specified by quantifiers, which can follow any of the
   following items:

       a literal data character
       the . meta-character
       the \C escape sequence
       escapes such as \d that match single characters
       a character class
       a back reference (see next section)
       a parenthesized subpattern (unless it is an assertion)

   The general repetition quantifier specifies a minimum and maximum
   number of permitted matches, by giving the two numbers in curly
   brackets (braces), separated by a comma. The numbers must be less than
   65536, and the first must be less than or equal to the second. For
   example:

   z{2,4}

   matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
   special character. If the second number is omitted, but the comma is
   present, there is no upper limit; if the second number and the comma
   are both omitted, the quantifier specifies an exact number of required
   matches. Thus

   [aeiou]{3,}

   matches at least 3 successive vowels, but may match many more, while

   \d{8}

   matches exactly 8 digits. An opening curly bracket that appears in a
   position where a quantifier is not allowed, or one that does not match
   the syntax of a quantifier, is taken as a literal character. For
   example, {,6} is not a quantifier, but a literal string of four
   characters.

   The quantifier {0} is permitted, causing the expression to behave as if
   the previous item and the quantifier were not present.

   For convenience (and historical compatibility) the three most common
   quantifiers have single-character abbreviations:

       *    is equivalent to {0,}
       +    is equivalent to {1,}
       ?    is equivalent to {0,1}

   It is possible to construct infinite loops by following a subpattern
   that can match no characters with a quantifier that has no upper limit,
   for example:

   (a?)*

   Earlier versions of Perl and PCRE used to give an error at compile time
   for such patterns. However, because there are cases where this can be
   useful, such patterns are now accepted, but if any repetition of the
   subpattern does in fact match no characters, the loop is forcibly
   broken.

   By default, the quantifiers are "greedy", that is, they match as much
   as possible (up to the maximum number of permitted times), without
   causing the rest of the pattern to fail. The classic example of where
   this gives problems is in trying to match comments in C programs. These
   appear between the sequences /* and */ and within the sequence,
   individual * and / characters may appear. An attempt to match C
   comments by applying the pattern

   /\*.*\*/

   to the string

       /* first command */  not comment  /* second comment */

   fails, because it matches the entire string owing to the greediness of
   the .* item.

   However, if a quantifier is followed by a question mark, it ceases to
   be greedy, and instead matches the minimum number of times possible, so
   the pattern

   /\*.*?\*/

   does the right thing with the C comments. The meaning of the various
   quantifiers is not otherwise changed, just the preferred number of
   matches. Do not confuse this use of question mark with its use as a
   quantifier in its own right. Because it has two uses, it can sometimes
   appear doubled, as in

   \d??\d

   which matches one digit by preference, but can match two if that is the
   only way the rest of the pattern matches.

   If the PCRE_UNGREEDY option is set (an option which is not available in
   Perl), the quantifiers are not greedy by default, but individual ones
   can be made greedy by following them with a question mark. In other
   words, it inverts the default behaviour.

   When a parenthesized subpattern is quantified with a minimum repeat
   count that is greater than 1 or with a limited maximum, more store is
   required for the compiled pattern, in proportion to the size of the
   minimum or maximum. If a pattern starts with .* or .{0,} and the
   PCRE_DOTALL option (equivalent to Perl's /s) is set, thus allowing the
   . to match newlines, the pattern is implicitly anchored, because
   whatever follows will be tried against every character position in the
   subject string, so there is no point in retrying the overall match at
   any position after the first. PCRE normally treats such a pattern as
   though it were preceded by \A.

   In cases where it is known that the subject string contains no
   newlines, it is worth setting PCRE_DOTALL in order to obtain this
   optimization, or alternatively using ^ to indicate anchoring
   explicitly.

   However, there is one situation where the optimization cannot be used.
   When .* is inside capturing parentheses that are the subject of a
   backreference elsewhere in the pattern, a match at the start may fail,
   and a later one succeed. Consider, for example:

   (.*)abc\1

   If the subject is "xyz123abc123" the match point is the fourth
   character. For this reason, such a pattern is not implicitly anchored.

   When a capturing subpattern is repeated, the value captured is the
   substring that matched the final iteration. For example, after

   (tweedle[dume]{3}\s*)+

   has matched "tweedledum tweedledee" the value of the captured substring
   is "tweedledee". However, if there are nested capturing subpatterns,
   the corresponding captured values may have been set in previous
   iterations. For example, after

       /(a|(b))+/

  PCRE PERFORMANCE

   Certain items that may appear in regular expression patterns are more
   efficient than others. It is more efficient to use a character class
   like [aeiou] than a set of alternatives such as (a|e|i|o|u). In
   general, the simplest construction that provides the required behaviour
   is usually the most efficient. Jeffrey Friedl's book contains a lot of
   discussion about optimizing regular expressions for efficient
   performance.

   When a pattern begins with .* not in parentheses, or in parentheses
   that are not the subject of a backreference, and the PCRE_DOTALL option
   is set, the pattern is implicitly anchored by PCRE, since it can match
   only at the start of a subject string. However, if PCRE_DOTALL is not
   set, PCRE cannot make this optimization, because the . meta-character
   does not then match a newline, and if the subject string contains
   newlines, the pattern may match from the character immediately
   following one of them instead of from the very start. For example, the
   pattern

   .*second

   matches the subject "first\nand second" (where \n stands for a newline
   character), with the match starting at the seventh character. In order
   to do this, PCRE has to retry the match starting after every newline in
   the subject.

   If you are using such a pattern with subject strings that do not
   contain newlines, the best performance is obtained by setting
   PCRE_DOTALL, or starting the pattern with ^.* to indicate explicit
   anchoring. That saves PCRE from having to scan along the subject
   looking for a newline to restart at.

   Beware of patterns that contain nested indefinite repeats. These can
   take a long time to run when applied to a string that does not match.
   Consider the pattern fragment

   (a+)*

   This can match "aaaa" in 33 different ways, and this number increases
   very rapidly as the string gets longer. (The * repeat can match 0, 1,
   2, 3, or 4 times, and for each of those cases other than 0, the +
   repeats can match different numbers of times.) When the remainder of
   the pattern is such that the entire match is going to fail, PCRE has in
   principle to try every possible variation, and this can take an
   extremely long time. An optimization catches some of the more simple
   cases such as

   (a+)*b

   where a literal character follows. Before embarking on the standard
   matching procedure, PCRE checks that there is a "b" later in the
   subject string, and if there is not, it fails the match immediately.
   However, when there is no following literal this optimization cannot be
   used. You can see the difference by comparing the behaviour of

   (a+)*\d

   with the pattern above. The former gives a failure almost instantly
   when applied to a whole line of "a" characters, whereas the latter
   takes an appreciable time with strings longer than about 20 characters.

References

   None.

Warnings

   Regular expressions are case-sensitive. The pattern 'AAAA' will not
   match the sequence 'aaaa'. For this reason, both your pattern and the
   input sequences are converted to upper-case.

Diagnostic Error Messages

   None.

Exit status

   Always returns 0.

Known bugs

   None.

See also

                    Program name                      Description
                    fuzznuc      Search for patterns in nucleotide sequences
                    fuzztran     Search for patterns in protein sequences (translated)

Author(s)

   Peter            Rice
   European         Bioinformatics Institute, Wellcome Trust Genome Campus,
   Hinxton,         Cambridge CB10 1SD, UK

                    Please report all bugs to the EMBOSS bug team
                    (emboss-bug (c) emboss.open-bio.org) not to the original author.

History

                    Written (1999) - Peter Rice.

Target users

                    This program is intended to be used by everyone and everything, from
                    naive users to embedded scripts.

Comments

                    None
