org.biojavax.bio.seq.io
Class UniProtFormat

java.lang.Object
  extended by org.biojavax.bio.seq.io.RichSequenceFormat.BasicFormat
      extended by org.biojavax.bio.seq.io.RichSequenceFormat.HeaderlessFormat
          extended by org.biojavax.bio.seq.io.UniProtFormat
All Implemented Interfaces:
SequenceFormat, RichSequenceFormat

public class UniProtFormat
extends RichSequenceFormat.HeaderlessFormat

Format reader for UniProt files. This version of UniProt format will generate and write RichSequence objects. Loosely Based on code from the old, deprecated, org.biojava.bio.seq.io.EMBLLikeFormat object. Since 1.7, the parser reads the International Protein Index (IPI) pseudo-Uniprot format.

Since:
1.5
Author:
Richard Holland, Mark Schreiber, George Waldon

Nested Class Summary
static class UniProtFormat.Terms
          Implements some UniProt-specific terms.
 
Nested classes/interfaces inherited from interface org.biojavax.bio.seq.io.RichSequenceFormat
RichSequenceFormat.BasicFormat, RichSequenceFormat.HeaderlessFormat
 
Field Summary
protected static java.lang.String ACCESSION_TAG
           
protected static java.lang.String AUTHORS_TAG
           
protected static java.lang.String COMMENT_TAG
           
protected static java.lang.String CONSORTIUM_TAG
           
protected static java.lang.String DATABASE_XREF_TAG
           
protected static java.lang.String DATE_TAG
           
protected static java.lang.String DEFINITION_TAG
           
protected static java.util.regex.Pattern dp_ipi
           
protected static java.util.regex.Pattern dp_uniprot
           
protected static java.lang.String END_SEQUENCE_TAG
           
protected static java.lang.String FEATURE_TAG
           
protected static java.util.regex.Pattern fp
           
protected static java.lang.String GENE_TAG
           
protected static java.util.regex.Pattern headerLine
           
protected static java.lang.String KEYWORDS_TAG
           
protected static java.lang.String LOCATION_TAG
           
protected static java.lang.String LOCUS_TAG
           
protected static java.util.regex.Pattern lp_ipi
           
protected static java.util.regex.Pattern lp_uniprot
           
protected static java.lang.String ORGANELLE_TAG
           
protected static java.lang.String ORGANISM_TAG
           
protected static java.lang.String PROTEIN_EXIST_TAG
           
protected static java.lang.String RC_LINE_TAG
           
protected static java.lang.String REFERENCE_TAG
           
protected static java.lang.String REFERENCE_XREF_TAG
           
protected static java.lang.String RP_LINE_TAG
           
protected static java.util.regex.Pattern rppat
           
protected static java.lang.String SOURCE_TAG
           
protected static java.lang.String START_SEQUENCE_TAG
           
protected static java.lang.String TAXON_TAG
           
protected static java.lang.String TITLE_TAG
           
static java.lang.String UNIPROT_FORMAT
          The name of this format
 
Constructor Summary
UniProtFormat()
           
 
Method Summary
 boolean canRead(java.io.BufferedInputStream stream)
          Check to see if a given stream is in our format.
 boolean canRead(java.io.File file)
          Check to see if a given file is in our format.
 java.lang.String getDefaultFormat()
          getDefaultFormat returns the String identifier for the default sub-format written by a SequenceFormat implementation.
 SymbolTokenization guessSymbolTokenization(java.io.BufferedInputStream stream)
          On the assumption that the stream is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it.
 SymbolTokenization guessSymbolTokenization(java.io.File file)
          On the assumption that the file is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it.
 boolean readRichSequence(java.io.BufferedReader reader, SymbolTokenization symParser, RichSeqIOListener rlistener, Namespace ns)
          Reads a sequence from the given buffered reader using the given tokenizer to parse sequence symbols.
 boolean readSequence(java.io.BufferedReader reader, SymbolTokenization symParser, SeqIOListener listener)
          Read a sequence and pass data on to a SeqIOListener.
 void writeSequence(Sequence seq, Namespace ns)
          Writes a sequence out to the outputstream given by beginWriting() using the default format of the implementing class.
 void writeSequence(Sequence seq, java.io.PrintStream os)
          writeSequence writes a sequence to the specified PrintStream, using the default format.
 void writeSequence(Sequence seq, java.lang.String format, java.io.PrintStream os)
          writeSequence writes a sequence to the specified PrintStream, using the specified format.
 
Methods inherited from class org.biojavax.bio.seq.io.RichSequenceFormat.HeaderlessFormat
beginWriting, finishWriting
 
Methods inherited from class org.biojavax.bio.seq.io.RichSequenceFormat.BasicFormat
getElideComments, getElideFeatures, getElideReferences, getElideSymbols, getLineWidth, getPrintStream, setElideComments, setElideFeatures, setElideReferences, setElideSymbols, setLineWidth, setPrintStream
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

UNIPROT_FORMAT

public static final java.lang.String UNIPROT_FORMAT
The name of this format

See Also:
Constant Field Values

LOCUS_TAG

protected static final java.lang.String LOCUS_TAG
See Also:
Constant Field Values

ACCESSION_TAG

protected static final java.lang.String ACCESSION_TAG
See Also:
Constant Field Values

DEFINITION_TAG

protected static final java.lang.String DEFINITION_TAG
See Also:
Constant Field Values

DATE_TAG

protected static final java.lang.String DATE_TAG
See Also:
Constant Field Values

SOURCE_TAG

protected static final java.lang.String SOURCE_TAG
See Also:
Constant Field Values

ORGANELLE_TAG

protected static final java.lang.String ORGANELLE_TAG
See Also:
Constant Field Values

ORGANISM_TAG

protected static final java.lang.String ORGANISM_TAG
See Also:
Constant Field Values

TAXON_TAG

protected static final java.lang.String TAXON_TAG
See Also:
Constant Field Values

GENE_TAG

protected static final java.lang.String GENE_TAG
See Also:
Constant Field Values

DATABASE_XREF_TAG

protected static final java.lang.String DATABASE_XREF_TAG
See Also:
Constant Field Values

PROTEIN_EXIST_TAG

protected static final java.lang.String PROTEIN_EXIST_TAG
See Also:
Constant Field Values

REFERENCE_TAG

protected static final java.lang.String REFERENCE_TAG
See Also:
Constant Field Values

RP_LINE_TAG

protected static final java.lang.String RP_LINE_TAG
See Also:
Constant Field Values

REFERENCE_XREF_TAG

protected static final java.lang.String REFERENCE_XREF_TAG
See Also:
Constant Field Values

AUTHORS_TAG

protected static final java.lang.String AUTHORS_TAG
See Also:
Constant Field Values

CONSORTIUM_TAG

protected static final java.lang.String CONSORTIUM_TAG
See Also:
Constant Field Values

TITLE_TAG

protected static final java.lang.String TITLE_TAG
See Also:
Constant Field Values

LOCATION_TAG

protected static final java.lang.String LOCATION_TAG
See Also:
Constant Field Values

RC_LINE_TAG

protected static final java.lang.String RC_LINE_TAG
See Also:
Constant Field Values

KEYWORDS_TAG

protected static final java.lang.String KEYWORDS_TAG
See Also:
Constant Field Values

COMMENT_TAG

protected static final java.lang.String COMMENT_TAG
See Also:
Constant Field Values

FEATURE_TAG

protected static final java.lang.String FEATURE_TAG
See Also:
Constant Field Values

START_SEQUENCE_TAG

protected static final java.lang.String START_SEQUENCE_TAG
See Also:
Constant Field Values

END_SEQUENCE_TAG

protected static final java.lang.String END_SEQUENCE_TAG
See Also:
Constant Field Values

lp_uniprot

protected static final java.util.regex.Pattern lp_uniprot

lp_ipi

protected static final java.util.regex.Pattern lp_ipi

rppat

protected static final java.util.regex.Pattern rppat

dp_uniprot

protected static final java.util.regex.Pattern dp_uniprot

dp_ipi

protected static final java.util.regex.Pattern dp_ipi

fp

protected static final java.util.regex.Pattern fp

headerLine

protected static final java.util.regex.Pattern headerLine
Constructor Detail

UniProtFormat

public UniProtFormat()
Method Detail

canRead

public boolean canRead(java.io.File file)
                throws java.io.IOException
Check to see if a given file is in our format. Some formats may be able to determine this by filename, whilst others may have to open the file and read it to see what format it is in. A file is in UniProt format if the first line matches the UniProt format for the ID line.

Specified by:
canRead in interface RichSequenceFormat
Overrides:
canRead in class RichSequenceFormat.BasicFormat
Parameters:
file - the File to check.
Returns:
true if the file is readable by this format, false if not.
Throws:
java.io.IOException - in case the file is inaccessible.

guessSymbolTokenization

public SymbolTokenization guessSymbolTokenization(java.io.File file)
                                           throws java.io.IOException
On the assumption that the file is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it. For formats that only accept one tokenization, just return it without checking the file. For formats that accept multiple tokenizations, its up to you how you do it. Always returns a protein tokenizer.

Specified by:
guessSymbolTokenization in interface RichSequenceFormat
Overrides:
guessSymbolTokenization in class RichSequenceFormat.BasicFormat
Parameters:
file - the File object to guess the format of.
Returns:
a SymbolTokenization to read the file with.
Throws:
java.io.IOException - if the file is unrecognisable or inaccessible.

canRead

public boolean canRead(java.io.BufferedInputStream stream)
                throws java.io.IOException
Check to see if a given stream is in our format. A stream is in UniProt format if the first line matches the UniProt format for the ID line.

Parameters:
stream - the BufferedInputStream to check.
Returns:
true if the stream is readable by this format, false if not.
Throws:
java.io.IOException - in case the stream is inaccessible.

guessSymbolTokenization

public SymbolTokenization guessSymbolTokenization(java.io.BufferedInputStream stream)
                                           throws java.io.IOException
On the assumption that the stream is readable by this format (not checked), attempt to guess which symbol tokenization we should use to read it. For formats that only accept one tokenization, just return it without checking the stream. For formats that accept multiple tokenizations, its up to you how you do it. Always returns a protein tokenizer.

Parameters:
stream - the BufferedInputStream object to guess the format of.
Returns:
a SymbolTokenization to read the stream with.
Throws:
java.io.IOException - if the stream is unrecognisable or inaccessible.

readSequence

public boolean readSequence(java.io.BufferedReader reader,
                            SymbolTokenization symParser,
                            SeqIOListener listener)
                     throws IllegalSymbolException,
                            java.io.IOException,
                            ParseException
Read a sequence and pass data on to a SeqIOListener.

Parameters:
reader - The stream of data to parse.
symParser - A SymbolParser defining a mapping from character data to Symbols.
listener - A listener to notify when data is extracted from the stream.
Returns:
a boolean indicating whether or not the stream contains any more sequences.
Throws:
IllegalSymbolException - if it is not possible to translate character data from the stream into valid BioJava symbols.
java.io.IOException - if an error occurs while reading from the stream.
ParseException

readRichSequence

public boolean readRichSequence(java.io.BufferedReader reader,
                                SymbolTokenization symParser,
                                RichSeqIOListener rlistener,
                                Namespace ns)
                         throws IllegalSymbolException,
                                java.io.IOException,
                                ParseException
Reads a sequence from the given buffered reader using the given tokenizer to parse sequence symbols. Events are passed to the listener, and the namespace used for sequences read is the one given. If the namespace is null, then the default namespace for the parser is used, which may depend on individual implementations of this interface.

Parameters:
reader - the input source
symParser - the tokenizer which understands the sequence being read
rlistener - the listener to send sequence events to
ns - the namespace to read sequences into.
Returns:
true if there is more to read after this, false otherwise.
Throws:
IllegalSymbolException - if the tokenizer couldn't understand one of the sequence symbols in the file.
java.io.IOException - if there was a read error.
ParseException

writeSequence

public void writeSequence(Sequence seq,
                          java.io.PrintStream os)
                   throws java.io.IOException
writeSequence writes a sequence to the specified PrintStream, using the default format.

Parameters:
seq - the sequence to write out.
os - the printstream to write to.
Throws:
java.io.IOException

writeSequence

public void writeSequence(Sequence seq,
                          java.lang.String format,
                          java.io.PrintStream os)
                   throws java.io.IOException
writeSequence writes a sequence to the specified PrintStream, using the specified format.

Parameters:
seq - a Sequence to write out.
format - a String indicating which sub-format of those available from a particular SequenceFormat implemention to use when writing.
os - a PrintStream object.
Throws:
java.io.IOException - if an error occurs.

writeSequence

public void writeSequence(Sequence seq,
                          Namespace ns)
                   throws java.io.IOException
Writes a sequence out to the outputstream given by beginWriting() using the default format of the implementing class. If namespace is given, sequences will be written with that namespace, otherwise they will be written with the default namespace of the implementing class (which is usually the namespace of the sequence itself). If you pass this method a sequence which is not a RichSequence, it will attempt to convert it using RichSequence.Tools.enrich(). Obviously this is not going to guarantee a perfect conversion, so it's better if you just use RichSequences to start with! Namespace is ignored as UniProt has no concept of it.

Parameters:
seq - the sequence to write
ns - the namespace to write it with
Throws:
java.io.IOException - in case it couldn't write something

getDefaultFormat

public java.lang.String getDefaultFormat()
getDefaultFormat returns the String identifier for the default sub-format written by a SequenceFormat implementation.

Returns:
a String.