GSOC 2010 - Adding RNA Support to Jalview: Rfam

Showing posts with label Rfam. Show all posts

Wednesday, August 18, 2010

Demonstration of GSoC Project

The features that developed for the GSoC project will be available in the next Jalview release. You can also contact me and I can send you a jar file with the new features, or download the code from the Google code hosting page for Jalview and see my post on how to set up Eclipse with Jalview. The code submitted to Google is the same as rev 32 on the Google code host site.

Addendum: Jim posted a comment with a link to the webstart version.

You can fetch sequences from RFAM via the sequence fetcher. Go to "File > Fetch Sequence(s)..." and a dialog box will appear.

From the database drop down menu select RFAM (Full) for the "Full" alignment from RFAM or RFAM (Seed) for the "Seed" alignment. Seed alignments are the original alignments constructed to create a covariance model for searching databases. Full alignment are the result of a search using the covariance model against the sequence database. For this demonstration, I used RFAM (Seed), but selecting the RFAM (Full) will be very similar.

Click the "Example" button to load an alignment name. Click the "OK" button to fetch the alignment.

A new dialog box will appear with the alignment. The RFAM sequence fetcher retrieves files in Stockholm format, so Jalview can interpret the secondary structure information in the file. The secondary structure information in the file is displayed in WUSS notation and helices that are determined from this information are displayed as blue arrows in the Annotation panel.

You can color the helices with the "By RNA helices" color option. Go to "Colour > By RNA helices."

The helices should now be colored.

To view the consensus logo in the Annotation panel, go to "View > Autocalculated Annotation > Show Consensus logo". The coloring of the logo will change based on which color scheme is selected.

Close up of Consensus logo when "By RNA helices" or "Purine/Pyrimidine coloring is selected.

To change the color scheme to "Purine/Pyrimidine" go to "Colour > Purine/Pyrimidine"

Monday, August 9, 2010

Fetching sequences from Rfam

I added the ability to fetch sequences from Rfam. Jim recommended that I do some refactoring of the code to reuse methods to fetch sequences from Pfam, which is a database for protein families, rather than RNA families. Rfam was designed to be similar to Pfam, so it wasn't too hard to add the ability to fetch sequences from Rfam to Jalview.

Under the guidance of Jim, I created an Xfam class in the jalview.ws package, which the Rfam and Pfam classes extend. ( Sidenote, there is an Xfam blog about new developments of the Rfam and Pfam databases. ) Then I added RfamSeed and RfamFull classes, similar to the ones for Pfam. These contain the methods to fetch sequences from the "seed" and "full" alignments available on Rfam in Stockholm format, respectively. I had to modify the names of some methods to keep things consistent. In SequenceFetcher.java (in package jalview.ws), I called addDBRefSourceImpl() for RfamSeed and RfamFull to enable calling Rfam sequence retrieval from the Jalview menu.

To fetch sequences from Rfam (and Pfam), Jalview accesses stable urls that can be used to get the alignments. In the RfamSeed and RfamFull classes, part of the url is hardcoded in with the correct variables for the query string (I learned this while I was creating the classes, see the wikipedia page on CGI and QUERY_STRING.)

The variables for the Rfam website:

'acc': followed by "=" and the accession number will give you the corresponding familiy.
'id': followed by "=" and the ID name will give you the corresponding familiy.

'alnType': alignment type, can be 'seed' or 'full'

'nseLabels': toggle for species names, can be 0 or 1

'format': file format, can be 'stockholm', 'pfam', 'fasta' or 'fastau'.

Monday, August 2, 2010

MAFFT - multiple sequence alignment

One of my coworkers let me know about MAFFT, a multiple sequence alignment program. He thought that it might be useful for my Gsoc project. I found out that it uses a Jalview applet to display the results! Jalview is everywhere! Right now I know that Pfam, Rfam, Clustal, and MAFFT use Jalview applets to display their alignment results...

Addendum: I noticed right after I posted this that one of the alignments in the example file that Jalview has was generated by MAFFT! Shows how observant I am...

Wednesday, June 2, 2010

Parsing WUSS notation of RNA secondary structure annotation

A key part of this project is to parse the secondary structure line of Stockholm files so that it can be interpreted for coloring schemes. I have been adding mini-goals as appropriate. I will probably also need to add code to check that the sequence length and secondary structure length are the same, as well as the same number of open and closed parentheses.

WUSS notation is used in RNA stockholm files to indicate secondary structure. WUSS notation can support more characters than I thought, but Rfam uses the simplified version that the covariance modeling program Infernal uses. The description of Rfam on the Janelia Farm page is

Rfam is a collection of multiple sequence alignments and covariance models covering many common non-coding RNA families. The main use of Rfam is as a source of RNA multiple alignments with consensus secondary structure annotation in a consistent format. In conjunction with the Infernal software package, Rfam covariance models (CMs) can be used to search genomes or other DNA sequence databases for homologs to known structural RNA families.

WUSS notation uses <>, (), [], and {} to indicate base pairs and ':', ',', '_', '.', and '~' as single stranded columns. Each type of symbols has subtle meaning, but for Infernal the structure annotation line only needs to indicate which columns are base paired to each other. Thus, full WUSS notation is not necessary and a simple minimal annotation uses <> to indicate base pairs and '.' for single stranded positions of the alignment.

In more detail taken from the Infernal user guide:

Base pairs: the different symbols indicate different depth
*<> for simple terminal stems
*() for "internal" helices enclosing a multifunction of all terminal stems
*[] for internal helices enclosing a multifunction that includes at least one annotated () stem already
*{} for all internal helices enclosing deeper multifurcations

Hairpin loops
*indicated by underscores '_'
*Simple stem loops example: <<<____>>>

Bulge, interior loops
*indicated by dashes '-'

Multifurcation loops
*indicated by commas ','
*example: <<<___>>>,,<<<__>>>

External residues completely outside structure
*indicated by colons ':'

Insertions
* . to a known structure
* ~ used to indicate that a local structural alignment left regions of target and query unaligned.

Pseudoknots
* pairs of upper case/lower case letters
* example: <<<<_AAAA____>>>>aaaa

Things that I am thinking about:

-I need to interpret WUSS notation in a general way. It shouldn't be too difficult, but it is necessary since the same structure can be written in multiple ways. An example from the Infernal user guide is : <<<<....>>>> and ((((____)))) and <(<(._._)>)> all indicate a four base stem with a four base loop

-How should I store the secondary structure line so that it will be easily interpreted to implement coloring schemes?

Potentially I can store pairs of positions like how the disulfide bond positions are stored as annotations (Jim pointed this one out). I also need to keep in mind that bulges might exist, so I can't just interpret a run of the same type of bracket as part of the same stem. VARNA interprets bulges just fine, so I don't have to worry about that. An example of a complicated structure with a bulge:

<<<<……<<<< <<<<…..>>>>..>>>>……<<<<…>>>>….>>>>

-How can I make sure that there are 4 stems instead of 3? I can't simply scan through from left to right or eat away at both ends at the same time. It looks like the RALEE mode for Emacs handles bulges just fine based on this example in the readme


0123456789012345678901234
.<<<<<...>>.<<...>>..>>>.


   Column 1 pairs with 23
          2 with 22
   3 with 21
   4 with 10
   5 with 9
  12 with 18
  13 with 17

The image is from VARNA. Note the numbering in the image starts at 1 instead of 0.

RALEE is written in Emacs Lisp, so I need to look up some basics in Lisp before I can feel confident that I'm interpreting the code correctly! I think that this code will cut down on my thinking time, however.

To do
-check that secondary structure line and sequence are the same length. Does Jalview already do this?
-change all bracket types to () for VARNA (I just noticed that VARNA only likes (), not <> for base pairing! )
-convert all WUSS symbols to something simple, like how Jalview already does for protein secondary structure (simple helices and sheets)
-Need to figure out how to detect pseudoknots
-Add support for error checking when a user adds a base pair annotation. Make sure same number of column groups are selected
-How will colors cycle for different numbers of stems?

GSOC 2010 - Adding RNA Support to Jalview

Search This Blog

Relevant Links

About

Blog Archive

Java