A web-app for characterizing proteins using molecular evolution and phylogeny

Example Analyses

Load an existing analysis to explore MolEvolvR results right away without inputting anything.

liasix

FASTA upload: Full analysis, for 6 bacterial proteins

5uNQ9l

FASTA upload: Full analysis, for 1 viral protein

7JtOTB

BLAST results upload, for 1 eukaryotic protein

v7omZ3

InterProScan results upload, for 1 viral protein


Overview and Features

MolEvolvR allows users to start with protein(s) of interest and perform the full analysis (1, 3, & 4 below), only protein characterization (1 & 3), only homology searches (1 & 4), or start with external outputs from BLAST or Interproscan for further analysis, summarization, and visualization (2, 3, & 4). MolEvolvR is interactive, queriable, and customizable.

Abstract

Studying proteins through the lens of evolution can reveal conserved features, lineage-specific variants, and their potential functions. MolEvolvR (https://jravilab.org/molevolvr) is a novel web-app enabling researchers to visualize the molecular evolution of their proteins of interest in a phylogenetic context across the tree of life, spanning all superkingdoms. The web-app accepts multiple input formats — protein/domain sequences, homologous proteins, or domain scans — and, using a general-purpose computational workflow, returns detailed homolog data and dynamic graphical summaries (e.g., phylogenetic trees, multiple sequence alignments, domain architectures, domain proximity networks, phyletic spreads, co-occurrence patterns across lineages). Thus, MolEvolvR is a powerful, easy-to-use web interface for computational protein characterization.

How to Cite

MolEvolvR: A web-app for characterizing proteins using molecular evolution and phylogeny
Jacob D Krol*, Joseph T Burke*, Samuel Z Chen*, Lo Sosinski*, Faisal S Alquaddoomi, Evan P Brenner, Ethan P Wolfe, Vincent P Rubinetti, Shaddai Amolitos, Kellen M Reason, John B Johnson, Janani Ravi. [*Co-primary]
bioRxiv 2022. doi: https://doi.org/10.1101/2022.02.18.461833 ; web-app: http://jravilab.org/molevolvr

Submit your sequence or preprocessed data here and use your retrieval code (custom URL) to view the results.


Analysis

Past Analyses

Jobs you have previously submitted on this device will appear here.

Results Summary

An overview of the protein analysis. To view the full results, explore the additional tabs (or click the buttons!)

Domain Architecture

Visualizations and summaries for protein domains.

Phylogeny

Visualizations for protein evolution

Data

Summary table of proteins including domain architectures, phylogeny, and homologs, when applicable.

Query Data

Input data, additional metadata, and preliminary analyses of query protein(s).

The data table provides a summary of the sequences submitted, or "queried", for analysis. The preview shown can be extended by using "Add/remove column(s)" to see info about other taxonomic classes as well as domain architecuture codes from databases other than the default (Pfam).

Uploaded amino acid FASTA sequence(s).

Loading...
Loading...
Loading...
Loading...

A heatmap of submitted sequences and their respective taxonomic lineages.

Loading...

Summary and visualizations of protein motifs/subunits (domains) and their configurations within the query protein(s) (domain architectures).

Homolog Data

Full set of homologs of query sequences, including their lineage and domain architecture info.

Domain Architecture

Visualizations and analyses of all query and homologous protein domains, structural or functional subunits, and their architectures.

Select a row to see which lineages the domain architecture is present in

Loading...
Loading...
Messy graph? Try re-arranging the vertices by clicking and dragging the vertices! Also try zooming in and out using your scroll wheel!
Loading...
Loading...
Loading...
Loading...

Phylogeny

Visualizations of phyletic patterns, sequence similarity, and evolution of related proteins.


Loading...
Legend
Loading...

Help and FAQ docs for MolEvolvR

This website is free and open to all users, no login required.

This help page shows how to use MolEvolvR to its fullest potential.

Coming Soon: Videos demonstrating what you can do with MolEvolvR, and how to set up custom analyses and navigate the app after loading your results.

The UI: Workflow and Usage

Proteins are the functional units of cellular processes. The goal of MolEvolvR is to characterize proteins by their sequence, structure, function, and phylogeny by using sequence similarity, domain architecture, lineages/phyletic spread, and more.

Published use cases/Testing

You can explore a sample set of phage shock proteins (PSP) (e.g., lia operon from Bacillus subtilis here), and the full set of PSP proteins here. We created homology, domain architecture, and phylogeny of these proteins (and genomic contexts) to show their prevalence in other organisms and detail how variations of this phage shock stress response system are present across many lineages.

We have applied the approach underlying MolEvolvR to study diverse systems, including:

  • Nutrient acquisition systems in Staphylococcus aureus [tcyABCP, gis-gt]
  • A novel phage defense system in Vibrio cholerae [Vch1]
  • Surface layer proteins in Bacillus anthracis [SLPs]
  • Helicase operators in bacteria [DciA]
  • Internalins in Listeria [InlP]

How to use MolEvolvR

You can provide a variety of protein inputs, including:

  1. Protein sequence(s) in FASTA format
  2. Protein accession number(s) in NCBI and/or UniProt format
  3. Protein multiple sequence alignment (MSA) in FASTA/Pearson format
  4. Protein BLAST output in .tsv format
  5. InterProScan output in .tsv format

With any of these inputs, proteins of interest will be analyzed to identify homologs, determine domain architectures, and delineate phyletic spreads. These analyses provide insights into the biological role(s) of the protein(s) of interest within organisms, as well as trace their evolution.

MolEvolvR can perform 4 types of analyses:

  1. Domain architecture, which allows identification of protein domains, exploration of domain interactions, and domain co-occurences
  2. Identification of homologs, which reveals patterns within and across species
  3. Phylogenetic analysis, which shows the phyletic spread of proteins across the tree of life, a multiple sequence alignment, and a phylogenetic tree
  4. Visualization and analysis of results from BLAST suite, InterProScan, and multiple sequence alignments

Enter data

Accession Numbers and FASTA (full analysis)

To begin, enter the amino acid FASTA sequence(s) or accession numbers of your protein(s) of interest into the Start Analysis tab. You can also upload a file containing multiple FASTA sequences (.fa, .faa, .fasta ), or accession numbers (.csv). Up to 100 protein sequences per job are accepted. For analyses with more than 100 proteins, please contact us.

Multiple Accessions/FASTA of homologs

If you have a pre-existing set of homologous proteins, you can enter/upload the multiprotein FASTA or list of accession numbers. MolEvolvR can also use an MSA in FASTA/Pearson format generated through external programs such as Clustal Omega, ClustalW, Kalign, or MUSCLE.

Advanced Options

Advanced options allow you to customize your analysis.

Phylogenetic Analysis

Selecting Phylogenetic Analysis will analyze a set of known homologous proteins. Because this type of analysis already uses homologs, the homology search option will be disabled.

Homology Search

Selecting Homology Search will identify homologs (related proteins) for each input protein. This pairs well with domain architecture searches that can be obtained for all homologous hits for each query.

Domain Architecture

Selecting Domain Architecture will generate domain architecture (including sequence-structure motifs/domains such as Pfam, Hamap and SignalP, disorder predictions using MobiDBlite, and seconday structure/cellular localization using Phobius and Coils) for the query proteins provided. If selected alone, no other analysis will be performed.

Split Queries by Domain

This option allows you to search for proteins with homologous domains to those found within your query proteins. This allows for a broader search of remote homologs that would be missed by a standard whole-protein search. Phylogenetic searches, domain architecture, and characterization are then performed.

BLAST Parameters

For analyses that include a homology search, you can adjust parameters like database (default refseq), maximum hits (default 100), and E-value (default 0.00001).

Customizing your analysis

Fully characterize proteins of interest

You can start your analysis with a full list of accession numbers or FASTA files for protein(s) of interest. MolEvolvR gathers homologs of your input protein(s), and then performs domain architecture analysis to on all homolog and query sequences. You have the option to perform only Phylogenetic Analysis or Domain Architecture if you don’t need both.

Analyze external data

You can start your analysis from uploaded NCBI BLAST or InterProScan results. Web-BLAST results allow you to determine homolog similarity, the domain architecture and/or phylogeny. InterProScan results summarize and visualizae domains and (if accession numbers are provided) phylogeny.

BLAST outputs from the NCBI BLAST web-interface

BLAST is available through NCBI’s website.

You can start your analysis with data from a previous BLAST run. These data are run through BLASTClust to cluster similar sequences among the retrieved homologs. The Phylogenetic Analysis and Domain Architecture options are then applied.

To ensure compatibility with the MolEvolvR Start Analysis tab, follow these guidelines.

Step 1: Enter Accession Numbers/FASTA sequences and choose parameters

First, enter your Accession Number(s) or FASTA sequence(s) into the “Enter Query Sequence” box.

For the database parameter, we support either the non-redundant database (nr) or the reference sequence collection (refseq_proteins). The refseq_proteins dataset is a high quality, non-redundant subset of protein records curated by NCBI staff. Meanwhile, nr is a larger, non-redundant set that includes many more sequences but is not necessarily vetted for quality and accuracy. If you would like to further filter your results based on lineages (e.g., species, genus, family, kingdom), enter the name/taxID in the Organism field and toggle the box to include/exclude those results in your search.

Next, select which algorithm to run. If you don’t know details of your protein, ‘BLASTP’ is a great place to start. If you are interested in identifying remote homologs, we suggest using ‘PSI-BLAST’. If your protein has domains of interest, ‘DELTA-BLAST’ works very well.

Creating a job title for the run is optional and for your personal convenience.

Under the expandable “Algorithm parameters” section, the defaults for max target sequences are typically sufficient. The expect threshold value, or E-value, represents the number of matches by pure random chance, and filters out hits with values greater than the threshold. We suggest 1e-5 (1x10-5 or `0.00001’) for general searches. Double check your parameters across the page, then click the BLAST button.

Summary of NCBI BLAST submission parameters

  1. Accession Number(s) or FASTA sequence(s)
  2. Database
  1. RefSeq. This database contains only NCBI-curated, high quality, non-redundant protein sequences.
  2. NR. This database contains a much larger pool of uncurated, variable quality, non-redundant protein sequences.
Step 2: Downloading BLAST results

Once your BLAST search is complete, at the end of the RID row towards the top of the page, there will be a Download All option with a dropdown menu to download results. Click on the Download All button and select the Hit Table (csv) option. You can directly upload these .csv result files to MolEvolvR. If the first column of the results .csv does not include accession numbers, you will also need to provide the query sequence(s) that you used to run BLAST as a second file (.fa, .faa, or .fasta format).

If you are performing a PSI-BLAST, you will have the option to run additional iterative searches upon each search’s completion. Further iterations will find more remote homologs, so it is recommended you run several iterations before downloading the Hit Table.

BLAST provides information in many formats for your protein homologs, which we encourage you to review. However, MolEvolvR requires the Hit Table (csv) for analysis.

Alternatively, you may upload command line BLAST results with these columns specified:

query acc.ver, subject acc.ver, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score, % positives

Check out the BLAST tutorials to learn more about BLAST.


InterProScan outputs from the Iprscan5 web-interface

InterProScan is available through EBI’s website.

If you have already identified your protein’s domains through InterProScan, you can upload the output to MolEvolvR for a customizable visual summary of the information.

To ensure compatibility with the MolEvolvR Start Analysis tab, follow these guidelines.

Step 1: Enter FASTA sequence

Input your protein’s FASTA sequence by copy/pasting into the box or uploading the FASTA file with the Choose file button. If the sequence is valid, InterProScan will display a green check mark in the bottom right corner of the input box. You can use the Advanced options dropdown to select specific databases. When finished, click the `Search button to begin.

When the search completes (another green check mark will appear under Status), click on your job submission to view the output.

Step 2: Download InterProScan results & upload into MolEvolvR

Under the blue Export dropdown menu, download the results in the .tsv format. Upload the .tsv file to MolEvolvR for visualization and further analysis. If the first column of the results .tsv does not include accession numbers, you will also need to provide the query protein sequence that you used to run InterProScan as a second file (amino acid sequence in .fa, .faa, or .fasta format).

Check out the InterProScan tutorials to learn more about their algorithms and search parameters.


Retrieve analysis results

After submitting proteins to MolEvolvR, take a break! Runtime depends on server load and on the complexity of your submission (Full Analysis taking the most time), but you can expect this to take 10 minutes or more.

You will receive a six character alphanumeric analysis code after submission. We recommend saving this code before you close the app. You will need to enter it later on the Retrieve Results tab to view your results.

Before submitting you can provide an email to receive a link to your analysis. The app will also save any analyses you’ve submitted on your current device (laptop, phone, etc.), and will list them under the Retrieve Results tab.


Results Summary

The Result Summary tab provides a high level overview/snapshot of your analysis results. You can explore your analysis fully with the detailed results and visualizations under the other tabs, as follows.


Query Data

Data Table

The Data Table tab shows the processed input data in tabular form. The default view includes query name, the species and lineage in which it is found, and Pfam domain architecture, but the table can be customized with the Add/remove column(s) button. Columns can be filtered for particular species, lineages, percentages, etc., and the entire table can be searched with plain text or regex. The full data table can be downloaded in .csv format with the Download as csv button.

FASTA

All FASTA sequences for the query protein(s) are provided for ease of access.

Query Heatmap

A heatmap shows the occurrence of query protein(s) by taxonomic presence, which may be useful for multi-FASTA input of homologs.

Query Domain Architecture

A customizable domain architecture visualization shows the query protein(s) grouped by analysis or query. You can modify the domain plot by selecting the Analysis box and adding or removing results to display (e.g., Pfam, Phobius, Coils analyses).


Homolog Data

The Homolog Data table lists the best hits from all superkingdoms of life (queried across all refseq or nr genomes). Like in Query Data > Data Table described above, tabular details are provided across all homologs, including genome, species, lineage, and domain architecture information. Many homology-specific options available in Add/remove column(s) like percent identity, cluster ID (BLAST parameters). The accession number for each homolog is linked to its corresponding NCBI protein page.


Domain Architecture

A protein’s domain architecture (DA) refers to the order of specific functional regions of a protein. Currently, MolEvolvR uses databases and prediction algorithms integrated with InterProScan to characterize the domain architecture of protein queries and their homologs. We summarize the data with a set of useful visualizations below. Results from Pfam, Phobius, Gene3D, SignalP_Gram_positive, SignalP_Gram_negative, MobiDBlite, Hamap, and Coils are available.

Table

The table provides summary statistics on the domain architecture data across all homologs, with the top (most frequent) domain architectures by query protein (or across all queries) and the frequencies of occurrence and lineages in which they occur. Click each row to view the domain architecture spread across lineages. A popup demonstrates the ‘LineageCount’ by showing the frequencies of occurrence by individual lineage for the selected domain architecture.

Heatmap

A color gradient heatmap across the query protein(s) indicates the number of homologs identified within each lineage per domain architecture.

Rows: Predominant domain architectures. Columns: Key lineages from across the superkingdoms of life.

Network

A network visualization summarizes domain architectures across query protein(s) and their homologs Nodes represent a domain and edges denote domain co-occurrence within a protein. The domains (nodes) that co-occur within a protein/domain architecture are connected (edges), and the size of nodes and thickness of edges are proportional to their relative occurrences across homologs (or query proteins).

Interproscan Visualization

Each column of this visualization is organized by the database the domains were obtained from. The rows represent select query protein(s) and/or homologs (if a homology search was performed) with the lineage added to the front of the accession number. You can select specific proteins with the dropdown box, and choose group rows by analysis or query. The visualization can also be updated by toggling available database options under Analysis, or by adjusting the Total Cutoff Count slider.

UpSet plot

An UpSet plot is a helpful summary visualization that shows the frequencies of domains and domain architectures across all homologs. It shows distribution of constituent domains underlying all homologs in a histogram (to the left). The combination matrix displays the various combinations of domains present across the domain architectures. The adjoining second histogram (on top) shows the frequency of occurrences of the indicated domain architectures (combinations).


Phylogeny

Phylogenetic analysis of proteins provides key insights into their development and evolution. The conservation of certain portions through lineages or across domains of life could indicate the importance of the protein in certain biological processes.

Sunburst

An interactive sunburst plot shows the phyletic spread of the query protein (selected with the Protein dropdown) across life. Hovering over each section of the plot displays the lineage. The depth of displayed taxonomic levels can be adjusted with Number of Levels to add more detail to the sunburst plot.

Tree

This visualization is constructed from a multiple sequence alignment of representative homologs. Tree leaves are labeled by lineage, species (three-letter abbreviation), and accession numbers.

You can adjust the tree generation in two important ways: based on whether homologs are reduced to representative sequences (e.g., by lineage, species, or domain architecture), and based on the multiple sequence alignment (MSA) algorithm chosen (including Clustal Omega, Clustal W, and Muscle). The size of the tree can be altered selecting the desirable number of sequences. To the right of the tree is a visualization of the multiple sequence alignment, colored by amino acid and showing overall conservation of sequence and structure of the homologs used in tree construction.

MSA

You can customize and download a multiple sequence alignment as a .pdf file, including a user-specified number of representative sequences among the homologs.


Explore your results

Filters

Data tables can be filtered via global or column-specific search boxes and controls. These filters are applied to all results in the analysis. For example, if on the Homolog Data tab you filter the domain architectures down to just a few specific ones, all visualizations and summaries generated in the Domain architecture and Phylogeny tabs will also be filtered.

Columns are searched appropriately based on the data they contain. For example, the AccNum column is text searchable, while PcPositive provides sliders to specify a range of values.

Regex

Table-wide search boxes support JavaScript-flavored regular expressions. This can be used to make advanced searches, e.g. Staphylococcus\saureus|Klebsiella\spneumoniae (search for staphylococcus OR klebsiella.)


Compatibility

This web-app is regularly tested on the following:

  • Google Chrome, Mozilla Firefox, Apple Safari
  • Windows, MacOS, iOS, Android
  • Desktop, tablet, phone/mobile

We only use standardized and widely-supported HTML, CSS, and JavaScript features, so any other modern, standard-compliant browser such as Opera or Microsoft Edge should also work, even if not explicitly tested.

The following are NOT supported, and may result in unexpected look or behavior:

  • Microsoft Internet Explorer.
  • Smart watches, or any device with a screen width < ~250px.
  • Browsers without JavaScript enabled (interactive features won’t work).

If you encounter a bug, please let us know!


Dependencies

Tools

R, InterProScan, BLAST+, edirect, FastTree, MUSCLE, Phobius, TMHMM, HMMER

Data

NCBI Taxonomy, NCBI GenBank/RefSeq; BLAST RefSeq, NR DB; InterPro

R packages

ape, biomartr, cowplot, d3r, DT, gganimate, gggenes, ggraph, ggsci, ggthemes, ggtree, ggvis, gh, gridExtra, heatmap3, heatmaply, htmlwidgets, httr, igraph, knitr, latexpdf, pdftools, phangorn, phylogram, phylotools, phytools, plotly, rentrez, reutils, rmarkdown, seqinr, seqRFLP, shiny, shinydashboard, sunburstR, tidytext, tidytree, tidyverse, tinytex, UpSetR, viridis, visNetwork, wordcloud, wordcloud2

Tutorials

Coming Soon.

We will provide video tutorials covering: how to load your data (accession numbers, FASTA file, web-BLAST results, web-InterProScan results), how to run your analyses, how to load your analysis after it has been processed, how to navigate the web-app, and how to download publication-ready figures and data!


How to Cite

If you have used our web-app to generate any results for your publication or presentations, please cite us as follows:

MolEvolvR: a web-app for characterizing proteins using molecular evolution and phylogeny. Jacob D Krol*, Joseph T Burke*, Samuel Z Chen*, Lo Sosinski*, Faisal S Alquaddoomi, Evan P Brenner, Ethan P Wolfe, Vince P Rubinetti, Shaddai Amolitos, Kellen M Reason, John B Johnston, Janani Ravi. [*Co-primary] bioRxiv 2022.02.18.461833; doi: https://doi.org/10.1101/2022.02.18.461833; web-app: http://jravilab.org/molevolvr

Meet the Team

More from JRaviLab

Contact

Questions? Email us at mailto:janani.ravi@cuanschutz.edu.

Funding

We would like to thank our funding sources: Endowed Research Funds from the College of Veterinary Medicine, Michigan State University, NSF-funded BEACON funding support, and the University of Colorado Anschutz start-up funds awarded to JR; NSF-funded REU-ACRES summer scholarship to SZC; NIH NIAID U01AI176414 to JR; NIH NLM T15LM009451 to EPB.

Q: Will I receive an email when the job is done?

Yes, if you supplied an (optional) email on the submission page, then an email will be sent to confirm when a job is ready.

Q: How to paste/upload protein sequences?

Acceptable formats

NCBI FASTA

>OHS91782.1 16S rRNA pseudouridine(516) synthase [Staphylococcus aureus]
MRIDKFLANMGVGTRNEVKQLLKKGLVNVNEQVIKSPKTHIEPENDKITVRGELIEYIENVYIMLNKPKG
YISATEDHHSKTVIDLIPEYQHLNIFPVGRLDKDTEGLLLITNDGDFNHELMSPNKHVSKKYEVISANPI
TEDDIQAFKEGVTLTDGKVKPAILTYIDNQTSHVTIYEGKYHQVKRMFHSIQNEVLHLRRIKIADLELDS
NLDSGEYRLLTENDFDKLNYK

UniProt FASTA

>sp|P01189|COLI_HUMAN Pro-opiomelanocortin OS=Homo sapiens OX=9606 GN=POMC PE=1 SV=2
MPRSCCSRSGALLLALLLQASMEVRGWCLESSQCQDLTTESNLLECIRACKPDLSAETPM
FPGNGDEQPLTENPRKYVMGHFRWDRFGRRNSSSSGSSGAGQKREDVSAGEDCGPLPEGG
PEPRSDGAKPGPREGKRSYSMEHFRWGKPVGKKRRPVKVYPNGAEDESAEAFPLEFKREL
TGQRLREGDGPDGPADDGAGAQADLEHSLLVAAEKKDEGPYRMEHFRWGSPPKDKRYGGF
MTSEKSQTPLVTLFKNAIIKNAYKKGE

Custom FASTA header (not recommended)

>SEQUENCE154 UNKNOWN 
MPRSCCSRSGALLLALLLQASMEVRGWCLESSQCQDLTTESNLLECIRACKPDLSAETPM
FPG

The application uses NCBI or UniProt accessions to get taxonomy info from query proteins. Therefore, it is recommended to include valid protein accession numbers in the header when possible.

Common mistakes

No header lines (missing > header delimiter)

MRIDKFLANMGVGTRNEVKQLLKKGLVNVNEQVIKSPKTHIEPENDKITVRGELIEYIENVYIMLNKPKG

MPRSCCSRSGALLLALLLQASMEVRGWCLESSQCQDLTTESNLLECIRACKPDLSAETPM

Duplicate headers/accnums

>GCF_000013425.1
MVPEEKGSITLSKEAAIIFAIAKFKPFKNRIKNNPQKTNPFLKLHENKKS
>GCF_000013425.1
MKQKKSKNIFWVFSILAVVFLVLFSFAVGASNVPMMILTFILLVATFGIGFTTKKKYRENDWL
>protein
MKLTLMKFFVGGFAVLLSYIVSVTLPWKEFGGIFATFPAVFLVSMFITGMQYGDKVAVHVSRGAVFGMTGVLVCILVTWM
MLHMTHMWLISIVVGFLSWFISAVCIFEAVEFIAQKRLEKHSWKAGKSNSK
>protein
MVKRTYQPNKRKHSKVHGFRKRMSTKNGRKVLARRRRKGRKVLSA

Q: Is my job still running? Did it complete?

Upon submission, a url link to retrieve the results will display. The link provides job progress info and, once finished, the results.

Recommendations:

  • Bookmark link
  • Supply an optional email to receive the link

Q: How long will my submission take to process? When can I expect my results?

Key factors of job duration:

  1. Number of sequences submitted

  2. Number of homologs to search for each sequence (Advanced Options>Maximum Hits)

  3. Length & complexity of sequences

Example runtimes
submission_type Job Code Description Runtime (minutes) Homology Search Database Maximum Hits E-value
FASTA FkzSp3 staph-ref-proteome[1-500] 15982.0 TRUE refseq 500 1e-05
BLAST-output iYY6hD human-K-channel 43.0 TRUE refseq 500 1e-05
MSA HCeUWr bacillus 206.0 TRUE refseq 5000 1e-05
interproscan-output O0Anqd phage-shock-protein 0.6 FALSE NA NA NA
interproscan-output A3vRqg phage-shock-protein 59.0 TRUE refseq 1000 1e-05