Select a row to see which lineages the domain architecture is present in
Load an existing analysis to explore MolEvolvR results right away without inputting anything.
The MolEvolvR web-app integrates molecular evolution and phylogenetic protein characterization under a single comuptational platform. MolEvolvR allows users to perform protein characterization (1+3), homology searches (1+4), or combine the two (1+3+4) starting with either protein(s) of interest or with external outputs from BLAST or InterProScan for further analysis, summarization, and visualization (2+3+4). MolEvolvR is interactive, queriable, user-friendly, and customizable.
MolEvolvR: A web-app for characterizing proteins using molecular evolution and phylogeny
Joseph T Burke*, Samuel Z Chen*, Lo Sosinski*,
John B Johnson, Janani Ravi. [*Co-primary]
bioRxiv 2022. doi:
https://doi.org/10.1101/2022.02.18.461833
; web-app:
http://jravilab.org/molevolvr
Submit your sequence or preprocessed data here and use your retrieval code (custom URL) to view the results.
A preview of the various analyses performed on the submitted data. To view the full results, explore the additional tabs linked below.
Input data, additional metadata, and preliminary analyses of query protein(s).
The data table provides a summary of the sequences submitted, or "queried", for analysis. The preview shown can be extended by using "Add/remove column(s)" to see info about other taxonomic classes as well as domain architecuture codes from databases other than the default (Pfam).
Uploaded amino acid FASTA sequence(s).
A heatmap of submitted sequences and their respective taxonomic lineages.
Summary and visualizations of protein motifs/subunits (domains) and their configurations within the query protein(s) (domain architectures).
Full set of homologs of query sequences, including their lineage and domain architecture info.
Visualizations and analyses of all query and homologous protein domains, structural or functional subunits, and their architectures.
Select a row to see which lineages the domain architecture is present in
Visualizations of phyletic patterns, sequence similarity, and evolution of related proteins.
This website is free and open to all users and there is no login requirement.
The help documentation outlined here will provide you with instructions on how to use MolEvolvR to its fullest potential. The video tutorials (coming soon) will demonstrate how to set-up custom analyses and navigate the web-app after retrieving your results.
Characterization of proteins is crucial to understanding the molecular basis of fundamental cellular processes. We’ve created a web-app that performs analysis of proteins by their sequence, structure, function, and phylogeny. The generated visualizations and tables characterize the aforementioned key features between homologous proteins. This allows researchers to discover more about their proteins of interest on the scale of sequence similarity, domain architecture, and lineages.
For example, the Phage-shock proteins (PSP) were analyzed using the workflow present in this web-application. Homology, domain architecture, and phylogeny of these proteins (and genomic contexts) were created, showing their prevalence in other organisms and detailing how variations of this phage shock stress response system are present across many lineages.
To demonstrate its broad applicability, we have applied the approach underlying MolEvolvR to study several systems, including proteins/operons in zoonotic pathogens, e.g., nutrient acquisition systems in Staphylococcus aureus [tcyABCP, gis-ggt], novel phage defense system in Vibrio cholerae [Vch1], surface layer proteins in Bacillus anthracis [SLPs], helicase operators in bacteria [DciA], and internalins in Listeria [InlP].
Users can input or upload a list of amino acid FASTA files or accession numbers for proteins of interest to identify homologs, determine domain architectures, and delineate phyletic spreads. These analyses provide insight into the purpose of the protein(s) of interest within organisms as well as detail how they have evolved. This is done to assist with providing an overview of their importance to a particular biologial process, or survival of organisms themselves.
MolEvolvR’s functionality is comprised of 4 kinds of analyses:
To begin the full analysis of your protein(s), enter the amino acid FASTA sequences (.fa
, .fasta
, .faa
) or accession numbers associated with your proteins of interest into the “Start Analysis” tab. You can upload a file carrying the sequences, accession numbers too. The user can upload up to 1000 protein sequences. For analyses with more than 1000 proteins the user can contact janani.ravi@cuanschutz.edu.
Alternatively, if the user already has a list of homologous proteins, they may enter the multiprotein FASTA, Accession Numbers. MolEvolvR can also use MSA generated through external programs such as Clustal Omega, ClustalW, Kalign, or MUSCLE. If the user prefers to use a different alignment algorithm, they may enter their pre-aligned FASTA Sequences obtained using the algorithm of their choosing.
Advanced options allow the user to tweak the analysis performed based on both the data they bring and the desired analysis to be performed.
The user will select phylogenetic analyses on the query protein(s) provided, if they are homologs of each other. When selected, no other analysis will be performed unless explicitly selected. If the starting set of homologs do not have additional metadata or domain architectures, selecting that option alongside might be helpful for the user.
Selecting homology search will generate an extensive list of protein homologs (related proteins) based on their starting protein(s) of interest. This option will only run a homology search, unless otherwise specified. This pairs well with domain architecture that can be obtained for each of the 1000s of homologs, resulting in further functional characterization of the protein family.
The option generates domain architecture (including sequence-structure motifs/domains such as Pfam, TIGRFam, SignalP, TM and disorder predictions, and cellular localization) for the query proteins provided. If selected alone, no other analysis will be performed unless explicitly selected.
This option allows you to search for homologous proteins related to the specific domains found within your query proteins. It allows for a broader search that discovers remote homologs, which would otherwise not be detected by a standard search. Phylogenetic searches, domain architecture, and characterization is then done.
For submission types that include a homology search in the analysis, this section can be used to tune the homology search parameters: database, maximum hits, and e-value.
To begin the homology search, a protein FASTA sequence(s) or accession number(s) are entered/uploaded.
If an accession number is given, the web-app will search for the corresponding FASTA sequence. The FASTA file is then run through either DELTA-BLAST, a variation of BLASTP; DELTA-BLAST also uses PSSMs, but first searches pre-constructed PSSMs and the CDD database. Once the BLAST homology search completes, MolEvolvR clusters the resulting homolog sequences with BLASTClust, and adds additional metadata by lineage and domain architecture (when selecting Domain Architecture analyses).
MolEvolvR allows for customization of the analyses on your proteins of interest. Several different approaches can be taken to fit your needs.
The full analysis allows the user to begin with accession numbers or FASTA files for protein(s) of interest. It then compiles a comprehensive set of homologs, which can then be used to determine evolution, phylogeny, and domain architectures of all homologs. Users have the option to perform only phylogenetic analysis or domain architecture if both are not required.
Additionally, users may load results from NCBI BLAST or InterProScan and begin the analysis at that stage. Web-BLAST results allow the user to determine homolog similarity, the domain architecture and/or phylogeny, whereas uploading InterProScan results allow for domain characterization and, if desired, phylogeny.
Users may also enter the workflow with data obtained from a previous BLAST run. This data can be run through BLASTClust to cluster similar sequences among the retrieved homologs. The phylogenetic analysis and domain architecture components can then be applied.
MolEvolvR allows you to input BLAST results that have been run externally on the BLAST web-server for your protein(s) of interest. To help us help you, below are instructions and useful parameters to modify prior to setting up your BLAST runs. Additionally, we’ll help you with identifying the right format to download these results in to ensure compatibility with our web-app.
Here are some instructions on uploading BLASTP results to our “Upload” tab:
First, enter your Accession Number(s) or FASTA sequence(s) into the “Enter Query Sequence” box.
For the database parameter we support either the non-redundant database (nr
) or the reference sequence collection (refseq_proteins
). RefSeq contains protein sequences from human, mouse, and prokaryotes, restricted to the RefSeq Select set of proteins. Furthermore, RefSeq
Select includes one representative protein per protein-coding gene for human and mouse, and RefSeq proteins annotated on reference and representative genomes for prokaryotes. Meanwhile, nr
houses all non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects. If you would like to further filter your results based on lineages (e.g., species, genus, family, kingdom), please enter the name/taxID in the Organism field and toggle the box for include/exclude those results in your search.
Next, select which algorithm to run. ‘BLASTP’ is great for standard protein BLAST, if you know very little about your protein. If you are interested in identifying remote homologs, we suggest using the PSI-BLAST. DELTA-BLAST works very well if your protein has domains of interest. Creating a job title for the run is optional and for your personal use. If you are interested in determining remote homologs, performing an iterative search using PSI-BLAST would be best.
Choose the number of maximum number of sequences that will be aligned. Our recommendation is 5,000
total target sequences, or hits to ensure maximum inclusion of homologs. Next, go down to the PSI/PHI/DELTA-BLAST box at the bottom and choose your threshold value, or e-value. An e-value is the expected number of matches by pure random chance, and it is used to filter out hits with a value higher than the passed threshold. We recommend using 1e-5 (also written as 1x10-5
or 0.00001
). Double check your parameters in the box at the bottom of the page, then click the BLAST button.
Once the BLAST algorithm has finished running, there will be an option on the upper left of the screen to download results. Click on the “Download” button and select the “Hit Table (text)” option. Once these results are downloaded, you can directly upload these text files to the MolEvolvR web-app.
NCBI has updated the BLAST site to allow for a description table of your protein homologs to be downloaded, which we encourage you to look at. However, our app requires the HitTable for analysis to correctly and completely run.