Install key programs
-
BLAST+ executables (NCBI commandline tool; https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download)
-
Python distribution v3.x, recommend Anaconda (https://www.anaconda.com/distribution/)
- Python scripts will help automate many tasks below
-
CD-HIT (commandline tool; https://github.com/weizhongli/cdhit/wiki); install is easy if Anaconda has been installed earlier
conda install -c bioconda cd-hit
-
Sequence alignment program, recommend Clustal Omega (commandline tool; http://www.clustal.org/omega/) or MAFFT (commandline https://mafft.cbrc.jp/alignment/software/)
-
Alignment viewer, recommend AliView (http://ormbunkar.se/aliview/)
-
Phylogenetic tree inference program, recommend FastTree (commandline tool; http://www.microbesonline.org/fasttree/)
-
Phylogenetic tree viewer, recommend FigTree (well-designed; http://tree.bio.ed.ac.uk/software/figtree/) or Archaeopteryx (for large trees; https://sites.google.com/site/cmzmasek/home/software/archaeopteryx)
Below, instructions assume a UNIX like environment; MacOS is one and so is Linux, but Windows is not. Cygwin is a suite of tools that installs under Windows to provide a lot of the same command-like functionality.
Decide on a namespace/reference database, e.g. UniProt
- Source for identifiers from, key annotations, etc. Every sequence will (ideally) have an identifier from this source.
Collect key proteins: two tiers
-
Tier 1: “Guide” entries, proteins that are required in the final data set
Collect as FASTA or name list, e.g.
collect_tier1.fa
; if identifiers or sequence are not known we will find them using a BLAST search from our reference database below. -
Tier-2: POI entries (proteins of interest), prioritise inclusion, but if redundant, choose one
Collect as FASTA or name list, e.g.
collect_tier2.fa
; like above, identifiers/sequences will be retrieved using a BLAST search -
Devise a strategy with which a root can be placed in the eventual phylogenetic tree and amend Tier-1 and Tier-2 lists; this might involve having examplars in Tier-1 and Tier-2 that can work as outgroups or anchor points (by reference to published phylogenetic tree of the same family)
Download all members of protein family
- In UniProt, decide on a family identifier, e.g. “DapA”, and remove questionable entries, e.g. sequence fragments
- Execute search and download as FASTA, e.g. go to https://www.uniprot.org type in query
family:"dapa family" fragment:no
, download as FASTA, save asdb-100.fa
(“100” because it is 100% of the family.)
Create BLAST database
- makeblastdb (program part of BLASTx suite) on db-100
- For example,
makeblastdb -dbtype prot -in db-100.fa -out db-100
- For example,
Create Tier-1 and Tier-2 files
- Use the BLAST database db-100 to re-generate Tier-1 and Tier-2 files, so they are associated with identifiers and sequences from reference database
- For example,
blastp -db db-100 -outfmt 3 -num_descriptions 1 -num_alignments 0 -num_threads 5 -query collect_tier1.fa -out tier-1.txt -evalue 1e-100
; this will find the sequence with the highest similarity to that given, provided it is statistically supported at E=1e-100 - Extract sequence identifiers from BLAST report, e.g.
grep -e "^[st][pr]" tier-1.txt | cut -d' ' -f1 > tier-1.tab
; from this createtier-1.fa
- For example,
Constrain representation of protein family using key proteins
-
Use program such as CD-HIT to filter sequences at different levels of redundancy allowed, e.g. 99%, 95%, 90%, 50% (we call these db-99, db-95, etc)
- For example,
cd-hit -i db-100.fa -c 0.99 -T 5 -o db-99 -d 0
generates a filedb-99.clstr
; We do not use the FASTA file created by CD-HIT - For each sequence similarity “cluster” identified (in
db-99.clstr
), keep all Tier-1 entries, when no Tier-1 entries are in the cluster, keep exactly one Tier-2 entry if available; only when no Tier-1 or Tier-2 entries are in the cluster, pick an entry randomly [this step needs to be automated]; from this, createdb-99.fa
,db-95.fa
, etc
- For example,
-
Make BLAST databases for each redundancy level, db-99, db-95, etc
makeblastdb -dbtype prot -in db-99.fa -out db-99
Create datasets
- Search Tier-1 entries in db-100 including all homologs closer than a given E-value (e.g. 1e-100)
- For example,
blastp -db db-100 -outfmt 3 -num_descriptions 20000 -num_alignments 0 -num_threads 5 -query tier-1.fa -out dataset-99.txt -evalue 1e-100
- Extract sequence identifiers from BLAST report, e.g.
grep -e "^[st][pr]" dataset-1.txt | cut -d' ' -f1 > dataset-1.tab
; from this createdataset-1.fa
- Mark Tier-1 and Tier-2 entries in dataset, with distinctive labels etc so that they can be searched for later
- For example,
- Try different combinations of redundancy levels (db-99, db-95, …) and E-value thresholds (1e-100, 1e-75, 1e-50, 1e-10, etc) to construct alternative datasets
- Redundancy level will control how dense the overall sequence composition will be
- E-value threshold will control how closely we sample around the Tier-1 entries
Perform sequence alignment
-
Use multiple sequence alignment program to establish homologous positions between all entries in a dataset
- For example,
clustalo -i dataset-1.fa -o dataset-1.aln --output-order=tree-order --threads=5 --force
- For example,
-
Inspect alignment to evaluate quality
- For example, use AliView, zoom out so the whole alignment can be viewed
-
Check for sole-sequence insertions (and deletions), and remove if incorrectly aligned, and disruptive of the the overall alignment; re-run alignment once cleaned
Perform phylogenetic tree inference
- Use tree inference program to create an unrooted phylogenetic tree
- For example,
FastTree -nosupport -out dataset-1_unrooted.nwk dataset-1.fa
- For example,
- Inspect tree to place a root and to evaluate quality
- For example, use FigTree or Archaeopteryx
- Search for markers of Tier-1 and Tier-2 entries to find appropriate place for root, placed at branch point in unrooted tree