The goal of this tutorial is to explain how to install genomes from NCBI in your RSAT server.
This protocol assumes that you have a working instance of the RSAT server, and that you are familar with the Unix terminal.
We illustrate the installation procedure with a bacterial genome, since NCBI is the main source of prokaryote genomes for RSAT (http://prokaryotes.rsat.eu/). The same procedure can however apply to other taxa, as far as they are available on the NCBI FTP server (https://ftp.ncbi.nlm.nih.gov/genomes/).
Genomes are downloaded from the refseq repository: https://ftp.ncbi.nlm.nih.gov/genomes/refseq
The first step to perform with the command
ìnstall-organisms
is to get the list of organisms available
for a given taxonomic group.
The NCBI FTP genome site is organised by “main” taxonomic groups:
The option -task available
returns the complete list of
available species for the specified group.
The command below gets the list of all available bacteria and stores
them in a text file. It can easily be adapted to another group by
changing the environment variable GROUP
.
## Get curret date to use as suffix for output files
export TODAY=`date +%Y-%m-%d`
## Choose a taxonomic group of interest
export GROUP=bacteria
## Get the list of bacterial species available on the NCBI FTP site
install-organism -v 1 -group ${GROUP} -task available
This will download the assembly summmary file from NCBI, for the selected taxonomic group. Beware, for Bacteria this file is pretty large (153Mb on April 2024). However, this file is downloaded only if the version on the NCBI server is more recent than the local version. After the first download, subsequent calls of the install-organisms command will check if a new version is present at the NCBI site, and if not it will use the local copy.
The output of install-organisms indicates the local path of the assembly summary file.
; Assembly summary file $RSAT/downloads/refseq/bacteria/assembly_summary.txt
; Assembly summary columns $RSAT/downloads/refseq/bacteria/assembly_summary_colums.tsv
; Available assemblies 350727
## Count the number of species.
## Note: grep is used to filter out the comment lines.
export ASSEMBLY_SUMMARY=downloads/refseq/${GROUP}/assembly_summary.txt
grep -v '^#' ${ASSEMBLY_SUMMARY} | wc -l
We can of course do the same for other taxa, for example Archaea, Fungi or Mammalian.
export NCBI_GROUPS='archaea bacteria fungi invertebrate plant protozoa vertebrate_mammalian vertebrate_other viral'
echo "NCBI_GROUPS ${NCBI_GROUPS}"
for ncbi_group in $NCBI_GROUPS; do \
echo "Getting list of NCBI species for ${ncbi_group}" ; \
install-organism -v 1 -group ${ncbi_group} -task available
done
## Approx count of species per taxa
## (each file starts with 15 comment lines, which are be substracted from each count)
wc -l $RSAT/downloads/refseq/*/assembly_summary.txt \
| sort -nr \
| awk '{print "|\t"$1-15"\t| "$2"\t|"}'
Note: for some species, several strains have been sequenced. The number of genomes available at NCBI thus exceeds by far the number of distinct species.
Here is the result on July 31, 2024
Number of species | Taxonomic group |
---|---|
367675 | /Users/jvanheld/packages/rsat/downloads/refseq/bacteria/assembly_summary.txt |
14975 | /Users/jvanheld/packages/rsat/downloads/refseq/viral/assembly_summary.txt |
2171 | /Users/jvanheld/packages/rsat/downloads/refseq/archaea/assembly_summary.txt |
588 | /Users/jvanheld/packages/rsat/downloads/refseq/fungi/assembly_summary.txt |
404 | /Users/jvanheld/packages/rsat/downloads/refseq/invertebrate/assembly_summary.txt |
399 | /Users/jvanheld/packages/rsat/downloads/refseq/vertebrate_other/assembly_summary.txt |
220 | /Users/jvanheld/packages/rsat/downloads/refseq/vertebrate_mammalian/assembly_summary.txt |
171 | /Users/jvanheld/packages/rsat/downloads/refseq/plant/assembly_summary.txt |
83 | /Users/jvanheld/packages/rsat/downloads/refseq/protozoa/assembly_summary.txt |
386806 | total |
Notes
The number of available assemblies exceeds by far the number of species, because some species are represented by different strains, in particular for bacteria (some bacterial species have several thousands of strains sequenced).
Some of these genomes are only partly sequenced, or poorly assembled, and the level of annotation is quite variable. NCBI subdivided these available genomes in 4 categories depending on the level of completion of the sequencing project.
In order to select a reasonable number of genomes to install, we can filter the assemblies based on different criteria.
The option -species
enables selecting a subset of
assemblies corresponding to a given species of a given genus. The
argulent -species
should be followed by a string
concatenating the Genus (with a leading uppercase) and species (in
lowercase) separated by an underscore character (_
), with
exactly the same spelling and cases as on the NCBI FTP refseq site. Note
that special characters (][)(/
etc) should be replaced by
an underscore.
## List the assemblies for a given species, e.g. Rhizobium etli
install-organism -v 1 -group ${GROUP} \
-species Rhizobium_etli \
-task list -list_format org
Beware, for some species of interest, the number of assemblies can be quite large. This is for example the case for Escherchia coli.
## List the assemblies for Escherichia coli)
install-organism -v 1 -group ${GROUP} \
-species Escherichia_coli \
-task list -list_format org \
-o assemblies_Escherichia_coli.txt
## Count the number of assemblies
grep -v '^;' assemblies_Escherichia_coli.txt | grep -v '^#' | wc -l
## Display the first rows of this file
head -n 20 assemblies_Escherichia_coli.txt
For the species Escherichia_coli, we get 34510 assemblies (05/05/2024).
We generally don’t need to install all these assemblies. We should thus use additional information in order to narrow down the number of genomes to install. To this purpose, the next sections explain how to select genomes labeled either as reference or representative, and completely sequenced.
The assembly summary includes a field refseq_category
,
which is quite convenient
install-organism -v 1 -group ${GROUP} -task list \
-list_format org \
-filter refseq_category "reference genome"
This will print out the list of assemblies labeled as “reference genome” in the column “refseq_category” of the bacterial genome assembly file.
As of 05/05/2024, no more than 15 bacterial genomes are labeled “reference”, among the 350,729 available assemblies.
The option -list_format enables to export the result in 3 alternative formats.
org a list of organism names (default output, as shown above)
table a table with the different names and identifiers of the selected assemblies
install-organism -v 1 -group ${GROUP} -task list \
-list_format table \
-filter refseq_category "reference genome" \
-o reference_genomes.tsv
We can combine several filtering criteria to select a more reasonable number of assemblies. For example, we could select the assemblies labeled as “representative genome” in the column refseq_category, and “Complete Genome” in the colum “”
install-organism -v 1 -group ${GROUP} -task list \
-list_format table \
-filter refseq_category "representative genome" \
-filter assembly_level "Complete Genome" \
-o representative_complete.tsv
This returns 5014 assemblies (24/04/2024), with only one assembly per species, and belonging to 1555 distinct genus. The number of assemblies per genus can be displayed with the following command;
We will now download the genome of a selected assembly, and install it on the RSAT server. As an illustration, we will install the reference strain K-12 of Escherichia coli.
## Select a species and assembly of interest
## (can be adapted to your needs)
grep Escherichia reference_genomes.tsv | grep K-12
grep Escherichia reference_genomes.tsv | grep K-12 | cut -f 8
export ASSEMBLY=GCF_000005845.2
## List the selected assemblies for a given species
## (e.g. Escherichia_coli)
install-organism -v 1 -group ${GROUP} \
-filter assembly_accession ${ASSEMBLY} \
-task list
## Download the selected assemblies
install-organism -v 1 -group ${GROUP} \
-filter assembly_accession ${ASSEMBLY} \
-task download
Note: downloaded genomes will be installed in a
folder $RSAT/downloads/refseq
on your computer. Each strain
is stored in a specific subfolder built from the group, genus, species,
and assembly.
For example, the reference assembly for Escherichia coli is stored in
downloads/refseq/bacteria//Escherichia/Escherichia_coli_str._K-12_substr._MG1655/GCF_000005845.2_ASM584v2/
Parsing consists in extracting the information from a text-formatted file, in order to organise it in a computable structure.
For each parsed genome, the program creates a folder whose name is the concatenation of the genus, species, strain accession and assembly name, separated by undescore characters.
$RSAT/public_html/data/[Genus]_[species]_[assembly_accession]_[asm_name]
For Escherichia coli K12 reference strain, this gives:
Escherichia_coli_str._K-12_substr._MG1655_GCF_000005845.2_ASM584v2
From now on we will directly refer to this concatenated name as the
Organism name (generally denoted by the option
-org
in RSAT commands.
export ORG=Escherichia_coli_str._K-12_substr._MG1655_GCF_000005845.2_ASM584v2
## Check the parsing result
ls -l ${RSAT}/public_html/data/genomes/${ORG}/genome/
## Measure the disk space occupied by this genome (note: this will be increased in the subsequent steps)
du -sm ${RSAT}/public_html/data/genomes/${ORG}/genome/
This gives the number of megabases occupied by the genome folder.
In order to make the parsed genome available for RSAT, you need to run the configuration task.
Several more tasks need to be performed before we can consider this organism as “installed”.
The simplest way to run it is to select the “default” tasks.
Task | Meaning |
---|---|
default | run all the default tasks useful for the installation of a genome (this is a subset of the tasks specified below) |
start_stop | compute trinucleotide frequencies in all the start and stop codons (validation of the correspondance between genome sequences and gene coordinates) |
allup | Retrieve all upstream sequences in order to compute the background models (oligonucleotide and dyad frequencies) |
seq_len_distrib | Compute the distribution of upstream sequences |
genome_segments | SPlit the genome into segments corresponding to different genomic region types: genic, intergenic, … |
upstream_freq | Compute oligonucleotide or/and dyad frequencies in upstream sequences (must be combined with tasks oligos and/or dyads) |
genome_freq | Compute oligonucleotide or/and dyad frequencies in the whole genome (must be combined with tasks oligos and/or dyads) |
protein_freq | Compute amino acid, di- and tri-peptide frequencies in all protein sequences |
protein_len | Compute distribution of protein lengths |
oligos | Compute oligonucleotide frequencies (size 1 to 8) in a given type of sequences. Must be combined with stasks upstream or genomic |
dyads | Idem for dyads, i.e. spaced pairs of trinucleotides |
fasta_genome | convert genome sequence in fasta in order to make it usable by bedtools |
chrom_sizes | compute chromosome sizes from genomic sequences |
index_bedtools | index genome to enable its ultra-fast processing by bedtools |
uninstall | uninstall a genome, i.e. stop displaying it as available. Beware,
this does not erase the genome folder, for this you can use the option
-task erase |
erase | Erase the whole folder of your organism in
$RSAT/public_html/data |
## Retrieve all the start codon sequences
retrieve-seq -all -org ${ORG} -from 0 -to 2 -o start_seq.fasta
## Check the result
head -n 20 start_seq.fasta
## Check that most genes have ATG as start codon
oligo-analysis -v 1 -i start_seq.fasta -l 3 -1str -return occ,freq -sort | more
You can do the similar test with the stop codon frequencies. Note
that nothing forces you to store the stop codong sequences in a local
file, you can use the pipe symbol |
to concatenate two
commands.
## Compute all trinucleotide frequencies in the stop codons
retrieve-seq -all -org ${ORG} \
-type downstream -from -1 -to -3 -o stop_seq.fasta
## Check the beginning of the result file
head -20 stop_seq.fasta
## Check that most genes have ATG as start codon
oligo-analysis -v 1 -i stop_seq.fasta -l 3 -1str \
-return occ,freq -sort | more
The assembly summary file provided by NCBI for each taxonomic group gives detailed indications of the sequencing status for each assembly.
This table contains many columns with valuable information for genome management. The column contain is indicated in a separate file.
Genome assemblies are classigied in four levels (PMCID: PMC4702866).
We can use this assembly summary to select only those genomes with the a completion level that we consider appropriate.
For example, for the RSAT prokaryote server (http://prokaryote.rsat.eu), we restrict the installation to genomes by combining the following criteria :
The field refseq_category (column 5) is either reference genome or representative genome.
The field assembly_level (column 12) is Complete genome
Since ewe downloaded the assembly summary file above, we can count the number of genomes of different types in these columns.
## Choose a taxonomic group of interest
export GROUP=bacteria
export ASSEMBLY_SUMMARY=downloads/refseq/${GROUP}/assembly_summary.txt
## Count the number of assemblies for each value of the refseq_category
grep -v '^#' ${ASSEMBLY_SUMMARY} | cut -f 5,12 | sort | uniq -c | sort -n
As of May 5, 2024, the number of assemblies in these categories is the following:
We saw above that we can apply two filters to select the 15 genomes
labeled as reference genome in the field
refseq_category
and Complete Genome in
assembly_level
.
We will first list the selected organisms in order to make sure that the filter works properly (and avoid the risk of unwillingly installing 350,000 assemblies).
export GROUP=bacteria
## Select the reference genomes with complete genome
install-organism -v 1 -group ${GROUP} \
-filter refseq_category "reference genome" \
-filter assembly_level "Complete Genome" \
-task list -list_format org
This should give a list of 15 genomes. We will then specity the tasks required to install each of these genomes. Note that this can take some time, since each genome has to be downloaded, parsed, and processed in order to be ready for use on the RSAT server. In particular, the computation of oligonucleotide and dyads in all the non-coding upstream sequences can take some time.
export GROUP=bacteria
## Select the representative genomes with complete genome
install-organism -v 1 -group ${GROUP} \
-filter refseq_category "representative genome" \
-filter assembly_level "Complete Genome" \
-task list -list_format org \
-o ${GROUP}_representative_complete.txt
## Count the number of representative genomes
grep -v '^;' ${GROUP}_representative_complete.txt \
| grep -v '^#' \
| wc -l
Depending on the group of interest, the number of representative genomes can amount to a few tens, or several thousands. In the lattre case, yous should better see the next section (Managing installation tasks with a task scheduler).
We can install multiple species in a single command by specifying a file containing a list of species names (one species name per row).
Beware:
This command will download thousands of genomes on your hard drive, which will undoubtedly occupy some disj space.
The download can take a considerable time depending on the number of selected species and bandwidth of your computer.
## File with the list of species to install
SPECIES_TO_INSTALL=${GROUP}_species_to_install.txt
## Get the two first words of the 8th column as Genus and species names
cut -f 8 ${REPRESENTATIVE_COMPLETE} \
| sed 's/ /_/g' \
> ${SPECIES_TO_INSTALL}
## Count the number of species to instal
wc -l ${SPECIES_TO_INSTALL}
## Download all the selected species
install-organism -v 2 -group ${GROUP} \
-species_file ${SPECIES_TO_INSTALL} -task download
## Check the list of downloaded organisms
## Note: organism = Genus + species + strain
install-organism -v 2 -group ${GROUP} \
-species_file ${SPECIES_TO_INSTALL} -task list
Since the number of genomes to install can be important, we will run
install-organism
with the option -batch
, in
order to parallelise the installation steps via a job scheduler. This
job scheduler must have been previously defined in RSAT configuration
files.
If your RSAT server has been configured adequately, you can add the
option -batch
to the commands above in order to send the
installation tasks to a job scheduler.
ORG_TO_INSTALL=${GROUP}_organisms_to_install.txt
## Parse all organisms
install-organism -v 2 -group ${GROUP} \
-species_file ${SPECIES_TO_INSTALL} -task list,parse \
-batch
## Get the list of organisms from the list of parsed species
install-organism -v 2 -group ${GROUP} \
-species_file ${SPECIES_TO_INSTALL} -task list \
> ${ORG_TO_INSTALL}
## Count the number of species and organisms to install
SPECIES_DO_INSTALL_NB=`grep -v '^;' ${SPECIES_TO_INSTALL} | wc -l | awk '{print $1}'`
ORG_DO_INSTALL_NB=`grep -v '^;' ${ORG_TO_INSTALL} | wc -l | awk '{print $1}'`
echo "To install: ${ORG_DO_INSTALL_NB} organisms, ${SPECIES_DO_INSTALL_NB} distinct species"
## First retrieve all upstream sequences (cannot be done in batch mode)
install-organism -v 2 \
-org_file ${ORG_TO_INSTALL} -task config,allup;
## Send the other installation tasks to the job scheduler
install-organism -v 2 -org_file ${ORG_TO_INSTALL} -task start_stop,seq_len_distrib,genome_segments,upstream_freq,genome_freq,protein_freq,protein_len,oligos,dyads,fasta_genome,fasta_genome_rm,chrom_sizes,index_bedtools -batch
TO BE WRITTEN