The goal of this tutorial is to explain how to install genomes from NCBI in your RSAT server.
This protocol assumes that you have a working instance of the RSAT server, and that you are familar with the Unix terminal.
We illustrate the installation procedure with a bacterial genome, since NCBI is the main source of prokaryote genomes for RSAT (http://prokaryotes.rsat.eu/). The same procedure can however apply to other taxa, as far as they are available on the NCBI FTP server (https://ftp.ncbi.nlm.nih.gov/genomes/).
Genomes are downloaded from the refseq repository: https://ftp.ncbi.nlm.nih.gov/genomes/refseq
The first step to perform with the command
ìnstall-organisms
is to get the list of organisms available
for a given taxonomic group.
The NCBI FTP genome site is organised by “main” taxonomic groups:
The option -task available
returns the complete list of
available species for the specified group.
The command below gets the list of all available bacteria and stores
them in a text file. It can easily be adapted to another group by
changing the environment variable GROUP
.
## Get curret date to use as suffix for output files
export TODAY=`date +%Y-%m-%d`
## Choose a taxonomic group of interest
export GROUP=bacteria
## Get the list of bacterial species available on the NCBI FTP site
install-organism -v 1 -group ${GROUP} -task available
This will download the assembly summmary file from NCBI, for the selected taxonomic group. Beware, for Bacteria this file is pretty large (153Mb on April 2024). However, this file is downloaded only if the version on the NCBI server is more recent than the local version. After the first download, subsequent calls of the install-organisms command will check if a new version is present at the NCBI site, and if not it will use the local copy.
The output of install-organisms indicates the local path of the assembly summary file.
; Assembly summary file $RSAT/downloads/refseq/bacteria/assembly_summary.txt
; Assembly summary columns $RSAT/downloads/refseq/bacteria/assembly_summary_colums.tsv
; Available assemblies 350727
## Count the number of species.
## Note: grep is used to filter out the comment lines.
export ASSEMBLY_SUMMARY=downloads/refseq/${GROUP}/assembly_summary.txt
grep -v '^#' ${ASSEMBLY_SUMMARY} | wc -l
We can of course do the same for other taxa, for example Archaea, Fungi or Mammalian.
export NCBI_GROUPS='archaea bacteria fungi invertebrate plant protozoa vertebrate_mammalian vertebrate_other viral'
echo "NCBI_GROUPS ${NCBI_GROUPS}"
for ncbi_group in $NCBI_GROUPS; do \
echo "Getting list of NCBI species for ${ncbi_group}" ; \
install-organism -v 1 -group ${ncbi_group} -task available
done
## Approx count of species per taxa
## (each file starts with 15 comment lines, which are be substracted from each count)
wc -l $RSAT/downloads/refseq/*/assembly_summary.txt \
| sort -nr \
| awk '{print "|\t"$1-15"\t| "$2"\t|"}'
Note: for some species, several strains have been sequenced. The number of genomes available at NCBI thus exceeds by far the number of distinct species.
Here is the result on JanFeb 2nd, 2025
Number of species | Taxonomic group |
---|---|
405360 | /opt/rsat-tools/rsat_bacteria/downloads/refseq/bacteria/assembly_summary.txt |
14984 | /opt/rsat-tools/rsat_bacteria/downloads/refseq/viral/assembly_summary.txt |
2438 | /opt/rsat-tools/rsat_bacteria/downloads/refseq/archaea/assembly_summary.txt |
630 | /opt/rsat-tools/rsat_bacteria/downloads/refseq/fungi/assembly_summary.txt |
432 | /opt/rsat-tools/rsat_bacteria/downloads/refseq/vertebrate_other/assembly_summary.txt |
428 | /opt/rsat-tools/rsat_bacteria/downloads/refseq/invertebrate/assembly_summary.txt |
228 | /opt/rsat-tools/rsat_bacteria/downloads/refseq/vertebrate_mammalian/assembly_summary.txt |
173 | /opt/rsat-tools/rsat_bacteria/downloads/refseq/plant/assembly_summary.txt |
108 | /opt/rsat-tools/rsat_bacteria/downloads/refseq/protozoa/assembly_summary.txt |
424901 | total |
Notes
The number of available assemblies exceeds by far the number of species, because some species are represented by different strains, in particular for bacteria (some bacterial species have several thousands of strains sequenced).
Some of these genomes are only partly sequenced, or poorly assembled, and the level of annotation is quite variable. NCBI subdivided these available genomes in 4 categories depending on the level of completion of the sequencing project.
In order to select a reasonable number of genomes to install, we can filter the assemblies based on different criteria.
The option -species
enables selecting a subset of
assemblies corresponding to a given species of a given genus. The
argulent -species
should be followed by a string
concatenating the Genus (with a leading uppercase) and species (in
lowercase) separated by an underscore character (_
), with
exactly the same spelling and cases as on the NCBI FTP refseq site. Note
that special characters (][)(/
etc) should be replaced by
an underscore.
## List the assemblies for a given species, e.g. Rhizobium etli
install-organism -v 1 -group ${GROUP} \
-species Rhizobium_etli \
-task list -list_format org
Beware, for some species of interest, the number of assemblies can be quite large. This is for example the case for Escherchia coli.
## List the assemblies for Escherichia coli)
install-organism -v 1 -group ${GROUP} \
-species Escherichia_coli \
-task list -list_format org \
-o assemblies_Escherichia_coli.txt
## Count the number of assemblies
grep -v '^;' assemblies_Escherichia_coli.txt | grep -v '^#' | wc -l
## Display the first rows of this file
head -n 20 assemblies_Escherichia_coli.txt
For the species Escherichia_coli, we get 38518 assemblies (2024-02-02).
We generally don’t need to install all these assemblies. We should thus use additional information in order to narrow down the number of genomes to install. To this purpose, the next sections explain how to select genomes labeled either as reference or representative, and completely sequenced.
The assembly summary includes a field refseq_category
,
which is quite convenient
## Get the list of organisms
install-organism -v 1 -group ${GROUP} -task list \
-list_format org \
-filter refseq_category "reference genome" \
-o reference_genomes_organisms.txt
## Display the list of organisms
less reference_genomes_organisms.txt
## Display the list of organisms
wc -l reference_genomes_organisms.txt
The commands above list the assemblies labeled “reference genome” in the column “refseq_category” of the assembly summary file. This collection was built by “selecting the ‘best’ genome assembly for each species”.
On Feb 2, 2025, the reference collection contains 20,578 among the 424,901 available bacterial genomes.
The assembly summary file also contains a column indicating the sequencing status for each assembly.
This table contains many columns with valuable information for genome management. The column contain is indicated in a separate file.
Genome assemblies are classified in four levels of chromosome assembly (PMCID: PMC4702866).
We can use this assembly summary to select only those genomes with the a completion level that we consider appropriate. The commands below count the number of genomes per type of chromosome assembly.
## Choose a taxonomic group of interest
export GROUP=bacteria
export ASSEMBLY_SUMMARY=downloads/refseq/${GROUP}/assembly_summary.txt
## Count the number of assemblies for each value of the refseq_category
grep -v '^#' ${ASSEMBLY_SUMMARY} | cut -f 12 | sort | uniq -c | sort -n
We can now compute the number of assemblies for each combination of
annotations for the two criteria: refseq_category
(column 5 of assembly_summary.txt
) and
assembly_level (column 12).
For the RSAT prokaryote server (http://prokaryote.rsat.eu), we restrict the installation to genomes
## Choose a taxonomic group of interest
export GROUP=bacteria
export ASSEMBLY_SUMMARY=downloads/refseq/${GROUP}/assembly_summary.txt
## Count the number of assemblies for each value of the refseq_category
cut -f 5,12 ${ASSEMBLY_SUMMARY} | sort | uniq -c | sort -n \
| perl -pe 's/(\d+) /$1\t/' \
| awk -F'\t' '{print "| "$1"\t| "$3"\t| "$2" |"}'
As of Feb 2, 2025, the number of assemblies in these categories is the following:
Nb | assembly_level | refseq_category |
---|---|---|
424 | Chromosome | reference genome |
5511 | Chromosome | na |
5734 | Complete Genome | reference genome |
5952 | Scaffold | reference genome |
8468 | Contig | reference genome |
41143 | Complete Genome | na |
125680 | Scaffold | na |
212461 | Contig | na |
For the RSAT bacterial server, we select the assemblies that fit both of the following criteria :
We will list the selected organisms in order to make sure that the filter works properly, with 3 alternative options for the output format:
-list_format org
exports the organism name. These are
built by concatenating the genus, species and assembly, separated by
underscore characters.-list_format table
exports a subset of the table
assembly_summary.txt
, restricted to the assemblies passing
the filters-list_format bash
exports a bash script with all the
commands for installing the selected genomes. Beware:
depending on the group of interest, the number of reference genomes can
amount to a few tens, or several thousands. In the latter case, you
should better see the section Managing installation tasks
with a task scheduler.export GROUP=bacteria
export REFERENCE_COMPLETE=${GROUP}_reference_complete
## Select the reference genomes with complete genome
install-organism -v 1 -group ${GROUP} \
-filter refseq_category "reference genome" \
-filter assembly_level "Complete Genome" \
-task list -list_format org \
-o ${REFERENCE_COMPLETE}.txt
## Count the number of complete reference genomes
grep -v '^;' ${REFERENCE_COMPLETE}.txt \
| grep -v '^#' \
| wc -l
## Export the metadata table for the reference genomes with complete genome
install-organism -v 1 -group ${GROUP} \
-filter refseq_category "reference genome" \
-filter assembly_level "Complete Genome" \
-task list -list_format table \
-o ${REFERENCE_COMPLETE}.tsv
## Export a bash file with the installation tasks
install-organism -v 1 -group ${GROUP} -task list \
-list_format bash \
-filter refseq_category "reference genome" \
-filter assembly_level "Complete Genome" \
-o ${REFERENCE_COMPLETE}.bash
We will now download the genome of a selected assembly, and install it on the RSAT server. As an illustration, we will install the reference strain K-12 of Escherichia coli.
## Select a species and assembly of interest
## (can be adapted to your needs)
grep Escherichia reference_genomes.tsv | grep K-12
grep Escherichia reference_genomes.tsv | grep K-12 | cut -f 8
export ASSEMBLY=GCF_000005845.2
## List the selected assemblies for a given species
## (e.g. Escherichia_coli)
install-organism -v 1 -group ${GROUP} \
-filter assembly_accession ${ASSEMBLY} \
-task list
## Download the selected assemblies
install-organism -v 1 -group ${GROUP} \
-filter assembly_accession ${ASSEMBLY} \
-task download
Note: downloaded genomes will be installed in a
folder $RSAT/downloads/refseq
on your computer. Each strain
is stored in a specific subfolder built from the group, genus, species,
and assembly.
For example, the reference assembly for Escherichia coli is stored in
downloads/refseq/bacteria//Escherichia/Escherichia_coli_str._K-12_substr._MG1655/GCF_000005845.2_ASM584v2/
Parsing consists in extracting the information from a text-formatted file, in order to organise it in a computable structure.
For each parsed genome, the program creates a folder whose name is the concatenation of the genus, species, strain accession and assembly name, separated by undescore characters.
$RSAT/public_html/data/[Genus]_[species]_[assembly_accession]_[asm_name]
For Escherichia coli K12 reference strain, this gives:
Escherichia_coli_str._K-12_substr._MG1655_GCF_000005845.2_ASM584v2
From now on we will directly refer to this concatenated name as the
Organism name (generally denoted by the option
-org
in RSAT commands.
export ORG=Escherichia_coli_str._K-12_substr._MG1655_GCF_000005845.2_ASM584v2
## Check the parsing result
ls -l ${RSAT}/public_html/data/genomes/${ORG}/genome/
## Measure the disk space occupied by this genome (note: this will be increased in the subsequent steps)
du -sm ${RSAT}/public_html/data/genomes/${ORG}/genome/
This gives the number of megabases occupied by the genome folder.
In order to make the parsed genome available for RSAT, you need to run the configuration task.
Several more tasks need to be performed before we can consider this organism as “installed”.
The simplest way to run it is to select the “default” tasks.
Task | Meaning |
---|---|
default | run all the default tasks useful for the installation of a genome (this is a subset of the tasks specified below) |
start_stop | compute trinucleotide frequencies in all the start and stop codons (validation of the correspondance between genome sequences and gene coordinates) |
allup | Retrieve all upstream sequences in order to compute the background models (oligonucleotide and dyad frequencies) |
seq_len_distrib | Compute the distribution of upstream sequences |
genome_segments | SPlit the genome into segments corresponding to different genomic region types: genic, intergenic, … |
upstream_freq | Compute oligonucleotide or/and dyad frequencies in upstream sequences (must be combined with tasks oligos and/or dyads) |
genome_freq | Compute oligonucleotide or/and dyad frequencies in the whole genome (must be combined with tasks oligos and/or dyads) |
protein_freq | Compute amino acid, di- and tri-peptide frequencies in all protein sequences |
protein_len | Compute distribution of protein lengths |
oligos | Compute oligonucleotide frequencies (size 1 to 8) in a given type of sequences. Must be combined with stasks upstream or genomic |
dyads | Idem for dyads, i.e. spaced pairs of trinucleotides |
fasta_genome | convert genome sequence in fasta in order to make it usable by bedtools |
chrom_sizes | compute chromosome sizes from genomic sequences |
index_bedtools | index genome to enable its ultra-fast processing by bedtools |
uninstall | uninstall a genome, i.e. stop displaying it as available. Beware,
this does not erase the genome folder, for this you can use the option
-task erase |
erase | Erase the whole folder of your organism in
$RSAT/public_html/data |
## Retrieve all the start codon sequences
retrieve-seq -all -org ${ORG} -from 0 -to 2 -o start_seq.fasta
## Check the result
head -n 20 start_seq.fasta
## Check that most genes have ATG as start codon
oligo-analysis -v 1 -i start_seq.fasta -l 3 -1str -return occ,freq -sort | more
You can do the similar test with the stop codon frequencies. Note
that nothing forces you to store the stop codong sequences in a local
file, you can use the pipe symbol |
to concatenate two
commands.
## Compute all trinucleotide frequencies in the stop codons
retrieve-seq -all -org ${ORG} \
-type downstream -from -1 -to -3 -o stop_seq.fasta
## Check the beginning of the result file
head -20 stop_seq.fasta
## Check that most genes have ATG as start codon
oligo-analysis -v 1 -i stop_seq.fasta -l 3 -1str \
-return occ,freq -sort | more
We can install multiple species in a single command by specifying a file containing a list of species names (one species name per row).
Beware:
This command will download thousands of genomes on your hard drive. Before processing, we recommend to estimate the required size by installing a few genomes and extrapolating for the number of genomes to be installed.
The download can take a considerable time depending on the number of selected species and bandwidth of your computer.
## File with the list of species to install
export ORG_TO_INSTALL=${REFERENCE_COMPLETE}.txt
## Count the number of species to instal
wc -l ${ORG_TO_INSTALL}
## Download all the selected species
install-organism -v 2 -group ${GROUP} \
-org_file ${ORG_TO_INSTALL} -task download
## Check the list of downloaded organisms
## Note: organism = Genus + species + strain
install-organism -v 2 -group ${GROUP} \
-species_file ${ORG_TO_INSTALL} -task list
Since the number of genomes to install can be important, we will run
install-organism
with the option -batch
, in
order to parallelise the installation steps via a job scheduler. This
job scheduler must have been previously defined in RSAT configuration
files.
If your RSAT server has been configured adequately, you can add the
option -batch
to the commands above in order to send the
installation tasks to a job scheduler.
ORG_TO_INSTALL=${GROUP}_organisms_to_install.txt
## Parse all organisms
install-organism -v 2 -group ${GROUP} \
-species_file ${ORG_TO_INSTALL} -task list,parse \
-batch
## Get the list of organisms from the list of parsed species
install-organism -v 2 -group ${GROUP} \
-species_file ${ORG_TO_INSTALL} -task list \
> ${ORG_TO_INSTALL}
## Count the number of species and organisms to install
SPECIES_DO_INSTALL_NB=`grep -v '^;' ${ORG_TO_INSTALL} | wc -l | awk '{print $1}'`
ORG_DO_INSTALL_NB=`grep -v '^;' ${ORG_TO_INSTALL} | wc -l | awk '{print $1}'`
echo "To install: ${ORG_DO_INSTALL_NB} organisms, ${SPECIES_DO_INSTALL_NB} distinct species"
## First retrieve all upstream sequences (cannot be done in batch mode)
install-organism -v 2 \
-org_file ${ORG_TO_INSTALL} -task config,allup;
## Send the other installation tasks to the job scheduler
install-organism -v 2 -org_file ${ORG_TO_INSTALL} -task start_stop,seq_len_distrib,genome_segments,upstream_freq,genome_freq,protein_freq,protein_len,oligos,dyads,fasta_genome,fasta_genome_rm,chrom_sizes,index_bedtools -batch
TO BE WRITTEN