1 Introduction

The goal of this tutorial is to explain how to install genomes from NCBI in your RSAT server.

2 Prerequisite

This protocol assumes that you have a working instance of the RSAT server, and that you are familar with the Unix terminal.

3 Protocol

We illustrate the installation procedure with a bacterial genome, since NCBI is the main source of prokaryote genomes for RSAT (http://prokaryotes.rsat.eu/). The same procedure can however apply to other taxa, as far as they are available on the NCBI FTP server (https://ftp.ncbi.nlm.nih.gov/genomes/).

3.1 Data source

Genomes are downloaded from the refseq repository: https://ftp.ncbi.nlm.nih.gov/genomes/refseq

3.2 Getting the list of available species

The first step to perform with the command ìnstall-organisms is to get the list of organisms available for a given taxonomic group.

The NCBI FTP genome site is organised by “main” taxonomic groups:

The option -task available returns the complete list of available species for the specified group.

The command below gets the list of all available bacteria and stores them in a text file. It can easily be adapted to another group by changing the environment variable GROUP.

## Get curret date to use as suffix for output files
export TODAY=`date +%Y-%m-%d`

## Choose a taxonomic group of interest
export GROUP=bacteria

## Get the list of bacterial species available on the NCBI FTP site
install-organism -v 1 -group ${GROUP} -task available

This will download the assembly summmary file from NCBI, for the selected taxonomic group. Beware, for Bacteria this file is pretty large (153Mb on April 2024). However, this file is downloaded only if the version on the NCBI server is more recent than the local version. After the first download, subsequent calls of the install-organisms command will check if a new version is present at the NCBI site, and if not it will use the local copy.

The output of install-organisms indicates the local path of the assembly summary file.

;    Assembly summary file      $RSAT/downloads/refseq/bacteria/assembly_summary.txt
;    Assembly summary columns   $RSAT/downloads/refseq/bacteria/assembly_summary_colums.tsv
;    Available assemblies         350727
## Count the number of species.
## Note: grep is used to filter out the comment lines.
export ASSEMBLY_SUMMARY=downloads/refseq/${GROUP}/assembly_summary.txt
grep -v '^#' ${ASSEMBLY_SUMMARY} | wc -l

3.2.1 Getting lists of available species for other groups

We can of course do the same for other taxa, for example Archaea, Fungi or Mammalian.

export NCBI_GROUPS='archaea bacteria fungi invertebrate plant protozoa vertebrate_mammalian vertebrate_other viral'

echo "NCBI_GROUPS ${NCBI_GROUPS}"
for ncbi_group in $NCBI_GROUPS; do \
  echo "Getting list of NCBI species for ${ncbi_group}" ; \
  install-organism -v 1 -group ${ncbi_group} -task available
done


## Approx count of species per taxa
## (each file starts with 15 comment lines, which are be substracted from each count)
wc -l $RSAT/downloads/refseq/*/assembly_summary.txt \
   | sort -nr \
   | awk '{print "|\t"$1-15"\t| "$2"\t|"}' 

Note: for some species, several strains have been sequenced. The number of genomes available at NCBI thus exceeds by far the number of distinct species.

3.2.2 Number of species per group

Here is the result on JanFeb 2nd, 2025

Number of species Taxonomic group
405360 /opt/rsat-tools/rsat_bacteria/downloads/refseq/bacteria/assembly_summary.txt
14984 /opt/rsat-tools/rsat_bacteria/downloads/refseq/viral/assembly_summary.txt
2438 /opt/rsat-tools/rsat_bacteria/downloads/refseq/archaea/assembly_summary.txt
630 /opt/rsat-tools/rsat_bacteria/downloads/refseq/fungi/assembly_summary.txt
432 /opt/rsat-tools/rsat_bacteria/downloads/refseq/vertebrate_other/assembly_summary.txt
428 /opt/rsat-tools/rsat_bacteria/downloads/refseq/invertebrate/assembly_summary.txt
228 /opt/rsat-tools/rsat_bacteria/downloads/refseq/vertebrate_mammalian/assembly_summary.txt
173 /opt/rsat-tools/rsat_bacteria/downloads/refseq/plant/assembly_summary.txt
108 /opt/rsat-tools/rsat_bacteria/downloads/refseq/protozoa/assembly_summary.txt
424901 total

Notes

  • The number of available assemblies exceeds by far the number of species, because some species are represented by different strains, in particular for bacteria (some bacterial species have several thousands of strains sequenced).

  • Some of these genomes are only partly sequenced, or poorly assembled, and the level of annotation is quite variable. NCBI subdivided these available genomes in 4 categories depending on the level of completion of the sequencing project.

3.3 Filtering assemblies

In order to select a reasonable number of genomes to install, we can filter the assemblies based on different criteria.

3.3.1 Listing the available assemblies for a species of interest

The option -species enables selecting a subset of assemblies corresponding to a given species of a given genus. The argulent -species should be followed by a string concatenating the Genus (with a leading uppercase) and species (in lowercase) separated by an underscore character (_), with exactly the same spelling and cases as on the NCBI FTP refseq site. Note that special characters (][)(/ etc) should be replaced by an underscore.

## List the assemblies for a given species, e.g. Rhizobium etli
install-organism -v 1 -group ${GROUP} \
  -species Rhizobium_etli \
  -task list -list_format org

Beware, for some species of interest, the number of assemblies can be quite large. This is for example the case for Escherchia coli.

## List the assemblies for Escherichia coli)
install-organism -v 1 -group ${GROUP} \
  -species Escherichia_coli \
  -task list -list_format org \
  -o assemblies_Escherichia_coli.txt

## Count the number of assemblies
grep -v '^;' assemblies_Escherichia_coli.txt | grep -v '^#' | wc -l

## Display the first rows of this file
head -n 20 assemblies_Escherichia_coli.txt

For the species Escherichia_coli, we get 38518 assemblies (2024-02-02).

We generally don’t need to install all these assemblies. We should thus use additional information in order to narrow down the number of genomes to install. To this purpose, the next sections explain how to select genomes labeled either as reference or representative, and completely sequenced.

3.3.2 Selecting reference genomes

The assembly summary includes a field refseq_category, which is quite convenient

## Get the list of organisms
install-organism -v 1 -group ${GROUP}  -task list \
  -list_format org \
  -filter refseq_category "reference genome" \
  -o reference_genomes_organisms.txt

## Display the list of organisms
less reference_genomes_organisms.txt

## Display the list of organisms
wc -l reference_genomes_organisms.txt

The commands above list the assemblies labeled “reference genome” in the column “refseq_category” of the assembly summary file. This collection was built by “selecting the ‘best’ genome assembly for each species”.

On Feb 2, 2025, the reference collection contains 20,578 among the 424,901 available bacterial genomes.

3.3.3 Selecting reference organisms with complete genome

The assembly summary file also contains a column indicating the sequencing status for each assembly.

This table contains many columns with valuable information for genome management. The column contain is indicated in a separate file.

Genome assemblies are classified in four levels of chromosome assembly (PMCID: PMC4702866).

  1. Contig: Assemblies that include only contigs.
  2. Scaffold: Includes both scaffolds and contigs.
  3. Chromosome: Includes chromosome or linkage groups, plus scaffolds and contigs.
  4. Complete genome: Assemblies for which all molecules are fully sequenced.

We can use this assembly summary to select only those genomes with the a completion level that we consider appropriate. The commands below count the number of genomes per type of chromosome assembly.

## Choose a taxonomic group of interest
export GROUP=bacteria
export ASSEMBLY_SUMMARY=downloads/refseq/${GROUP}/assembly_summary.txt

## Count the number of assemblies for each value of the refseq_category
grep -v '^#' ${ASSEMBLY_SUMMARY} | cut -f 12 | sort | uniq -c | sort -n

We can now compute the number of assemblies for each combination of annotations for the two criteria: refseq_category (column 5 of assembly_summary.txt) and assembly_level (column 12).

For the RSAT prokaryote server (http://prokaryote.rsat.eu), we restrict the installation to genomes

## Choose a taxonomic group of interest
export GROUP=bacteria
export ASSEMBLY_SUMMARY=downloads/refseq/${GROUP}/assembly_summary.txt

## Count the number of assemblies for each value of the refseq_category
cut -f 5,12 ${ASSEMBLY_SUMMARY} | sort | uniq -c | sort -n \
  | perl -pe 's/(\d+) /$1\t/' \
  | awk -F'\t' '{print "| "$1"\t| "$3"\t| "$2" |"}'

As of Feb 2, 2025, the number of assemblies in these categories is the following:

Nb assembly_level refseq_category
424 Chromosome reference genome
5511 Chromosome na
5734 Complete Genome reference genome
5952 Scaffold reference genome
8468 Contig reference genome
41143 Complete Genome na
125680 Scaffold na
212461 Contig na

For the RSAT bacterial server, we select the assemblies that fit both of the following criteria :

  • The field refseq_category (column 5) is reference genome.
  • The field assembly_level (column 12) is Complete genome.

We will list the selected organisms in order to make sure that the filter works properly, with 3 alternative options for the output format:

  • -list_format org exports the organism name. These are built by concatenating the genus, species and assembly, separated by underscore characters.
  • -list_format table exports a subset of the table assembly_summary.txt, restricted to the assemblies passing the filters
  • -list_format bash exports a bash script with all the commands for installing the selected genomes. Beware: depending on the group of interest, the number of reference genomes can amount to a few tens, or several thousands. In the latter case, you should better see the section Managing installation tasks with a task scheduler.
export GROUP=bacteria
export REFERENCE_COMPLETE=${GROUP}_reference_complete

## Select the reference genomes with complete genome
install-organism -v 1 -group ${GROUP} \
  -filter refseq_category "reference genome" \
  -filter assembly_level "Complete Genome" \
  -task list -list_format org \
  -o ${REFERENCE_COMPLETE}.txt

## Count the number of complete reference genomes
grep -v '^;' ${REFERENCE_COMPLETE}.txt \
  | grep -v '^#' \
  | wc -l

## Export the metadata table for the reference genomes with complete genome
install-organism -v 1 -group ${GROUP} \
  -filter refseq_category "reference genome" \
  -filter assembly_level "Complete Genome" \
  -task list -list_format table \
  -o ${REFERENCE_COMPLETE}.tsv

## Export a bash file with the installation tasks
install-organism -v 1 -group ${GROUP}  -task list \
  -list_format bash \
  -filter refseq_category "reference genome" \
  -filter assembly_level "Complete Genome" \
  -o ${REFERENCE_COMPLETE}.bash

3.4 Installing one assembly for a given species and strain of interest

3.4.1 Downloading a selected assembly

We will now download the genome of a selected assembly, and install it on the RSAT server. As an illustration, we will install the reference strain K-12 of Escherichia coli.

## Select a species and assembly of interest 
## (can be adapted to your needs)
grep Escherichia reference_genomes.tsv | grep K-12
grep Escherichia reference_genomes.tsv | grep K-12 | cut -f 8

export ASSEMBLY=GCF_000005845.2


## List the selected assemblies for a given species 
## (e.g. Escherichia_coli)  
install-organism -v 1 -group ${GROUP} \
  -filter assembly_accession ${ASSEMBLY} \
  -task list

## Download the selected assemblies 
install-organism -v 1 -group ${GROUP} \
  -filter assembly_accession ${ASSEMBLY} \
  -task download
  

Note: downloaded genomes will be installed in a folder $RSAT/downloads/refseq on your computer. Each strain is stored in a specific subfolder built from the group, genus, species, and assembly.

For example, the reference assembly for Escherichia coli is stored in

downloads/refseq/bacteria//Escherichia/Escherichia_coli_str._K-12_substr._MG1655/GCF_000005845.2_ASM584v2/

ls -1 downloads/refseq/bacteria/Escherichia/Escherichia_coli_str._K-12_substr._MG1655/GCF_000005845.2_ASM584v2

3.4.2 Parsing a selected genome from NCBI

Parsing consists in extracting the information from a text-formatted file, in order to organise it in a computable structure.

## Parse a given strain of a given genome 
## (e.g. strain GCF_000016185.1_ASM1618v1 of Rhodospirillum centenum) 
install-organism -v 2 -group ${GROUP} \
  -filter assembly_accession ${ASSEMBLY} \
  -task parse

3.4.3 Organism names

For each parsed genome, the program creates a folder whose name is the concatenation of the genus, species, strain accession and assembly name, separated by undescore characters.

$RSAT/public_html/data/[Genus]_[species]_[assembly_accession]_[asm_name]

For Escherichia coli K12 reference strain, this gives:

Escherichia_coli_str._K-12_substr._MG1655_GCF_000005845.2_ASM584v2

From now on we will directly refer to this concatenated name as the Organism name (generally denoted by the option -org in RSAT commands.


export ORG=Escherichia_coli_str._K-12_substr._MG1655_GCF_000005845.2_ASM584v2

## Check the parsing result
ls -l ${RSAT}/public_html/data/genomes/${ORG}/genome/

## Measure the disk space occupied by this genome (note: this will be increased in the subsequent steps)
du -sm ${RSAT}/public_html/data/genomes/${ORG}/genome/

This gives the number of megabases occupied by the genome folder.

3.4.4 Adding the parsed genome to the list of supported genomes

In order to make the parsed genome available for RSAT, you need to run the configuration task.

## Configure the organism for RSAT
install-organism -v 1 -org ${ORG} -source NCBI -task config


## Check that the organism has wel been declared
supported-organisms -return ID,source,last_update | grep ${ORG}

3.4.5 Finalizing the installation

Several more tasks need to be performed before we can consider this organism as “installed”.

The simplest way to run it is to select the “default” tasks.

## Run the installation task (retrieve upstream sequences compute background oligo and dyad frequencies, index fasta genome for bedtols, ...)
install-organism -v 1 -org ${ORG} -task default

3.4.6 Detailed list of supported installation tasks (default and others)

Task Meaning
default run all the default tasks useful for the installation of a genome (this is a subset of the tasks specified below)
start_stop compute trinucleotide frequencies in all the start and stop codons (validation of the correspondance between genome sequences and gene coordinates)
allup Retrieve all upstream sequences in order to compute the background models (oligonucleotide and dyad frequencies)
seq_len_distrib Compute the distribution of upstream sequences
genome_segments SPlit the genome into segments corresponding to different genomic region types: genic, intergenic, …
upstream_freq Compute oligonucleotide or/and dyad frequencies in upstream sequences (must be combined with tasks oligos and/or dyads)
genome_freq Compute oligonucleotide or/and dyad frequencies in the whole genome (must be combined with tasks oligos and/or dyads)
protein_freq Compute amino acid, di- and tri-peptide frequencies in all protein sequences
protein_len Compute distribution of protein lengths
oligos Compute oligonucleotide frequencies (size 1 to 8) in a given type of sequences. Must be combined with stasks upstream or genomic
dyads Idem for dyads, i.e. spaced pairs of trinucleotides
fasta_genome convert genome sequence in fasta in order to make it usable by bedtools
chrom_sizes compute chromosome sizes from genomic sequences
index_bedtools index genome to enable its ultra-fast processing by bedtools
uninstall uninstall a genome, i.e. stop displaying it as available. Beware, this does not erase the genome folder, for this you can use the option -task erase
erase Erase the whole folder of your organism in $RSAT/public_html/data

3.4.7 Checking that the organism is well supported

## Get all the supported organisms
supported-organisms | grep ${ORG}

3.4.8 Checking the correctness of gene annotations by computing start/stop codon frequencies

## Retrieve all the start codon sequences
retrieve-seq -all -org ${ORG} -from 0 -to 2 -o start_seq.fasta

## Check the result
head -n 20 start_seq.fasta

## Check that most genes have ATG as start codon
oligo-analysis -v 1 -i start_seq.fasta  -l 3 -1str -return occ,freq -sort | more

You can do the similar test with the stop codon frequencies. Note that nothing forces you to store the stop codong sequences in a local file, you can use the pipe symbol | to concatenate two commands.

## Compute all trinucleotide frequencies in the stop codons
retrieve-seq -all -org ${ORG} \
   -type downstream -from -1 -to -3 -o stop_seq.fasta

## Check the beginning of the result file
head -20 stop_seq.fasta

## Check that most genes have ATG as start codon
oligo-analysis -v 1 -i stop_seq.fasta  -l 3 -1str \
   -return occ,freq -sort | more

3.5 Installing multiple genomes


3.6 Installing many genomes

We can install multiple species in a single command by specifying a file containing a list of species names (one species name per row).

3.6.1 Download

Beware:

  • This command will download thousands of genomes on your hard drive. Before processing, we recommend to estimate the required size by installing a few genomes and extrapolating for the number of genomes to be installed.

  • The download can take a considerable time depending on the number of selected species and bandwidth of your computer.

3.6.2 Downloading genome files from NCBI

## File with the list of species to install
export ORG_TO_INSTALL=${REFERENCE_COMPLETE}.txt

## Count the number of species to instal
wc -l ${ORG_TO_INSTALL}

## Download all the selected species
install-organism -v 2 -group ${GROUP} \
   -org_file ${ORG_TO_INSTALL} -task download

## Check the list of downloaded organisms 
## Note: organism = Genus + species + strain
install-organism -v 2 -group ${GROUP} \
   -species_file ${ORG_TO_INSTALL} -task list

3.6.3 Parsing


## Parse all organisms
install-organism -v 2 -group ${GROUP} \
   -species_file ${ORG_TO_INSTALL} -task list,parse

3.6.4 Parsing genomes from NCBI files

Since the number of genomes to install can be important, we will run install-organism with the option -batch, in order to parallelise the installation steps via a job scheduler. This job scheduler must have been previously defined in RSAT configuration files.

install-organism -v 2 -group ${GROUP} \
   -species_file ${ORG_TO_INSTALL} -task parse -batch

make -f ${RSAT}/makefiles/server.mk  watch_jobs

3.6.5 Default installation tasks


## Run default installation tasks
## NOTE: if you need to install hundreds of genomes, this can take a while. ## In such case, it is worth testing the option -batch (see below)
install-organism -v 2 -group ${GROUP} \
   -species_file ${ORG_TO_INSTALL} -task config,list,default

3.6.6 Managing installation tasks with a job scheduler

If your RSAT server has been configured adequately, you can add the option -batch to the commands above in order to send the installation tasks to a job scheduler.

ORG_TO_INSTALL=${GROUP}_organisms_to_install.txt

## Parse all organisms
install-organism -v 2 -group ${GROUP} \
   -species_file ${ORG_TO_INSTALL} -task list,parse \
  -batch


## Get the list of organisms from the list of parsed species
install-organism -v 2 -group ${GROUP} \
  -species_file ${ORG_TO_INSTALL} -task list \
  > ${ORG_TO_INSTALL}

## Count the number of species and organisms to install
SPECIES_DO_INSTALL_NB=`grep -v '^;' ${ORG_TO_INSTALL} | wc -l  | awk '{print $1}'`
ORG_DO_INSTALL_NB=`grep -v '^;' ${ORG_TO_INSTALL} | wc -l | awk '{print $1}'`
echo "To install: ${ORG_DO_INSTALL_NB} organisms,  ${SPECIES_DO_INSTALL_NB} distinct species"

## First retrieve all upstream sequences (cannot be done in batch mode)
install-organism -v 2  \
   -org_file ${ORG_TO_INSTALL} -task config,allup; 
   
## Send the other installation tasks to the job scheduler
install-organism -v 2 -org_file ${ORG_TO_INSTALL}  -task  start_stop,seq_len_distrib,genome_segments,upstream_freq,genome_freq,protein_freq,protein_len,oligos,dyads,fasta_genome,fasta_genome_rm,chrom_sizes,index_bedtools -batch

3.7 Check the installation

TO BE WRITTEN