1 Introduction

The goal of this tutorial is to explain how to install genomes from NCBI in your RSAT server.

2 Prerequisite

This protocol assumes that you have a working instance of the RSAT server, and that you are familar with the Unix terminal.

3 Protocol

We illustrate the installation procedure with a bacterial genome, since NCBI is the main source of prokaryote genomes for RSAT (http://prokaryotes.rsat.eu/). The same procedure can however apply to other taxa, as far as they are available on the NCBI FTP server (https://ftp.ncbi.nlm.nih.gov/genomes/).

3.1 Data source

Genomes are downloaded from the refseq repository: https://ftp.ncbi.nlm.nih.gov/genomes/refseq

3.2 Getting the list of available species

The first step to perform with the command ìnstall-organisms is to get the list of organisms available for a given taxonomic group.

The NCBI FTP genome site is organised by “main” taxonomic groups:

The option -task available returns the complete list of available species for the specified group.

The command below gets the list of all available bacteria and stores them in a text file. It can easily be adapted to another group by changing the environment variable GROUP.

## Get curret date to use as suffix for output files
export TODAY=`date +%Y-%m-%d`

## Choose a taxonomic group of interest
export GROUP=bacteria

## Get the list of bacterial species available on the NCBI FTP site
install-organism -v 1 -group ${GROUP} -task available

This will download the assembly summmary file from NCBI, for the selected taxonomic group. Beware, for Bacteria this file is pretty large (153Mb on April 2024). However, this file is downloaded only if the version on the NCBI server is more recent than the local version. After the first download, subsequent calls of the install-organisms command will check if a new version is present at the NCBI site, and if not it will use the local copy.

The output of install-organisms indicates the local path of the assembly summary file.

;    Assembly summary file      $RSAT/downloads/refseq/bacteria/assembly_summary.txt
;    Assembly summary columns   $RSAT/downloads/refseq/bacteria/assembly_summary_colums.tsv
;    Available assemblies         350727

## Count the number of species.
## Note: grep is used to filter out the comment lines.
export ASSEMBLY_SUMMARY=downloads/refseq/${GROUP}/assembly_summary.txt
grep -v '^#' ${ASSEMBLY_SUMMARY} | wc -l

3.2.1 Getting lists of available species for other groups

We can of course do the same for other taxa, for example Archaea, Fungi or Mammalian.

export NCBI_GROUPS='archaea bacteria fungi invertebrate plant protozoa vertebrate_mammalian vertebrate_other viral'

echo "NCBI_GROUPS ${NCBI_GROUPS}"
for ncbi_group in $NCBI_GROUPS; do \
  echo "Getting list of NCBI species for ${ncbi_group}" ; \
  install-organism -v 1 -group ${ncbi_group} -task available
done


## Approx count of species per taxa
## (each file starts with 15 comment lines, which are be substracted from each count)
wc -l $RSAT/downloads/refseq/*/assembly_summary.txt \
   | sort -nr \
   | awk '{print "|\t"$1-15"\t| "$2"\t|"}'

Note: for some species, several strains have been sequenced. The number of genomes available at NCBI thus exceeds by far the number of distinct species.

3.2.2 Number of species per group

Here is the result on JanFeb 2nd, 2025

Number of species	Taxonomic group
405360	/opt/rsat-tools/rsat_bacteria/downloads/refseq/bacteria/assembly_summary.txt
14984	/opt/rsat-tools/rsat_bacteria/downloads/refseq/viral/assembly_summary.txt
2438	/opt/rsat-tools/rsat_bacteria/downloads/refseq/archaea/assembly_summary.txt
630	/opt/rsat-tools/rsat_bacteria/downloads/refseq/fungi/assembly_summary.txt
432	/opt/rsat-tools/rsat_bacteria/downloads/refseq/vertebrate_other/assembly_summary.txt
428	/opt/rsat-tools/rsat_bacteria/downloads/refseq/invertebrate/assembly_summary.txt
228	/opt/rsat-tools/rsat_bacteria/downloads/refseq/vertebrate_mammalian/assembly_summary.txt
173	/opt/rsat-tools/rsat_bacteria/downloads/refseq/plant/assembly_summary.txt
108	/opt/rsat-tools/rsat_bacteria/downloads/refseq/protozoa/assembly_summary.txt
424901	total

Notes

The number of available assemblies exceeds by far the number of species, because some species are represented by different strains, in particular for bacteria (some bacterial species have several thousands of strains sequenced).
Some of these genomes are only partly sequenced, or poorly assembled, and the level of annotation is quite variable. NCBI subdivided these available genomes in 4 categories depending on the level of completion of the sequencing project.

3.3 Filtering assemblies

In order to select a reasonable number of genomes to install, we can filter the assemblies based on different criteria.

3.3.1 Listing the available assemblies for a species of interest

The option -species enables selecting a subset of assemblies corresponding to a given species of a given genus. The argulent -species should be followed by a string concatenating the Genus (with a leading uppercase) and species (in lowercase) separated by an underscore character (_), with exactly the same spelling and cases as on the NCBI FTP refseq site. Note that special characters (][)(/ etc) should be replaced by an underscore.

## List the assemblies for a given species, e.g. Rhizobium etli
install-organism -v 1 -group ${GROUP} \
  -species Rhizobium_etli \
  -task list -list_format org

Beware, for some species of interest, the number of assemblies can be quite large. This is for example the case for Escherchia coli.

## List the assemblies for Escherichia coli)
install-organism -v 1 -group ${GROUP} \
  -species Escherichia_coli \
  -task list -list_format org \
  -o assemblies_Escherichia_coli.txt

## Count the number of assemblies
grep -v '^;' assemblies_Escherichia_coli.txt | grep -v '^#' | wc -l

## Display the first rows of this file
head -n 20 assemblies_Escherichia_coli.txt

For the species Escherichia_coli, we get 38518 assemblies (2024-02-02).

We generally don’t need to install all these assemblies. We should thus use additional information in order to narrow down the number of genomes to install. To this purpose, the next sections explain how to select genomes labeled either as reference or representative, and completely sequenced.

3.3.2 Selecting reference genomes

The assembly summary includes a field refseq_category, which is quite convenient

## Get the list of organisms
install-organism -v 1 -group ${GROUP}  -task list \
  -list_format org \
  -filter refseq_category "reference genome" \
  -o reference_genomes_organisms.txt

## Display the list of organisms
less reference_genomes_organisms.txt

## Display the list of organisms
wc -l reference_genomes_organisms.txt

The commands above list the assemblies labeled “reference genome” in the column “refseq_category” of the assembly summary file. This collection was built by “selecting the ‘best’ genome assembly for each species”.

On Feb 2, 2025, the reference collection contains 20,578 among the 424,901 available bacterial genomes.

3.3.3 Selecting reference organisms with complete genome

The assembly summary file also contains a column indicating the sequencing status for each assembly.

This table contains many columns with valuable information for genome management. The column contain is indicated in a separate file.

https://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt

Genome assemblies are classified in four levels of chromosome assembly (PMCID: PMC4702866).

Contig: Assemblies that include only contigs.
Scaffold: Includes both scaffolds and contigs.
Chromosome: Includes chromosome or linkage groups, plus scaffolds and contigs.
Complete genome: Assemblies for which all molecules are fully sequenced.

We can use this assembly summary to select only those genomes with the a completion level that we consider appropriate. The commands below count the number of genomes per type of chromosome assembly.

## Choose a taxonomic group of interest
export GROUP=bacteria
export ASSEMBLY_SUMMARY=downloads/refseq/${GROUP}/assembly_summary.txt

## Count the number of assemblies for each value of the refseq_category
grep -v '^#' ${ASSEMBLY_SUMMARY} | cut -f 12 | sort | uniq -c | sort -n

We can now compute the number of assemblies for each combination of annotations for the two criteria: refseq_category (column 5 of assembly_summary.txt) and assembly_level (column 12).

For the RSAT prokaryote server (http://prokaryote.rsat.eu), we restrict the installation to genomes

## Choose a taxonomic group of interest
export GROUP=bacteria
export ASSEMBLY_SUMMARY=downloads/refseq/${GROUP}/assembly_summary.txt

## Count the number of assemblies for each value of the refseq_category
cut -f 5,12 ${ASSEMBLY_SUMMARY} | sort | uniq -c | sort -n \
  | perl -pe 's/(\d+) /$1\t/' \
  | awk -F'\t' '{print "| "$1"\t| "$3"\t| "$2" |"}'

As of Feb 2, 2025, the number of assemblies in these categories is the following:

Nb	assembly_level	refseq_category
424	Chromosome	reference genome
5511	Chromosome	na
5734	Complete Genome	reference genome
5952	Scaffold	reference genome
8468	Contig	reference genome
41143	Complete Genome	na
125680	Scaffold	na
212461	Contig	na

For the RSAT bacterial server, we select the assemblies that fit both of the following criteria :

The field refseq_category (column 5) is reference genome.
The field assembly_level (column 12) is Complete genome.

We will list the selected organisms in order to make sure that the filter works properly, with 3 alternative options for the output format:

-list_format org exports the organism name. These are built by concatenating the genus, species and assembly, separated by underscore characters.
-list_format table exports a subset of the table assembly_summary.txt, restricted to the assemblies passing the filters
-list_format bash exports a bash script with all the commands for installing the selected genomes. Beware: depending on the group of interest, the number of reference genomes can amount to a few tens, or several thousands. In the latter case, you should better see the section Managing installation tasks with a task scheduler.

export GROUP=bacteria
export REFERENCE_COMPLETE=${GROUP}_reference_complete

## Select the reference genomes with complete genome
install-organism -v 1 -group ${GROUP} \
  -filter refseq_category "reference genome" \
  -filter assembly_level "Complete Genome" \
  -task list -list_format org \
  -o ${REFERENCE_COMPLETE}.txt

## Count the number of complete reference genomes
grep -v '^;' ${REFERENCE_COMPLETE}.txt \
  | grep -v '^#' \
  | wc -l

## Export the metadata table for the reference genomes with complete genome
install-organism -v 1 -group ${GROUP} \
  -filter refseq_category "reference genome" \
  -filter assembly_level "Complete Genome" \
  -task list -list_format table \
  -o ${REFERENCE_COMPLETE}.tsv

## Export a bash file with the installation tasks
install-organism -v 1 -group ${GROUP}  -task list \
  -list_format bash \
  -filter refseq_category "reference genome" \
  -filter assembly_level "Complete Genome" \
  -o ${REFERENCE_COMPLETE}.bash

3.4 Installing one assembly for a given species and strain of interest

3.4.1 Downloading a selected assembly

We will now download the genome of a selected assembly, and install it on the RSAT server. As an illustration, we will install the reference strain K-12 of Escherichia coli.

## Select a species and assembly of interest 
## (can be adapted to your needs)
grep Escherichia reference_genomes.tsv | grep K-12
grep Escherichia reference_genomes.tsv | grep K-12 | cut -f 8

export ASSEMBLY=GCF_000005845.2


## List the selected assemblies for a given species 
## (e.g. Escherichia_coli)  
install-organism -v 1 -group ${GROUP} \
  -filter assembly_accession ${ASSEMBLY} \
  -task list

## Download the selected assemblies 
install-organism -v 1 -group ${GROUP} \
  -filter assembly_accession ${ASSEMBLY} \
  -task download

Note: downloaded genomes will be installed in a folder $RSAT/downloads/refseq on your computer. Each strain is stored in a specific subfolder built from the group, genus, species, and assembly.

For example, the reference assembly for Escherichia coli is stored in

downloads/refseq/bacteria//Escherichia/Escherichia_coli_str._K-12_substr._MG1655/GCF_000005845.2_ASM584v2/

ls -1 downloads/refseq/bacteria/Escherichia/Escherichia_coli_str._K-12_substr._MG1655/GCF_000005845.2_ASM584v2

3.4.2 Parsing a selected genome from NCBI

Parsing consists in extracting the information from a text-formatted file, in order to organise it in a computable structure.

## Parse a given strain of a given genome 
## (e.g. strain GCF_000016185.1_ASM1618v1 of Rhodospirillum centenum) 
install-organism -v 2 -group ${GROUP} \
  -filter assembly_accession ${ASSEMBLY} \
  -task parse

3.4.3 Organism names

For each parsed genome, the program creates a folder whose name is the concatenation of the genus, species, strain accession and assembly name, separated by undescore characters.

$RSAT/public_html/data/[Genus]_[species]_[assembly_accession]_[asm_name]

For Escherichia coli K12 reference strain, this gives:

Escherichia_coli_str._K-12_substr._MG1655_GCF_000005845.2_ASM584v2

From now on we will directly refer to this concatenated name as the Organism name (generally denoted by the option -org in RSAT commands.


export ORG=Escherichia_coli_str._K-12_substr._MG1655_GCF_000005845.2_ASM584v2

## Check the parsing result
ls -l ${RSAT}/public_html/data/genomes/${ORG}/genome/

## Measure the disk space occupied by this genome (note: this will be increased in the subsequent steps)
du -sm ${RSAT}/public_html/data/genomes/${ORG}/genome/

This gives the number of megabases occupied by the genome folder.

3.4.4 Adding the parsed genome to the list of supported genomes

In order to make the parsed genome available for RSAT, you need to run the configuration task.

## Configure the organism for RSAT
install-organism -v 1 -org ${ORG} -source NCBI -task config


## Check that the organism has wel been declared
supported-organisms -return ID,source,last_update | grep ${ORG}

3.4.5 Finalizing the installation

Several more tasks need to be performed before we can consider this organism as “installed”.

The simplest way to run it is to select the “default” tasks.

## Run the installation task (retrieve upstream sequences compute background oligo and dyad frequencies, index fasta genome for bedtols, ...)
install-organism -v 1 -org ${ORG} -task default

3.4.6 Detailed list of supported installation tasks (default and others)

Task	Meaning
default	run all the default tasks useful for the installation of a genome (this is a subset of the tasks specified below)
start_stop	compute trinucleotide frequencies in all the start and stop codons (validation of the correspondance between genome sequences and gene coordinates)
allup	Retrieve all upstream sequences in order to compute the background models (oligonucleotide and dyad frequencies)
seq_len_distrib	Compute the distribution of upstream sequences
genome_segments	SPlit the genome into segments corresponding to different genomic region types: genic, intergenic, …
upstream_freq	Compute oligonucleotide or/and dyad frequencies in upstream sequences (must be combined with tasks oligos and/or dyads)
genome_freq	Compute oligonucleotide or/and dyad frequencies in the whole genome (must be combined with tasks oligos and/or dyads)
protein_freq	Compute amino acid, di- and tri-peptide frequencies in all protein sequences
protein_len	Compute distribution of protein lengths
oligos	Compute oligonucleotide frequencies (size 1 to 8) in a given type of sequences. Must be combined with stasks upstream or genomic
dyads	Idem for dyads, i.e. spaced pairs of trinucleotides
fasta_genome	convert genome sequence in fasta in order to make it usable by bedtools
chrom_sizes	compute chromosome sizes from genomic sequences
index_bedtools	index genome to enable its ultra-fast processing by bedtools
uninstall	uninstall a genome, i.e. stop displaying it as available. Beware, this does not erase the genome folder, for this you can use the option `-task erase`
erase	Erase the whole folder of your organism in `$RSAT/public_html/data`

3.4.7 Checking that the organism is well supported

## Get all the supported organisms
supported-organisms | grep ${ORG}

3.4.8 Checking the correctness of gene annotations by computing start/stop codon frequencies

## Retrieve all the start codon sequences
retrieve-seq -all -org ${ORG} -from 0 -to 2 -o start_seq.fasta

## Check the result
head -n 20 start_seq.fasta

## Check that most genes have ATG as start codon
oligo-analysis -v 1 -i start_seq.fasta  -l 3 -1str -return occ,freq -sort | more

You can do the similar test with the stop codon frequencies. Note that nothing forces you to store the stop codong sequences in a local file, you can use the pipe symbol | to concatenate two commands.

## Compute all trinucleotide frequencies in the stop codons
retrieve-seq -all -org ${ORG} \
   -type downstream -from -1 -to -3 -o stop_seq.fasta

## Check the beginning of the result file
head -20 stop_seq.fasta

## Check that most genes have ATG as start codon
oligo-analysis -v 1 -i stop_seq.fasta  -l 3 -1str \
   -return occ,freq -sort | more

3.5 Installing multiple genomes

3.6 Installing many genomes

We can install multiple species in a single command by specifying a file containing a list of species names (one species name per row).

3.6.1 Download

Beware:

This command will download thousands of genomes on your hard drive. Before processing, we recommend to estimate the required size by installing a few genomes and extrapolating for the number of genomes to be installed.
The download can take a considerable time depending on the number of selected species and bandwidth of your computer.

3.6.2 Downloading genome files from NCBI

## File with the list of species to install
export ORG_TO_INSTALL=${REFERENCE_COMPLETE}.txt

## Count the number of species to instal
grep -v '^#' ${ORG_TO_INSTALL} | grep -v ';' | wc -l

## Download all the selected species
install-organism -v 2 -group ${GROUP} \
   -org_file ${ORG_TO_INSTALL} -task download

## Check the list of downloaded organisms 
## Note: organism = Genus + species + assembly + version
install-organism -v 2 -group ${GROUP} \
  -filter refseq_category "reference genome" \
  -filter assembly_level "Complete Genome" \
  -task list

## Before installing, count the number of organisms that would be installed
## MAKE SURE THIS NUMBER IS REASONABLE
install-organism -v 0 -group ${GROUP} \
   -filter refseq_category "reference genome" \
   -filter assembly_level "Complete Genome" \
   -task list | grep -v '^#' | wc -l

3.6.3 Parsing


## Parse all the selected organisms
install-organism -v 0 -group ${GROUP} \
   -filter refseq_category "reference genome" \
   -filter assembly_level "Complete Genome" \
   -task list,parse

3.6.4 Parsing genomes from NCBI files

Since the number of genomes to install can be important, we recommand to run install-organism with the option -batch, in order to parallelise the installation steps via a job scheduler. This job scheduler must have been previously defined in RSAT configuration files. This will be explained in the next section.

However, if you don’t dispose of a job scheduler, and if the number of genomes is not too high, you can still run the parsing and installation in sequential mode with the following command.

## Parallelize the parsing of all the selected organisms on a cluster
## BEWARE: this requires to have a job scheduler running on the server, 
## and to configure properly its parameters in RSAT configuration (script configure_rsat.pl)
install-organism -v 0 -group ${GROUP} \
   -filter refseq_category "reference genome" \
   -filter assembly_level "Complete Genome" \
   -task list,parse

3.6.5 Default installation tasks

Same comment as before: we recommend to run the following command with the -batch option, which requires a properly configured job scheduler. Run the following command only if you don’t dispose of a job scheduler and the number of genomes is not too high.


## Run default installation tasks
## NOTE: if you need to install hundreds of genomes, this can take a while. ## In such case, it is worth testing the option -batch (see below)
install-organism -v 2 -group ${GROUP} \
   -filter refseq_category "reference genome" \
   -filter assembly_level "Complete Genome" \
   -task config,list,default

3.6.6 Managing installation tasks with a job scheduler

If your RSAT server has been configured adequately, you can add the option -batch to the commands above in order to send the installation tasks to a job scheduler.


## Get the list of organisms from the list of parsed species
install-organism -v 2 -group ${GROUP} \
    -filter refseq_category "reference genome" \
    -filter assembly_level "Complete Genome" \
    -task list \
    > ${ORG_TO_INSTALL}


## Count the number of species and organisms to install
grep -v '^;' ${ORG_TO_INSTALL} | grep -v '^#' | wc -l  | awk '{print $1}'


## Parallelize the parsing of all the selected organisms on a cluster
## BEWARE: this requires to have a job scheduler running on the server, 
## and to configure properly its parameters in RSAT configuration (script configure_rsat.pl)
install-organism -v 2 -group ${GROUP} \
   -filter refseq_category "reference genome" \
   -filter assembly_level "Complete Genome" \
   -task list,parse -batch

## Check the job scheduler, for example with the command qstat
# make -f ${RSAT}/makefiles/server.mk  watch_jobs

## Update RSAT configuration so that the newly parsed organisms can be used 
## Note: this cannot be done in batch mode, but the configuration is fast
install-organism -v 2 -group ${GROUP} \
   -filter refseq_category "reference genome" \
   -filter assembly_level "Complete Genome" \
   -task config

## Retrieve all upstream sequences (cannot be done in batch mode)
install-organism -v 2 -group ${GROUP} \
   -filter refseq_category "reference genome" \
   -filter assembly_level "Complete Genome" \
   -task allup -batch

   
## Send the other installation tasks to the job scheduler
install-organism -v 2 -group ${GROUP} \
   -filter refseq_category "reference genome" \
   -filter assembly_level "Complete Genome" \
   -task  start_stop,seq_len_distrib,genome_segments,upstream_freq,genome_freq,protein_freq,protein_len,oligos,dyads,fasta_genome,fasta_genome_rm,chrom_sizes,index_bedtools \
   -batch

3.7 Check the installation

TO BE WRITTEN

Install organisms from NCBI

Jacques van Helden

2025-05-17