1 Introduction

The goal of this tutorial is to explain how to install genomes from NCBI in your RSAT server.

2 Prerequisite

This protocol assumes that you have a working instance of the RSAT server, and that you are familar with the Unix terminal.

3 Protocol

We illustrate the installation procedure with a bacterial genome, since NCBI is the main source of prokaryote genomes for RSAT (http://prokaryotes.rsat.eu/). The same procedure can however apply to other taxa, as far as they are available on the NCBI FTP server (https://ftp.ncbi.nlm.nih.gov/genomes/).

3.1 Data source

Genomes are downloaded from the refseq repository: https://ftp.ncbi.nlm.nih.gov/genomes/refseq

3.2 Getting the list of available species

The first step to perform with the command ìnstall-organisms is to get the list of organisms available for a given taxonomic group.

The NCBI FTP genome site is organised by “main” taxonomic groups:

The option -task available returns the complete list of available species for the specified group.

The command below gets the list of all available bacteria and stores them in a text file. It can easily be adapted to another group by changing the environment variable GROUP.

## Get curret date to use as suffix for output files
export TODAY=`date +%Y-%m-%d`

## Choose a taxonomic group of interest
export GROUP=bacteria

## Get the list of bacterial species available on the NCBI FTP site
install-organism -v 1 -group ${GROUP} -task available

This will download the assembly summmary file from NCBI, for the selected taxonomic group. Beware, for Bacteria this file is pretty large (153Mb on April 2024). However, this file is downloaded only if the version on the NCBI server is more recent than the local version. After the first download, subsequent calls of the install-organisms command will check if a new version is present at the NCBI site, and if not it will use the local copy.

The output of install-organisms indicates the local path of the assembly summary file.

;    Assembly summary file      $RSAT/downloads/refseq/bacteria/assembly_summary.txt
;    Assembly summary columns   $RSAT/downloads/refseq/bacteria/assembly_summary_colums.tsv
;    Available assemblies         350727
## Count the number of species.
## Note: grep is used to filter out the comment lines.
export ASSEMBLY_SUMMARY=downloads/refseq/${GROUP}/assembly_summary.txt
grep -v '^#' ${ASSEMBLY_SUMMARY} | wc -l

3.2.1 Getting lists of available species for other groups

We can of course do the same for other taxa, for example Archaea, Fungi or Mammalian.

export NCBI_GROUPS='archaea bacteria fungi invertebrate plant protozoa vertebrate_mammalian vertebrate_other viral'

echo "NCBI_GROUPS ${NCBI_GROUPS}"
for ncbi_group in $NCBI_GROUPS; do \
  echo "Getting list of NCBI species for ${ncbi_group}" ; \
  install-organism -v 1 -group ${ncbi_group} -task available
done


## Approx count of species per taxa
## (each file starts with 15 comment lines, which are be substracted from each count)
wc -l $RSAT/downloads/refseq/*/assembly_summary.txt \
   | sort -nr \
   | awk '{print "|\t"$1-15"\t| "$2"\t|"}' 

Note: for some species, several strains have been sequenced. The number of genomes available at NCBI thus exceeds by far the number of distinct species.

3.2.2 Number of species per group

Here is the result on July 31, 2024

Number of species Taxonomic group
367675 /Users/jvanheld/packages/rsat/downloads/refseq/bacteria/assembly_summary.txt
14975 /Users/jvanheld/packages/rsat/downloads/refseq/viral/assembly_summary.txt
2171 /Users/jvanheld/packages/rsat/downloads/refseq/archaea/assembly_summary.txt
588 /Users/jvanheld/packages/rsat/downloads/refseq/fungi/assembly_summary.txt
404 /Users/jvanheld/packages/rsat/downloads/refseq/invertebrate/assembly_summary.txt
399 /Users/jvanheld/packages/rsat/downloads/refseq/vertebrate_other/assembly_summary.txt
220 /Users/jvanheld/packages/rsat/downloads/refseq/vertebrate_mammalian/assembly_summary.txt
171 /Users/jvanheld/packages/rsat/downloads/refseq/plant/assembly_summary.txt
83 /Users/jvanheld/packages/rsat/downloads/refseq/protozoa/assembly_summary.txt
386806 total

Notes

  • The number of available assemblies exceeds by far the number of species, because some species are represented by different strains, in particular for bacteria (some bacterial species have several thousands of strains sequenced).

  • Some of these genomes are only partly sequenced, or poorly assembled, and the level of annotation is quite variable. NCBI subdivided these available genomes in 4 categories depending on the level of completion of the sequencing project.

3.3 Filtering assemblies

In order to select a reasonable number of genomes to install, we can filter the assemblies based on different criteria.

3.3.1 Listing the available assemblies for a species of interest

The option -species enables selecting a subset of assemblies corresponding to a given species of a given genus. The argulent -species should be followed by a string concatenating the Genus (with a leading uppercase) and species (in lowercase) separated by an underscore character (_), with exactly the same spelling and cases as on the NCBI FTP refseq site. Note that special characters (][)(/ etc) should be replaced by an underscore.

## List the assemblies for a given species, e.g. Rhizobium etli
install-organism -v 1 -group ${GROUP} \
  -species Rhizobium_etli \
  -task list -list_format org

Beware, for some species of interest, the number of assemblies can be quite large. This is for example the case for Escherchia coli.

## List the assemblies for Escherichia coli)
install-organism -v 1 -group ${GROUP} \
  -species Escherichia_coli \
  -task list -list_format org \
  -o assemblies_Escherichia_coli.txt

## Count the number of assemblies
grep -v '^;' assemblies_Escherichia_coli.txt | grep -v '^#' | wc -l

## Display the first rows of this file
head -n 20 assemblies_Escherichia_coli.txt

For the species Escherichia_coli, we get 34510 assemblies (05/05/2024).

We generally don’t need to install all these assemblies. We should thus use additional information in order to narrow down the number of genomes to install. To this purpose, the next sections explain how to select genomes labeled either as reference or representative, and completely sequenced.

3.3.2 Selecting reference genomes

The assembly summary includes a field refseq_category, which is quite convenient

install-organism -v 1 -group ${GROUP}  -task list \
  -list_format org \
  -filter refseq_category "reference genome"

This will print out the list of assemblies labeled as “reference genome” in the column “refseq_category” of the bacterial genome assembly file.

As of 05/05/2024, no more than 15 bacterial genomes are labeled “reference”, among the 350,729 available assemblies.

The option -list_format enables to export the result in 3 alternative formats.

  • org a list of organism names (default output, as shown above)

  • table a table with the different names and identifiers of the selected assemblies

install-organism -v 1 -group ${GROUP}  -task list \
  -list_format table \
  -filter  refseq_category "reference genome"  \
  -o reference_genomes.tsv
  • bash a bash script with the commands to install each of the selected assemblies
install-organism -v 1 -group ${GROUP}  -task list \
  -list_format bash \
  -filter  refseq_category "reference genome" \
  -o reference_genomes.bash

3.3.3 Listing the available assemblies - representative and complete genomes

We can combine several filtering criteria to select a more reasonable number of assemblies. For example, we could select the assemblies labeled as “representative genome” in the column refseq_category, and “Complete Genome” in the colum “”

install-organism -v 1 -group ${GROUP}  -task list \
  -list_format table \
  -filter refseq_category "representative genome" \
  -filter assembly_level "Complete Genome" \
  -o representative_complete.tsv

This returns 5014 assemblies (24/04/2024), with only one assembly per species, and belonging to 1555 distinct genus. The number of assemblies per genus can be displayed with the following command;

grep -v '^;' representative_complete.tsv \
   | grep -v '^#' \
   | cut -f 6 | sort  \
   | uniq -c | sort -nr | less

3.4 Installing one assembly for a given species and strain of interest

3.4.1 Downloading a selected assembly

We will now download the genome of a selected assembly, and install it on the RSAT server. As an illustration, we will install the reference strain K-12 of Escherichia coli.

## Select a species and assembly of interest 
## (can be adapted to your needs)
grep Escherichia reference_genomes.tsv | grep K-12
grep Escherichia reference_genomes.tsv | grep K-12 | cut -f 8

export ASSEMBLY=GCF_000005845.2


## List the selected assemblies for a given species 
## (e.g. Escherichia_coli)  
install-organism -v 1 -group ${GROUP} \
  -filter assembly_accession ${ASSEMBLY} \
  -task list

## Download the selected assemblies 
install-organism -v 1 -group ${GROUP} \
  -filter assembly_accession ${ASSEMBLY} \
  -task download
  

Note: downloaded genomes will be installed in a folder $RSAT/downloads/refseq on your computer. Each strain is stored in a specific subfolder built from the group, genus, species, and assembly.

For example, the reference assembly for Escherichia coli is stored in

downloads/refseq/bacteria//Escherichia/Escherichia_coli_str._K-12_substr._MG1655/GCF_000005845.2_ASM584v2/

ls -1 downloads/refseq/bacteria/Escherichia/Escherichia_coli_str._K-12_substr._MG1655/GCF_000005845.2_ASM584v2

3.4.2 Parsing a selected genome from NCBI

Parsing consists in extracting the information from a text-formatted file, in order to organise it in a computable structure.

## Parse a given strain of a given genome 
## (e.g. strain GCF_000016185.1_ASM1618v1 of Rhodospirillum centenum) 
install-organism -v 2 -group ${GROUP} \
  -filter assembly_accession ${ASSEMBLY} \
  -task parse

3.4.3 Organism names

For each parsed genome, the program creates a folder whose name is the concatenation of the genus, species, strain accession and assembly name, separated by undescore characters.

$RSAT/public_html/data/[Genus]_[species]_[assembly_accession]_[asm_name]

For Escherichia coli K12 reference strain, this gives:

Escherichia_coli_str._K-12_substr._MG1655_GCF_000005845.2_ASM584v2

From now on we will directly refer to this concatenated name as the Organism name (generally denoted by the option -org in RSAT commands.


export ORG=Escherichia_coli_str._K-12_substr._MG1655_GCF_000005845.2_ASM584v2

## Check the parsing result
ls -l ${RSAT}/public_html/data/genomes/${ORG}/genome/

## Measure the disk space occupied by this genome (note: this will be increased in the subsequent steps)
du -sm ${RSAT}/public_html/data/genomes/${ORG}/genome/

This gives the number of megabases occupied by the genome folder.

3.4.4 Adding the parsed genome to the list of supported genomes

In order to make the parsed genome available for RSAT, you need to run the configuration task.

## Configure the organism for RSAT
install-organism -v 1 -org ${ORG} -source NCBI -task config


## Check that the organism has wel been declared
supported-organisms -return ID,source,last_update | grep ${ORG}

3.4.5 Finalizing the installation

Several more tasks need to be performed before we can consider this organism as “installed”.

The simplest way to run it is to select the “default” tasks.

## Run the installation task (retrieve upstream sequences compute background oligo and dyad frequencies, index fasta genome for bedtols, ...)
install-organism -v 1 -org ${ORG} -task default

3.4.6 Detailed list of supported installation tasks (default and others)

Task Meaning
default run all the default tasks useful for the installation of a genome (this is a subset of the tasks specified below)
start_stop compute trinucleotide frequencies in all the start and stop codons (validation of the correspondance between genome sequences and gene coordinates)
allup Retrieve all upstream sequences in order to compute the background models (oligonucleotide and dyad frequencies)
seq_len_distrib Compute the distribution of upstream sequences
genome_segments SPlit the genome into segments corresponding to different genomic region types: genic, intergenic, …
upstream_freq Compute oligonucleotide or/and dyad frequencies in upstream sequences (must be combined with tasks oligos and/or dyads)
genome_freq Compute oligonucleotide or/and dyad frequencies in the whole genome (must be combined with tasks oligos and/or dyads)
protein_freq Compute amino acid, di- and tri-peptide frequencies in all protein sequences
protein_len Compute distribution of protein lengths
oligos Compute oligonucleotide frequencies (size 1 to 8) in a given type of sequences. Must be combined with stasks upstream or genomic
dyads Idem for dyads, i.e. spaced pairs of trinucleotides
fasta_genome convert genome sequence in fasta in order to make it usable by bedtools
chrom_sizes compute chromosome sizes from genomic sequences
index_bedtools index genome to enable its ultra-fast processing by bedtools
uninstall uninstall a genome, i.e. stop displaying it as available. Beware, this does not erase the genome folder, for this you can use the option -task erase
erase Erase the whole folder of your organism in $RSAT/public_html/data

3.4.7 Checking that the organism is well supported

## Get all the supported organisms
supported-organisms | grep ${ORG}

3.4.8 Checking the correctness of gene annotations by computing start/stop codon frequencies

## Retrieve all the start codon sequences
retrieve-seq -all -org ${ORG} -from 0 -to 2 -o start_seq.fasta

## Check the result
head -n 20 start_seq.fasta

## Check that most genes have ATG as start codon
oligo-analysis -v 1 -i start_seq.fasta  -l 3 -1str -return occ,freq -sort | more

You can do the similar test with the stop codon frequencies. Note that nothing forces you to store the stop codong sequences in a local file, you can use the pipe symbol | to concatenate two commands.

## Compute all trinucleotide frequencies in the stop codons
retrieve-seq -all -org ${ORG} \
   -type downstream -from -1 -to -3 -o stop_seq.fasta

## Check the beginning of the result file
head -20 stop_seq.fasta

## Check that most genes have ATG as start codon
oligo-analysis -v 1 -i stop_seq.fasta  -l 3 -1str \
   -return occ,freq -sort | more

3.5 Installing multiple genomes

3.5.1 Selecting reference organisms with complete genome

The assembly summary file provided by NCBI for each taxonomic group gives detailed indications of the sequencing status for each assembly.

This table contains many columns with valuable information for genome management. The column contain is indicated in a separate file.

Genome assemblies are classigied in four levels (PMCID: PMC4702866).

  1. Contig: Assemblies that include only contigs.
  2. Scaffold: Includes both scaffolds and contigs.
  3. Chromosome: Includes chromosome or linkage groups, plus scaffolds and contigs.
  4. Complete genome: Assemblies for which all molecules are fully sequenced.

We can use this assembly summary to select only those genomes with the a completion level that we consider appropriate.

For example, for the RSAT prokaryote server (http://prokaryote.rsat.eu), we restrict the installation to genomes by combining the following criteria :

  • The field refseq_category (column 5) is either reference genome or representative genome.

  • The field assembly_level (column 12) is Complete genome

Since ewe downloaded the assembly summary file above, we can count the number of genomes of different types in these columns.

## Choose a taxonomic group of interest
export GROUP=bacteria
export ASSEMBLY_SUMMARY=downloads/refseq/${GROUP}/assembly_summary.txt

## Count the number of assemblies for each value of the refseq_category
grep -v '^#' ${ASSEMBLY_SUMMARY} | cut -f 5,12 | sort | uniq -c | sort -n

As of May 5, 2024, the number of assemblies in these categories is the following:

     15 reference genome        Complete Genome
    470 representative genome   Chromosome
   4818 na                      Chromosome
   5027 representative genome   Complete Genome
   5195 representative genome   Scaffold
   7940 representative genome   Contig
  35394 na                      Complete Genome
 105250 na                      Scaffold
 186618 na                      Contig

3.5.2 Installing the reference genomes with complete genome

We saw above that we can apply two filters to select the 15 genomes labeled as reference genome in the field refseq_category and Complete Genome in assembly_level.

We will first list the selected organisms in order to make sure that the filter works properly (and avoid the risk of unwillingly installing 350,000 assemblies).

export GROUP=bacteria

## Select the reference genomes with complete genome
install-organism -v 1 -group ${GROUP} \
  -filter refseq_category "reference genome" \
  -filter assembly_level "Complete Genome" \
  -task list -list_format org

This should give a list of 15 genomes. We will then specity the tasks required to install each of these genomes. Note that this can take some time, since each genome has to be downloaded, parsed, and processed in order to be ready for use on the RSAT server. In particular, the computation of oligonucleotide and dyads in all the non-coding upstream sequences can take some time.

## Install the reference genomes with complete genome
install-organism -v 1 -group ${GROUP} \
  -filter refseq_category "reference genome" \
  -filter assembly_level "Complete Genome" \
  -task download,parse,config,default

3.5.3 Installing the representative genomes with complete genome

export GROUP=bacteria

## Select the representative genomes with complete genome
install-organism -v 1 -group ${GROUP} \
  -filter refseq_category "representative genome" \
  -filter assembly_level "Complete Genome" \
  -task list -list_format org \
  -o ${GROUP}_representative_complete.txt

## Count the number of representative genomes
grep -v '^;' ${GROUP}_representative_complete.txt \
  | grep -v '^#' \
  | wc -l

Depending on the group of interest, the number of representative genomes can amount to a few tens, or several thousands. In the lattre case, yous should better see the next section (Managing installation tasks with a task scheduler).


3.6 Installing many genomes

We can install multiple species in a single command by specifying a file containing a list of species names (one species name per row).

3.6.1 Download

Beware:

  • This command will download thousands of genomes on your hard drive, which will undoubtedly occupy some disj space.

  • The download can take a considerable time depending on the number of selected species and bandwidth of your computer.

3.6.2 Downloading genome files from NCBI

## File with the list of species to install
SPECIES_TO_INSTALL=${GROUP}_species_to_install.txt

## Get the two first words of the 8th column as Genus and species names
cut -f 8 ${REPRESENTATIVE_COMPLETE} \
  | sed 's/ /_/g' \
   > ${SPECIES_TO_INSTALL}

## Count the number of species to instal
wc -l ${SPECIES_TO_INSTALL}

## Download all the selected species
install-organism -v 2 -group ${GROUP} \
   -species_file ${SPECIES_TO_INSTALL} -task download
   

## Check the list of downloaded organisms 
## Note: organism = Genus + species + strain
install-organism -v 2 -group ${GROUP} \
   -species_file ${SPECIES_TO_INSTALL} -task list

3.6.3 Parsing


## Parse all organisms
install-organism -v 2 -group ${GROUP} \
   -species_file ${SPECIES_TO_INSTALL} -task list,parse

3.6.4 Parsing genomes from NCBI files

Since the number of genomes to install can be important, we will run install-organism with the option -batch, in order to parallelise the installation steps via a job scheduler. This job scheduler must have been previously defined in RSAT configuration files.

install-organism -v 2 -group ${GROUP} \
   -species_file ${SPECIES_TO_INSTALL} -task parse -batch

make -f ${RSAT}/makefiles/server.mk  watch_jobs

3.6.5 Default installation tasks


## Run default installation tasks
## NOTE: if you need to install hundreds of genomes, this can take a while. ## In such case, it is worth testing the option -batch (see below)
install-organism -v 2 -group ${GROUP} \
   -species_file ${SPECIES_TO_INSTALL} -task config,list,default

3.6.6 Managing installation tasks with a job scheduler

If your RSAT server has been configured adequately, you can add the option -batch to the commands above in order to send the installation tasks to a job scheduler.

ORG_TO_INSTALL=${GROUP}_organisms_to_install.txt

## Parse all organisms
install-organism -v 2 -group ${GROUP} \
   -species_file ${SPECIES_TO_INSTALL} -task list,parse \
  -batch


## Get the list of organisms from the list of parsed species
install-organism -v 2 -group ${GROUP} \
  -species_file ${SPECIES_TO_INSTALL} -task list \
  > ${ORG_TO_INSTALL}

## Count the number of species and organisms to install
SPECIES_DO_INSTALL_NB=`grep -v '^;' ${SPECIES_TO_INSTALL} | wc -l  | awk '{print $1}'`
ORG_DO_INSTALL_NB=`grep -v '^;' ${ORG_TO_INSTALL} | wc -l | awk '{print $1}'`
echo "To install: ${ORG_DO_INSTALL_NB} organisms,  ${SPECIES_DO_INSTALL_NB} distinct species"

## First retrieve all upstream sequences (cannot be done in batch mode)
install-organism -v 2  \
   -org_file ${ORG_TO_INSTALL} -task config,allup; 
   
## Send the other installation tasks to the job scheduler
install-organism -v 2 -org_file ${ORG_TO_INSTALL}  -task  start_stop,seq_len_distrib,genome_segments,upstream_freq,genome_freq,protein_freq,protein_len,oligos,dyads,fasta_genome,fasta_genome_rm,chrom_sizes,index_bedtools -batch

3.7 Check the installation

TO BE WRITTEN