1 RSAT tools to install genomes from Ensembl

The Ensembl browser (http://ensembl.org) contains mostly genomes of Vertebrate organisms.

RSAT includes a series of programs to download and install genomes from Ensembl:

  1. install-ensembl-genome is a wrapper enabling to autmoatize the download (genome sequences, features, variations) and configuration tasks.

  2. download-ensembl-genome downloads the genomics sequences and converts them in the raw format required for .

  3. download-ensembl-features downloads tab-delimited text files describing genomic features (transcripts, CDS, genes, …).

  4. download-ensembl-variations downloads tab-delimited text files describing genomic variations (polymorphism).

2 Installing genomes from Ensembl

The program install-ensembl-genome manages all the required steps to download and install a genome (sequence, features, and optionally variations) from Ensembl to RSAT.

It performs the following tasks:

  1. install-ensembl-genome -available_species returns the list species available on the Ensembl server, together with their status of availability for the 3 data types (genome sequence, features, variations). When thisoption is called, the program does not install any genome.

  2. The option install-ensembl-genome -task genome -org [Selected_organism] runs the program download-ensembl-genome to download the complete genomic sequence of a given organism from the Ensembl Web site, and formats it according to RSAT requirements (conversion from the original fasta sequence file to one file per chromosome, in raw format).

  3. The option install-ensembl-genome -task features -org [Selected_organism] runs download-ensembl-features to download the positions and descriptions of genomic features (genes, CDS, mRNAs, …).

  4. If the option -task variations is activated, install-ensembl-genome -org [Selected_organism] runs download-ensembl-variations to download the description of genomic variations (polymorphism). Note that variations are supported only for a subset of genomes.

  5. install-ensembl-genome -org [Selected_organism] -task config updates RSAT configuration files to make the newly installed genome available.

  6. install-ensembl-genome -org [Selected_organism] -task install runs the additional tasks required to have a fully functional genome on the local site: compute genomic statisics (intergenic sizes, …) and background models (oligonucleotide and dyad frequencies).

The detailed description of the program and the list of options can be obtained with the option .

## Get the description of the program + all options
install-ensembl-genome -help

3 Getting the list of available genomes

Before installing a genome, it is generally a good idea to know which genomes are available. For this, use the option .

export TODAY=`date '+%Y-%m-%d'`

## Retrieve the list of supported species on EnsEMBL
install-ensembl-genome -v 1  -available_species \
    -o available_species_ensembl_${TODAY}.tsv
    
## Read the result file
more available_species_ensembl_${TODAY}.tsv

## Count the number of available genomes (using grep -v to discard comment lines)
grep -v '^;' available_species_ensembl_${TODAY}.tsv | wc -l
## Note: on August 6, 2018, this returns 117 organism names

4 Availability of polymorphic variations for genomes in Ensembl

Beware: inter-individual variations are available for a subset only of the genomes available in Ensembl. The option -available_species indicates, for each species, the availability (genome, features, variations). Obviously, the programs to analyse regulatory variations (variation-info, convert-variations, retrieve-variation-seq, variation-scan) are working only for the genomes documented with variations.

## select genomes for which variations are available in Ensembl
awk -F'\t' '$2 ~ "variations"' available_species_ensembl_${TODAY}.tsv | grep -v '^;'

5 Installing a genome from Ensembl

We can now download and install the complete genomic sequence for the species of our choice. For the sake of space and time economy, we will use a small genome for this manual: the budding yeast Saccharomyces cerevisiae.

Beware: some installation steps take a lot of time. For large genomes (e.g. Vertebrate organisms), the full installation can thus take several hours. This should in principle not be a big issue, since installing a genome is not a daily task, but it is worth knowing that the whole process requires a continuous connection during several hours.

## Install the genome sequences for a selected organism
install-ensembl-genome -v 2 -species Saccharomyces_cerevisiae

This command will automatically run all the installation tasks described above, except the installation of variations (see Section 1.3).

6 Downloading variations

The program downloads variations from the Web site, and installs it on the local site.

This program relies on , which must be installed beforehand on your computer.

## Retrieve the list of supported species in the EnsEMBL variation database
download-ensembl-variations -v 1  -available_species -o species_with_variations_ensembl.tsv

## Check the content of the result file
more species_with_variations_ensembl.tsv

## Count genomes with variations available at Ensembl
grep -v '^;' species_with_variations_ensembl.tsv | wc -l
## Result : 23 on August 6, 2018

Note: as an alternative to download-ensembl-variations, we could have used the command ìnstall-ensembl-genomes with the option -task variations.

We can now download all the variations available for the yeast.

## Download all variations for a selected organism on your server
download-ensembl-variations -v 1 -species Saccharomyces_cerevisiae

Variation files are stored in a specific subfolder for the specified organism.

## Check the content of the variation directory for the yeast
make -f makefiles/variation-scan_demo.mk \
     SPECIES=Saccharomyces_cerevisiae ASSEMBLY=R64-1-1 \
     variation_stats

This command will indicate the location of the variation directory on your server, and count the number of lines for each variation file (there is one separate file per chromosome or contig).


7 Installing the Human genome and human polymorphic variations

In the examples above we intently installed the smallest genome available at Ensembl, in order to obtain the results in a reasonable time and with a reasonable disk space occupancy.

The installation of Metazoa occupies much more space, and takes a significantly longer time.

We summarize hereafter the commands to download and install the Human genome, its annotations, and the variations from Ensembl. To this purpose, we customize the options of install-ensembl-genomes in order to :

  • download the genome sequence,
  • download genome annotations (features),
  • download variations,
  • update RSAT configuration in order to enable the downloaded genome,
  • install the genome on RSAT (compute oligo and dyad frequencies + some other tasks)

Beware: the following command will mobilize ~11.7 Gb of disk space for the genome + 11.5 Gb for the variations.


## Install Human genome from Ensembl, including polymorphic variations
install-ensembl-genome -v 2 -db ensembl -org Homo_sapiens -task genome,features,config,install,variations

## Check the disk space occupied by the different folders
du -sm data/genomes/Homo_sapiens_GRCh38/*

## Result (2018-08-06) in Megabytes per folder
# 11701   data/genomes/Homo_sapiens_GRCh38/genome
# 18      data/genomes/Homo_sapiens_GRCh38/oligo-frequencies
# 11544   data/genomes/Homo_sapiens_GRCh38/variations

8 Installing a genome from Ensembl Genomes

For all non Vertebrate organisms (Protists, Fungi, Plants, Bacteria, Metazoa) please check Installing genomes from Ensembl Genomes.