The Ensembl browser (http://ensembl.org) contains mostly genomes of Vertebrate organisms.
RSAT includes a series of programs to download and install genomes from Ensembl:
install-ensembl-genome
is a wrapper enabling to
autmoatize the download (genome sequences, features, variations) and
configuration tasks.
download-ensembl-genome
downloads the genomics
sequences and converts them in the raw format required for .
download-ensembl-features
downloads tab-delimited
text files describing genomic features (transcripts, CDS, genes,
…).
download-ensembl-variations
downloads tab-delimited
text files describing genomic variations (polymorphism).
The program install-ensembl-genome
manages all the
required steps to download and install a genome (sequence, features, and
optionally variations) from Ensembl to RSAT.
It performs the following tasks:
install-ensembl-genome -available_species
returns
the list species available on the Ensembl server, together with their
status of availability for the 3 data types (genome sequence, features,
variations). When thisoption is called, the program does not install any
genome.
The option
install-ensembl-genome -task genome -org [Selected_organism]
runs the program download-ensembl-genome
to download the
complete genomic sequence of a given organism from the Ensembl Web site, and formats it according
to RSAT requirements (conversion from the original fasta
sequence file to one file per chromosome, in raw format).
The option
install-ensembl-genome -task features -org [Selected_organism]
runs download-ensembl-features
to download the positions
and descriptions of genomic features (genes, CDS, mRNAs, …).
If the option -task variations
is activated,
install-ensembl-genome -org [Selected_organism]
runs
download-ensembl-variations
to download the description of
genomic variations (polymorphism). Note that variations are supported
only for a subset of genomes.
install-ensembl-genome -org [Selected_organism] -task config
updates RSAT configuration files to make the newly installed genome
available.
install-ensembl-genome -org [Selected_organism] -task install
runs the additional tasks required to have a fully functional genome on
the local site: compute genomic statisics (intergenic sizes, …) and
background models (oligonucleotide and dyad frequencies).
The detailed description of the program and the list of options can be obtained with the option .
Before installing a genome, it is generally a good idea to know which genomes are available. For this, use the option .
export TODAY=`date '+%Y-%m-%d'`
## Retrieve the list of supported species on EnsEMBL
install-ensembl-genome -v 1 -available_species \
-o available_species_ensembl_${TODAY}.tsv
## Read the result file
more available_species_ensembl_${TODAY}.tsv
## Count the number of available genomes (using grep -v to discard comment lines)
grep -v '^;' available_species_ensembl_${TODAY}.tsv | wc -l
## Note: on August 6, 2018, this returns 117 organism names
Beware: inter-individual variations are
available for a subset only of the genomes available in Ensembl. The
option -available_species
indicates, for each species, the
availability (genome, features, variations). Obviously, the programs to
analyse regulatory variations (variation-info
,
convert-variations
, retrieve-variation-seq
,
variation-scan
) are working only for the genomes documented
with variations.
We can now download and install the complete genomic sequence for the species of our choice. For the sake of space and time economy, we will use a small genome for this manual: the budding yeast Saccharomyces cerevisiae.
Beware: some installation steps take a lot of time. For large genomes (e.g. Vertebrate organisms), the full installation can thus take several hours. This should in principle not be a big issue, since installing a genome is not a daily task, but it is worth knowing that the whole process requires a continuous connection during several hours.
## Install the genome sequences for a selected organism
install-ensembl-genome -v 2 -species Saccharomyces_cerevisiae
This command will automatically run all the installation tasks described above, except the installation of variations (see Section 1.3).
The program downloads variations from the Web site, and installs it on the local site.
This program relies on , which must be installed beforehand on your computer.
## Retrieve the list of supported species in the EnsEMBL variation database
download-ensembl-variations -v 1 -available_species -o species_with_variations_ensembl.tsv
## Check the content of the result file
more species_with_variations_ensembl.tsv
## Count genomes with variations available at Ensembl
grep -v '^;' species_with_variations_ensembl.tsv | wc -l
## Result : 23 on August 6, 2018
Note: as an alternative to download-ensembl-variations
,
we could have used the command ìnstall-ensembl-genomes
with
the option -task variations
.
We can now download all the variations available for the yeast.
## Download all variations for a selected organism on your server
download-ensembl-variations -v 1 -species Saccharomyces_cerevisiae
Variation files are stored in a specific subfolder for the specified organism.
## Check the content of the variation directory for the yeast
make -f makefiles/variation-scan_demo.mk \
SPECIES=Saccharomyces_cerevisiae ASSEMBLY=R64-1-1 \
variation_stats
This command will indicate the location of the variation directory on your server, and count the number of lines for each variation file (there is one separate file per chromosome or contig).
In the examples above we intently installed the smallest genome available at Ensembl, in order to obtain the results in a reasonable time and with a reasonable disk space occupancy.
The installation of Metazoa occupies much more space, and takes a significantly longer time.
We summarize hereafter the commands to download and install the Human
genome, its annotations, and the variations from Ensembl. To this
purpose, we customize the options of
install-ensembl-genomes
in order to :
Beware: the following command will mobilize ~11.7 Gb of disk space for the genome + 11.5 Gb for the variations.
## Install Human genome from Ensembl, including polymorphic variations
install-ensembl-genome -v 2 -db ensembl -org Homo_sapiens -task genome,features,config,install,variations
## Check the disk space occupied by the different folders
du -sm data/genomes/Homo_sapiens_GRCh38/*
## Result (2018-08-06) in Megabytes per folder
# 11701 data/genomes/Homo_sapiens_GRCh38/genome
# 18 data/genomes/Homo_sapiens_GRCh38/oligo-frequencies
# 11544 data/genomes/Homo_sapiens_GRCh38/variations
For all non Vertebrate organisms (Protists, Fungi, Plants, Bacteria, Metazoa) please check Installing genomes from Ensembl Genomes.