The Ensembl browser (http://ensembl.org) contains mostly genomes of Vertebrate organisms.
RSAT includes a series of programs to download and install genomes from Ensembl:
install-ensembl-genome is a wrapper enabling to
autmoatize the download (genome sequences, features, variations) and
configuration tasks.
download-ensembl-genome downloads the genomics
sequences and converts them in the raw format required for .
download-ensembl-features downloads tab-delimited
text files describing genomic features (transcripts, CDS, genes,
…).
download-ensembl-variations downloads tab-delimited
text files describing genomic variations (polymorphism).
The program install-ensembl-genome manages all the
required steps to download and install a genome (sequence, features, and
optionally variations) from Ensembl to RSAT.
It performs the following tasks:
install-ensembl-genome -available_species returns
the list species available on the Ensembl server, together with their
status of availability for the 3 data types (genome sequence, features,
variations). When thisoption is called, the program does not install any
genome.
The option
install-ensembl-genome -task genome -org [Selected_organism]
runs the program download-ensembl-genome to download the
complete genomic sequence of a given organism from the Ensembl Web site, and formats it according
to RSAT requirements (conversion from the original fasta
sequence file to one file per chromosome, in raw format).
The option
install-ensembl-genome -task features -org [Selected_organism]
runs download-ensembl-features to download the positions
and descriptions of genomic features (genes, CDS, mRNAs, …).
If the option -task variations is activated,
install-ensembl-genome -org [Selected_organism] runs
download-ensembl-variations to download the description of
genomic variations (polymorphism). Note that variations are supported
only for a subset of genomes.
install-ensembl-genome -org [Selected_organism] -task config
updates RSAT configuration files to make the newly installed genome
available.
install-ensembl-genome -org [Selected_organism] -task install
runs the additional tasks required to have a fully functional genome on
the local site: compute genomic statisics (intergenic sizes, …) and
background models (oligonucleotide and dyad frequencies).
The detailed description of the program and the list of options can be obtained with the option .
Before installing a genome, it is generally a good idea to know which genomes are available. For this, use the option .
export TODAY=`date '+%Y-%m-%d'`
## Retrieve the list of supported species on EnsEMBL
install-ensembl-genome -v 1 -available_species \
-o available_species_ensembl_${TODAY}.tsv
## Read the result file
more available_species_ensembl_${TODAY}.tsv
## Count the number of available genomes (using grep -v to discard comment lines)
grep -v '^;' available_species_ensembl_${TODAY}.tsv | wc -l
## Note: on August 6, 2018, this returns 117 organism namesBeware: inter-individual variations are
available for a subset only of the genomes available in Ensembl. The
option -available_species indicates, for each species, the
availability (genome, features, variations). Obviously, the programs to
analyse regulatory variations (variation-info,
convert-variations, retrieve-variation-seq,
variation-scan) are working only for the genomes documented
with variations.
We can now download and install the complete genomic sequence for the species of our choice. For the sake of space and time economy, we will use a small genome for this manual: the budding yeast Saccharomyces cerevisiae.
Beware: some installation steps take a lot of time. For large genomes (e.g. Vertebrate organisms), the full installation can thus take several hours. This should in principle not be a big issue, since installing a genome is not a daily task, but it is worth knowing that the whole process requires a continuous connection during several hours.
## Install the genome sequences for a selected organism
install-ensembl-genome -v 2 -species Saccharomyces_cerevisiaeThis command will automatically run all the installation tasks described above, except the installation of variations (see Section 1.3).
The program downloads variations from the Web site, and installs it on the local site.
This program relies on , which must be installed beforehand on your computer.
## Retrieve the list of supported species in the EnsEMBL variation database
download-ensembl-variations -v 1 -available_species -o species_with_variations_ensembl.tsv
## Check the content of the result file
more species_with_variations_ensembl.tsv
## Count genomes with variations available at Ensembl
grep -v '^;' species_with_variations_ensembl.tsv | wc -l
## Result : 23 on August 6, 2018Note: as an alternative to download-ensembl-variations,
we could have used the command ìnstall-ensembl-genomes with
the option -task variations.
We can now download all the variations available for the yeast.
## Download all variations for a selected organism on your server
download-ensembl-variations -v 1 -species Saccharomyces_cerevisiaeVariation files are stored in a specific subfolder for the specified organism.
## Check the content of the variation directory for the yeast
make -f makefiles/variation-scan_demo.mk \
SPECIES=Saccharomyces_cerevisiae ASSEMBLY=R64-1-1 \
variation_statsThis command will indicate the location of the variation directory on your server, and count the number of lines for each variation file (there is one separate file per chromosome or contig).
In the examples above we intently installed the smallest genome available at Ensembl, in order to obtain the results in a reasonable time and with a reasonable disk space occupancy.
The installation of Metazoa occupies much more space, and takes a significantly longer time.
We summarize hereafter the commands to download and install the Human
genome, its annotations, and the variations from Ensembl. To this
purpose, we customize the options of
install-ensembl-genomes in order to :
Beware: the following command will mobilize ~11.7 Gb of disk space for the genome + 11.5 Gb for the variations.
## Install Human genome from Ensembl, including polymorphic variations
install-ensembl-genome -v 2 -db ensembl -org Homo_sapiens -task genome,features,config,install,variations
## Check the disk space occupied by the different folders
du -sm data/genomes/Homo_sapiens_GRCh38/*
## Result (2018-08-06) in Megabytes per folder
# 11701 data/genomes/Homo_sapiens_GRCh38/genome
# 18 data/genomes/Homo_sapiens_GRCh38/oligo-frequencies
# 11544 data/genomes/Homo_sapiens_GRCh38/variationsFor all non Vertebrate organisms (Protists, Fungi, Plants, Bacteria, Metazoa) please check Installing genomes from Ensembl Genomes.