1 Introduction

This document explains how to install genomes and annotations from Ensembl Genomes using mostly the FTP site. Gene Ontology terms are optional and are obtained from BioMart instead.

Note that while Ensembl covers Vertebrates, Ensembl Genomes (EG, also known as Non Vertebrates, NV) includes the other divisions (Protists, Fungi, Plants, Bacteria, Metazoa). These instructions have mostly been tested with Ensembl Plants.

This protocol uses the Makefile ensemblgenomes_FTP_client.mk, which can also be used to install arbitrary genomes from other sources from FASTA and GTF files, as explained in Installing organisms from FASTA and GTF files.

This document does not use the following scripts, which are documented in Installing genomes and variations from Ensembl:

  • install-ensembl-genome,
  • download-ensembl-genome,
  • download-ensembl-features,
  • download-ensembl-variations.

2 Installing genome sequences and annotations from Ensembl Genomes

2.1 Check FTP site URL

The current FTP site is at ftp://ftp.ensemblgenomes.org ; should it change in the future, or its folder structure, ensemblgenomes_FTP_client.mk will be updated accordingly.

2.2 Find out current Ensembl Genomes release

Visit the Web site of your division of interest, such as http://plants.ensembl.org, and check the release statement at the bottom. Alternatively you can use the REST info/eg_version endpoint

Ensembl Genome releases are integers and have an offset of 53 with Ensembl releases. For instance, Ensembl release 99 corresponds to Ensembl Genomes 46.

export EGRELEASE=46

2.3 Get supported species

We shall download the current list of supported genomes in the respective division (GROUP). This can be done by running the following command:

cd $RSAT
export EGDIVISION=Plants

make -f makefiles/ensemblgenomes_FTP_client.mk GROUP=$EGDIVISION RELEASE=$EGRELEASE organisms

# in the example, this will create $RSAT/data/ensemblgenomes/plants/release-46/species_EnsemblPlants.txt

Once this is done then you can install genomes from that release.

3 Install all species

You can install all genomes with these commands:

cd $RSAT

# download FASTA and GTF files
nohup make -f makefiles/ensemblgenomes_FTP_client.mk GROUP=$EGDIVISION RELEASE=$EGRELEASE download_all_species

# parse input files, extract genomic features and compute oligo frequencies
nohup make -f makefiles/ensemblgenomes_FTP_client.mk GROUP=$EGDIVISION RELEASE=$EGRELEASE install_all_species

# check upstream sequences can be retrieved
nohup make -f makefiles/ensemblgenomes_FTP_client.mk GROUP=$EGDIVISION RELEASE=$EGRELEASE check_all_species

The newly installed species will be added to $RSAT/data/supported_organisms.tab and should be listed with the following command-line:

supported-organisms

This will take a very long time, many days for a complete Ensembl Genomes release. The most time-consuming task is the calculation of genomic frequencies of oligos and dyads. Check the table for the installation time of selected plant genomes:

organisms genome assembly length (Mb) installation time (minutes)
Arabidopsis_thaliana.TAIR10.46 119.7 144
Brachypodium_distachyon.Brachypodium_distachyon_v3.0.46 271.2 233
Oryza_sativa.IRGSP-1.0.46 375.0 290
Beta_vulgaris.RefBeet-1.2.2.46 566.2 320
Glycine_max.Glycine_max_v2.1.46 978.5 660
Zea_mays.Zm-B73-REFERENCE-NAM-5.0.51 2182.1 1069
Helianthus_annuus.HanXRQr1.0.46 3027.8 1689
Hordeum_vulgare.IBSCv2.46 4834.4 3574
Triticum_aestivum.IWGSC.46 14547.3 10458

4 Install selected species

In case you want to install a single genome you can do that by:

cd $RSAT
make -f makefiles/ensemblgenomes_FTP_client.mk GROUP=$EGDIVISION RELEASE=$EGRELEASE SPECIES=oryza_longistaminata download_one_species
make -f makefiles/ensemblgenomes_FTP_client.mk GROUP=$EGDIVISION RELEASE=$EGRELEASE SPECIES=oryza_longistaminata install_one_species

You can also install selected genomes from older releases:

cd $RSAT
make -f makefiles/ensemblgenomes_FTP_client.mk GROUP=$EGDIVISION RELEASE=42 organisms
make -f makefiles/ensemblgenomes_FTP_client.mk GROUP=$EGDIVISION RELEASE=42 SPECIES=oryza_sativa download_one_species
make -f makefiles/ensemblgenomes_FTP_client.mk GROUP=$EGDIVISION RELEASE=42 SPECIES=oryza_sativa install_one_species

As indicated earlier, the newly installed species are added to $RSAT/data/supported_organisms.tab and should appear in the list produced by:

supported-organisms

5 Other optional tasks

5.1 Compute genome stats report

This step generates a report of descriptive stats of genomes currently in your system, such as http://rsat.eead.csic.es/plants/data/stats :

make -f makefiles/ensemblgenomes_FTP_client.mk calc_stats

5.2 Install variation data

If variation data is available for your species of interest you can download it with:

cd $RSAT
make -f makefiles/ensemblgenomes_FTP_client.mk GROUP=$EGDIVISION RELEASE=$EGRELEASE SPECIES=oryza_sativa variations_one_species

Note: this will update file $RSAT/data/supported_organisms.tab

5.3 Download and install Ensembl Compara homologies (optional)

Script get-orthologs-compara can be used to retrieve homologues (orthologues by default) precomputed at Ensembl Compara. In order to use it you must first install Compara in your system, which you can do with:

cd $RSAT
export ENSRELEASE=99

make -f makefiles/ensemblgenomes_FTP_client.mk RELEASE=$EGRELEASE ENSEMBL_RELEASE=$ENSRELEASE GROUP=$EGDIVISION download_compara
make -f makefiles/ensemblgenomes_FTP_client.mk RELEASE=$EGRELEASE ENSEMBL_RELEASE=$ENSRELEASE GROUP=$EGDIVISION parse_compara_match
make -f makefiles/ensemblgenomes_FTP_client.mk RELEASE=$EGRELEASE ENSEMBL_RELEASE=$ENSRELEASE GROUP=$EGDIVISION install_compara

All going well you can check the species with installed homologies with:

get-orthologs-compara -supported_organisms

5.4 Install Gene Ontology terms from BioMart

These operations have been tested in Ensembl Plants, but should work also with Metazoa and Fungi. They require the installation of optional software, which can be done as follows:

sudo bash
cd $RSAT
make -f makefiles/install_software.mk install_biomart-perl check_biomart-perl

To complete the configuration we shall get the current registry for the division of interest. For instance, for Plants the registry is at http://plants.ensembl.org/biomart/martservice?type=registry

The XML content of the registry must be copied and pasted into file $RSAT/ext_lib/biomart-perl/conf/martURLLocation.xml . For instance, for Plants release 46 it looked like this:

<?xml version="1.0" encoding="UTF-8"?>

<MartRegistry>
  <MartURLLocation database="plants_mart_46" default="" displayName="Ensembl Plants Genes 46" host="plants.ensembl.org" includeDatasets="" martUser="" name="plants_mart" path="/biomart/martservice" port="80" serverVirtualSchema="plants_mart" visible="1" />
  <MartURLLocation database="plants_snp_mart_46" default="" displayName="Ensembl Plants Variations 46" host="plants.ensembl.org" includeDatasets="" martUser="" name="plants_variations" path="/biomart/martservice" port="80" serverVirtualSchema="plants_mart" visible="1" />
  <MartURLLocation database="plants_sequence_mart_46" default="" displayName="Ensembl Plants Sequences 46" host="plants.ensembl.org" includeDatasets="" martUser="" name="plants_sequences" path="/biomart/martservice" port="80" serverVirtualSchema="plants_mart" visible="0" />
  <MartURLLocation database="plants_genomic_features_mart_46" default="" displayName="Ensembl Plants Genomic Features 46" host="plants.ensembl.org" includeDatasets="" martUser="" name="plants_genomic_features" path="/biomart/martservice" port="80" serverVirtualSchema="plants_mart" visible="0" />
  <MartURLLocation database="ontology_mart_99" default="" displayName="Ontology Mart 99" host="plants.ensembl.org" includeDatasets="" martUser="" name="ontology" path="/biomart/martservice" port="80" serverVirtualSchema="plants_mart" visible="0" />
</MartRegistry>

Once this is done we shall update the registry cache with:

download-ensembl-go-annotations-biomart -reg

This will take a few minutes, but then we can do:

make -f makefiles/ensemblgenomes_FTP_client.mk GROUP=$EGDIVISION RELEASE=$EGRELEASE download_go

And for each species for which we want GO terms we can now do:

make -f makefiles/ensemblgenomes_FTP_client.mk GROUP=$EGDIVISION RELEASE=$EGRELEASE SPECIES=oryza_sativa download_go_annotations