This document explains how to install genomes and annotations from Ensembl Genomes using mostly the FTP site. Gene Ontology terms are optional and are obtained from BioMart instead.
Note that while Ensembl covers Vertebrates, Ensembl Genomes (EG, also known as Non Vertebrates, NV) includes the other divisions (Protists, Fungi, Plants, Bacteria, Metazoa). These instructions have mostly been tested with Ensembl Plants.
This protocol uses the Makefile ensemblgenomes_FTP_client.mk, which can also be used to install arbitrary genomes from other sources from FASTA and GTF files, as explained in Installing organisms from FASTA and GTF files.
This document does not use the following scripts, which are documented in Installing genomes and variations from Ensembl:
install-ensembl-genome
,download-ensembl-genome
,download-ensembl-features
,download-ensembl-variations
.The current FTP site is at ftp://ftp.ensemblgenomes.org ; should it change in the future, or its folder structure, ensemblgenomes_FTP_client.mk will be updated accordingly.
Visit the Web site of your division of interest, such as http://plants.ensembl.org, and check the release statement at the bottom. Alternatively you can use the REST info/eg_version endpoint
Ensembl Genome releases are integers and have an offset of 53 with Ensembl releases. For instance, Ensembl release 99 corresponds to Ensembl Genomes 46.
We shall download the current list of supported genomes in the respective division (GROUP). This can be done by running the following command:
cd $RSAT
export EGDIVISION=Plants
make -f makefiles/ensemblgenomes_FTP_client.mk GROUP=$EGDIVISION RELEASE=$EGRELEASE organisms
# in the example, this will create $RSAT/data/ensemblgenomes/plants/release-46/species_EnsemblPlants.txt
Once this is done then you can install genomes from that release.
You can install all genomes with these commands:
cd $RSAT
# download FASTA and GTF files
nohup make -f makefiles/ensemblgenomes_FTP_client.mk GROUP=$EGDIVISION RELEASE=$EGRELEASE download_all_species
# parse input files, extract genomic features and compute oligo frequencies
nohup make -f makefiles/ensemblgenomes_FTP_client.mk GROUP=$EGDIVISION RELEASE=$EGRELEASE install_all_species
# check upstream sequences can be retrieved
nohup make -f makefiles/ensemblgenomes_FTP_client.mk GROUP=$EGDIVISION RELEASE=$EGRELEASE check_all_species
The newly installed species will be added to $RSAT/data/supported_organisms.tab and should be listed with the following command-line:
This will take a very long time, many days for a complete Ensembl Genomes release. The most time-consuming task is the calculation of genomic frequencies of oligos and dyads. Check the table for the installation time of selected plant genomes:
organisms | genome assembly length (Mb) | installation time (minutes) |
---|---|---|
Arabidopsis_thaliana.TAIR10.46 | 119.7 | 144 |
Brachypodium_distachyon.Brachypodium_distachyon_v3.0.46 | 271.2 | 233 |
Oryza_sativa.IRGSP-1.0.46 | 375.0 | 290 |
Beta_vulgaris.RefBeet-1.2.2.46 | 566.2 | 320 |
Glycine_max.Glycine_max_v2.1.46 | 978.5 | 660 |
Zea_mays.Zm-B73-REFERENCE-NAM-5.0.51 | 2182.1 | 1069 |
Helianthus_annuus.HanXRQr1.0.46 | 3027.8 | 1689 |
Hordeum_vulgare.IBSCv2.46 | 4834.4 | 3574 |
Triticum_aestivum.IWGSC.46 | 14547.3 | 10458 |
In case you want to install a single genome you can do that by:
cd $RSAT
make -f makefiles/ensemblgenomes_FTP_client.mk GROUP=$EGDIVISION RELEASE=$EGRELEASE SPECIES=oryza_longistaminata download_one_species
make -f makefiles/ensemblgenomes_FTP_client.mk GROUP=$EGDIVISION RELEASE=$EGRELEASE SPECIES=oryza_longistaminata install_one_species
You can also install selected genomes from older releases:
cd $RSAT
make -f makefiles/ensemblgenomes_FTP_client.mk GROUP=$EGDIVISION RELEASE=42 organisms
make -f makefiles/ensemblgenomes_FTP_client.mk GROUP=$EGDIVISION RELEASE=42 SPECIES=oryza_sativa download_one_species
make -f makefiles/ensemblgenomes_FTP_client.mk GROUP=$EGDIVISION RELEASE=42 SPECIES=oryza_sativa install_one_species
As indicated earlier, the newly installed species are added to $RSAT/data/supported_organisms.tab and should appear in the list produced by:
This step generates a report of descriptive stats of genomes currently in your system, such as http://rsat.eead.csic.es/plants/data/stats :
If variation data is available for your species of interest you can download it with:
cd $RSAT
make -f makefiles/ensemblgenomes_FTP_client.mk GROUP=$EGDIVISION RELEASE=$EGRELEASE SPECIES=oryza_sativa variations_one_species
Note: this will update file $RSAT/data/supported_organisms.tab
Script get-orthologs-compara
can be used to retrieve
homologues (orthologues by default) precomputed at Ensembl Compara. In
order to use it you must first install Compara in your system, which you
can do with:
cd $RSAT
export ENSRELEASE=99
make -f makefiles/ensemblgenomes_FTP_client.mk RELEASE=$EGRELEASE ENSEMBL_RELEASE=$ENSRELEASE GROUP=$EGDIVISION download_compara
make -f makefiles/ensemblgenomes_FTP_client.mk RELEASE=$EGRELEASE ENSEMBL_RELEASE=$ENSRELEASE GROUP=$EGDIVISION parse_compara_match
make -f makefiles/ensemblgenomes_FTP_client.mk RELEASE=$EGRELEASE ENSEMBL_RELEASE=$ENSRELEASE GROUP=$EGDIVISION install_compara
All going well you can check the species with installed homologies with:
These operations have been tested in Ensembl Plants, but should work also with Metazoa and Fungi. They require the installation of optional software, which can be done as follows:
To complete the configuration we shall get the current registry for the division of interest. For instance, for Plants the registry is at http://plants.ensembl.org/biomart/martservice?type=registry
The XML content of the registry must be copied and pasted into file $RSAT/ext_lib/biomart-perl/conf/martURLLocation.xml . For instance, for Plants release 46 it looked like this:
<?xml version="1.0" encoding="UTF-8"?>
<MartRegistry>
<MartURLLocation database="plants_mart_46" default="" displayName="Ensembl Plants Genes 46" host="plants.ensembl.org" includeDatasets="" martUser="" name="plants_mart" path="/biomart/martservice" port="80" serverVirtualSchema="plants_mart" visible="1" />
<MartURLLocation database="plants_snp_mart_46" default="" displayName="Ensembl Plants Variations 46" host="plants.ensembl.org" includeDatasets="" martUser="" name="plants_variations" path="/biomart/martservice" port="80" serverVirtualSchema="plants_mart" visible="1" />
<MartURLLocation database="plants_sequence_mart_46" default="" displayName="Ensembl Plants Sequences 46" host="plants.ensembl.org" includeDatasets="" martUser="" name="plants_sequences" path="/biomart/martservice" port="80" serverVirtualSchema="plants_mart" visible="0" />
<MartURLLocation database="plants_genomic_features_mart_46" default="" displayName="Ensembl Plants Genomic Features 46" host="plants.ensembl.org" includeDatasets="" martUser="" name="plants_genomic_features" path="/biomart/martservice" port="80" serverVirtualSchema="plants_mart" visible="0" />
<MartURLLocation database="ontology_mart_99" default="" displayName="Ontology Mart 99" host="plants.ensembl.org" includeDatasets="" martUser="" name="ontology" path="/biomart/martservice" port="80" serverVirtualSchema="plants_mart" visible="0" />
</MartRegistry>
Once this is done we shall update the registry cache with:
This will take a few minutes, but then we can do:
And for each species for which we want GO terms we can now do:
make -f makefiles/ensemblgenomes_FTP_client.mk GROUP=$EGDIVISION RELEASE=$EGRELEASE SPECIES=oryza_sativa download_go_annotations