1 Genome-wise comparison of protein sequences

In this section, we explain how to use the program genome-blast, which runs sequence similarity search with BLAST in order to detect significant similarities between all the proteins of a set of genomes.

This operation can take time, and the result tables occupy a considerable amount of space on the hard disk. For this reason, the distribution does thus not include the complete comparison of all genomes against all other ones, but is restricted to some model genomes (Escherichia_coli_GCF_000005845.2_ASM584v2 versus all Gammaproteobacteria, Saccharomyces cerevisiae against all Fungi, …).

Depending on your organism of interest, you might wish to perform additional comparisons for your own purpose. In this section, we explain how to compute the similiraty tables between a query organism (e.g. Escherichia_coli_GCF_000005845.2_ASM584v2) and a reference taxon (e.g. all Gammaproteobacteria).

In order to install the tables of similarities between gene products in , you need writing permissions in the directory $RSAT/public_html/data. If this is not the case, ask your system administrator to do it for you.

1.1 Running genome-blast between two genomes

As a first test, we will use to compare all the gene products (proteins) of a query organism (e.g. Escherichia_coli_GCF_000005845.2_ASM584v2) against all the gene products of a reference organism (e.g. Pseudomonas_aeruginosa_GCF_000006765.1_ASM676v1).

This protocol assumes that the two organisms are already installed on your site, as explained in the installation guide.

We will perform in two steps.

  1. Use the program (which is part of the distribution) to create a BLAST-formatted structure (the “database”) with all proteins of the reference organism ().

  2. Use the program (part of the distribution) to detect similarities between each protein of the query organism () and the reference organism.

1.1.0.1 Formatting the BLAST database

This DB formatting step is very efficient, it should be completed in a few seconds.

The result is found in the data directory containing . A new directory has been created, which contains the BLAST-formatted database with all the proteins of the reference organism.

ls -ltr $RSAT/data/genomes/Bacillus_subtilis/blastdb

These are binary files, that you should in principle not open as such.

1.1.0.2 Searching similarities

The program compares all the sequences of an input set against all the sequences of a database (the one we just created above). The program generates the appropriate command to find the BLAST database directory, and query it with the proteins of the query organism.

This task takes a bit less that one minute for (because we chose a very small genomes), and can take around 10 minutes for medium-sized bacterial genomes ( 4,000 genes).

Note that the command is written in the verbosity message. If you have specific reasons to customize this command, you can adapt it to apply different parameters.

1.1.0.3 Searching reciprocal similarities

One classical orthology criterion (which is not perfect but has practical advantages) is to select the bidirectional best hits as candidate orthologs.

For this, we need to run the reciprocal blast, i.e. using as query organism, and as reference organism.

Note that you can run the two BLAST commands ( and ) in a single shot, by specifying multiple tasks for .

We can now perform a quick test: select the bidirectional best hit () for the gene .

1.1.1 Applying genome-blast between a genome and a taxon

Generally, we want to compare a query organism to all the organisms of a given taxon (the ). This can be done with the option .

As an example, we will BLAST all the proteins of against all the proteins of each species of .

We can also use the option to activate the reciprocal search: BLAST all gene products of each bacteria of the reference taxon against those of the query organism .

We can now retrieve the orthologs of a gene (e.g. ) in all Mollicutes.

1.2 Getting putative homologs, orthologs and paralogs

In this section, I will explain how to use the program . This program takes as input one or several query genes belonging to a given organism (the ), and return the genes whose product (peptidic sequence) show significant similarities with the products of the query genes. The primary usage of is thus to return lists of similar genes, not specialy orthologs. Additional criteria can be imposed to infer orthology. In particular, one of the most common criterion is to select . This can be achieved by imposing the rank 1 with the option .

We will illustrate the concept by retrieving the genes whose product is similar to the protein LexA of , in all the Enterobacteriales. We will then refine the query to extract putative orthologs.

1.2.1 Getting genes by similarities

The result file is a list of all the Enterobacteriales genes whose product shows some similarity with the LexA protein from E.coli K12.

   ...
   #ref_id ref_org query
   Sde_1787        Saccharophagus_degradans_2-40   b4043
   CPS_0237        Colwellia_psychrerythraea_34H   b4043
   CPS_2683        Colwellia_psychrerythraea_34H   b4043
   CPS_1635        Colwellia_psychrerythraea_34H   b4043
   IL0262  Idiomarina_loihiensis_L2TR      b4043
   ...
   c5014   Escherichia_coli_CFT073 b4043
   c3190   Escherichia_coli_CFT073 b4043
   b4043   Escherichia_coli_K_12_substr__MG1655_uid57779    b4043
   ...

Each similarity is reported by the ID of the gene, the organism to which is belong, and the ID of the query gene. In this case, the third column contains the same ID on all lines: b4043, which is the ID of the gene lexA in . It seems thus poorly informative, but this column becomes useful when several queries are submitted simultaneously.

1.2.2 Obtaining information on the BLAST hits

The program allows to return additional information on the hits. The list of supported return fields is obtained by calling the command with the option . For example, we can ask to return the percentage of identity, the alignment length, the E-value and the rank of each hit.

Which gives the following result:

   ...
   #ref_id ref_org query   ident   ali_len e_value rank
   Sde_1787        Saccharophagus_degradans_2-40   b4043   65.33   199     1e-68   1
   CPS_0237        Colwellia_psychrerythraea_34H   b4043   65.69   204     6e-75   1
   CPS_2683        Colwellia_psychrerythraea_34H   b4043   33.94   109     1e-10   2
   CPS_1635        Colwellia_psychrerythraea_34H   b4043   34.12   85      1e-06   3
   IL0262  Idiomarina_loihiensis_L2TR      b4043   66.83   202     1e-75   1
   ...
   c5014   Escherichia_coli_CFT073 b4043   100.00  202     2e-111  1
   c3190   Escherichia_coli_CFT073 b4043   43.33   90      2e-14   2
   b4043   Escherichia_coli_K_12_substr__MG1655_uid57779    b4043   100.00  202     2e-111  1
   ...

Not surprisingly, the answer includes the self-match of lexA (ID b4043) in , with 100% of identify.

1.2.3 Selecting bidirectional best hits

We can see that the output contains several matches per genome. For instance, there are 3 matches in . If we assume that these similarities reflect homologies, the result contains thus a combination of paralogs and orthologs.

The simplest criterion to select ortholog is that of . We can select BBH by imposing an upper threshold on the rank, with the option .

The result has now been reduced to admit at most one hit per genome.

   ...
   #ref_id ref_org query   ident   ali_len e_value rank
   Sde_1787        Saccharophagus_degradans_2-40   b4043   65.33   199     1e-68   1
   CPS_0237        Colwellia_psychrerythraea_34H   b4043   65.69   204     6e-75   1
   IL0262  Idiomarina_loihiensis_L2TR      b4043   66.83   202     1e-75   1
   ...
   c5014   Escherichia_coli_CFT073 b4043   100.00  202     2e-111  1
   b4043   Escherichia_coli_K_12_substr__MG1655_uid57779    b4043   100.00  202     2e-111  1
   ...

1.2.4 Selecting hits with more stringent criteria

It is well known that the sole criterion of BBH is not sufficient to infer orthology between two genes. In particular, there is a risk to obtain irrelevant matches, due to partial matches between a protein and some spurious domains. To avoid this, we can add a constraint on the percentage of identity (min 30%), and on the alignment length (min 50 aa). These limits are somewhat arbitrary, we use them to illustrate the principe, and leave to each user the responsibility to choose the criteria that she/he considers as relevant. Finally, we will use a more stringent threshold on E-value than the default one, by imposing an upper threshold of 1e-10.

We can now combine the constrains above with the criterion of BBH.

As expected, the number of selected hits is reduced by adding these constraints. In Sept 2006, we obtained the following number of hits for lexA in Enterobacteriales.

  • 122 hits without any constraint;

  • 107 hits with contrains on ident,ali_len and e_value;

  • 69 hits with the constraint of BBH;

  • 69 hits with the combined constraint of BBH, at least 30% identity and an alignment over more than 50 aminoacids, and an E-value <= 1.e-10.

Actually, in the particular case of , the BBH constraint already filtered out the spurious matches, but inother cases they can be useful.

1.2.5 Restricting the number of reference organisms

The decrease of sequencing cost impulsed the multiplication of genome sequencing projects. In 2015, some bacterial species are represented by several hundreds of strains in the EnsemblGenomes database (e.g. ). A conseuence is that the comparative genomics analyses can become quite heavy. A more importance drawback is that some taxonomic branches are over-represented relative to other ones.

To reduce this problem, we added an option to the taxonomic tools.

1.3 Retrieving sequences for multiple organisms

The program can be used to retrieve sequences for a group of genes belonging to different organisms.This program takes as input a file with two columns. Each row of this file specifies one query gene.

  1. The first column contains the name or identifier of the gene (exactly as for the single-genome program ).

  2. The second column indicates the organism to which the gne belongs.

The output of can thus directly be used as input for .

1.5 Phylogenetic profiles

The notion of was introduced by Pellegrini et al. (1999). They identified putative orthologs for all the genes of in all the complete genomes available at that time, and built a table with one row per gene, one column per genome. Each cell of this table indicates if an ortholog of the considered gene (row) has been identified in the considered genome (column). Pellegrini et al. (1999) showed that genes having similar phylogenetic profiles are generally involved in common biological processes. The analysis of phylogenetic profiles is thus a powerful way to identify functional grouping in completely sequenced genomes.

The program can be used to obtain the phylogenetic profiles. The principle is to submit the complete list of protein-coding genes of the query organism. We process in two steps :

  1. With , we can identify the putative ortholgos for all the genes of the query organism, using the criterion of . This generate a large table with one row per pair of putative orthologs.

  2. We then use to convert the ortholog table into profiles (one row per gene, one column per genome).

We will illustrate this by calculating the phylogenetic profiles of all the genes from across all the Fungi. We use a level of verbosity of 2, in order to get information about the progress of the calculations.

The resulting table indicates the identifier of the ortholog genes. The option was used to specify that the string <NA> should be used to indicate the absence of putative orhtolog.

Another option would be to obtain a “quantitative” profile, where each cell indicates the E-value of the match between the two orthologs. This can be done by specifying a different score column with the option of .

1.6 Detecting pairs of genes with similar phylogenetic profiles

In the previous section, we generated tables indicating the phylogenetic profiles of each gene from . This table contains one row per gene, and one column per fungal genome.

We will now use the program to compare each gene profile to each other, to select the pairs of genes with significantly similar profiles. The problem is of course to choose our criterion of similarity between two gene profiles.

1.6.1 Comparing binary profiles with

For the binary profiles, the most relevant statistics is the .

In the previous commands, we set the verbosity to 2, in order to keep track the progress of the task. Actually, the processing can take a few minuts, it is probably the good moment for a coffee break.

1.6.2 Comparing binary profiles with

Another way to compare the phylogenetic profiles is to directly analyze with the table of orthology (previously obtained from ).

This is just another way of considering the same problem: in order to compare genes \(A\) and \(B\), we will consider as a first class (\(Q\)) the set of genomes in which gene \(A\) is present, and as a second class (\(R\)) the set of genomes in which gene \(B\) is present. We will then calculate the intersection between these two classes, and assess the significance of this intersection, given the total number of genomes.

Thus, will calculate the hypergeometric statistics, exactly in the same way as .