ECCB 2014 T01

Introduction

Goal: Being able to perform analysis similar to those in Session 1 using command-line tools only, installed locally or through the usage of a distance virtual machine.
The advantages are :

automatation of the tasks into analysis pipeline that can be used for several datasets
access to additional tools of the RSAT suite that have no web interface
running parallel analysis for example on a cluster
possibility to customize the tools by installing e.g. custom genomes
easy combination with additional command-line tools

Getting started with command-line tools

Goal:Reproduce the web-based analysis of session 1 using command-line tools.

Notice: In the following instructions, the commands that need to be typed in the terminal are indicated like this. After each command, the Return key needs to be pressed.

1 - Technicalities

Command-line tools can be accessed in different ways:

by installing the RSAT suite on a server using the instructions provided on this web page; or
using a virtual machine provided by the IFB which deploys a fully functional instance of the RSAT tools : check the instructions on this document; or
using a virtual machine on your own computer (the image needs to be downloaded from the USB stick): check the instructions here

2 - Exercice : fetching sequences from command line

Open a terminal; type fetch-sequences. The result should be an error message indicating that no genome information has been provided. However, this means that the RSAT tools are installed and can be accessed through the terminal.
Examine the options of the fetch-sequence tool by running fetch-sequences -h

The long list of options is typical for all RSAT command-line tools. This is the sign that these tools can be tuned using many parameters, but also implies that they are more complex to use. Many of these options have default values, meaning that predefined values will be used if no information for these options are provided by the user (optional parameters). However, some parameters have no default values and need to be provided by the user (as the genome information previously)

Try to identify the parameters which are compulsory.
Run the fetch-sequence tool using the bed file for the dog genome peaks. Make sure to use the -header_format galaxy in order to include the genomic coordinates of the sequence in the fasta header !

You should now have a file in your working directory containing the fasta sequences for the dog peaks.

Advanced motif search in ChIP-seq peaks

Goals:

Use a position-weight matrix (PWM) to locate possible transcription factor binding sites in ChIP-seq peaks (a.k.a. "motif-scanning").
Visualize the hits within the peak sequences using a genome browser.

1 - Inspecting PWMs

Classical ChIP-seq peaks are much longer than the binding site of the transcription factor that was immuno-precipitated. Hence, it is important to try to identify the most likely position of the transcription factor binding site (TFBS) inside the peaks.
In order to do this, we will scan the ChIP-seq peaks for CEPBa using two distinct CEPBa position-weight matrices (PWM).

Download the CEPBa PWMs in TRANSFAC format
1. MA0102.1
2. MA0102.3

Looking at these two files, can you see obvious differences between these PWMs ? How were these matrices built ? We can build the logos for these to inspect the matrices graphically

Run the command convert-matrix -from transfac -to transfac -return logo -i [pwm] where [pwm] represents the path to the PWM file in transfac format; navigate to the logos folder and visualize the PWM logos.

2 - Scanning sequences

We will now scan the peak regions using the two PWMs. We will restrict the analysis to the peaks of one of the smaller chromosomes (chr38, remember, dog !) to reduce computational time.

Fetch the sequences corresponding to the peaks on chromosome 38 using fetch-sequences.
Scan the resulting fasta file using the matrix-scan tool using the following command (values in brackets should be replaced!):
matrix-scan -quick -i [fasta] -m [pwm] -matrix_format tf -return pval -bginput -markov 3 -origin start -o [outfile] -v 2 -uth pval 1e-3

Exercice: check the meaning of the parameters used in the previous command using the help file of matrix-scan.

Hint choose a name for the output file that contains information about the parameters used and their values ! This is very usefull if we run this command again with a different set of parameters, in order to keep track of the output files.

Open the resulting file to understand the format and the information provided by the output.

How many peaks did we have ? How many predicted TFBS do we have ?

3 - Exporting the results to the UCSC browser

Goal: We want to visualize the PWM hits in the UCSC genome browser. More precisely, we want to display:

the localization of the ChIP-seq peaks
the localization of the PWM hits inside these peaks, for the two PWMs used.

You might have noticed that the coordinates of the PWM hits in the output file of matrix-scan (columns start and end) are relative coordinates, i.e. they are given relative to the start of the sequence that was scanned (first column). We first need to convert these relative coordinates to absolute or genomic coordinates. For this, we first need to extract the coordinates of the peaks sequences (basically the initial bed file, but in a slightly different format).

Run the command :
convert-features -from galaxy_seq -to bed -i [fasta] > [coord]
where [fasta] represents the fasta file of fetched sequences for chromosome 38, and [coord] is a new file in bed format containing the coordinated of all peaks.

Remember that we used the option -header_format galaxy in the fetch-sequences command ?
Convert the output of matrix-scan into a plain bed format using
convert-features -from ft -to bed -coord [coord] -i [outfile] -o [TFBS bed file]
Open the UCSC genome browser site ; select the dog genome (beware : canFam2)
click on the "Add custom tracks" button, and upload the bed file containing the peak coordinates, and the two bed files containig the TFBS inside the peaks obtained using the two PWMs.
Click on "go to the genome browser" (right-hand side) once you have uploaded the bed files, and navigate/zoom out to visulize some peak regions and the TFBS inside them.

Questions :

Do you see differences in the results obtained with the different PWMs ?
Can you identify peaks that have no TFBS inside ? Do you have an explanation for this ?

Testing the tools with negative controls

Goal: Test whether the tools are able to return negative results (i.e. no result at all) when the input dataset contains no signal. We will learn how to generate negative control datasets and test how the motif discovery tools behave on these datasets. This is a crucial test to validate the reliability of the results on real datasets !

1 - generate artificial random sequences

There are several ways to create random sequences to generate a negative control.

The simplest way would be to generate sequences from an equiprobable model, but this is a very unrealistic model since no biological sequence follows this model.
An alternative is to generate random sequences according to a Markov model (i.e. taking into account dependencies between nucleotides, cf. Session 1)
a more realistic control dataset would be to extract random genomic fragement from the genome of interest according to a template file (in order to retrieve the same number of sequences of equal size)

Run the following command
random-seq -i [template bed file] -template_format bed -bg upstream -markov 2 -org Canis_familiaris_EnsEMBL -o [output file]
Do not forget to check the help of this tool to understand the parameters !
Run peak-motifs on the output file
peak-motifs -i [previous output file] -title random-seq -markov auto -disco oligos,positions -nmotifs 5 -minol 6 -maxol 6 -2str -origin center -task purge,seqlen,composition,disco,merge_words,collect_motifs,synthesis -prefix peakmot_randomseq -noov -outdir random-seq-mmo-2 -v 2
Note that we chose a high verbosity (-v 2) which allows to follow the different steps the program is going through. You can however choose a lower verbosity (-v 0 or -v 1)

In order to visualize the results in html format, we need to fetch resulting folder (random-seq-mmo-2) locally, so that you can open the html files in a local browser. You can do this either using filezilla or scp, by indicating the login (root on the IFB cloud or vmusers of the local virtual machines)) and IP address of your virtual machine.

Navigate to the output folder just created and open the html file

Question : Do you find any over-represented motifs ? Check the different algorithm used (olig-analysis, position-analysis, ...).

2 - generate random genomic fragments

We now turn to a different negative control, i.e. extracting random genomic fragments from the genome of interest using as template the bed file of the true sequences.

Check the list of supported organisms using
supported-organisms Check the exact name of the Canis familiaris organims in the output.
Extract random genomic fragments from the dog genome using
random-genome-fragments -i [template bed file] -template_format bed -org Canis_familiaris_EnsEMBL -return seq -o [output file]
Run the previous peak-motifs command on this new file (Modify the name of the output dir and the title !)

Question : Do you find any over-represented motifs ? Check the different algorithm used (olig-analysis, position-analysis). Why do you see a different behavior depending on the algorithms ?

Slides:

Presentation

Dataset:

Schmidt,Wilson and Ballester Five-vertebrate ChIP-seq reveals the evolutionary dynamics of transcription factor binding. Science, 328(5981), 1036–1040. 2010 Article [PDF]
peaks:bed file

Session 2 : Using command-line tools

Introduction

Getting started with command-line tools

1 - Technicalities

2 - Exercice : fetching sequences from command line

Advanced motif search in ChIP-seq peaks

1 - Inspecting PWMs

2 - Scanning sequences

3 - Exporting the results to the UCSC browser

Testing the tools with negative controls

1 - generate artificial random sequences

2 - generate random genomic fragments

Slides:

Dataset:

Suggested Reading: