Introduction

Goal
The aim is to :
  • Get familiar with motif analysis of ChIP-seq data.
  • Learn de novo motif discovery methods.
  • Become familiar with using RSAT via the website.
In practice :
  • Motif discovery with peak-motifs.
  • Advanced parameter settings
  • .
  • visualisation of the putative TFBS
  • Motif enrichment with matrix-quality.

Retrieving sequences from your peaks

Goal: Given a set of peaks from a ChIP-seq experiment in a bed format, retrieve the sequences corresponding to those coordinates from the genome in fasta format.

1 - Example dataset1: CEBPa binding regions in dog liver

Schmidt, Wilson and Ballester published a ChIP-seq experiment on liver tissue to identify binding regions for the transcription factor CEBPa (PMID:20378774) in five different species (human, mouse, dog, short-tailed opossum and chicken). This data set is publicly available through arrayexpess (E-TABM-722).
As done by the authors, CEBPa binding regions (peaks) were called by running SWEMBL with parameter R=0.05, on merged reads from two biological replicates and their corresponding input controls. For this tutorial we will analyze CEBPa binding pattern in dog, peaks can be downloaded from here.

2 - Fetch sequences from a bed file

  1. In a web browser window open the RSAT web page
  2. In the menu (left side) click on the NGS-ChIP-seq drop down menu, and select the tool: fetch-sequences from UCSC.
  3. Select the genome of interest, in this case: canFam2.
  4. There are several options to input a bed file: Paste the coordinates, input from a URL, and upload the file from your computer. In this case, to prevent traffic between the teaching room and internet, we will favor using the URL option. Right click on this link and "copy link location" to get the URL of the peak coordinates (BED) file. Alternatively, save the file on your computer and use the "upload" method.
  5. Introduce your email address to receive a mail once the job is done.
  6. Click on Go once the form is complete.
  7. fetch_seq_SC1
  8. When the job is finished you will receive a link to the fasta file containing the sequences corresponding to the coordinates in the bed file.
Check point: Did you recover all sequences in the bed file?

Anticipated Results
  1. Fasta file.
  2. Log.
Selecting the correct genome version: Genomes are constantly updated with the improvement of sequencing technologies, alignment tools and annotations. Always verify you are selecting the correct version.

Results: At this point, you should have the URL link to the fasta file.

Discovering motifs from peak sequences

Goal:Discover binding motifs or patterns from a fasta file containing ChIP-seq determined binding regions of a transcription factor.

1 - Getting to know peak-motifs

peak-motifs is a computational pipeline that discovers motifs in peak sequences, compares them with databases, exports putative binding sites for visualization in the UCSC genome browser and generates an extensive report.
The following articles describe peak-motifs and its usage:
  1. Thomas-Chollier, M., Herrmann, C., Defrance, M., Sand, O., Thieffry, D. and van Helden, J. (2011). RSAT peak-motifs: motif analysis in full-size ChIP-seq datasets Nucleic Acids Research doi:10.1093/nar/gkr1104, 9. [Paper]
  2. Thomas-Chollier M, Darbo E, Herrmann C, Defrance M, Thieffry D, van Helden J. (2012). A complete workflow for the analysis of full-size ChIP-seq (and similar) data sets using peak-motifs. Nat Protoc 7(8): 1551-1568. [Paper]
In this section we will get familiar with this tool and its general usage. Its basic usage requires as input:
  1. Title for the analysis. For the studied dataset, use CEBPa_ChIP-seq_in_dog_liver
  2. A set of peak sequences in fasta format. Sequences can be pasted in the available box, input from a URL, and uploading a file from your computer. Alternatively, the sequences can come directly from another RSAT program (like fetch-sequences), as detailed just below.
  3. Introduce your email address to receive a mail once the job is done.
fetch_seq_SC2
Passing results from one tool to the next one: As a suite of tools, RSAT is designed to pass the output from one tool as input into related tools. E.g: From the output display in fetch-sequences, you can directly send the sequences to peak-motifs. fetch_seq_SC3
The peak-motifs output is formed by the following parts:
  1. Sequence Composition:The distribution of sequence lengths provides a useful way to detect outlier peaks (i.e., exceptionally long peaks that may ‘dilute’ the motif signal) or irregular length distributions resulting from problems during the peak-calling procedure. Nucleotide and dinucleotide compositions are computed and displayed in the form of heat maps and positional profiles
  2. PMSC3
  3. Motif Discovery:The workflow combines four word-based pattern-discovery algorithms that rely on two complementary criteria (overrepresentation and positional bias) to detect exceptional words (oligonucleotides) and spaced pairs of words (dyads). Significant words are used as seeds to build probabilistic description of motifs (position-specific scoring matrices), indicating residue variability at each position of the motif. Motif discovery will be done only using oligonucleotides detection by default.
  4. PMSC3
  5. Motif comparisons: Discovered motifs are compared with one or several public databases of annotated motifs to predict associated transcription factors. Comparison results are displayed as multiple motif alignments to highlight matches with several annotated motifs (e.g., factors belonging to the same family, composite motifs bound by protein complexes). Motif comparison is perfomred against vertebrate transcription factors binding motifs in JASPAR database.
  6. Binding site predictions:Sequences are scanned with the discovered motifs to locate binding sites, and their positioning within peaks is analyzed (coverage, positional distribution along peaks).
  7. PMSC3
Understanding peak-motifs results
  1. Do you have any concerns regarding peak compotition?
  2. Are there any significant motifs discovered?
  3. Were you expecting these results?
Anticipated Results:peak-motifs results

2 - Fine-tuning peak-motifs parameters

Several parameters can be tunned in peak-motifs in order to obtain better results.
  1. Reduce peak sequences: In our previous results it is possible to observe that most of the discovered motifs lay in the middle parts of the peaks. In order to focus our anaysis to this section of the peaks we can use this option and reduce the sequence length.
  2. PMSC4
  3. Motif-discovery parameters: The choice of motif discovery algorithms markedly affects the result. It is recommended to combine the analysis of overrepresentation (oligo-analysis) and positional bias (position-analysis). Other available analysis are based on: spaced pairs (dyad-analysis) and locally overrepresented words (local-word).
  4. PMSC5
  5. Motif Comparison: There are several databases that contain binding motifs available. Users can also add their own collections, in this case we will add as well the motif reported by Schmidt,Wilson and Ballester based on a CEBPa ChIP-seq done in mouse [motif]
  6. PMSC6
  7. Locate motifs: Locating discovered motifs in peaks can be useful to detect potitional bias, once an intersting motif is found it becomes important to locate the site in the genomic context, there are options available in peak-motifs that facilitate this task.
  8. PMSC7
Results and parameters
  1. Try different combinations of parameters. How would you improve these results?
  2. How different is the discovered dog CEBPa motif in comparison to the mouse reported motif?
Anticipated Results

3 - Using a sequence set as control

peak-motifs can take as input a second set of sequences to be used as a control. For example, if there is a set of peaks produced by a ChIP-seq experiment on a mutant that does not contain the transcription factor, this peak set can be used as a control for motif discovery.
Ballester, et al, in their most resent work (eLife, in press) classified the CEBPa peaks in: Peaks belonging to a Cluster of Regulatory Modules (CRM, several TFs binding together) and Singletons (only CEBPa). Using Singleton peaks as control for the CRM peaks it is possible to discover the CEBPA co-factors.
  1. CRMs:bed file, fasta file
  2. Singletons:bed file , fasta file
Results
  1. Now there are two sequence composition results, one for the query sample and one for the control. The query is composed by peaks in the CRM category, the control are Singletons, as expected intuitively the size of the sequences are different.
  2. The CEBPa motif is no longer found since it was over-represented in the control data set, now is possible to observe enrichment for other transcription factors like HNF4a which is a known important liver transcription factor likely to bind together with CEBPa
Anticipated Results:peak-motifs control, CRMs vs Singletons.

Visualizing the sites in the context of genome annotations

Goal: Visualize the predicted binding sites with the discovered motifs in genomic context.

1 - UCSC browser

Visualization of ChIP-seq data in the genome context can be very useful; it can be used to empirically assess quality and to identify interesting genomic regions.

USCS browser contains several annotations and data sets (mostly for human and model organisms) that can be visualized together with user specified samples.

VSC1

Users can create and share personalized sessions with their data.

VSC2

You can find a session we prepared containing the dog data set here

2 - Load predicted binding sites into UCSC browser

To visualize our binding sites predictions we need to:
  1. Dowload the bed file with the coordinates for the predicted sites from the peak-motifs output
  2. VSC3

  3. In UCSC browser select: My Data / Custom Tracks / add custom tracks.
  4. Select the bed file and click on submit
  5. This task can take time, don't close the window!!
  6. Once is loaded you will see one track per motif in the table
  7. Now go back to genome browser

3 - Interpretation

Sites in perspective
  1. Do all peaks have sites?
  2. Did you expect this results?
  3. Do you have any interesting findings?
Anticipated Results: UCSC Session.

ChIP-seq in bacteria

Goal: Apply the knowledge acquire today in a second data set.

1 - FNR ChIP-seq in E. coli K12

Myers, et al. recently published a paper where they characterized through ChIP-seq the binding profile for the transcription factor FNR (Paper) in Escherichia coli K12MG1655 .

Data was processed in the following way:

  1. Raw reads were downloaded from GEO database. ID:GSE41195.
  2. Reads were aligned to E.coli K12 MG1655 genome, version NC_000913.2 using bowtie.
  3. Peaks were called using MACS with parameters: --gsize 4639675 --name "macs14" --bw 400 --keep-dup 1 --bdg --single-profile --diag.
  4. The peak set is here.

2 - Get the genome sequence

We will require to have the fasta file for the NC_000913.2 version of the E.coli K12 MG1655 genome for the following step.

  1. We will download this file from NCBI.
  2. In the "Send to" menu, select file and then specify fasta format.
  3. FNRSC1

3 - Fetching peak sequences with Galaxy

Galaxy provides access through the web to useful tools for NGS analysis. We will use the tool Extract Genomic DNA to get the peak regions from the fasta file of the E. coli K12 MG1655 genome. This tool is specially useful for genomes that are not supported by resources like UCSC genome browser.

  1. Go to Galaxy:https://usegalaxy.org/, and login. In case you don't have a user please create one, is fast and it might be useful in the future.
  2. Now we will upload the genome (fasta format) and peaks (bed format) files into Galaxy through the tool Upload File under the "Get Data" menu. Select the corresponding type of data and select "unspecified" genome (the E. coli K12 genome availablable in Galaxy does not correspond to the verision we are using).
  3. FNRSC2

  4. The tool "Extract Genomic DNA" can be found under the Fetch Sequences dropdown menu.
  5. Select the bed and fasta files saved in the history as inputs and execute the job.
  6. FNRSC3

  7. Once the job is finished the result will appear in the history.
Fetching sequences for new genomes
  1. Do you know other options to fetch peak sequences for genomes that are new or not supported?
  2. You can also do this using the command line.
    $ bedtools getfasta -fi Escherichia_coli_K_12_MG1655.fasta -bed macs14_peaks.bed -fo macs14_peaks.fa 
Anticipated Results: fasta file.

4 - Motif Discovery

Now that we have the peak sequences we can do motif discovery.

You now know what to do!

  1. Go to the RSAT web page.
  2. Fill the form and input the peak sequences in fasta format.
  3. Select the desired options.
  4. Go!
Tunning parameters
  1. Which parameters did you use?
  2. Did you select any specific set of motifs to compare with? Why?
  3. Which algorithm gave you the expected motif?
  4. Try comparing the discovered motifs with motifs from binding interfaces of proteins in footprintDB, these motifs are inferred from a collection of protein-DNA complexes.
Anticipated Results: [Default parameters] [Tunned parameters]

5 - Focusing motif discovery on the summits

The last analysis didn't show the expected results. Probably because the peaks were to long and this can difficult the search. We will now center the search around +/- 50 base pairs of the reported summits.
  1. Go back to the galaxy server.
  2. Upload the summits
  3. We will use the tool Compute under the Text Manipulation menu. We will use this tool twice, one to calculate the start-50 bps and end +50 bps.
  4. FNRSC3

  5. With the tool Cut under the Text Manipulation menu we will select the first,sixth and seventh columns to create a new bed file.
  6. FNRSC4

  7. As done before, use the bed file to obtain the sequences from the genome data file.
Signal in the summit
  1. Now the results show the expected FNR motif.
Anticipated Results: [summit fasta file] [Summit peak-motifs]

Measure the enrichment of your peak for expected motifs

Goal: To identify weather there is enrichment for a set of specific motifs in a collection of peaks.

1 - Motif enrichment in RSAT

matrix-quality is a tool that highlights the enrichment of binding sites in sequence sets obtained from high-throughput ChIP-chip, and ChIP–seq and experiments, to assess enrichment it combines information from theoretical and empirical score distributions. The tool can be found under the Matrix Tools menu in the RSAT web. The following paper describes in detail the algorithm behind this tool.
  • Medina-Rivera, A., Abreu-Goodger, C., Thomas-Chollier, M., Salgado, H., Collado-Vides, J., & Van Helden, J. (2011). Theoretical and empirical quality assessment of transcription factor-binding motifs. Nucleic Acids Research, 39(3), 808–824. [Paper]

2 - Enrichment of liver Transcription factors binding sites in ChIP-seq peak sequences

In a recent paper, Ballester, et al. in press, characterized the binding profile for other three relevant transcription factors in liver: OCT1 (HNF6), FOXA1 and HNF4A. The matrices reported in this paper for mouse can be found here. We will use matrix-qualt to assess enrichment for the four TFs (CEBPa, OCT1, HNF4A and FOXA1).
  1. Define a title for the job. We will use the title:CEBPa motif enrichment in ChIP-seq
  2. Paste the motifs to be used and select transfac format.
  3. Input the peak sequences using the URL.
  4. MQSC1

  5. Permutations are used as negative control, select 2.
  6. Enter your email.
  7. Go!
  8. MQSC1

3 - Understanding enrichment graphs

First we will analyse the enrichment for CEBPa reported motif in the collection of peaks from the ChIP-seq in dog liver.
  1. Decreasing cumulative distribution function (dCDF).
  2. MQ3

  3. Decreasing cumulative distribution function (dCDF) in logy scale. The logarithm scale facilitates observing differences in high scores.
  4. MQ4

  5. Receiver Operating Characteristic (ROC) curves.
  6. MQ5

Is the CEBPa motif enriched?

Anticipated Results: In comparison with the theoretical score distribution with the empirical one using the CEBPA motif shows enrichment for high score values. High scores are more likely to be related to biologically relevant sites.
  • matrix-quality result.
  • Background model

    Goal: Understand the relevance of background model selection and how to create one.

    1 - Creating a background model

    The tool creat-background-model can be used to create a costumized background model from the a set of sequences.

    1. Input the peak sequences in fasta format.
    2. Select the markov order to be used.
    3. Specify an email.

    VSC1