Introduction

Goal
The aim is to :
  • get familiar with remote programmatic access using SOAP Web services
  • use client script in Python to query RSAT Web Services (WS)
In practice :
  • extract sequences corresponding to peaks (fetch-sequences)
  • scan the peak sequences with a discovered motif (matrix-scan)

Install the Python modules for SOAP

Goal: prepare the Python environment to access remote SOAPweb services

1 - On MAC/Linux

We will first install pip as follow :
sudo easy_install pip
Alternative : Linux users may prefer
 apt-get install python-pip 

Now we will install suds via pip:
sudo pip install suds
Alternative :Linux users may prefer
apt-get install python-suds

2 - On Windows

  1. Install pip with these instructions . Note from Stackoverflow: pip.exe will be placed inside your Python installation's Scripts folder, which is likely not on your path (fix that by running C:\PythonXX\Tools\Scripts\win_add2path.py. Download get-pip.py, being careful to save it as a .py file rather than .txt.
  2. Now we will install suds via pip:
     pip install suds
At this point, you're ready to access remote SOAP webservices, including the RSAT WS.

Retrieving sequences from your peaks

Goal: Given a set of peaks from a ChIP-seq experiment in a bed format, retrieve the sequences corresponding to those coordinates from the genome in fasta format.

1 - Example dataset1: CEBPa binding regions in dog liver

Schmidt, Wilson and Ballester published a ChIP-seq experiment on liver tissue to identify binding regions for the transcription factor CEBPa (PMID:20378774) in five different species (human, mouse, dog, short-tailed opossum and chicken). This data set is publicly available through arrayexpess (E-TABM-722).
As done by the authors, CEBPa binding regions (peaks) were called using SWEMBL with parameter R=0.05, merged reads from two biological replicates and their corresponding input controls. For this tutorial we will analyze CEBPa binding pattern in dog, peaks can be downloaded from here.

2 - Fetch sequences from a bed file

  1. Save this Python script fetch-sequences_soap.py on your computer
  2. Run the script a first time to get the help by typing
    python fetch-sequences_soap.py -h

    Which files do you need, according to the help of the program ?

    [Show tip]

  3. Now run the script to retrieve the sequences corresponding to the CEBPa peak coordinates. Save this file on your machine and adapt the file path of the command to the files as needed for your computer. The genome is the dog CanFam2 assembly
  4. python fetch-sequences_soap.py -b do61+do79_cfam_CEBPA_liver.bed.SWEMBL.3.3.bed -g canFam2

  5. You should obtain the results from fetch-sequences. Note that this analysis did not run on your computer, but used the computing power of the remote RSAT server.
At this point, you're able to query an RSAT WS, and programmatically extract sequences corresponding to a given set of genomic coordinates.

Scanning peak sequences with a discovered motif

Goal: Given peak sequences in FASTA format and a motif represented by a matrix in TRANSFAC format, obtain the location of putative TFBS.

1 - Run matrix-scan as web service

  1. Save this Python script matrix_scan_soap.py on your computer
  2. Run the script a first time to get the help by typing
    python matrix_scan_soap.py -h

    Which files do you need, according to the help of the program ?

    [Show tip]

  3. Now run the script to scan the CEBPa peak sequences with the first discovered motif (session1, oligos_6nt_mkv4_m1). Adapt the file path to the files as needed for your computer.
  4. python matrix_scan_soap.py -s 1_do61+do79_cfam_CEBPA_liver.SWEMBL.3.3_peaks.fasta -m peak-motifs_oligos_6nt_mkv4_m1.tf -u 0.0001 > cebpa_oligos_6nt_mkv4_m1_10-4.ft

  5. You should obtain the results from matrix-scan within the file cebpa_oligos_6nt_mkv4_m1_10-4.ft. Note that this analysis did not run on your computer, but used the computing power of the remote RSAT server.
At this point, you should know what is a Web service and be able to run a pre-made client script in Python.