Welcome to WGBSSuite Pages.
We have developed WGBS Suite in order to allow anyone to asses which WGBS analysis package is the best fit for their data. We provide a tool to analyse your data in order to simulate data of the same type. This data can then be used to benchmark existing analysis methods either automatically (BSmooth, MethylKit and MethylSig) or independently on any existing package. Doing this will help you identify the best package to use for your downstream analysis, hence increase the quality of your analysis overall.
The simulator is summarised in three sections; (A) Firstly the locations of the CpG are simulated using a 2 state hidden Markov model. (B)Next the methylation status for each CpG is simulated using a modulated hidden Markov model that ensures that CpG sites that are close together are more likely to have coordinated behaviour. (C) Finally the methylation profile is calculated producing a simulated number of methylated and de-methylated reads at each CpG. The statistical framework for this simulated can be switched between binomial and negative-binomial depending on which distribution you think better fits your data.
The benchmarking currently uses MethylKit, BSmooth , MethylSeq and the Fisher Exact Test. We will be working to encorportate more in the future however it is very easy to simulate data and use it to test any analysis software of your choice. The benchmarking that we have carried out so far shows that the performance is very context dependent with different packages performing very differently depending on coverage and distribution assumption. As an example, the following is an ROC analysis based binomial (D) and negative-binomially (E) distributed methylation counts.
Install R: You will need to install the correct version of R for your operating system. In order to do this visit the R mirror that is closest to your location from: http://www.r-project.org/ Install Bsmooth: Download and install the code for bsseq from here: http://rafalab.jhsph.edu/bsmooth/ Install MethylSig:Download and install the code for bsseq from here: http://sartorlab.ccmb.med.umich.edu/node/17 Install MethylKit:Download and install the code for bsseq from here: https://code.google.com/p/methylkit/
Installing the software
Clone the code in a directory that you have read/write permissions. eg /home/USERNAME/bin/WGBSSuite/
git clone https://github.com/SystemsGeneticsSG/WGBSSuite.git
If this directory is not already in your PATH then you can add it using the following command:
You will need to have you data in the correct format (see below) then follow these steps:
Analysing your data:From the command line simply type:
This will run the analysis script in interactive mode, for more details on how to run this script read the advanced guide below. The results of this script will be written to a folder that you select at runtime and include graphs and statistics about the dataset that can be used to parametrise the simulation. The result of this will be a set of parameters for the simulation but also a summary document of the input data, eg. analysis_of_real_data.pdf. This document can be used to visualise the properties of a dataset for example the distribution of read counts or methylated proportion.
Simulating data: From the command simply type,
Rscript simulate_WGBS.R interactive
This will run the simulation script in interactive mode, for more details on how to run this script read the advanced guide below. The results of this script will be written to a folder that you select at runtime and include graph and the raw data that can be used for the benchmarking step or within R to test any WGBS software.
Benchmarking the tools: From the command line simply type,
Rscript benchmark_WGBS.R interactive
This will run the benchmark script in interactive mode, for more details on how to run this script read the advanced guide below. You will be required to enter the filename for the data you have simulated from the previous step as one of the options. This will produce an ROC, AUC and runtime plot for the dataset.
Simulating and Benchmarking the data in non-interactive mode:This runs the simulation and benchmarking in a loop and produces an averaged ROC, AUC and runtime analysis. This approach should be used to get the most accurate idea of the performance of a package. The result of this will be a set of simulated data, all store in /tmp/myWGBSanalysis. As well as this will be individual and the averaged analysis plots.
Rscript simulate_WGBS.R multi 5000 0.9203 0.076 0.1 0.1 29 29 3 2 0.1 0.5 0.019,0.002 /tmp/myWGBSanalysis binomial 10
- Arg1: multi – This indicates that you wish to perform repeats of the same settings in order to produce an ROC/AUC analysis.
- Arg2: 5000 – This is the number of CpG sites to simulate
- Arg3: 0.9203 – This is the probability of success in the methylated region. This can be approximated using the analysis script as explained above.
- Arg4: 0.076 – This is the probability of success in the de-methylated region. This can be approximated using the analysis script as explained above.
- Arg5: 0.1 – This is the size of the error in the methylated region. This can be approximated using the analysis script as explained above.
- Arg6: 0.1 – This is the size of the error in the de-methylated region. This can be approximated using the analysis script as explained above.
- Arg7: 29 – This is the average number of reads in the methylated region. This can be approximated using the analysis script as explained above.
- Arg8: 29 – This is the average number of reads in the de-methylated region. This can be approximated using the analysis script as explained above.
- Arg9: 3 – This is the number of replicates to be simulated.
- Arg10: 2 – This is the number of samples to be simulated.
- Arg11: 0.1 – This is the phase difference in the differentially methylated regions.
- Arg12: 0.5 – This is the balance of hypo/hyper methylation.
- Arg13: 0.19, 0.02 – This is the matrix of values that describes the exponential decay functions that define the distances between CpG values.
- Arg14: /tmp/my WGBSanalysisfiles – This is the location and prefix of the files that will be written out.
- Arg15: binomial – This can be either binomial or truncated and defines the distribution of methylated reads at each CpG.
- Arg16: 10 – The number of repeats to run for the ROC analysis.
Using the simulated data on other packages:The following command will create a file that can be used to test any WGBS package as follows,
Rscript simulate_WGBS.R 5000 0.9203 0.076 0.1 0.1 29 29 3 2 0.1 0.5 0.019,0.002 /tmp/myWGBSanalysis binomial
Once this has executed you will find the simulated data in the file “/tmp/myWGBSanalysis_5000_1_0.65_0.2_0.35_0.8_0.9_0.1_0.1_0.9_29_29_0.9203_0.076_0.1_0.1_0_0.1_0.5_.txt”. The data in this file has the following format:
- V1: location in base pairs
- V2: differentially methylated flag
blocks of 4 columns for each replica as follows:
- Vn: Number of methylated reads
- Vn+1: Number of de-methylated reads
- Vn+2: Total number of reads
- Vn+3: Proportion of methylated vs de-methylated reads
This information is sufficient to test any differential methylation software.
Authors and Contributors
Support or Contact
please email me firstname.lastname@example.org