Commit 51306200 authored by Pradat Yoann's avatar Pradat Yoann

[README] format and remote git url

parent 823a9d8c
......@@ -7,14 +7,22 @@ The tool is divided in 3 steps
## 1. What is the tool doing ?
VEP annotates variants with information from multiple external databases and can be configure for to answer a lot of specific needs. For more details,
see [VEP's options page](https://www.ensembl.org/info/docs/tools/vep/script/vep_options.html). VEP does not however extract information like number of reads or somatic status from the VCF file. vcf2maf is supposed to perform these tasks but failed to do on quite a lot of example VCF and does not always extract all the relevant information. For this reason, manual parsing was implemented in Python and was tested on TCGA VCF files from the legacy archive portal (see examples) and other VCFs. See the table for exhaustive details.
VEP annotates variants with information from multiple external databases and can be configure for to answer a lot of
specific needs. For more details, see [VEP's options
page](https://www.ensembl.org/info/docs/tools/vep/script/vep_options.html). VEP does not however extract information
like number of reads or somatic status from the VCF file. vcf2maf is supposed to perform these tasks but failed to do on
quite a lot of example VCF and does not always extract all the relevant information. For this reason, manual parsing was
implemented in Python and was tested on TCGA VCF files from the legacy archive portal (see examples) and other VCFs. See
the table for exhaustive details.
### 1.1. Manual parsing
Relies on tags specified by the user to extract relevant info like genotype (GT), somatic status (SS), quality and filter info (QUAL, INFO) and most importantly reads information (AD, DP, FA, DP4, TAR, TIR). The parser has been tested on VCF files as produced by
Relies on tags specified by the user to extract relevant info like genotype (GT), somatic status (SS), quality and
filter info (QUAL, INFO) and most importantly reads information (AD, DP, FA, DP4, TAR, TIR). The parser has been tested
on VCF files as produced by
- Mutect v.1 (TCGA GA SNV) and Strelka (TCGA GA Indel)
- sets of callers VarScanSomatic-Strelka-Sniper-Samtools (TCGA HS SNP), GatkSomaticIndel-Pindel-Strelka-VarScanSomatic (TCGA HS Indel).
- sets of callers VarScanSomatic-Strelka-Sniper-Samtools (TCGA HS SNP), GatkSomaticIndel-Pindel-Strelka-VarScanSomatic
(TCGA HS Indel).
- Mutect v.1.1.7 with no header
- Strelka v.2.9.2
......@@ -36,7 +44,6 @@ vcf2maf also runs VEP internally but performs extra work to build some annotatio
- *HGVS_c*
- *HGVS_Short*
- *all_effects*
that are not available from VEP's output.
### 1.3. VEP
......@@ -48,7 +55,7 @@ Run the VEP annotator on the VCF file from a specific set of options. The option
Clone the github repository using
```
git clone --recurse-submodules https://github.com/Durzot/BT_variant_annotator
git clone --recurse-submodules [path/to/this/repo]
```
or, if you forgot the `--recurse-submodules` option, run `git submodule update --init` after the cloning.
......@@ -61,11 +68,20 @@ Go to the ensembl-vep/ folder in tools/ and run
perl INSTALL.pl
```
The perl script INSTALL.pl may return errors as missing dependencies or other. For instance,the error `Bio::Root::Version is not installed` may be solved by running `sudo cpanm Bio::Root::Version`. You may have more than one such library to install. Refer to the github for the details.
The perl script INSTALL.pl may return errors as missing dependencies or other. For instance,the error
`Bio::Root::Version is not installed` may be solved by running `sudo cpanm Bio::Root::Version`. You may have more than
one such library to install. Refer to the github for the details.
The installation from the perl script offers the choice to install cache files (most efficient use of vep) and FASTA files (to retrieve sequence data for HGVS notations) into `$HOME/.vep`. You may also install plugins for additional analyses. Download cache files for Homo Sapiens genome 100_GRCh37 (or newer). The total download size is about 12 GB of data so a stable and fast connection is required here. The uncompressed FASTA file (**DO NOT FORGET** to uncompress this file or VEP will fail) Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz requires about 3.0 GB of storage.
The installation from the perl script offers the choice to install cache files (most efficient use of vep) and FASTA
files (to retrieve sequence data for HGVS notations) into `$HOME/.vep`. You may also install plugins for additional
analyses. Download cache files for Homo Sapiens genome 100_GRCh37 (or newer). The total download size is about 12 GB of
data so a stable and fast connection is required here. The uncompressed FASTA file (**DO NOT FORGET** to uncompress this
file or VEP will fail) Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz requires about 3.0 GB of storage.
As detailed in this [page](https://m.ensembl.org/info/docs/tools/vep/script/vep_custom.html), sometimes the Ensembl latest release does not use the latests release of other databases. If you want the latest release of say, ClinVar, you should download all relevant files and specify the path to these files along with field names while calling the vep command. For instance, you can get the latest ClinVar VCF files from their FTP website with
As detailed in this [page](https://m.ensembl.org/info/docs/tools/vep/script/vep_custom.html), sometimes the Ensembl
latest release does not use the latests release of other databases. If you want the latest release of say, ClinVar, you
should download all relevant files and specify the path to these files along with field names while calling the vep
command. For instance, you can get the latest ClinVar VCF files from their FTP website with
```
# Compressed VCF file
......@@ -82,7 +98,8 @@ and then add the following to the vep command
### 2.2 Install vcf2maf
VEP is required by vcf2maf but you also need the commands from `samtools` and `htslib` available at [http://www.htslib.org/download/](http://www.htslib.org/download/). Do the following
VEP is required by vcf2maf but you also need the commands from `samtools` and `htslib` available at
[http://www.htslib.org/download/](http://www.htslib.org/download/). Do the following
```
cd samtools-1.x
......@@ -101,12 +118,16 @@ ln −s /where/to/install/bin/tabix /usr/local/bin
ln −s /where/to/install/bin/bgzip /usr/local/bin
```
You may replace `/usr/local/bin` with whatever path where you usually save binaries. Finish the installation of vcf2maf following the instructions given in the [github](https://github.com/mskcc/vcf2maf).
You may replace `/usr/local/bin` with whatever path where you usually save binaries. Finish the installation of vcf2maf
following the instructions given in the [github](https://github.com/mskcc/vcf2maf).
### 2.3 Example
The main function for annotating a vcf is `run_annotator` in `main` module. Have a look at `run_example_tcga_GA.py` or `run_example_tcga_HS` to have examples of how to run the tool and at the `run_annotator` documentation for more details about the options.
The main function for annotating a vcf is `run_annotator` in `main` module. Have a look at `run_example_tcga_GA.py` or
`run_example_tcga_HS` to have examples of how to run the tool and at the `run_annotator` documentation for more details
about the options.
## 3. References
McLaren, W., Gil, L., Hunt, S.E. et al. The Ensembl Variant Effect Predictor. Genome Biol 17, 122 (2016). [https://doi.org/10.1186/s13059-016-0974-4](https://doi.org/10.1186/s13059-016-0974-4).
McLaren, W., Gil, L., Hunt, S.E. et al. The Ensembl Variant Effect Predictor. Genome Biol 17, 122 (2016).
[https://doi.org/10.1186/s13059-016-0974-4](https://doi.org/10.1186/s13059-016-0974-4).
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment