Chapter 1 Generate additional GVCF
This chapter explains how to generate a GVCF for every additionnal isolate to include in the final variant matrix.
Required softwares:
- bwa-mem2 (bwa mem can be used, but is slower)
- samtools
- gatk
Commands were successfully run with bwa-mem2 v2.2.1, samtools v1.16.1 and gatk v4.5.0.0.
1.1 Short reads mapping
Short reads mapping have to be done with bwa-mem2 mem
(or bwa mem
) default parameters. The output of bwa can be piped with samtools sort
to output a sorted bam. The -T
option allows to give a specific name to temporary files created for the sorting, which allows to sort multiple bam in parallel in the same directory without any name conflict.
# Create reference index
bwa-mem2 index reference.fasta
# Run mapping
bwa-mem2 mem -t 20 reference.fasta IlluminaReads_1.fq.gz IlluminaReads_2.fq.gz | samtools sort -o Mapping.bam -T reads.Prefix.tmp
Once the BAM file is generated, one can add ReadGroups (required for some downstream analyses) with gatk AddOrReplaceReadGroups
.
# Add read groups
gatk AddOrReplaceReadGroups -I Mapping.bam -O Mapping.RD.bam --RGID SampleName --RGLB SampleName --RGPL ILLUMINA --RGPU SampleName --RGSM SampleName
Finally, the bam file has to be indexed with samtools index
.
1.2 Variant calling
HaplotypeCaller is used to call variants and generate a GVCF file from the bam file.
# Index reference genome
samtools faidx reference.fasta
gatk CreateSequenceDictionary -R reference.fasta
# Index bam file
samtools index Mapping.RD.bam
# Run HaplotypeCaller
gatk HaplotypeCaller -R reference.fasta -I Mapping.RD.bam -O Prefix.g.vcf.gz --emit-ref-confidence GVCF
For sample with really high sequencing depth, variant calling can be performed on each chromosome and subsequently merged into a single GVCF file.
# Run HaplotypeCaller per chromosome
gatk HaplotypeCaller -R reference.fasta -I Mapping.RD.bam -O Prefix.chromosome1.g.vcf.gz --emit-ref-confidence GVCF -L chromosome1
gatk HaplotypeCaller -R reference.fasta -I Mapping.RD.bam -O Prefix.chromosome2.g.vcf.gz --emit-ref-confidence GVCF -L chromosome2
gatk HaplotypeCaller -R reference.fasta -I Mapping.RD.bam -O Prefix.chromosome3.g.vcf.gz --emit-ref-confidence GVCF -L chromosome3
...
gatk MergeVcfs -I Prefix.chromosome1.g.vcf.gz -I Prefix.chromosome2.g.vcf.gz -I Prefix.chromosome3.g.vcf.gz ... -O Prefix.g.vcf.gz
Finally, tab-index all newly generated GVCF.