Chapter 1 Generate additional GVCF

This chapter explains how to generate a GVCF for every additionnal isolate to include in the final variant matrix.

Required softwares:

  • bwa-mem2 (bwa mem can be used, but is slower)
  • samtools
  • gatk

Commands were successfully run with bwa-mem2 v2.2.1, samtools v1.16.1 and gatk v4.5.0.0.

1.1 Short reads mapping

Short reads mapping have to be done with bwa-mem2 mem (or bwa mem) default parameters. The output of bwa can be piped with samtools sort to output a sorted bam. The -T option allows to give a specific name to temporary files created for the sorting, which allows to sort multiple bam in parallel in the same directory without any name conflict.

# Create reference index
bwa-mem2 index reference.fasta

# Run mapping
bwa-mem2 mem -t 20 reference.fasta IlluminaReads_1.fq.gz IlluminaReads_2.fq.gz | samtools sort -o Mapping.bam -T reads.Prefix.tmp

Once the BAM file is generated, one can add ReadGroups (required for some downstream analyses) with gatk AddOrReplaceReadGroups.

# Add read groups
gatk AddOrReplaceReadGroups -I Mapping.bam -O Mapping.RD.bam --RGID SampleName --RGLB SampleName --RGPL ILLUMINA --RGPU SampleName --RGSM SampleName

Finally, the bam file has to be indexed with samtools index.

# Index BAM
samtools index Mapping.RD.bam

1.2 Variant calling

HaplotypeCaller is used to call variants and generate a GVCF file from the bam file.

# Index reference genome
samtools faidx reference.fasta
gatk CreateSequenceDictionary -R reference.fasta

# Index bam file
samtools index Mapping.RD.bam

# Run HaplotypeCaller
gatk HaplotypeCaller -R reference.fasta -I Mapping.RD.bam -O Prefix.g.vcf.gz --emit-ref-confidence GVCF

For sample with really high sequencing depth, variant calling can be performed on each chromosome and subsequently merged into a single GVCF file.

# Run HaplotypeCaller per chromosome
gatk HaplotypeCaller -R reference.fasta -I Mapping.RD.bam -O Prefix.chromosome1.g.vcf.gz --emit-ref-confidence GVCF -L chromosome1
gatk HaplotypeCaller -R reference.fasta -I Mapping.RD.bam -O Prefix.chromosome2.g.vcf.gz --emit-ref-confidence GVCF -L chromosome2
gatk HaplotypeCaller -R reference.fasta -I Mapping.RD.bam -O Prefix.chromosome3.g.vcf.gz --emit-ref-confidence GVCF -L chromosome3
...

gatk MergeVcfs -I Prefix.chromosome1.g.vcf.gz -I Prefix.chromosome2.g.vcf.gz -I Prefix.chromosome3.g.vcf.gz ... -O Prefix.g.vcf.gz

Finally, tab-index all newly generated GVCF.

# Index GVCF
gatk IndexFeatureFile -I Prefix.g.vcf.gz