Chapter 3 Variant filtering
This chapter contains bcftools commands to filter multi-sample VCF files to obtain high-quality SNPs and InDels.
Required software:
- bcftools
Commands were successfully run with bcftools v1.16.1.
3.1 Quality filtering
Quality filtering is the first step to process. It includes depth filtering, genotype quality or any other filters based on quality metrics (QUAL, allele balance, etc). Loci that do not pass the thresholds must be set to missing. The allele count for each loci must then be updated bcftools +fill-tags
.
# Depth filtering 10X
bcftools +setGT $PREFIX.vcf.gz -- -t q -n . -e 'FMT/DP>=10' | \
bcftools +fill-tags | \
bcftools view -Oz -o $PREFIX.DP10.vcf.gz
# Genotype quality filtering 20
bcftools +setGT $PREFIX.DP10.vcf.gz -- -t q -n . -e 'FMT/GQ>=20' | \
bcftools +fill-tags | \
bcftools view -Oz -o $PREFIX.DP10.GQ20.vcf.gz
3.2 Missing genotype filtering
Multiple filters can be applied on missing genotypes. First we can filter out samples having less than 20% of informed loci. +fill-tags
must be used since the number of samples changes.
# Get for each sample the number of loci with missing genotype
bcftools stats -s - $PREFIX.DP10.GQ20.vcf.gz | grep -E ^PSC | cut -f3,14 > $PREFIX.DP10.GQ20.imiss
# Get the total number of sites in the VCF
nSites=$(bcftools +counts $PREFIX.DP10.GQ20.vcf.gz | grep "Number of sites" | rev | cut -d " " -f 1 | rev)
# List samples with less than 20% missing genotypes
awk -v nSites=$nSites '{if ($2 / nSites <= 0.2) print $1}' $PREFIX.DP10.GQ20.imiss > Samples.Mind20
# Filter samples
bcftools view --samples-file Samples.Mind20 $PREFIX.DP10.GQ20.vcf.gz | \
bcftools +fill-tags | \
bcftools view -Oz -o $PREFIX.DP10.GQ20.Mind20.vcf.gz
Then we can filter loci that are not informed for enough samples (ex: filter out loci that are informed for less than 99% of the samples).
3.3 Filter on excess of heterozygosity
Regions duplicated in some samples but in a single copy in the reference genome will often result in excess of heterozygosity. These regions can therefore be filtered with the ExcHet field, which tests for excess of heterozygosity and is calculated with fill-tags. This filter is usefull to avoid artificial increase in diversity caused by duplicated regions.
3.4 Split SNPs and InDels
In raw VCFs, SNPs and InDels can be hard to discriminate because they can be located at the same loci (called mixed loci). Some raw VCFs also contains MNVs (= Multiple Nucleotide Variants), that can be broken down in multiple consecutive SNPs. This makes the dicrimination between SNPs and InDels very complex. Since those mixed loci are hard to analyse, we filter them out. The raw VCF will be split in 2 VCF: one containing only strict SNPs (no InDel present at these loci in the population), one containing only strict InDels (no SNP present at these loci in the population). Non variant positions, referred to as REF
type will be kept.
For SNPs, all sites with *
as genotype (= position is overlapping a deletion) or not SNP
or REF
as type will be excluded with the command 'ALT="*" || (type!="snp" && type!="ref")'
. Same for InDels, except *
genotypes are kept: '(type!="indel" && type!="ref")'
.
# Keep only SNPs
bcftools view -e 'ALT="*" || (type!="snp" && type!="ref")' $PREFIX.DP10.GQ20.Mind20.99pNonMiss.ExcHet99.vcf.gz -Oz -o $PREFIX.DP10.GQ20.Mind20.99pNonMiss.ExcHet99.SNPs.vcf.gz
# Keep only InDels
bcftools view -e '(type!="indel" && type!="ref")' $PREFIX.DP10.GQ20.Mind20.99pNonMiss.ExcHet99.vcf.gz -Oz -o $PREFIX.DP10.GQ20.Mind20.99pNonMiss.ExcHet99.InDels.vcf.gz