Evaluates and depicts results from plink --missing (missing genotype rates per individual) and plink --het (heterozygosity rates per individual) via run_check_heterozygosity and run_check_missingness or externally conducted check.) Non-systematic failures in genotyping and outlying heterozygosity (hz) rates per individual are often proxies for DNA sample quality. Larger than expected heterozygosity can indicate possible DNA contamination. The mean heterozygosity in PLINK is computed as hz_mean = (N-O)/N, where N: number of non-missing genotypes and O:observed number of homozygous genotypes for a given individual. Mean heterozygosity can differ between populations and SNP genotyping panels. Within a population and genotyping panel, a reduced heterozygosity rate can indicate inbreeding - these individuals will then be returned by check_relatedness as individuals that fail the relatedness filters. evaluate_check_het_and_miss creates a scatter plot with the individuals' missingness rates on x-axis and their heterozygosity rates on the y-axis.

evaluate_check_het_and_miss(
  qcdir,
  name,
  imissTh = 0.03,
  hetTh = 3,
  label_fail = TRUE,
  highlight_samples = NULL,
  highlight_type = c("text", "label", "color", "shape"),
  highlight_text_size = 3,
  highlight_color = "#c51b8a",
  highlight_shape = 17,
  legend_text_size = 5,
  legend_title_size = 7,
  axis_text_size = 5,
  axis_title_size = 7,
  title_size = 9,
  highlight_legend = FALSE,
  interactive = FALSE
)

Arguments

qcdir

[character] path/to/directory/with/QC/results containing name.imiss and name.het results as returned by plink --missing and plink --het.

name

[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam, name.het and name.imiss.

imissTh

[double] Threshold for acceptable missing genotype rate in any individual; has to be proportion between (0,1)

hetTh

[double] Threshold for acceptable deviation from mean heterozygosity in any individual. Expressed as multiples of standard deviation of heterozygosity (het), i.e. individuals outside mean(het) +/- hetTh*sd(het) will be returned as failing heterozygosity check; has to be larger than 0.

label_fail

[logical] Set TRUE, to add fail IDs as text labels in scatter plot.

highlight_samples

[character vector] Vector of sample IIDs to highlight in the plot (p_het_imiss); all highlight_samples IIDs have to be present in the IIDs of the name.fam file.

highlight_type

[character] Type of sample highlight, labeling by IID ("text"/"label") and/or highlighting data points in different "color" and/or "shape". "text" and "label" use ggrepel for minimal overlap of text labels ("text) or label boxes ("label"). Only one of "text" and "label" can be specified.Text/Label size can be specified with highlight_text_size, highlight color with highlight_color, or highlight shape with highlight_shape.

highlight_text_size

[integer] Text/Label size for samples specified to be highlighted (highlight_samples) by "text" or "label" (highlight_type).

highlight_color

[character] Color for samples specified to be highlighted (highlight_samples) by "color" (highlight_type).

highlight_shape

[integer] Shape for samples specified to be highlighted (highlight_samples) by "shape" (highlight_type). Possible shapes and their encoding can be found at: https://ggplot2.tidyverse.org/articles/ggplot2-specs.html#sec:shape-spec

legend_text_size

[integer] Size for legend text.

legend_title_size

[integer] Size for legend title.

axis_text_size

[integer] Size for axis text.

axis_title_size

[integer] Size for axis title.

title_size

[integer] Size for plot title.

highlight_legend

[logical] Should a separate legend for the highlighted samples be provided; only relevant for highlight_type == "color" or highlight_type == "shape".

interactive

[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_het_imiss) via ggplot2::ggsave(p=p_het_imiss , other_arguments) or pdf(outfile) print(p_het_imiss) dev.off().

Value

named [list] with i) fail_imiss dataframe containing FID (Family ID), IID (Within-family ID), MISS_PHENO (Phenotype missing? (Y/N)), N_MISS (Number of missing genotype call(s), not including obligatory missings), N_GENO ( Number of potentially valid call(s)), F_MISS (Missing call rate) of individuals failing missing genotype check and ii) fail_het dataframe containing FID (Family ID), IID (Within-family ID), O(HOM) (Observed number of homozygotes), E(HOM) (Expected number of homozygotes), N(NM) (Number of non-missing autosomal genotypes), F (Method-of-moments F coefficient estimate) of individuals failing outlying heterozygosity check; iii) p_het_imiss, a ggplot2-object 'containing' a scatter plot with the samples' missingness rates on x-axis and their heterozygosity rates on the y-axis, which can be shown by print(p_het_imiss) and iv) plot_data, a data.frame with the data visualised in p_het_imiss (iii).

Details

All, run_check_heterozygosity, run_check_missingness and evaluate_check_het_and_miss can simply be invoked by check_het_and_miss.

For details on the output data.frame fail_imiss and fail_het, check the original description on the PLINK output format page: https://www.cog-genomics.org/plink/1.9/formats#imiss and https://www.cog-genomics.org/plink/1.9/formats#het

Examples

qcdir <- system.file("extdata", package="plinkQC") name <- "data" if (FALSE) { fail_het_miss <- evaluate_check_het_and_miss(qcdir=qcdir, name=name, interactive=FALSE) #' # highlight samples highlight_samples <- read.table(system.file("extdata", "keep_individuals", package="plinkQC")) fail_het_miss <- evaluate_check_het_and_miss(qcdir=qcdir, name=name, interactive=FALSE, highlight_samples = highlight_samples[,2], highlight_type = c("text", "color")) }