perMarkerQC checks the markers in the plink dataset for their missingness rates across samples, their deviation from Hardy-Weinberg-Equilibrium (HWE) and their minor allele frequencies (MAF). Per default, it assumes that IDs of individuals that have failed perIndividualQC have been written to qcdir/name.fail.IDs and removes these individuals when computing missingness rates, HWE p-values and MAF. If the qcdir/name.fail.IDs file does not exist, a message is written to stdout but the analyses will continue for all samples in the name.fam/name.bed/name.bim dataset. Depicts i) SNP missingness rates (stratified by minor allele frequency) as histograms, ii) p-values of HWE exact test (stratified by all and low p-values) as histograms and iii) the minor allele frequency distribution as a histogram.

perMarkerQC(
  indir,
  qcdir = indir,
  name,
  do.check_snp_missingness = TRUE,
  lmissTh = 0.01,
  do.check_hwe = TRUE,
  hweTh = 1e-05,
  do.check_maf = TRUE,
  macTh = 20,
  mafTh = NULL,
  interactive = FALSE,
  verbose = TRUE,
  keep_individuals = NULL,
  remove_individuals = NULL,
  exclude_markers = NULL,
  extract_markers = NULL,
  legend_text_size = 5,
  legend_title_size = 7,
  axis_text_size = 5,
  axis_title_size = 7,
  title_size = 9,
  subplot_label_size = 9,
  path2plink = NULL,
  showPlinkOutput = TRUE
)

Arguments

indir

[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files.

qcdir

[character] /path/to/directory where results will be written to. If perIndividualQC was conducted, this directory should be the same as qcdir specified in perIndividualQC, i.e. it contains name.fail.IDs with IIDs of individuals that failed QC. User needs writing permission to qcdir. Per default, qcdir=indir.

name

[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam.

do.check_snp_missingness

[logical] If TRUE, run check_snp_missingness.

lmissTh

[double] Threshold for acceptable variant missing rate across samples.

do.check_hwe

[logical] If TRUE, run check_hwe.

hweTh

[double] Significance threshold for deviation from HWE.

do.check_maf

[logical] If TRUE, run check_maf.

macTh

[double] Threshold for minor allele cut cut-off, if both mafTh and macTh are specified, macTh is used (macTh = mafTh\*2\*NrSamples).

mafTh

[double] Threshold for minor allele frequency cut-off.

interactive

[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_marker) via ggplot2::ggsave(p=p_marker, other_arguments) or pdf(outfile) print(p_marker) dev.off().

verbose

[logical] If TRUE, progress info is printed to standard out.

keep_individuals

[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals.

remove_individuals

[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals.

exclude_markers

[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers.

extract_markers

[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers.

legend_text_size

[integer] Size for legend text.

legend_title_size

[integer] Size for legend title.

axis_text_size

[integer] Size for axis text.

axis_title_size

[integer] Size for axis title.

title_size

[integer] Size for plot title.

subplot_label_size

[integer] Size of the subplot labeling.

path2plink

[character] Absolute path to PLINK executable (https://www.cog-genomics.org/plink/1.9/) i.e. plink should be accessible as path2plink -h. The full name of the executable should be specified: for windows OS, this means path/plink.exe, for unix platforms this is path/plink. If not provided, assumed that PATH set-up works and PLINK will be found by exec('plink').

showPlinkOutput

[logical] If TRUE, plink log and error messages are printed to standard out.

Value

Named [list] with i) fail_list, a named [list] with 1. SNP_missingness, containing SNP IDs [vector] failing the missingness threshold lmissTh, 2. hwe, containing SNP IDs [vector] failing the HWE exact test threshold hweTh and 3. maf, containing SNPs Ids [vector] failing the MAF threshold mafTh/MAC threshold macTh and ii) p_markerQC, a ggplot2-object 'containing' a sub-paneled plot with the QC-plots of check_snp_missingness, check_hwe and check_maf, which can be shown by print(p_markerQC). List entries contain NULL if that specific check was not chosen.

Details

perMarkerQC wraps around the marker QC functions check_snp_missingness, check_hwe and check_maf. For details on the parameters and outputs, check these function documentations.

Examples

indir <- system.file("extdata", package="plinkQC") qcdir <- tempdir() name <- "data" path2plink <- '/path/to/plink' # the following code is not run on package build, as the path2plink on the # user system is not known. # All quality control checks if (FALSE) { # run on all markers and individuals fail_markers <- perMarkerQC(indir=indir, qcdir=qcdir, name=name, interactive=FALSE, verbose=TRUE, path2plink=path2plink) # run on subset of individuals and markers keep_individuals_file <- system.file("extdata", "keep_individuals", package="plinkQC") extract_markers_file <- system.file("extdata", "extract_markers", package="plinkQC") fail_markers <- perMarkerQC(qcdir=qcdir, indir=indir, name=name, interactive=FALSE, verbose=TRUE, path2plink=path2plink, keep_individuals=keep_individuals_file, extract_markers=extract_markers_file) }