Runs and evaluates results from plink --missing --freq. It calculate the rates of missing genotype calls and frequency for all variants in the individuals that passed the perIndividualQC. The SNP missingness rates (stratified by minor allele frequency) are depicted as histograms.

check_snp_missingness(
  indir,
  name,
  qcdir = indir,
  lmissTh = 0.01,
  interactive = FALSE,
  path2plink = NULL,
  verbose = FALSE,
  showPlinkOutput = TRUE,
  keep_individuals = NULL,
  remove_individuals = NULL,
  exclude_markers = NULL,
  extract_markers = NULL,
  legend_text_size = 5,
  legend_title_size = 7,
  axis_text_size = 5,
  axis_title_size = 7,
  title_size = 9
)

Arguments

indir

[character] /path/to/directory containing the basic PLINK data files name.bim, name.bed, name.fam files.

name

[character] Prefix of PLINK files, i.e. name.bed, name.bim, name.fam.

qcdir

[character] /path/to/directory where results will be written to. If perIndividualQC was conducted, this directory should be the same as qcdir specified in perIndividualQC, i.e. it contains name.fail.IDs with IIDs of individuals that failed QC. User needs writing permission to qcdir. Per default, qcdir=indir.

lmissTh

[double] Threshold for acceptable variant missing rate across samples.

interactive

[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_lmiss) via ggplot2::ggsave(p=p_lmiss, other_arguments) or pdf(outfile) print(p_lmiss) dev.off().

path2plink

[character] Absolute path to PLINK executable (https://www.cog-genomics.org/plink/1.9/) i.e. plink should be accessible as path2plink -h. The full name of the executable should be specified: for windows OS, this means path/plink.exe, for unix platforms this is path/plink. If not provided, assumed that PATH set-up works and PLINK will be found by exec('plink').

verbose

[logical] If TRUE, progress info is printed to standard out and specifically, if TRUE, plink log will be displayed.

showPlinkOutput

[logical] If TRUE, plink log and error messages are printed to standard out.

keep_individuals

[character] Path to file with individuals to be retained in the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples not listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals.

remove_individuals

[character] Path to file with individuals to be removed from the analysis. The file has to be a space/tab-delimited text file with family IDs in the first column and within-family IDs in the second column. All samples listed in this file will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#indiv. Default: NULL, i.e. no filtering on individuals.

exclude_markers

[character] Path to file with makers to be removed from the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All listed variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers.

extract_markers

[character] Path to file with makers to be included in the analysis. The file has to be a text file with a list of variant IDs (usually one per line, but it's okay for them to just be separated by spaces). All unlisted variants will be removed from the current analysis. See https://www.cog-genomics.org/plink/1.9/filter#snp. Default: NULL, i.e. no filtering on markers.

legend_text_size

[integer] Size for legend text.

legend_title_size

[integer] Size for legend title.

axis_text_size

[integer] Size for axis text.

axis_title_size

[integer] Size for axis title.

title_size

[integer] Size for plot title.

Value

Named list with i) fail_missingness containing a [data.frame] with CHR (Chromosome code), SNP (Variant identifier), CLST (Cluster identifier. Only present with --within/--family), N_MISS (Number of missing genotype call(s), not counting obligatory missings), N_CLST (Cluster size; does not include nonmales on Ychr; Only present with --within/--family), N_GENO (Number of potentially valid call(s)), F_MISS (Missing call rate) for all SNPs failing the lmissTh and ii) p_lmiss, a ggplot2-object 'containing' the SNP missingness histogram which can be shown by (print(p_lmiss)).

Details

check_snp_missingness uses plink --remove name.fail.IDs --missing --freq to calculate rates of missing genotype calls and frequency per SNP in the individuals that passed the perIndividualQC. It does so without generating a new dataset but simply removes the IDs when calculating the statistics.

For details on the output data.frame fail_missingness, check the original description on the PLINK output format page: https://www.cog-genomics.org/plink/1.9/formats#lmiss.

Examples

indir <- system.file("extdata", package="plinkQC") qcdir <- tempdir() name <- "data" path2plink <- '/path/to/plink' # the following code is not run on package build, as the path2plink on the # user system is not known. if (FALSE) { # run on all individuals and markers fail_snp_missingness <- check_snp_missingness(qcdir=qcdir, indir=indir, name=name, interactive=FALSE, verbose=TRUE, path2plink=path2plink) # run on subset of individuals and markers keep_individuals_file <- system.file("extdata", "keep_individuals", package="plinkQC") extract_markers_file <- system.file("extdata", "extract_markers", package="plinkQC") fail_snp_missingness <- check_snp_missingness(qcdir=qcdir, indir=indir, name=name, interactive=FALSE, verbose=TRUE, path2plink=path2plink, keep_individuals=keep_individuals_file, extract_markers=extract_markers_file) }