Skip to contents

Predicts the ancestry of inputted samples using plink2. Uses the output of run_ancestry_prediction as input in a random forest classifier to predict the genomic ancestry of samples within six continental groups: AFR, AMR, EAS, EUR, CSA, and MID. Genomic data version hg38 with variant identifiers in the format of 1:12345[hg38] is needed for the function to work

Usage

evaluate_ancestry_prediction(
  qcdir,
  name,
  verbose = FALSE,
  interactive = FALSE,
  excludeAncestry = NULL,
  legend_text_size = 5,
  legend_title_size = 7,
  axis_text_size = 5,
  axis_title_size = 7,
  title_size = 9,
  showPlinkOutput = TRUE,
  legend_position = "right"
)

Arguments

qcdir

[character] /path/to/directory where name.sscore as returned by plink2 –score is located.

name

[character] Prefix of file with a .sscore output

verbose

[logical] If TRUE, progress info is printed to standard out.

interactive

[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_ancestry) via ggplot2::ggsave(p=p_ancestry, other_arguments) or pdf(outfile) print(p_ancestry) dev.off().

excludeAncestry

[character] Ancestries to be excluded (if any). Options are: Africa, America, Central_South_Asia, East_Asia, Europe, and Middle_East. Strings must be spelled exactly as shown.

legend_text_size

[integer] Size for legend text.

legend_title_size

[integer] Size for legend title.

axis_text_size

[integer] Size for axis text.

axis_title_size

[integer] Size for axis title.

title_size

[integer] Size for plot title.

showPlinkOutput

[logical] If TRUE, plink log and error messages are printed to standard out.

legend_position

[character] Legend position for the plot.

Value

Three dataframes and a visualization of the ancestral probabilities. prediction_prob contains the sample IDs and ancestral probabilities from the model. prediction_majority contains the sample IDs and greatest ancestry probabilities from the model. exclude_ancestry contains the list of sample ids with ancestries to be excluded. p_ancestry contains a plot visualizing the ancestry probabilities in a bargraph.

Examples

indir <- system.file("extdata", package="plinkQC")
qcdir <- tempdir()
name <- "data.hg38"
path2plink <- '/path/to/plink'
path2load_mat <- '/path/to/load_mat/merged_chrs.postQC.train.pca'
if (FALSE) { # \dontrun{
# the following code is not run on package build, as the path2plink on the
# user system is not known.
superpop_classification(indir=indir, qcdir=qcdir, name=name, 
path2plink2 = path2plink2, path2load_mat = path2load_mat)
} # }