Predicts the ancestry of inputted samples using plink2. Uses the output of
run_ancestry_prediction as input in a random forest classifier
to predict the genomic ancestry of samples within six continental groups:
AFR, AMR, EAS, EUR, CSA, and MID. Genomic data version hg38 with variant
identifiers in the format of 1:12345[hg38] is needed for the function to work
Usage
evaluate_ancestry_prediction(
qcdir,
name,
verbose = FALSE,
interactive = FALSE,
excludeAncestry = NULL,
legend_text_size = 5,
legend_title_size = 7,
axis_text_size = 5,
axis_title_size = 7,
title_size = 9,
showPlinkOutput = TRUE,
legend_position = "right"
)Arguments
- qcdir
[character] /path/to/directory where name.sscore as returned by plink2 –score is located.
- name
[character] Prefix of file with a .sscore output
- verbose
[logical] If TRUE, progress info is printed to standard out.
- interactive
[logical] Should plots be shown interactively? When choosing this option, make sure you have X-forwarding/graphical interface available for interactive plotting. Alternatively, set interactive=FALSE and save the returned plot object (p_ancestry) via ggplot2::ggsave(p=p_ancestry, other_arguments) or pdf(outfile) print(p_ancestry) dev.off().
- excludeAncestry
[character] Ancestries to be excluded (if any). Options are: Africa, America, Central_South_Asia, East_Asia, Europe, and Middle_East. Strings must be spelled exactly as shown.
- legend_text_size
[integer] Size for legend text.
- legend_title_size
[integer] Size for legend title.
- axis_text_size
[integer] Size for axis text.
- axis_title_size
[integer] Size for axis title.
- title_size
[integer] Size for plot title.
- showPlinkOutput
[logical] If TRUE, plink log and error messages are printed to standard out.
- legend_position
[character] Legend position for the plot.
Value
Three dataframes and a visualization of the ancestral probabilities. prediction_prob contains the sample IDs and ancestral probabilities from the model. prediction_majority contains the sample IDs and greatest ancestry probabilities from the model. exclude_ancestry contains the list of sample ids with ancestries to be excluded. p_ancestry contains a plot visualizing the ancestry probabilities in a bargraph.
Examples
indir <- system.file("extdata", package="plinkQC")
qcdir <- tempdir()
name <- "data.hg38"
path2plink <- '/path/to/plink'
path2load_mat <- '/path/to/load_mat/merged_chrs.postQC.train.pca'
if (FALSE) { # \dontrun{
# the following code is not run on package build, as the path2plink on the
# user system is not known.
superpop_classification(indir=indir, qcdir=qcdir, name=name,
path2plink2 = path2plink2, path2load_mat = path2load_mat)
} # }