R/utils.R
relatednessFilter.Rd
relatednessFilter
takes a data.frame with pair-wise relatedness
measures of samples and returns pairs of individual IDs that are related as
well as a list of suggested individual IDs to remove.
relatednessFilter
finds pairs of samples whose relatedness estimate
is larger than the specified relatednessTh. Subsequently, for pairs of
individual that do not have additional relatives in the dataset, the
individual with the worse otherCriterionMeasure (if provided) or arbitrarily
individual 1 of that pair is selected and returned as the individual failing
the relatedness check. For more complex family structures, the unrelated
individuals per family are selected (e.g. in a simple case of a
parents-offspring trio, the offspring will be marked as fail, while the
parents will be kept in the analysis). Selection is achieved by constructing
subgraphs of clusters of individuals that are related.
relatednessFilter
then finds the maximum independent set of vertices
in the subgraphs of related individuals. If all individuals are related (i.e.
all maximum independent sets are 0), one individual of that cluster will be
kept and all others listed as failIDs.
relatednessFilter( relatedness, otherCriterion = NULL, relatednessTh, otherCriterionTh = NULL, otherCriterionThDirection = c("gt", "ge", "lt", "le", "eq"), relatednessIID1 = "IID1", relatednessIID2 = "IID2", relatednessFID1 = NULL, relatednessFID2 = NULL, relatednessRelatedness = "PI_HAT", otherCriterionIID = "IID", otherCriterionMeasure = NULL, verbose = FALSE )
relatedness | [data.frame] containing pair-wise relatedness estimates (in column [relatednessRelatedness]) for individual 1 (in column [relatednessIID1] and individual 2 (in column [relatednessIID1]). Columns relatednessIID1, relatednessIID2 and relatednessRelatedness have to present, while additional columns such as family IDs can be present. Default column names correspond to column names in output of plink --genome (https://www.cog-genomics.org/plink/1.9/ibd). All original columns for pair-wise highIBDTh fails will be returned in fail_IBD. |
---|---|
otherCriterion | [data.frame] containing a QC measure (in column [otherCriterionMeasure]) per individual (in column [otherCriterionIID]). otherCriterionMeasure and otherCriterionIID have to present, while additional columns such as family IDs can be present. IIDs in relatednessIID1 have to be present in otherCriterionIID. |
relatednessTh | [double] Threshold for filtering related individuals. Individuals, whose pair-wise relatedness estimates are greater than this threshold are considered related. |
otherCriterionTh | [double] Threshold for filtering individuals based on otherCriterionMeasure. If related individuals fail this threshold they will automatically be excluded. |
otherCriterionThDirection | [character] Used to determine the direction for failing the otherCriterionTh. If 'gt', individuals whose otherCriterionMeasure > otherCriterionTh will automatically be excluded. For pairs of individuals that have no other related samples in the cohort: if both otherCriterionMeasure < otherCriterionTh, the individual with the larger otherCriterionMeasure will be excluded. |
relatednessIID1 | [character] Column name of column containing the IDs of the first individual. |
relatednessIID2 | [character] Column name of column containing the IDs of the second individual. |
relatednessFID1 | [character, optional] Column name of column containing the family IDs of the first individual; if only relatednessFID1 but not relatednessFID2 provided, or none provided even though present in relatedness, FIDs will not be returned. |
relatednessFID2 | [character, optional] Column name of column containing the family IDs of the second individual; if only relatednessFID2 but not relatednessFID1 provided, or none provided even though present in relatedness, FIDs will not be returned. |
relatednessRelatedness | [character] Column name of column containing the relatedness estimate. |
otherCriterionIID | [character] Column name of column containing the individual IDs. |
otherCriterionMeasure | [character] Column name of the column containing the measure of the otherCriterion (for instance SNP missingness rate). |
verbose | [logical] If TRUE, progress info is printed to standard out. |
named [list] with i) relatednessFails, a [data.frame] containing the data.frame relatedness after filtering for pairs of individuals in relatednessIID1 and relatednessIID2, that fail the relatedness QC; the data.frame is reordered with the fail individuals in column 1 and their related individuals in column 2 and ii) failIDs, a [data.frame] with the [IID]s (and [FID]s if provided) of the individuals that fail the relatednessTh.