Remove related individuals while keeping maximum number of individuals

relatednessFilter takes a data.frame with pair-wise relatedness measures of samples and returns pairs of individual IDs that are related as well as a list of suggested individual IDs to remove. relatednessFilter finds pairs of samples whose relatedness estimate is larger than the specified relatednessTh. Subsequently, for pairs of individual that do not have additional relatives in the dataset, the individual with the worse otherCriterionMeasure (if provided) or arbitrarily individual 1 of that pair is selected and returned as the individual failing the relatedness check. For more complex family structures, the unrelated individuals per family are selected (e.g. in a simple case of a parents-offspring trio, the offspring will be marked as fail, while the parents will be kept in the analysis). Selection is achieved by constructing subgraphs of clusters of individuals that are related. relatednessFilter then finds the maximum independent set of vertices in the subgraphs of related individuals. If all individuals are related (i.e. all maximum independent sets are 0), one individual of that cluster will be kept and all others listed as failIDs.

Usage

relatednessFilter(
  relatedness,
  otherCriterion = NULL,
  relatednessTh,
  otherCriterionTh = NULL,
  otherCriterionThDirection = c("gt", "ge", "lt", "le", "eq"),
  relatednessIID1 = "IID1",
  relatednessIID2 = "IID2",
  relatednessFID1 = NULL,
  relatednessFID2 = NULL,
  relatednessRelatedness = "PI_HAT",
  otherCriterionIID = "IID",
  otherCriterionMeasure = NULL,
  verbose = FALSE
)

Arguments

relatedness: [data.frame] containing pair-wise relatedness estimates (in column [relatednessRelatedness]) for individual 1 (in column [relatednessIID1] and individual 2 (in column [relatednessIID1]). Columns relatednessIID1, relatednessIID2 and relatednessRelatedness have to present, while additional columns such as family IDs can be present. Default column names correspond to column names in output of plink –genome (https://www.cog-genomics.org/plink/1.9/ibd). All original columns for pair-wise highIBDTh fails will be returned in fail_IBD.
otherCriterion: [data.frame] containing a QC measure (in column [otherCriterionMeasure]) per individual (in column [otherCriterionIID]). otherCriterionMeasure and otherCriterionIID have to present, while additional columns such as family IDs can be present. IIDs in relatednessIID1 have to be present in otherCriterionIID.
relatednessTh: [double] Threshold for filtering related individuals. Individuals, whose pair-wise relatedness estimates are greater than this threshold are considered related.
otherCriterionTh: [double] Threshold for filtering individuals based on otherCriterionMeasure. If related individuals fail this threshold they will automatically be excluded.
otherCriterionThDirection: [character] Used to determine the direction for failing the otherCriterionTh. If 'gt', individuals whose otherCriterionMeasure > otherCriterionTh will automatically be excluded. For pairs of individuals that have no other related samples in the cohort: if both otherCriterionMeasure < otherCriterionTh, the individual with the larger otherCriterionMeasure will be excluded.
relatednessIID1: [character] Column name of column containing the IDs of the first individual.
relatednessIID2: [character] Column name of column containing the IDs of the second individual.
relatednessFID1: [character, optional] Column name of column containing the family IDs of the first individual; if only relatednessFID1 but not relatednessFID2 provided, or none provided even though present in relatedness, FIDs will not be returned.
relatednessFID2: [character, optional] Column name of column containing the family IDs of the second individual; if only relatednessFID2 but not relatednessFID1 provided, or none provided even though present in relatedness, FIDs will not be returned.
relatednessRelatedness: [character] Column name of column containing the relatedness estimate.
otherCriterionIID: [character] Column name of column containing the individual IDs.
otherCriterionMeasure: [character] Column name of the column containing the measure of the otherCriterion (for instance SNP missingness rate).
verbose: [logical] If TRUE, progress info is printed to standard out.

Value

named [list] with i) relatednessFails, a [data.frame] containing the data.frame relatedness after filtering for pairs of individuals in relatednessIID1 and relatednessIID2, that fail the relatedness QC; the data.frame is reordered with the fail individuals in column 1 and their related individuals in column 2 and ii) failIDs, a [data.frame] with the [IID]s (and [FID]s if provided) of the individuals that fail the relatednessTh.