Module `epiclass.utils.shap.analyze_shaps_kfold`

For every folder of a cross-validation, take already computed SHAP values and analyze them.

The following analyses are performed: - Writing bed files of most frequent features in the N most high SHAP values (absolute value). e.g for N=100, the top 100 most high SHAP values (absolute value) are taken for each sample, counted, and features that are present in X% of the samples are written to bed. This is done for each class separately, using the output class SHAP matrix. (There are SHAP values for all features, for each class, for each sample, thus multiple matrices, one per class.) - For all samples of the classifier class - Subsampled by assay - Subsampled by cell type and assay

Functions

def analyze_shap_fold(extract_shap_values_and_info_output: Tuple[np.ndarray, List[str], List[Tuple[str, str]]], output_folder: Path, metadata: Metadata, label_category: str, chromsizes: List[Tuple[str, int]], resolution: int, overwrite: bool, top_N_required: int = 100, min_percentile: float = 80, copy_metadata: bool = True) ‑> Dict[str, Dict[~KT, ~VT]]

Analyzes SHAP values from one fold of the training and writes BED files of most frequent important features.

Args

extract_shap_values_and_info_output : Tuple[np.ndarray, List[str], List[Tuple[str, str]]]: Output of extract_shap_values_and_info.
output_folder : Path: The directory to write the results to.
metadata : Metadata: Metadata object containing label information.
label_category : str: The category of labels to consider.
chromsizes : List[Tuple[str, int]]: List with chromosome names and sizes.
resolution : int: The resolution for binning.
overwrite : bool: Whether to overwrite existing files.
top_N_required : int, optional: The number of top SHAP values/features to consider per sample. Defaults to 100.
min_percentile : float, optional: The percentile value for feature frequency selection (0 < x < 100). Defaults to 80.

Returns

Dict[str, Dict]: A dictionary containing important features for each analyzed class label.

The keys are class labels, and values are dictionaries of important features for different percentiles.

def analyze_single_fold(split_folder: Path, metadata: Metadata, chromsizes: List[Tuple[str, int]], label_category: str, resolution: int, top_N_required: int, min_percentile: float, overwrite: bool) ‑> None

Analyze SHAP values for a single fold.

def analyze_subsamplings(shap_folder: Path, output_folder: Path, metadata: Metadata, chromsizes: List[Tuple[str, int]], label_category: str, subsample_categories: List[List[str]], resolution: int, overwrite: bool, top_N_required: int = 100, min_percentile: float = 80) ‑> None

Analyzes SHAP values for given subsampling category combinations for an individual split.

Args

shap_folder : Path: Path to the folder containing SHAP values.
output_folder : Path: Path to the folder where analysis results will be stored.
metadata : Metadata: Metadata object containing sample information.
chromsizes: List[Tuple[str, int]]: Chromosome sizes information.
label_category : str: Primary category for labels.
subsample_categories : List[List[str]]: List of category combinations for subsampling.
resolution : int: Resolution of the sample binning.
overwrite : bool: Flag to overwrite existing data. Defaults to False.
top_N_required : int: Number of top features required. Defaults to 100.
min_percentile : float: Minimum percentile for filtering over samples (0 < val < 100). Defaults to 80.

def get_resolution_from_path(path: Path) ‑> int

Extract resolution from path.

def main()

Main function.

def parse_arguments() ‑> argparse.Namespace

Argument parser for command line.