Module epiclass.utils.shap.analyze_shaps_kfold

For every folder of a cross-validation, take already computed SHAP values and analyze them.

The following analyses are performed: - Writing bed files of most frequent features in the N most high SHAP values (absolute value). e.g for N=100, the top 100 most high SHAP values (absolute value) are taken for each sample, counted, and features that are present in X% of the samples are written to bed. This is done for each class separately, using the output class SHAP matrix. (There are SHAP values for all features, for each class, for each sample, thus multiple matrices, one per class.) - For all samples of the classifier class - Subsampled by assay - Subsampled by cell type and assay

Functions

def analyze_shap_fold(extract_shap_values_and_info_output: Tuple[np.ndarray, List[str], List[Tuple[str, str]]], output_folder: Path, metadata: Metadata, label_category: str, chromsizes: List[Tuple[str, int]], resolution: int, overwrite: bool, top_N_required: int = 100, min_percentile: float = 80, copy_metadata: bool = True) ‑> Dict[str, Dict[~KT, ~VT]]

Analyzes SHAP values from one fold of the training and writes BED files of most frequent important features.

Args

extract_shap_values_and_info_output : Tuple[np.ndarray, List[str], List[Tuple[str, str]]]
Output of extract_shap_values_and_info.
output_folder : Path
The directory to write the results to.
metadata : Metadata
Metadata object containing label information.
label_category : str
The category of labels to consider.
chromsizes : List[Tuple[str, int]]
List with chromosome names and sizes.
resolution : int
The resolution for binning.
overwrite : bool
Whether to overwrite existing files.
top_N_required : int, optional
The number of top SHAP values/features to consider per sample. Defaults to 100.
min_percentile : float, optional
The percentile value for feature frequency selection (0 < x < 100). Defaults to 80.

Returns

Dict[str, Dict]
A dictionary containing important features for each analyzed class label.

The keys are class labels, and values are dictionaries of important features for different percentiles.

def analyze_single_fold(split_folder: Path, metadata: Metadata, chromsizes: List[Tuple[str, int]], label_category: str, resolution: int, top_N_required: int, min_percentile: float, overwrite: bool) ‑> None

Analyze SHAP values for a single fold.

def analyze_subsamplings(shap_folder: Path, output_folder: Path, metadata: Metadata, chromsizes: List[Tuple[str, int]], label_category: str, subsample_categories: List[List[str]], resolution: int, overwrite: bool, top_N_required: int = 100, min_percentile: float = 80) ‑> None

Analyzes SHAP values for given subsampling category combinations for an individual split.

Args

shap_folder : Path
Path to the folder containing SHAP values.
output_folder : Path
Path to the folder where analysis results will be stored.
metadata : Metadata
Metadata object containing sample information.
chromsizes
List[Tuple[str, int]]: Chromosome sizes information.
label_category : str
Primary category for labels.
subsample_categories : List[List[str]]
List of category combinations for subsampling.
resolution : int
Resolution of the sample binning.
overwrite : bool
Flag to overwrite existing data. Defaults to False.
top_N_required : int
Number of top features required. Defaults to 100.
min_percentile : float
Minimum percentile for filtering over samples (0 < val < 100). Defaults to 80.
def get_resolution_from_path(path: Path) ‑> int

Extract resolution from path.

def main()

Main function.

def parse_arguments() ‑> argparse.Namespace

Argument parser for command line.