Module epiclass.utils.shap.prep_shap_run

Functions to prepare background and evaluation data for SHAP analysis.

Functions

def evaluate_background_ratios(category: str, metadata: Metadata, training_md5s: Iterable[str], n_samples_list: List[int], verbose: bool = True) ‑> Tuple[List[str], int]

Evaluates the ratio of each class in the specified category within the background data compared to the training data. The comparison is performed for different sampling sizes (n_samples).

The function identifies trios of (track_type, assay, cell_type) in the datasets. For each sampling size in n_samples, it calculates the absolute difference in class ratios between the training and background data. The background data with the minimal ratio difference is selected for use in SHAP analysis.

Args: - category (str): The category of class to analyze. - metadata (Metadata): The metadata object containing dataset information. - training_md5s (List[str]): List of md5 hashes representing the training datasets. - n_samples_list (List[int]): List of sampling sizes to evaluate. - verbose (bool, optional): If True, prints detailed logs during processing. Defaults to True.

Returns: - List[str]: List of md5 hashes representing the best background datasets based on minimal ratio difference. - int: Sampling size used to select the best background datasets.

def main()

Selects background data for SHAP analysis for each training fold.

def parse_arguments() ‑> argparse.Namespace

Argument parser for command line.

def select_datasets(metadata: Metadata, n=3, seed=42) ‑> List[str]

Selects a random subset of datasets for each unique trio of (track_type, assay, cell_type) found in the metadata. It samples 'n' datasets for each trio, if available, or fewer if a trio has less than 'n' datasets.

Args: - metadata (Metadata): The metadata object containing dataset information. - n (int, optional): The number of datasets to sample for each unique trio. Defaults to 3. - seed (int, optional): The seed to use for random sampling. Defaults to 42.

Returns: - List[str]: A list of md5 hashes representing the randomly selected datasets.

Note: - If a trio has fewer than 'n' datasets, all datasets for that trio are included.