Module epiclass.utils.shap.prep_shap_run
Functions to prepare background and evaluation data for SHAP analysis.
Functions
def evaluate_background_ratios(category: str, metadata: Metadata, training_md5s: Iterable[str], n_samples_list: List[int], verbose: bool = True) ‑> Tuple[List[str], int]
-
Evaluates the ratio of each class in the specified category within the background data compared to the training data. The comparison is performed for different sampling sizes (n_samples).
The function identifies trios of (track_type, assay, cell_type) in the datasets. For each sampling size in n_samples, it calculates the absolute difference in class ratios between the training and background data. The background data with the minimal ratio difference is selected for use in SHAP analysis.
Args: - category (str): The category of class to analyze. - metadata (Metadata): The metadata object containing dataset information. - training_md5s (List[str]): List of md5 hashes representing the training datasets. - n_samples_list (List[int]): List of sampling sizes to evaluate. - verbose (bool, optional): If True, prints detailed logs during processing. Defaults to True.
Returns: - List[str]: List of md5 hashes representing the best background datasets based on minimal ratio difference. - int: Sampling size used to select the best background datasets.
def main()
-
Selects background data for SHAP analysis for each training fold.
def parse_arguments() ‑> argparse.Namespace
-
Argument parser for command line.
def select_datasets(metadata: Metadata, n=3, seed=42) ‑> List[str]
-
Selects a random subset of datasets for each unique trio of (track_type, assay, cell_type) found in the metadata. It samples 'n' datasets for each trio, if available, or fewer if a trio has less than 'n' datasets.
Args: - metadata (Metadata): The metadata object containing dataset information. - n (int, optional): The number of datasets to sample for each unique trio. Defaults to 3. - seed (int, optional): The seed to use for random sampling. Defaults to 42.
Returns: - List[str]: A list of md5 hashes representing the randomly selected datasets.
Note: - If a trio has fewer than 'n' datasets, all datasets for that trio are included.