Module `epiclass.utils.shap.subset_features_handling`

Module that defines functions useful for handling data subsets and their 'important' features. Used on k-fold SHAP analysis results.

Functions

def aggregate_feature_sets(filtered_class_features: Dict[str, Dict[str, Set[int]]], verbose: bool) ‑> Dict[str, Set[int]]

Aggregates and organizes feature sets from various sources based on output classes and subsamplings.

Create union of feature sets. - For a given subsampling, take the union of features for each output class. - For a given output class, take the union of features for each assay. - Create global union (diagonal in intersection matrix) feature set.

Args

filtered_class_features : Dict[str, Dict[str, Set[int]]]: A dictionary containing the features that meet the minimum count threshold for each
output class of each subsampling.
verbe : bool: The minimum count threshold for features to be included. Defaults to 8.

Returns

Dict[str, Set[int]]: A dictionary containing the aggregated feature sets.

def collect_all_features_from_feature_count_file(path: str | Path, n: int = 8)

Collect all features from feature count file that are present in at least n splits.

Returns

List[int]: A sorted list of all features present in the feature count file.

def collect_features_from_feature_count_file(path: str | Path, n: int = 8)

Collect features from feature count file that are present in at least n splits.

Returns

selected_features (Dict[str, List[int]]): A dictionary where keys are classifer output classes and values are lists of features.

def exclude_track_subsamplings(all_features_counts: Dict[str, Dict]) ‑> Dict[str, Dict[~KT, ~VT]]

Exclude track subsamplings from all_features_counts, except for (m)rna-seq.

def filter_feature_sets(all_features_counts: Dict[str, Dict], minimum_count: int = 8) ‑> Dict[str, Dict[str, Set[int]]]

Return a dictionary containing the features that meet the minimum count threshold for each output class of each subsampling.

Args

all_features_counts : Dict[str, Dict]: A dictionary containing the feature counts for each subsampling/folder.
minimum_count : int, optional: The minimum count threshold for features to be included. Defaults to 8.

Returns

Dict[str, Dict[str, Set[int]]]: A dictionary containing the features that meet the minimum count threshold for each

output class of each subsampling.

def flatten_feature_sets(feature_sets: Dict[str, Dict[str, Set[int]]]) ‑> Dict[str, Set[int]]

Flatten the feature sets from feature_sets into a single layer/depth. Merges the key names

Args

feature_sets : Dict[str, Dict[str, Set[int]]]: A dictionary containing the feature sets with new key names.

Returns

Dict[str, Set[int]]: A dictionary containing the flattened feature sets.

def process_all_subsamplings(jsons_parent_folder: Path, aggregate: bool, minimum_count: int, verbose: bool) ‑> Dict[str, Set[int]]

Process all subsamplings and aggregate feature sets (see aggregate_feature_sets).

Args

jsons_parent_folder : Path: The path to the folders containing feature_count.json files.
aggregate : bool: Whether to aggregate feature sets.
minimum_count : int: The minimum count threshold for features to be included.
verbose : bool: Whether to print verbose output.

Returns

Dict[str, Set[int]]: A dictionary containing the desired feature sets.

def read_all_feature_sets(jsons_parent_folder: Path) ‑> Dict[str, Dict[~KT, ~VT]]

Return a dictionary containing the important feature sets for each subsampling/folder.

Args

jsons_parent_folder : Path: The path to the folders containing feature_count.json files.

Returns

Dict[str, Dict]: A dictionary containing the important feature fold counts for each subsampling/folder.