Module epiclass.utils.shap.subset_features_handling
Module that defines functions useful for handling data subsets and their 'important' features. Used on k-fold SHAP analysis results.
Functions
def aggregate_feature_sets(filtered_class_features: Dict[str, Dict[str, Set[int]]], verbose: bool) ‑> Dict[str, Set[int]]
-
Aggregates and organizes feature sets from various sources based on output classes and subsamplings.
Create union of feature sets. - For a given subsampling, take the union of features for each output class. - For a given output class, take the union of features for each assay. - Create global union (diagonal in intersection matrix) feature set.
Args
filtered_class_features
:Dict[str, Dict[str, Set[int]]]
- A dictionary containing the features that meet the minimum count threshold for each
- output class of each subsampling.
verbe
:bool
- The minimum count threshold for features to be included. Defaults to 8.
Returns
Dict[str, Set[int]]
- A dictionary containing the aggregated feature sets.
def collect_all_features_from_feature_count_file(path: str | Path, n: int = 8)
-
Collect all features from feature count file that are present in at least n splits.
Returns
List[int]
- A sorted list of all features present in the feature count file.
def collect_features_from_feature_count_file(path: str | Path, n: int = 8)
-
Collect features from feature count file that are present in at least n splits.
Returns
selected_features (Dict[str, List[int]]): A dictionary where keys are classifer output classes and values are lists of features.
def exclude_track_subsamplings(all_features_counts: Dict[str, Dict]) ‑> Dict[str, Dict[~KT, ~VT]]
-
Exclude track subsamplings from all_features_counts, except for (m)rna-seq.
def filter_feature_sets(all_features_counts: Dict[str, Dict], minimum_count: int = 8) ‑> Dict[str, Dict[str, Set[int]]]
-
Return a dictionary containing the features that meet the minimum count threshold for each output class of each subsampling.
Args
all_features_counts
:Dict[str, Dict]
- A dictionary containing the feature counts for each subsampling/folder.
minimum_count
:int
, optional- The minimum count threshold for features to be included. Defaults to 8.
Returns
Dict[str, Dict[str, Set[int]]]
- A dictionary containing the features that meet the minimum count threshold for each
output class of each subsampling.
def flatten_feature_sets(feature_sets: Dict[str, Dict[str, Set[int]]]) ‑> Dict[str, Set[int]]
-
Flatten the feature sets from feature_sets into a single layer/depth. Merges the key names
Args
feature_sets
:Dict[str, Dict[str, Set[int]]]
- A dictionary containing the feature sets with new key names.
Returns
Dict[str, Set[int]]
- A dictionary containing the flattened feature sets.
def process_all_subsamplings(jsons_parent_folder: Path, aggregate: bool, minimum_count: int, verbose: bool) ‑> Dict[str, Set[int]]
-
Process all subsamplings and aggregate feature sets (see aggregate_feature_sets).
Args
jsons_parent_folder
:Path
- The path to the folders containing feature_count.json files.
aggregate
:bool
- Whether to aggregate feature sets.
minimum_count
:int
- The minimum count threshold for features to be included.
verbose
:bool
- Whether to print verbose output.
Returns
Dict[str, Set[int]]
- A dictionary containing the desired feature sets.
def read_all_feature_sets(jsons_parent_folder: Path) ‑> Dict[str, Dict[~KT, ~VT]]
-
Return a dictionary containing the important feature sets for each subsampling/folder.
Args
jsons_parent_folder
:Path
- The path to the folders containing feature_count.json files.
Returns
Dict[str, Dict]
- A dictionary containing the important feature fold counts for each subsampling/folder.