Module epiclass.utils.winsorize_hdf5
Module: winsorize_hdf5
This module provides utility functions for winsorizing HDF5 files. The main function processes a list of HDF5 files, concatenates all chromosome datasets into one array, applies winsorization to the global array, and then de-concatenates the winsorized array back into individual chromosome datasets. The modified HDF5 files are saved in the specified output directory.
Note: This module requires the h5py and scipy libraries, and the epiclass package for additional utilities.
Usage: The main function is used for processing HDF5 files based on the provided command line arguments. It expects the following arguments: - hdf5_list (a file with HDF5 filenames) - output_dir (the directory where the modified HDF5 files will be created). - n_jobs (optional; the number of parallel jobs to run) By running the module as a script, it will execute the main function.
Example: $ python winsorize_hdf5.py hdf5_list.txt output_directory -n 4
Author: Joanny Raby
Functions
def main() ‑> None
-
Main function.
def parse_arguments() ‑> argparse.Namespace
-
argument parser for command line
def process_file(og_hdf5_path: Path, output_dir: Path, limits: Tuple[float, float]) ‑> None
-
Processes an hdf5 file by Winsorizing the dataset within specified limits.
The function performs the following steps: 1. Concatenates all chromosome datasets from the hdf5 file into a single numpy array. 2. Applies Winsorization to the combined numpy array within the provided limits. 3. Splits the Winsorized array back into chromosome datasets and updates the hdf5 file.
Args
og_hdf5_path
:Path
- The path to the original hdf5 file.
output_dir
:Path
- The directory to save the processed hdf5 file.
limits
:tuple
- A tuple of two floats representing the lower and upper limits for Winsorization.
Returns
None. The function writes the processed data to a new hdf5 file in the specified output directory.
def winsorize_dataset(array: np.ndarray, limits: Tuple[float, float] = (0, 0.01)) ‑> numpy.ndarray
-
Winsorizes a dataset.
Args
dataset
- The vector to be winsorized.
limits
- Pair of (lower limit, upper limit) for the winsorization. Defaults to (0, 0.01).
Returns
The winsorized dataset.