Module epiclass.utils.clean_hdf5
Module: clean_hdf5
This module provides functionality for cleaning and processing genomic data stored in HDF5 files. Specifically, it allows for applying blacklist filters onto HDF5 datasets to modify certain genomic intervals. It reads BED (Browser Extensible Data) files containing these intervals, preprocesses them, determines the positions that are subject to treatment based on the blacklist, and applies changes to HDF5 datasets.
The module primarily consists of the following functions:
load_bed(path: Path | str) -> BEDIntervals:
Loads the contents of a BED file.
preprocess_bed(bed: BEDIntervals) -> Dict[str, BEDIntervals]:
Preprocesses the BED intervals.
get_positions_to_treat(blacklist_chrom_intervals: Dict[str, BEDIntervals], bin_resolution: int) -> Dict[str, List[int]]:
Generates the positions to treat based on the bin resolution and blacklist intervals.
process_file(og_hdf5_path: Path, positions_to_treat: Dict[str, List[int]], output_dir: Path):
Cleans one HDF5 file and saves a copy with blacklisted regions set to zero. This function assumes the positions to treat contain chromosome vector indices for the right resolution.
main() -> None: The main driver function that organizes the entire process of reading, filtering and writing the HDF5 files.
This module also defines the following type aliases:
BEDInterval = Tuple[str, int, int]: Represents a single genomic interval from a BED file.
BEDIntervals = List[BEDInterval]: Represents a list of genomic intervals from a BED file.
Dependencies: This module requires the h5py library for HDF5 file operations, and the epiclass package for argument parsing and directory checking utilities.
Usage: Invoke the main function to process a list of HDF5 files according to command-line arguments:
hdf5_list: A file containing a list of HDF5 filenames.
bed_filter: A path to a BED file containing the blacklist positions.
output_dir: The directory where the modified HDF5 files will be saved.
n_jobs (optional): The number of parallel jobs to run.
Example: $ python clean_hdf5.py hdf5_list.txt blacklist.bed output_directory -n 4
Author: Joanny Raby
Functions
def check_file(og_hdf5_path: Path, positions_to_treat: Dict[str, List[int]]) ‑> bool
-
Checks one hdf5 file to ensure all designated positions are zero.
Supposes the positions to treat contain chromosome vector indices for the right resolution.
Returns: True if all positions are zero, False otherwise.
def get_positions_to_treat(blacklist_chrom_intervals: Dict[str, BEDIntervals], bin_resolution: int) ‑> Dict[str, List[int]]
-
Generate the positions to treat based on the bin resolution and blacklist intervals.
Args
blacklist_chrom_intervals
- A dictionary mapping chromosomes to sorted intervals.
bin_resolution
- The bin resolution.
Returns
A dictionary mapping chromosomes to a list of positions to be treated.
def load_bed(path: Path | str)
-
Loads the contents of a BED file.
Args
path
- The path to the BED file.
Returns
A list of tuples representing the BED intervals (chromosome, start, end).
def main() ‑> None
-
Main function.
def parse_arguments() ‑> argparse.Namespace
-
argument parser for command line
def preprocess_bed(bed: BEDIntervals) ‑> Dict[str, List[Tuple[str, int, int]]]
-
Preprocesses the BED intervals.
Args
bed
- A list of tuples representing the BED intervals (chromosome, start, end).
Returns
A dictionary mapping chromosomes to sorted intervals.
def print_h5py_structure(file: str) ‑> None
-
Prints the structure of an h5py file.
Args
file
- The path to the h5py file.
def process_file(og_hdf5_path: Path, positions_to_treat: Dict[str, List[int]], output_dir: Path)
-
Clean one hdf5 file. Save copy with regions that touch blacklisted regions to 0.
Supposes the positions to treat contain chromosome vector indices for the right resolution.