Module epiclass.utils.clean_hdf5

Module: clean_hdf5

This module provides functionality for cleaning and processing genomic data stored in HDF5 files. Specifically, it allows for applying blacklist filters onto HDF5 datasets to modify certain genomic intervals. It reads BED (Browser Extensible Data) files containing these intervals, preprocesses them, determines the positions that are subject to treatment based on the blacklist, and applies changes to HDF5 datasets.

The module primarily consists of the following functions:

load_bed(path: Path | str) -> BEDIntervals:
Loads the contents of a BED file.

preprocess_bed(bed: BEDIntervals) -> Dict[str, BEDIntervals]:
Preprocesses the BED intervals.

get_positions_to_treat(blacklist_chrom_intervals: Dict[str, BEDIntervals], bin_resolution: int) -> Dict[str, List[int]]:
Generates the positions to treat based on the bin resolution and blacklist intervals.

process_file(og_hdf5_path: Path, positions_to_treat: Dict[str, List[int]], output_dir: Path):
Cleans one HDF5 file and saves a copy with blacklisted regions set to zero. This function assumes the positions to treat contain chromosome vector indices for the right resolution.

main() -> None: The main driver function that organizes the entire process of reading, filtering and writing the HDF5 files.

This module also defines the following type aliases:

BEDInterval = Tuple[str, int, int]: Represents a single genomic interval from a BED file.
BEDIntervals = List[BEDInterval]: Represents a list of genomic intervals from a BED file.

Dependencies: This module requires the h5py library for HDF5 file operations, and the epiclass package for argument parsing and directory checking utilities.

Usage: Invoke the main function to process a list of HDF5 files according to command-line arguments:

hdf5_list: A file containing a list of HDF5 filenames.
bed_filter: A path to a BED file containing the blacklist positions.
output_dir: The directory where the modified HDF5 files will be saved.
n_jobs (optional): The number of parallel jobs to run.

Example: $ python clean_hdf5.py hdf5_list.txt blacklist.bed output_directory -n 4

Author: Joanny Raby

Functions

def check_file(og_hdf5_path: Path, positions_to_treat: Dict[str, List[int]]) ‑> bool

Checks one hdf5 file to ensure all designated positions are zero.

Supposes the positions to treat contain chromosome vector indices for the right resolution.

Returns: True if all positions are zero, False otherwise.

def get_positions_to_treat(blacklist_chrom_intervals: Dict[str, BEDIntervals], bin_resolution: int) ‑> Dict[str, List[int]]

Generate the positions to treat based on the bin resolution and blacklist intervals.

Args

blacklist_chrom_intervals
A dictionary mapping chromosomes to sorted intervals.
bin_resolution
The bin resolution.

Returns

A dictionary mapping chromosomes to a list of positions to be treated.

def load_bed(path: Path | str)

Loads the contents of a BED file.

Args

path
The path to the BED file.

Returns

A list of tuples representing the BED intervals (chromosome, start, end).

def main() ‑> None

Main function.

def parse_arguments() ‑> argparse.Namespace

argument parser for command line

def preprocess_bed(bed: BEDIntervals) ‑> Dict[str, List[Tuple[str, int, int]]]

Preprocesses the BED intervals.

Args

bed
A list of tuples representing the BED intervals (chromosome, start, end).

Returns

A dictionary mapping chromosomes to sorted intervals.

def print_h5py_structure(file: str) ‑> None

Prints the structure of an h5py file.

Args

file
The path to the h5py file.
def process_file(og_hdf5_path: Path, positions_to_treat: Dict[str, List[int]], output_dir: Path)

Clean one hdf5 file. Save copy with regions that touch blacklisted regions to 0.

Supposes the positions to treat contain chromosome vector indices for the right resolution.