Module epiclass.utils.modify_metadata

Functions to perform more complex operations on the metadata.

Functions

def add_fake_epiatlas_metadata(metadata: Metadata) ‑> None

Add uuid and track_type info to non epiatlas metadata. uuid will be md5sum, and track_type will be raw.

def add_formated_date(metadata: Metadata) ‑> None

Add 'upload_date_2' category to dsets with YYYY-MM format.

def add_random_group(metadata: Metadata, seed=42, n_split=23) ‑> str

Add a 'random_seed{seed}_{n_split}splits' category made out of n_splits random separations withing uuids (different tracks have same val).

Return the name of the new category.

def filter_by_pairs(my_metadata: ~Meta, assay_cat: str = 'assay', cat2: str = 'cell_type', nb_pairs: int = 5, min_per_pair: int = 10, use_uuid: bool = True) ‑> ~Meta

Returns filtered metadata keeping only certain classes from cat2 based on pairs conditions with assay_cat.

1) Remove (assay_cat, cat2) pairs that have less than 'min_per_pair' signals. 2) Only keep cat2 that have at least 'nb_pairs' different pairings still non-zero.

def five_cell_types_selection(my_metadata: Metadata)

Return a filtered metadata with 5 major cell_types and certain assays.

def fix_roadmap(metadata: Metadata)

Merge info from 'data_generating_centre' category.

Convert 'NIH Roadmap Epigenomics' to 'Roadmap'.

def keep_major_assays_2019(my_metadata)

Combine rna_seq and polr2a classes pairs in the assay category. Written for the 2019-11 release.

def keep_major_cell_types(my_metadata: Metadata)

Remove datasets which are not part of a cell_type which has at least 10 signals in two assays. Those assays must also have at least two cell_type.

def keep_major_cell_types_2019(my_metadata)

Select 20 cell types in the major assays signal subset. A cell type needs to have at least 10 signals in one assay. Selection choices made out of the code. Written for the 2019-11 release.

def keep_major_cell_types_alt(my_metadata: Metadata)

Return a filtered metadata with certain assays. Datasets which are not part of a cell_type which has at least 10 signals are removed.

def merge_pair_end_info(metadata: Metadata)

Merge info from 'paired' and 'pair_end_mode' categories.

Convert FALSE/TRUE to 'single_end' and 'paired_end'

def special_case(my_metadata)

Return a filtered metadata with only rna_seq examples, but also add 3 thyroid (for model construction).

Made to evaluate an already trained model, works with min_class_size=3 and oversample=False.

def special_case_2(my_metadata)

Return a filtered metadata without 2 examples from all assay/cell_type pairs, and all mrna_seq.