scooby.data

Submodules

Attributes

Classes

Functions

_sparse_to_coverage_rna(m, seq_coord, strand, ...)

Converts a sparse RNA expression matrix to a dense coverage vector.

_sparse_to_coverage_atac(m, seq_coord, custom_read_length)

Converts a sparse ATAC-seq insertion matrix to a dense coverage vector.

Package Contents

scooby.data.min_value
scooby.data.max_value
scooby.data._sparse_to_coverage_rna(m, seq_coord, strand, custom_read_length)

Converts a sparse RNA expression matrix to a dense coverage vector.

This function processes a sparse matrix representing RNA expression counts and converts it into a dense coverage vector, accounting for the strand information.

Args:

m (scipy.sparse.csr_matrix): Sparse matrix of RNA expression counts. seq_coord (tuple): Tuple containing genomic coordinates and sequence information. strand (str): Strand of the gene (‘plus’ or ‘minus’). custom_read_length (int): Maximum read length (Default: 90 for Illumina sequencing)

Returns:

torch.Tensor: Dense coverage vector for RNA expression.

scooby.data._sparse_to_coverage_atac(m, seq_coord, custom_read_length)

Converts a sparse ATAC-seq insertion matrix to a dense coverage vector.

Args:

m (scipy.sparse.csr_matrix): Sparse matrix of ATAC-seq insertion counts. seq_coord (tuple): Tuple containing genomic coordinates and sequence information. custom_read_length (int): Maximum read length (Default: 90 for Illumina sequencing)

Returns:

torch.Tensor: Dense coverage vector for ATAC-seq insertions.

class scooby.data.onTheFlyDataset(adata_plus, adata_minus, embedding, ds, clip_soft, neighbors: scipy.sparse.csr_matrix | None = None, cell_sample_size: int = 32, get_targets: bool = True, random_cells: bool = True, cells_to_run: numpy.ndarray | None = None, cell_weights: numpy.ndarray | None = None, custom_read_length: int = 90)

Bases: torch.utils.data.Dataset

clip_soft
cell_weights = None
cells_to_run = None
embedding
neighbors
get_targets = True
random_cells = True
genome_ds
cell_sample_size = 32
adata_plus
adata_minus
custom_read_length = 90
__len__()
_get_neighbors_for_cell(bar_code_id)
_process_cells(adata, cells, seq_coord, strand, custom_read_length)

Processes RNA expression data for the given cells and sequence coordinates.

This function extracts RNA expression counts from the AnnData object, converts them to dense coverage vectors, applies normalization, and returns the processed profiles.

Args:

adata (anndata.AnnData): AnnData object containing RNA expression data. cells (list): List of cell indices. seq_coord (tuple): Tuple containing genomic coordinates and sequence information. strand (str): Strand of the gene (‘plus’ or ‘minus’). custom_read_length (int): Maximum read length (Default: 90 for Illumina sequencing)

Returns:

torch.Tensor: Processed RNA expression profiles for the given cells.

_load_pseudobulk(neighbors, seq_coord)

Loads and processes pseudobulk RNA expression profiles for the given cells.

Args:

neighbors (list): List of cell indices. seq_coord (tuple): Tuple containing genomic coordinates and sequence information.

Returns:

Tuple[torch.Tensor, torch.Tensor]: Processed pseudobulk RNA expression profiles for the plus and minus strands.

_reinit_fasta_reader()
__getitem__(idx)
class scooby.data.onTheFlyPseudobulkDataset(cell_types, ds, base_path, seqlevelstyle='UCSC', clip_soft=384, custom_read_length=90)

Bases: torch.utils.data.Dataset

cell_types
genome_ds
base_path
seqlevelstyle = 'UCSC'
clip_soft = 384
custom_read_length = 90
__len__()
_process_paths(paths, seq_coord)

Processes BigWig files to extract and normalize coverage values.

This function opens BigWig files, extracts coverage values for the specified genomic interval, and applies normalization and clipping.

Args:

paths (list): List of paths to BigWig files. seq_coord (pd.Series): Pandas Series containing genomic interval information.

Returns:

torch.Tensor: Processed and normalized coverage values for the given interval.

_load_pseudobulk(neighbors, seq_coord)

Loads and processes pseudobulk profiles for the specified cell types.

Args:

neighbors (list): List of cell type names. seq_coord (pd.Series): Pandas Series containing genomic interval information.

Returns:

torch.Tensor: Concatenated pseudobulk profiles for the given cell types.

_reinit_fasta_reader()

Re-initializes the FastaInterval reader.

This is necessary because pyfaidx and torch multiprocessing are not compatible.

__getitem__(idx)

Gets the item at the given index.

Args:

idx (int): The index of the item.

Returns:

Tuple[torch.Tensor, torch.Tensor, torch.Tensor]: The input sequences, reverse complemented sequences, and the target pseudobulk profiles.

class scooby.data.onTheFlyMultiomeDataset(adatas: dict, embedding: pandas.DataFrame, ds: enformer_pytorch.data.GenomeIntervalDataset, clip_soft, neighbors: scipy.sparse.csr_matrix | None = None, cell_sample_size: int = 32, get_targets: bool = True, random_cells: bool = True, cells_to_run: numpy.ndarray | None = None, cell_weights: numpy.ndarray | None = None, normalize_atac: bool = False, custom_read_length: int = 90)

Bases: torch.utils.data.Dataset

clip_soft
cell_weights = None
cells_to_run = None
embedding
neighbors
get_targets = True
random_cells = True
genome_ds
cell_sample_size = 32
adatas
normalize_atac = False
custom_read_length = 90
__len__()
_get_neighbors_for_cell(bar_code_id)
_process_rna(adata, cell_indices, seq_coord, strand, custom_read_length)

Processes RNA expression data for the given cells and sequence coordinates.

This function extracts RNA expression counts from the AnnData object, converts them to dense coverage vectors, applies normalization, and returns the processed profiles.

Args:

adata (anndata.AnnData): AnnData object containing RNA expression data. cell_indices (list): List of cell indices. seq_coord (tuple): Tuple containing genomic coordinates and sequence information. strand (str): Strand of the gene (‘plus’ or ‘minus’). custom_read_length (int): Maximum read length (Default: 90 for Illumina sequencing)

Returns:

torch.Tensor: Processed RNA expression profiles for the given cells.

_process_atac(adata, cell_indices, seq_coord, custom_read_length)

Processes ATAC-seq data for the given cells and sequence coordinates.

This function extracts ATAC-seq insertion counts from the AnnData object, converts them to dense coverage vectors, applies normalization (if specified), and returns the processed profiles.

Args:

adata (anndata.AnnData): AnnData object containing ATAC-seq insertion data. cell_indices (list): List of cell indices. seq_coord (tuple): Tuple containing genomic coordinates and sequence information. custom_read_length (int): Maximum read length (Default: 90 for Illumina sequencing)

Returns:

torch.Tensor: Processed ATAC-seq profiles for the given cells.

_load_pseudobulk(neighbors, seq_coord)

Loads and processes pseudobulk profiles for RNA and ATAC-seq data.

Args:

neighbors (list): List of cell indices to aggregate into a pseudobulk profile. seq_coord (tuple): Tuple containing genomic coordinates and sequence information.

Returns:

torch.Tensor: Concatenated pseudobulk profiles for RNA and ATAC-seq data.

_reinit_fasta_reader()

Re-initializes the FastaInterval reader.

This is necessary because pyfaidx and torch multiprocessing can have compatibility issues.

__getitem__(idx)

Retrieves data for a given genomic interval.

This function retrieves the DNA sequence, processes RNA and ATAC-seq data for the selected cells and their neighbors, and returns the input sequence, reverse complement, target profiles (if get_targets is True), and cell embeddings.

Args:

idx (int): Index of the genomic interval in the GenomeIntervalDataset.

Returns:

Tuple: A tuple containing the input sequence, reverse complement sequence, target profiles (optional), and cell embeddings.