Complete reference for all functions and classes in the Metagene package.
Load genomic sites from various file formats.
metagene.load_sites(
input_file_name: str,
with_header: bool = False,
meta_col_index: List[int] = None
) -> PyRanges
Parameters:
input_file_name
: Path to input file (BED, TSV, or compressed formats)with_header
: Whether the file has a header rowmeta_col_index
: Column indices for [chr, start, end, strand] or [chr, pos, strand]Returns:
Example:
# Load BED file without header
sites = metagene.load_sites("sites.bed", meta_col_index=[0, 1, 2, 5])
# Load TSV with header
sites = metagene.load_sites("sites.tsv", with_header=True, meta_col_index=[0, 1, 3])
Load built-in reference annotations or list available references.
metagene.load_reference(species: Optional[str] = None) -> Union[PyRanges, dict]
Parameters:
species
: Species name (e.g., “GRCh38”, “GRCm39”) or None to list availableReturns:
Example:
# List available references
available = metagene.load_reference()
# Load specific reference
reference = metagene.load_reference("GRCh38")
Load custom GTF/GFF file.
metagene.load_gtf(gtf_file: str) -> PyRanges
Parameters:
gtf_file
: Path to GTF or GFF fileReturns:
Map genomic sites to transcript coordinates.
metagene.map_to_transcripts(
input_sites: pr.PyRanges,
exon_ref: pr.PyRanges
) -> pl.DataFrame
Parameters:
input_sites
: PyRanges with genomic sites (output from load_sites())exon_ref
: PyRanges with gene/transcript annotations (output from load_reference() or load_gtf())Returns:
Example:
# Load data
sites = metagene.load_sites("sites.tsv", with_header=True, meta_col_index=[0, 1, 2])
reference = metagene.load_reference("GRCh38")
# Map to transcripts
annotated = metagene.map_to_transcripts(sites, reference)
Normalize transcript positions to relative feature positions (0-1 scale).
metagene.normalize_positions(
annotated_sites: pl.DataFrame,
split_strategy: str = "median",
bin_number: int = 100,
weight_col_index: list[int] | None = None
) -> tuple[pl.DataFrame, dict, tuple]
Parameters:
annotated_sites
: DataFrame from map_to_transcripts() with transcript coordinatessplit_strategy
: Strategy for calculating gene region splits (“median” or other)bin_number
: Number of bins for discretizing the normalized positions (default: 100)weight_col_index
: Optional list of column indices for weightingReturns:
gene_bins
: DataFrame with normalized position bins and countsgene_stats
: Dictionary with statistics for each feature typegene_splits
: Tuple of (5’UTR ratio, CDS ratio, 3’UTR ratio)Example:
gene_bins, gene_stats, gene_splits = metagene.normalize_positions(
annotated_sites,
split_strategy="median",
bin_number=100
)
print(f"5'UTR: {gene_splits[0]:.3f}, CDS: {gene_splits[1]:.3f}, 3'UTR: {gene_splits[2]:.3f}")
Display summary statistics for the analysis.
metagene.show_summary_stats(data: PyRanges) -> None
Parameters:
data
: PyRanges with analysis resultsGenerate a metagene profile plot.
metagene.plot_profile(
gene_bins: pl.DataFrame,
gene_splits: tuple[float, float, float],
output_path: str,
figsize: tuple[int, int] = (10, 5)
) -> None
Parameters:
gene_bins
: DataFrame with binned data from normalize_positions()gene_splits
: Tuple of gene region ratios from normalize_positions()output_path
: Path for output image filefigsize
: Figure size (width, height) in inchesExample:
gene_bins, gene_stats, gene_splits = metagene.normalize_positions(annotated_sites)
metagene.plot_profile(gene_bins, gene_splits, "metagene_plot.png")
alpha
: Line transparencyCreate detailed metagene profile with customization options.
metagene.plot_profile(
data: PyRanges,
output_path: str,
**kwargs
) -> None
Generate binned statistics plot.
metagene.plot_binned_statistics(
data: PyRanges,
output_path: str,
**kwargs
) -> None
Annotate sites with overlapping genomic features.
metagene.annotate_with_features(
sites: PyRanges,
features: PyRanges
) -> PyRanges
Calculate statistics for binned data.
metagene.calculate_bin_statistics(
data: PyRanges,
bins: int = 100
) -> PyRanges
metagene [OPTIONS]
Option | Type | Description |
---|---|---|
-i, --input |
PATH | Input file path |
-o, --output |
PATH | Output file path |
-r, --reference |
TEXT | Built-in reference (e.g., GRCh38) |
-g, --gtf |
PATH | Custom GTF file |
-p, --output-figure |
PATH | Output plot file |
--region |
CHOICE | Region to analyze (all/5utr/cds/3utr) |
--bins |
INTEGER | Number of bins (default: 100) |
--with-header |
FLAG | Input file has header |
-m, --meta-columns |
TEXT | Column indices for coordinates |
--list |
FLAG | List available references |
--download |
TEXT | Download reference |
# Basic analysis
metagene -i sites.bed -o results.tsv -r GRCh38
# With custom parameters
metagene -i sites.tsv --with-header -m "1,2,3" -r GRCm39 --bins 200
# List and download references
metagene --list
metagene --download GRCh38
The package uses PyRanges objects to represent genomic intervals and annotations. Key columns include:
Chromosome
: Chromosome nameStart
: Start position (0-based)End
: End position (exclusive)Strand
: Strand (‘+’ or ‘-‘)Additional columns may be present depending on the analysis step:
transcript_id
: Transcript identifiergene_id
: Gene identifiernormalized_position
: Position normalized to gene structurebin
: Bin number for aggregated dataThe package provides informative error messages for common issues:
All functions include comprehensive type hints for better IDE support and code clarity. Import types as needed:
from typing import List, Optional, Union
import pyranges as pr