metagene

Getting Started

This guide will help you get up and running with the Metagene Analysis Package.

Table of contents

  1. Installation
    1. Requirements
    2. Install from PyPI
    3. Install from source
  2. Basic Usage
    1. Python API
      1. 1. Load Your Data
      2. 2. Load Reference Annotations
      3. 3. Run Analysis
      4. 4. Generate Plots
    2. Command Line Interface
      1. Basic Analysis
      2. Available Options
  3. Input Data Formats
    1. Genomic Sites File
      1. BED Format (0-based)
      2. TSV Format (1-based with header)
      3. TSV Format (0-based without header)
    2. Column Specification
  4. Built-in References
    1. Available Species
    2. Using Built-in References
  5. Examples
    1. Example 1: RNA Modification Sites
    2. Example 2: Protein Binding Sites
    3. Example 3: Custom GTF File
  6. Next Steps

Installation

Requirements

Install from PyPI

pip install metagene

Install from source

git clone https://github.com/y9c/metagene.git
cd metagene
pip install -e .

Basic Usage

Python API

1. Load Your Data

import metagene

# Load genomic sites from file
sites = metagene.load_sites(
    "sites.tsv", 
    with_header=True,
    meta_col_index=[0, 1, 2]  # chromosome, position, strand
)

2. Load Reference Annotations

# Use built-in reference (auto-downloads if needed)
reference = metagene.load_reference("GRCh38")

# Or load custom GTF file
reference = metagene.load_gtf("custom.gtf")

3. Run Analysis

# Map sites to transcripts
results = metagene.map_to_transcripts(sites, reference)

# Normalize positions relative to gene structure
gene_bins, gene_stats, gene_splits = metagene.normalize_positions(results, region="all")

# Show summary statistics
print(f"Gene splits - 5'UTR: {gene_splits[0]:.3f}, CDS: {gene_splits[1]:.3f}, 3'UTR: {gene_splits[2]:.3f}")
print(f"Gene statistics - 5'UTR: {gene_stats['5UTR']}, CDS: {gene_stats['CDS']}, 3'UTR: {gene_stats['3UTR']}")

4. Generate Plots

# Create metagene plot
metagene.plot_profile(
    gene_bins, 
    gene_splits,
    output_path="metagene_plot.png"
)

Command Line Interface

Basic Analysis

# Run complete analysis
metagene -i sites.tsv -o results.tsv -r GRCh38 -p plot.png --with-header

Available Options

# View all options
metagene --help

# List available built-in references
metagene --list

# Download specific reference
metagene --download GRCm39

# Custom analysis parameters
metagene -i sites.bed -o results.tsv -g custom.gtf --bins 200 --region cds

Input Data Formats

Genomic Sites File

Your input file should contain genomic coordinates. Supported formats:

BED Format (0-based)

chr1    1000    1001    .    100    +
chr1    2000    2001    .    150    -

TSV Format (1-based with header)

chromosome    position    strand    score
chr1          1001        +         100
chr1          2001        -         150

TSV Format (0-based without header)

chr1    1000    1001    +    100
chr1    2000    2001    -    150

Column Specification

Use the --meta-columns option to specify which columns contain your coordinates:

# For BED-like format (chr, start, end, strand)
metagene -i data.tsv -m "1,2,3,4"

# For position-only format (chr, pos, strand)
metagene -i data.tsv -m "1,2,3"

Built-in References

The package includes built-in references for common model organisms:

Available Species

Species Reference Description
Human GRCh38 Genome Reference Consortium Human Build 38
Mouse GRCm39 Genome Reference Consortium Mouse Build 39
Mouse GRCm38 Genome Reference Consortium Mouse Build 38
Arabidopsis TAIR10 The Arabidopsis Information Resource
Rice IRGSP-1.0 International Rice Genome Sequencing Project

Using Built-in References

# List available references
available = metagene.load_reference()
print(available)

# Load specific reference (auto-downloads)
reference = metagene.load_reference("GRCh38")
# Command line
metagene --list
metagene --download GRCh38

Examples

Example 1: RNA Modification Sites

import metagene

# Load m6A modification sites
sites = metagene.load_sites("m6a_sites.bed")

# Use human reference
reference = metagene.load_reference("GRCh38")

# Focus on 3' UTR regions
results = metagene.map_to_transcripts(sites, reference)
gene_bins, gene_stats, gene_splits = metagene.normalize_positions(results, region="3utr")

# Create publication-ready plot
metagene.plot_profile(
    gene_bins, 
    gene_splits,
    "m6a_metagene.png"
)

Example 2: Protein Binding Sites

# Command line analysis of ChIP-seq peaks
metagene -i chip_peaks.bed \
         -o binding_analysis.tsv \
         -r GRCm39 \
         -p binding_plot.png \
         --region cds \
         --bins 100

Example 3: Custom GTF File

import metagene

# Load sites and custom annotation
sites = metagene.load_sites("sites.tsv", with_header=True)
reference = metagene.load_gtf("custom_annotation.gtf")

# Run analysis
results = metagene.map_to_transcripts(sites, reference)
gene_bins, gene_stats, gene_splits = metagene.normalize_positions(results)

# Generate plot
metagene.plot_profile(gene_bins, gene_splits, "custom_analysis.png")

Next Steps