Skip to content

API Reference

This section provides the complete API documentation for pyfgs, parsed directly from the type stubs.

Main Classes

The core ab initio gene prediction engine.

Parameters:

Name Type Description Default
model Model

The sequencing error model to use.

required
whole_genome bool

If False, the HMM permits internal frameshifts (insertions/deletions) typical of sequencing errors or pseudogenes. If True, strictly enforces contiguous reading frames. Defaults to True if Model.Complete is used, otherwise False.

None

Examples:

>>> from Bio import SeqIO
>>> import pyfgs
>>> record = SeqIO.read("genome.fasta", "fasta")
>>>
>>> # Keep whole_genome=False to allow pseudogene/frameshift detection
>>> finder = pyfgs.GeneFinder(pyfgs.Model.Complete, whole_genome=False)
>>> genes = finder.find_genes(record.seq._data)

find_genes(sequence)

Predicts open reading frames in a given DNA sequence.

This method releases the Python GIL, allowing for safe, lock-free multi-threading across multiple CPU cores.

Parameters:

Name Type Description Default
sequence bytes

The raw nucleotide sequence.

required

Returns:

Type Description
List[Gene]

List[Gene]: A list of predicted Gene objects.

The available sequencing error models for the Hidden Markov Model.

The model alters the transition probabilities to account for expected sequencing error rates, making it more or less forgiving of frameshifts.

Attributes:

Name Type Description
Illumina1 Model

Illumina reads with ~0.1% error rate.

Illumina5 Model

Illumina reads with ~0.5% error rate.

Illumina10 Model

Illumina reads with ~1% error rate.

Sanger5 Model

Sanger reads with ~0.5% error rate.

Sanger10 Model

Sanger reads with ~1% error rate.

Pyro454_5 Model

454 pyrosequencing reads with ~0.5% error rate.

Pyro454_10 Model

454 pyrosequencing reads with ~1% error rate.

Pyro454_30 Model

454 pyrosequencing reads with ~3% error rate.

Complete Model

Complete genomic sequences without expected sequencing errors.

Examples:

>>> import pyfgs
>>> model = pyfgs.Model.Complete

Result Classes

Represents a single predicted Open Reading Frame (ORF).

Attributes:

Name Type Description
start int

The 0-based, inclusive start coordinate.

end int

The 0-based, exclusive end coordinate.

strand int

The strand of the feature (1 for forward, -1 for reverse).

frame int

The reading frame.

score float

The log-probability score of the HMM prediction.

insertions List[int]

1-based global coordinates of predicted insertions.

deletions List[int]

1-based global coordinates of predicted deletions.

mutations(sequence)

Extracts structural variant objects for any predicted frameshifts.

Parameters:

Name Type Description Default
sequence bytes

The raw parent contig sequence, used to determine VCF anchored alleles.

required

Returns:

Type Description
List[Mutation]

List[Mutation]: A list of structured mutation objects.

Examples:

>>> seq_bytes = str(record.seq).encode()
>>> for gene in genes: 
...     for mut in gene.mutations(seq_bytes):
...         print(f"Frameshift at codon {mut.codon_idx}: {mut.annotation}")

sequence()

Retrieves the raw nucleotide sequence of the predicted gene.

If the gene is on the reverse strand, the sequence is automatically reverse-complemented. If the gene contains frameshifts, the returned sequence represents the conceptual (corrected) HMM path.

Returns:

Name Type Description
bytes bytes

The DNA sequence.

translation()

Translates the predicted gene into an amino acid sequence.

Alternative start codons (e.g., GTG, TTG) are automatically translated to Methionine (M) if the model was initialized with whole_genome=True.

Returns:

Name Type Description
bytes bytes

The amino acid sequence.

Represents a frameshift mutation (insertion or deletion) detected by the HMM.

Attributes:

Name Type Description
pos int

The 0-based index of the mutation in the global assembly. (Note: This is mathematically identical to the 1-based VCF anchor position).

mut_type str

Either 'ins' (extra base in assembly) or 'del' (missing base).

ref_allele str

The reference allele from the raw assembly.

alt_allele str

The conceptually corrected allele determined by the model.

codon_idx int

The 1-based codon index where the reading frame breaks.

annotation str

A Snippy-style text annotation (e.g., 'AA:X->fs|DNA:ATC->AT').

I/O

File Readers

Bases: Iterator[Tuple[bytes, bytes]]

A fast, memory-efficient FASTA parser implemented in Rust.

Parameters:

Name Type Description Default
path str

The file path to the FASTA file.

required

Yields:

Type Description

Tuple[bytes, bytes]: A tuple containing the header and sequence as raw bytes.

Examples:

>>> from pyfgs import FastaReader
>>> reader = FastaReader("genome.fna")
>>> for header, sequence in reader:
...     print(f"ID: {header.decode()} | Length: {len(sequence)}")

Bases: Iterator[Tuple[bytes, bytes, bytes]]

A fast, memory-efficient FASTQ parser implemented in Rust.

Parameters:

Name Type Description Default
path str

The file path to the FASTQ file.

required

Yields:

Type Description

Tuple[bytes, bytes, bytes]: A tuple containing the header, sequence,

and Phred quality string as raw bytes.

File Writers

A high-performance streaming context manager for writing Extended BED files.

Outputs a BED6+1 format, where the 7th column contains a VCF-style INFO string detailing any frameshifts present in the feature.

Parameters:

Name Type Description Default
output_path str

The destination file path.

required

write_record(genes, header, sequence)

Writes the BED intervals for a single contig to the buffer.

Parameters:

Name Type Description Default
genes List[Gene]

The list of predicted Gene objects.

required
header str

The chromosome/contig ID.

required
sequence bytes

The raw parent nucleotide sequence.

required

A high-performance streaming context manager for writing VCF v4.2 files.

Automatically translates structural frameshifts detected by the HMM into anchored VCF variants with SnpEff-compliant ANN fields.

Parameters:

Name Type Description Default
output_path str

The destination file path.

required

Examples:

>>> with pyfgs.VcfWriter("variants.vcf") as vcf:
...     vcf.write_record(genes, record.id, str(record.seq).encode())

write_record(genes, header, sequence)

Writes the frameshift variants for a single contig to the VCF buffer.

Parameters:

Name Type Description Default
genes List[Gene]

The list of predicted Gene objects.

required
header str

The chromosome/contig ID (used for the #CHROM column).

required
sequence bytes

The raw parent nucleotide sequence.

required

A high-performance streaming context manager for writing INSDC-compliant GFF3.

Automatically shifts 0-based coordinates to 1-based fully-closed coordinates. Genes containing frameshifts are flagged as pseudogene=unknown to ensure compliance with downstream translation tools (like Prokka or Bakta).

Parameters:

Name Type Description Default
output_path str

The destination file path.

required

write_record(genes, header, sequence)

Writes the GFF3 annotations for a single contig to the buffer.

Parameters:

Name Type Description Default
genes List[Gene]

The list of predicted Gene objects.

required
header str

The chromosome/contig ID.

required
sequence bytes

The raw parent nucleotide sequence.

required

A high-performance streaming context manager for writing nucleotide FASTA files.

Outputs raw, non-wrapped byte streams for maximum parsing speed by downstream tools.

Parameters:

Name Type Description Default
output_path str

The destination file path.

required

write_record(genes, header)

Writes the conceptual nucleotide sequences for a single contig.

Parameters:

Name Type Description Default
genes List[Gene]

The list of predicted Gene objects.

required
header str

The chromosome/contig ID.

required

A high-performance streaming context manager for writing amino acid FASTA files.

Outputs raw, non-wrapped byte streams of the translated proteins.

Parameters:

Name Type Description Default
output_path str

The destination file path.

required

write_record(genes, header)

Writes the translated amino acid sequences for a single contig.

Parameters:

Name Type Description Default
genes List[Gene]

The list of predicted Gene objects.

required
header str

The chromosome/contig ID.

required