API Reference

This section provides the complete API documentation for pyfgs, parsed directly from the type stubs.

Main Classes

The core ab initio gene prediction engine.

Parameters:

Name	Type	Description	Default
`model`	`Model`	The sequencing error model to use.	required
`whole_genome`	`bool`	If False, the HMM permits internal frameshifts (insertions/deletions) typical of sequencing errors or pseudogenes. If True, strictly enforces contiguous reading frames. Defaults to True if `Model.Complete` is used, otherwise False.	`None`

Examples:

>>> from Bio import SeqIO
>>> import pyfgs
>>> record = SeqIO.read("genome.fasta", "fasta")
>>>
>>> # Keep whole_genome=False to allow pseudogene/frameshift detection
>>> finder = pyfgs.GeneFinder(pyfgs.Model.Complete, whole_genome=False)
>>> genes = finder.find_genes(record.seq._data)

`find_genes(sequence)`

Predicts open reading frames in a given DNA sequence.

This method releases the Python GIL, allowing for safe, lock-free multi-threading across multiple CPU cores.

Parameters:

Name	Type	Description	Default
`sequence`	`bytes`	The raw nucleotide sequence.	required

Returns:

Type	Description
`List[Gene]`	List[Gene]: A list of predicted Gene objects.

The available sequencing error models for the Hidden Markov Model.

The model alters the transition probabilities to account for expected sequencing error rates, making it more or less forgiving of frameshifts.

Attributes:

Name	Type	Description
`Illumina1`	`Model`	Illumina reads with ~0.1% error rate.
`Illumina5`	`Model`	Illumina reads with ~0.5% error rate.
`Illumina10`	`Model`	Illumina reads with ~1% error rate.
`Sanger5`	`Model`	Sanger reads with ~0.5% error rate.
`Sanger10`	`Model`	Sanger reads with ~1% error rate.
`Pyro454_5`	`Model`	454 pyrosequencing reads with ~0.5% error rate.
`Pyro454_10`	`Model`	454 pyrosequencing reads with ~1% error rate.
`Pyro454_30`	`Model`	454 pyrosequencing reads with ~3% error rate.
`Complete`	`Model`	Complete genomic sequences without expected sequencing errors.

Examples:

>>> import pyfgs
>>> model = pyfgs.Model.Complete

Result Classes

Represents a single predicted Open Reading Frame (ORF).

Attributes:

Name	Type	Description
`start`	`int`	The 0-based, inclusive start coordinate.
`end`	`int`	The 0-based, exclusive end coordinate.
`strand`	`int`	The strand of the feature (1 for forward, -1 for reverse).
`frame`	`int`	The reading frame.
`score`	`float`	The log-probability score of the HMM prediction.
`insertions`	`List[int]`	1-based global coordinates of predicted insertions.
`deletions`	`List[int]`	1-based global coordinates of predicted deletions.

`mutations(sequence)`

Extracts structural variant objects for any predicted frameshifts.

Parameters:

Name	Type	Description	Default
`sequence`	`bytes`	The raw parent contig sequence, used to determine VCF anchored alleles.	required

Returns:

Type	Description
`List[Mutation]`	List[Mutation]: A list of structured mutation objects.

Examples:

>>> seq_bytes = str(record.seq).encode()
>>> for gene in genes: 
...     for mut in gene.mutations(seq_bytes):
...         print(f"Frameshift at codon {mut.codon_idx}: {mut.annotation}")

`sequence()`

Retrieves the raw nucleotide sequence of the predicted gene.

If the gene is on the reverse strand, the sequence is automatically reverse-complemented. If the gene contains frameshifts, the returned sequence represents the conceptual (corrected) HMM path.

Returns:

Name	Type	Description
`bytes`	`bytes`	The DNA sequence.

`translation()`

Translates the predicted gene into an amino acid sequence.

Alternative start codons (e.g., GTG, TTG) are automatically translated to Methionine (M) if the model was initialized with whole_genome=True.

Returns:

Name	Type	Description
`bytes`	`bytes`	The amino acid sequence.

Represents a frameshift mutation (insertion or deletion) detected by the HMM.

Attributes:

Name	Type	Description
`pos`	`int`	The 0-based index of the mutation in the global assembly. (Note: This is mathematically identical to the 1-based VCF anchor position).
`mut_type`	`str`	Either 'ins' (extra base in assembly) or 'del' (missing base).
`ref_allele`	`str`	The reference allele from the raw assembly.
`alt_allele`	`str`	The conceptually corrected allele determined by the model.
`codon_idx`	`int`	The 1-based codon index where the reading frame breaks.
`annotation`	`str`	A Snippy-style text annotation (e.g., 'AA:X->fs\|DNA:ATC->AT').

I/O

File Readers

Bases: Iterator[Tuple[bytes, bytes]]

A fast, memory-efficient FASTA parser implemented in Rust.

Parameters:

Name	Type	Description	Default
`path`	`str`	The file path to the FASTA file.	required

Yields:

Type	Description
	Tuple[bytes, bytes]: A tuple containing the header and sequence as raw bytes.

Examples:

>>> from pyfgs import FastaReader
>>> reader = FastaReader("genome.fna")
>>> for header, sequence in reader:
...     print(f"ID: {header.decode()} | Length: {len(sequence)}")

Bases: Iterator[Tuple[bytes, bytes, bytes]]

A fast, memory-efficient FASTQ parser implemented in Rust.

Parameters:

Name	Type	Description	Default
`path`	`str`	The file path to the FASTQ file.	required

Yields:

Type	Description
	Tuple[bytes, bytes, bytes]: A tuple containing the header, sequence,
	and Phred quality string as raw bytes.

File Writers

A high-performance streaming context manager for writing Extended BED files.

Outputs a BED6+1 format, where the 7th column contains a VCF-style INFO string detailing any frameshifts present in the feature.

Parameters:

Name	Type	Description	Default
`output_path`	`str`	The destination file path.	required

`write_record(genes, header, sequence)`

Writes the BED intervals for a single contig to the buffer.

Parameters:

Name	Type	Description	Default
`genes`	`List[Gene]`	The list of predicted Gene objects.	required
`header`	`str`	The chromosome/contig ID.	required
`sequence`	`bytes`	The raw parent nucleotide sequence.	required

A high-performance streaming context manager for writing VCF v4.2 files.

Automatically translates structural frameshifts detected by the HMM into anchored VCF variants with SnpEff-compliant ANN fields.

Parameters:

Name	Type	Description	Default
`output_path`	`str`	The destination file path.	required

Examples:

>>> with pyfgs.VcfWriter("variants.vcf") as vcf:
...     vcf.write_record(genes, record.id, str(record.seq).encode())

`write_record(genes, header, sequence)`

Writes the frameshift variants for a single contig to the VCF buffer.

Parameters:

Name	Type	Description	Default
`genes`	`List[Gene]`	The list of predicted Gene objects.	required
`header`	`str`	The chromosome/contig ID (used for the #CHROM column).	required
`sequence`	`bytes`	The raw parent nucleotide sequence.	required

A high-performance streaming context manager for writing INSDC-compliant GFF3.

Automatically shifts 0-based coordinates to 1-based fully-closed coordinates. Genes containing frameshifts are flagged as pseudogene=unknown to ensure compliance with downstream translation tools (like Prokka or Bakta).

Parameters:

Name	Type	Description	Default
`output_path`	`str`	The destination file path.	required

`write_record(genes, header, sequence)`

Writes the GFF3 annotations for a single contig to the buffer.

Parameters:

Name	Type	Description	Default
`genes`	`List[Gene]`	The list of predicted Gene objects.	required
`header`	`str`	The chromosome/contig ID.	required
`sequence`	`bytes`	The raw parent nucleotide sequence.	required

A high-performance streaming context manager for writing nucleotide FASTA files.

Outputs raw, non-wrapped byte streams for maximum parsing speed by downstream tools.

Parameters:

Name	Type	Description	Default
`output_path`	`str`	The destination file path.	required

`write_record(genes, header)`

Writes the conceptual nucleotide sequences for a single contig.

Parameters:

Name	Type	Description	Default
`genes`	`List[Gene]`	The list of predicted Gene objects.	required
`header`	`str`	The chromosome/contig ID.	required

A high-performance streaming context manager for writing amino acid FASTA files.

Outputs raw, non-wrapped byte streams of the translated proteins.

Parameters:

Name	Type	Description	Default
`output_path`	`str`	The destination file path.	required

`write_record(genes, header)`

Writes the translated amino acid sequences for a single contig.

Parameters:

Name	Type	Description	Default
`genes`	`List[Gene]`	The list of predicted Gene objects.	required
`header`	`str`	The chromosome/contig ID.	required