API Reference
This section provides the complete API documentation for pyfgs, parsed directly from the type stubs.
Main Classes
The core ab initio gene prediction engine.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
Model
|
The sequencing error model to use. |
required |
whole_genome
|
bool
|
If False, the HMM permits internal
frameshifts (insertions/deletions) typical of sequencing errors
or pseudogenes. If True, strictly enforces contiguous reading frames.
Defaults to True if |
None
|
Examples:
>>> from Bio import SeqIO
>>> import pyfgs
>>> record = SeqIO.read("genome.fasta", "fasta")
>>>
>>> # Keep whole_genome=False to allow pseudogene/frameshift detection
>>> finder = pyfgs.GeneFinder(pyfgs.Model.Complete, whole_genome=False)
>>> genes = finder.find_genes(record.seq._data)
find_genes(sequence)
Predicts open reading frames in a given DNA sequence.
This method releases the Python GIL, allowing for safe, lock-free multi-threading across multiple CPU cores.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequence
|
bytes
|
The raw nucleotide sequence. |
required |
Returns:
| Type | Description |
|---|---|
List[Gene]
|
List[Gene]: A list of predicted Gene objects. |
The available sequencing error models for the Hidden Markov Model.
The model alters the transition probabilities to account for expected sequencing error rates, making it more or less forgiving of frameshifts.
Attributes:
| Name | Type | Description |
|---|---|---|
Illumina1 |
Model
|
Illumina reads with ~0.1% error rate. |
Illumina5 |
Model
|
Illumina reads with ~0.5% error rate. |
Illumina10 |
Model
|
Illumina reads with ~1% error rate. |
Sanger5 |
Model
|
Sanger reads with ~0.5% error rate. |
Sanger10 |
Model
|
Sanger reads with ~1% error rate. |
Pyro454_5 |
Model
|
454 pyrosequencing reads with ~0.5% error rate. |
Pyro454_10 |
Model
|
454 pyrosequencing reads with ~1% error rate. |
Pyro454_30 |
Model
|
454 pyrosequencing reads with ~3% error rate. |
Complete |
Model
|
Complete genomic sequences without expected sequencing errors. |
Examples:
Result Classes
Represents a single predicted Open Reading Frame (ORF).
Attributes:
| Name | Type | Description |
|---|---|---|
start |
int
|
The 0-based, inclusive start coordinate. |
end |
int
|
The 0-based, exclusive end coordinate. |
strand |
int
|
The strand of the feature (1 for forward, -1 for reverse). |
frame |
int
|
The reading frame. |
score |
float
|
The log-probability score of the HMM prediction. |
insertions |
List[int]
|
1-based global coordinates of predicted insertions. |
deletions |
List[int]
|
1-based global coordinates of predicted deletions. |
mutations(sequence)
Extracts structural variant objects for any predicted frameshifts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequence
|
bytes
|
The raw parent contig sequence, used to determine VCF anchored alleles. |
required |
Returns:
| Type | Description |
|---|---|
List[Mutation]
|
List[Mutation]: A list of structured mutation objects. |
Examples:
sequence()
Retrieves the raw nucleotide sequence of the predicted gene.
If the gene is on the reverse strand, the sequence is automatically reverse-complemented. If the gene contains frameshifts, the returned sequence represents the conceptual (corrected) HMM path.
Returns:
| Name | Type | Description |
|---|---|---|
bytes |
bytes
|
The DNA sequence. |
translation()
Translates the predicted gene into an amino acid sequence.
Alternative start codons (e.g., GTG, TTG) are automatically translated
to Methionine (M) if the model was initialized with whole_genome=True.
Returns:
| Name | Type | Description |
|---|---|---|
bytes |
bytes
|
The amino acid sequence. |
Represents a frameshift mutation (insertion or deletion) detected by the HMM.
Attributes:
| Name | Type | Description |
|---|---|---|
pos |
int
|
The 0-based index of the mutation in the global assembly. (Note: This is mathematically identical to the 1-based VCF anchor position). |
mut_type |
str
|
Either 'ins' (extra base in assembly) or 'del' (missing base). |
ref_allele |
str
|
The reference allele from the raw assembly. |
alt_allele |
str
|
The conceptually corrected allele determined by the model. |
codon_idx |
int
|
The 1-based codon index where the reading frame breaks. |
annotation |
str
|
A Snippy-style text annotation (e.g., 'AA:X->fs|DNA:ATC->AT'). |
I/O
File Readers
Bases: Iterator[Tuple[bytes, bytes]]
A fast, memory-efficient FASTA parser implemented in Rust.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
The file path to the FASTA file. |
required |
Yields:
| Type | Description |
|---|---|
|
Tuple[bytes, bytes]: A tuple containing the header and sequence as raw bytes. |
Examples:
>>> from pyfgs import FastaReader
>>> reader = FastaReader("genome.fna")
>>> for header, sequence in reader:
... print(f"ID: {header.decode()} | Length: {len(sequence)}")
Bases: Iterator[Tuple[bytes, bytes, bytes]]
A fast, memory-efficient FASTQ parser implemented in Rust.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
The file path to the FASTQ file. |
required |
Yields:
| Type | Description |
|---|---|
|
Tuple[bytes, bytes, bytes]: A tuple containing the header, sequence, |
|
|
and Phred quality string as raw bytes. |
File Writers
A high-performance streaming context manager for writing Extended BED files.
Outputs a BED6+1 format, where the 7th column contains a VCF-style INFO string detailing any frameshifts present in the feature.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_path
|
str
|
The destination file path. |
required |
write_record(genes, header, sequence)
Writes the BED intervals for a single contig to the buffer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
genes
|
List[Gene]
|
The list of predicted Gene objects. |
required |
header
|
str
|
The chromosome/contig ID. |
required |
sequence
|
bytes
|
The raw parent nucleotide sequence. |
required |
A high-performance streaming context manager for writing VCF v4.2 files.
Automatically translates structural frameshifts detected by the HMM into
anchored VCF variants with SnpEff-compliant ANN fields.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_path
|
str
|
The destination file path. |
required |
Examples:
>>> with pyfgs.VcfWriter("variants.vcf") as vcf:
... vcf.write_record(genes, record.id, str(record.seq).encode())
write_record(genes, header, sequence)
Writes the frameshift variants for a single contig to the VCF buffer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
genes
|
List[Gene]
|
The list of predicted Gene objects. |
required |
header
|
str
|
The chromosome/contig ID (used for the #CHROM column). |
required |
sequence
|
bytes
|
The raw parent nucleotide sequence. |
required |
A high-performance streaming context manager for writing INSDC-compliant GFF3.
Automatically shifts 0-based coordinates to 1-based fully-closed coordinates.
Genes containing frameshifts are flagged as pseudogene=unknown to ensure
compliance with downstream translation tools (like Prokka or Bakta).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_path
|
str
|
The destination file path. |
required |
write_record(genes, header, sequence)
Writes the GFF3 annotations for a single contig to the buffer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
genes
|
List[Gene]
|
The list of predicted Gene objects. |
required |
header
|
str
|
The chromosome/contig ID. |
required |
sequence
|
bytes
|
The raw parent nucleotide sequence. |
required |
A high-performance streaming context manager for writing nucleotide FASTA files.
Outputs raw, non-wrapped byte streams for maximum parsing speed by downstream tools.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_path
|
str
|
The destination file path. |
required |
write_record(genes, header)
Writes the conceptual nucleotide sequences for a single contig.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
genes
|
List[Gene]
|
The list of predicted Gene objects. |
required |
header
|
str
|
The chromosome/contig ID. |
required |
A high-performance streaming context manager for writing amino acid FASTA files.
Outputs raw, non-wrapped byte streams of the translated proteins.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
output_path
|
str
|
The destination file path. |
required |
write_record(genes, header)
Writes the translated amino acid sequences for a single contig.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
genes
|
List[Gene]
|
The list of predicted Gene objects. |
required |
header
|
str
|
The chromosome/contig ID. |
required |