`eris` 🧬🧞‍♀🔮️

Uncovering IS-mediated discord in bacterial genomes

Introduction 🌐

eris is a Python package for finding IS elements in bacterial genomes and quantifying their effect on other genes. IS elements are known to move, disrupt and even promote genes and whilst there are many tools to find IS elements in genomes, few attempt to report the resulting effects.

Like many bioinformatics tools eris is designed to work from the command-line, but is built on top of a robust API with few dependencies, and can be easily installed and incorporated into other programs, scripts and pipelines.

Installation ⚙️

Requires 🧰

python >=3.9
minimap2 >=2.18
pyrodigal >=3.5.0 (for ORF prediction only)

NOTE: eris is not yet on PyPI or Bioconda, please install from source until it is released

From source:

# First clone the repo
git clone https://github.com/tomdstanton/eris.git && cd eris
# Then install with pip
pip install .  # -e for editable, developers only!
# or install with pixi
pixi install

NOTE: For Pyrodigal, you should install the orf or dev environments with pip or pixi

# First clone the repo
git clone https://github.com/tomdstanton/eris.git && cd eris
# Then install with pip
pip install .[orf]  # -e for editable, developers only!
# or install with pixi
pixi install -e orf

Usage 🧑‍💻

The information below explains how to use the eris CLI. For API usage, please refer to the reference documentation.

Scan 🔍

Quickstart

eris scan *.{fasta,gfa,gb} > results.tsv

Arguments

usage: eris scan <genome> <genome...> [options]

========================|> eris |>========================
             Scan for IS in bacterial genomes             

Inputs:

  Note, input file(s) may be compressed
  Note, Genome(s) in FASTA/GFA format can paired up with GFA/BED
  annotation files with the same prefix.

  <genome>              Genome(s) in FASTA, GFA or Genbank format;
                        reads from stdin by default.
  -a [ ...], --annotations [ ...]
                        Optional genome annotations in GFF3/BED format;
                        These will be matched up to input genomes (FASTA/GFA) 
                        with corresponding filenames

Outputs:

  Note, text outputs accept "-" or "stdout" for stdout
  If a directory is passed, individual files will be written per input genome

  --tsv []              Path to output tabular results (default: stdout)
  --ffn []              Path to output Feature DNA sequences in FASTA format;
                        defaults to "./[genome]_eris_results.ffn" when passed without arguments.
  --faa []              Path to output Feature Amino acid sequences in FASTA format;
                        defaults to "./[genome]_eris_results.faa" when passed without arguments.
  --no-tsv-header       Suppress header in TSV output

Other options:

  --progress            Show progress bar
  -v, --version         Show version number and exit
  -h, --help            Show this help message and exit

For more help, visit: eris.readthedocs.io

The algorithm

Given a bacterial genome as an assembly (FASTA), assembly-graph (GFA) or annotation file (Genbank), the scan pipeline will align IS element nucleotide sequences from the ISFinder database against the assembly contigs using minimap2.
These alignments are then sorted by their target contig, and culled such that each region aligned contains the highest scoring query.
Each Element alignment is then considered to be a "mobile-element" Feature, and added to the list of Features on the respective contig.
If the genome is from a sequence file (FASTA/GFA), ORFs are predicted with Pyrodigal and CDS Features are added to each contig.
The genome is then converted into a Feature graph, whereby Features on each contig, sorted by their respective start coordinates, are connected to their flanking Features; and if the contig is connected to other contigs (GFA input), Features on the termini of connected contigs are also connected to each other.
Promoters are searched for in each Element Feature using a regular expression.
For each Element Feature, the Breadth-first search (BFS) algorithm traverses the Feature graph to find CDS that either overlap (part of the element) or flank the element.
The relative effect of the Element on each flanking CDS Feature is predicted.

Performance

eris scan is very fast, especially when providing annotations or once the pyrodigal.GeneFinder instance has been trained (this occurs on the first input genome); it should only take <1 second per assembly 🚀

Outputs

The main output of eris scan is the TSV tabular result which are written to stdout by default. Sequence information can be written to separate files via the respective CLI flags.

TSV tabular output

The TSV tabular output reports one line per Feature of interest, which can either be the Element itself from the resulting alignments, the CDS inside the Element, or the CDS flanking the Element. If the context is the Element, information about the element from ISFinder will be reported. If the context is a CDS, information about the ORF/translation will be reported.

The TSV columns are as follows:

Genome: The name of the input genome.
Feature: The unique identifier of the Feature in question.
Type: The Feature type, currently only CDS are supported if annotations are provided; Elements are annotated as "mobile_element" and promoters are annotated as "regulatory".
Contig: The name of the contig the Element is on.
Start: The start coordinate (0-based) of the Feature.
End: The end coordinate of the Feature.
Strand: The strand of the Feature (1 or -1).
Partial: Whether the Feature overlaps with the start or end the contig.
Element: The unique identifier of the Element from the current context.
Element_distance: Signifies the distance of the Feature from the Element from the current context.
Element_location: Signifies the relative location of the Element from the current context.
Element_strand: Signifies the relative strand of the Element from the current context.
Element_effect: Signifies the relative effect of the Element from the current context on the CDS.
Percent_identity: The percent identity of the Element.
Percent_coverage: The percent coverage of the Element.
Name: The name of the Element if known (from ISFinder).
Family: The family of the Element if known (from ISFinder).
Group: The group of the Element if known (from ISFinder).
Synonyms: The synonyms of the Element if known (from ISFinder).
Origin: The origin of the Element if known (from ISFinder).
IR: The relative location of the inverted repeat of the Element if known (from ISFinder).
DR: The relative location of the direct repeat of the Element if known (from ISFinder).

Example output

This is a single Element context in TSV format from a K. pneumoniae genome, supplied from a GFA and BED file. You can see that the Element (0db94675-30ef-4b28-a212-f26cbad7409e) contains one CDS, one promoter and is of the IS1380 family. On the same strand downstream is one CDS, annotated as a CTX-M-15 gene, known to confer ESBL resistance. As the Element contains a promoter and is on the same strand downstream, and in close proximity (48bp), we predict this gene is being upregulated.

Genome	Feature	Type	Contig	Start	End	Strand	Partial	Element	Element_distance	Element_location	Element_strand	Element_effect	Percent_identity	Percent_coverage	Name	Family	Group	Synonyms	Origin	IR	DR
ERR4920392	KNDCPA_05188	CDS	65	834	1710	-1	FALSE	0db94675-30ef-4b28-a212-f26cbad7409e	48bp	downstream	same strand	upregulated	-	-	extended-spectrum class A beta-lactamase CTX-M-15	-	-	-	-	-	-
ERR4920392	0db94675-30ef-4b28-a212-f26cbad7409e	mobile_element	65	1758	2157	-1	TRUE	0db94675-30ef-4b28-a212-f26cbad7409e	-	-	-	-	100	24.0942029	ISEc9	IS1380		None	Escherichia coli	13/22	NA
ERR4920392	KNDCPA_05189	CDS	65	1862	1985	-1	FALSE	0db94675-30ef-4b28-a212-f26cbad7409e	-	inside	same strand	-	-	-	MobQ family relaxase	-	-	-	-	-	-
ERR4920392	0db94675-30ef-4b28-a212-f26cbad7409e_promoter_1	regulatory	65	2053	2083	-1	FALSE	0db94675-30ef-4b28-a212-f26cbad7409e	-	inside	-	-	-	-	-	-	-	-	-	-	-

Pan 🦘

eris pan quantifies the effect of IS-mediated events in pan-genome graphs.

Coming soon!

Map 🗺️

Re-implementation of ISMapper, with reference-based and reference-free options.