Tutorials and Explanations

Below is a user-friendly summary of how different function work like visualizations—Heat Map, Bubble Map, Pie Chart, Time Series, and the Phylogenetic Tree—are used to explore genome data. Additionally, you will find how these visualizations are prepared in the backend, from coordinate cleaning and text classification to how the phylogenetic tree is generated.

1. Visualizing Genome Data: Overview and Methodology
2. Data Preprocessing
3. Bringing It All Visualizations Together
4. Creating a signature using Sourmash

1. Overview of the Four Visualizations

1.1 Pie Chart

What It Shows: A breakdown of genome-sequenced isolates by “isolation source category,” such as “Animal associated,” “Human associated,” “Soil associated.”

How It Works:

All isolates are grouped into categories (e.g., “Animal associated,” “Plant associated,” or “Wastewater associated,” etc.).
Each category is assigned a distinct color and slice size to represent its share of the total samples.
Hovering over a slice often displays the category name, number of isolates, and percentage of the total.

Why It's Useful:

IllustratesOffers a quick sense of the most common sources (hosts/environments) for the genome-sequenced isolates that belong to the same LINgroup.
Helpful for researchers to infer the ecological niche of a group of genome-sequenced isolates.

Points to Note:

WhenIf a slice is hovered over, you willmay see it expand outward for emphasis.
Labels aremight be placed around the outside with connecting lines to each slice.
A “No Data Available” overlay appears if there are no records to categorize.

1.2 Time Series

What It Shows: How the number of collected genomes changes over time—typically by year or month.

How It Works:

The x-axis represents a timeline (years or months), and the y-axis represents the count of genomes collected at each timepoint.
A line or area chart visually depicts increases or decreases in sampling.
Some implementations provide a slider to filter the date range:
- Wide range (10+ years) → annual data.
- Narrow range (<10 years) → monthly data for finer detail.

Why It's Useful:

Highlights how collection of genome-sequenced isolates of a group has changed over time.
Reveals patterns indicative of putative outbreaks, for example, a recent surge in highly similar genome-sequenced isolates over a short time frame could indicate an emerging disease outbreak.

Points to Note:

Hovering over a point on the line typically shows an exact count and date.
The chart auto-scales the vertical axis to the countsdata in the selected range.
A “No Data Available” overlay may appear if no samples fall within a chosen time frame.
If a sampling point does not indicate the month of collection, it is set to the 1st day of the year.

1.3 Heat Map

What It Shows: A choropleth map indicating the distribution of genome-sequenced isolates by country. Darker shades represent higher counts of isolates.

How It Works:

Each country is drawn on a map using a geographic projection.
The color intensity is driven by the number of genome-sequenced isolates collected from that country.
Hovering over a country typically reveals a tooltip showing the country name and the exact number of reported isolates.

Why It's Useful:

Quickly identifies geographic “hotspots” where many samples originate.
Allows comparisons between countries at a glance.

Points to Note:

You can usually zoom/pan if the feature is enabled.
If there is no data, you will see a “No Data Available” overlay.
Because some countries may have drastically larger numbers of samples than others, the color scale may be adjusted (e.g., using a cube root) so differences remain visually meaningful.

1.4 Bubble Map

What It Shows: Circles plotted at specific coordinates (latitude/longitude) to indicate the precise location of collection sites (if available).

How It Works:

Each genome entry is represented by a circle placed on the map at its coordinates.
The circle's size corresponds to how many genomes were collected at that location.
Circle color may indicate the certainty or precision of the coordinate (e.g., exact location vs. country center).

Why It's Useful:

Pinpoints exact spots (for those isolates with precise GPS data) rather than just a country-wide view.
Let's you see clusters of sampling sites or single outliers in remote regions.

Points to Note:

Hovering over a bubble shows the number of genomes and their latitude/longitude.
You can zoom and pan to inspect dense clusters more easily.
If valid coordinates are missing for many samples, large circles may appear in the “center” of a country as a fallback (if available).

1.5 Phylogenetic Tree

What It Shows:
A phylogenetic tree based on pairwise genome similarities among selected genomes. The resulting tree (in Newick format) can be displayed in either a traditional left-to-right layout or a radial layout.

How It Works

Signature Files & K-mer Size
We use .sig (sourmash signature) files with kmer size, k = 21 to generate the tree.
Genome Count Constraints
You must select at least 2 genomes (to form a valid tree) and no more than 200 (to avoid excessively large computations).
Neighbor-Joining from Jaccard Similarities
Each genome's signature is used to compute Jaccard similarity scores (pairwise) with every other genome. Those similarity values (Distance = 1 - Similarity) feed into a standard Neighbor-Joining algorithm (via Biopython package) to build the final tree.
Job Output Caching
The tree building is saved in the system and will be used on the next LINgroup search on the same group.
User Interactions
On the web interface, you can choose a radial or left-to-right tree layout. Hovering or clicking on nodes highlights branches or clades. Buttons let you expand/compress the tree, download it, or switch between “simple tree” (showing genome IDs) and “metadata tree” (showing labels enriched with metadata).

Why It's Useful:

Reveals evolutionary or taxonomic relationships, indicating how closely related different genomes are.
Helps researchers see whether genomic similarities align with specific regions or isolation sources (e.g., do certain clades correlate with a particular environment?).

Points to Note:

Leaf Labeling: You can toggle between raw IDs or enriched labels (e.g., organism name, strain, isolation date).
Layouts: Switch between radial and linear views based on personal or research preference.

2. Data Preprocessing

Accurate and meaningful visualizations require standardized and reliable data. Below is a summary of the key steps taken to ensure the charts represent the underlying genome information correctly.

2.1 Initial Genome Metadata Extraction from NCBI

Data originates from NCBI, which provides genomic metadata (including “Isolation Host,” “Isolation Source,” and “GPS Coordinates”). The raw tables can be quite inconsistent: some columns might be well-formatted (Country), while others are free text.

2.2 Preprocessing for World Maps (location data)

To ensure each sample has valid latitude/longitude coordinates and that country names are consistent, we:

Parsed various coordinate formats to convert them into a standardized (latitude, longitude) pair known as the “Decimal Degrees” format.
Applied a fallback “country center” if coordinate parsing failed.
Assigned an “accuracy” code reflecting the precision of the coordinate data—exact location, regional approximation, or country-level center.

2.3 Preprocessing for the Pie Chart (Isolation Source Categories)

We standardized free-text isolation descriptions (e.g., “human stool,” “soil sample,” “rhizosphere”) by:

Classifying with a Large Language Model
We used Google's “gemini-1.5-flash” to classify thousands of records into about 13 high-level categories (e.g., “Human associated,” “Soil associated”).
Generating the Pie Chart
After each sample was “tagged,” the pie chart script counted how many samples fell into each category. Each category’s slice in the pie chart reflects its percentage of the total samples.

3. Bringing It All Together

Heat Map and Bubble Map both rely on the cleaned latitude/longitude data and corrected country names. Pie Chart uses the newly classified “isolation source category” to display the proportion of samples from different environments or hosts. Time Series leverages date fields (e.g., “Date of isolation”) to show patterns in sampling over time. Phylogenetic Tree visualizes evolutionary relationships, with optional metadata-enriched leaf labels.

Employing these preprocessing steps—cleaning coordinates, standardizing country names, categorizing isolation sources, and labeling tree leaves—each chart or tree more accurately presents its information on the genomic landscape. A researcher (or any interested user) can then explore:

Which environments or hosts dominate (Pie Chart),
When sampling peaks or dips occur (Time Series), and
Where the samples originate (Heat / Bubble Maps),
How samples are evolutionarily related (Phylogenetic Tree).

This comprehensive approach makes complex genomic metadata more accessible and interpretable for all audiences.

4. Creating a signature using Sourmash

Follow the steps from https://sourmash.readthedocs.io/en/latest/tutorial-install.html on how to install Sourmash. Then when installed run the command:

sourmash sketch dna -p scaled=1000,k=21 -p scaled=1000,k=51 ralstonia_example.fna -o ralstonia_example.sig

Now you can use the ralstonia_example.sig file to identify your signature file on genomeRxiv. Don't forget to use the signature identification page https://genomerxiv.cs.vt.edu/signature

Tutorials and Explanations

Table of Contents

1. Overview of the Four Visualizations

1.1 Pie Chart

1.2 Time Series

1.3 Heat Map

1.4 Bubble Map

1.5 Phylogenetic Tree

2. Data Preprocessing

2.1 Initial Genome Metadata Extraction from NCBI

2.2 Preprocessing for World Maps (location data)

2.3 Preprocessing for the Pie Chart (Isolation Source Categories)

3. Bringing It All Together

4. Creating a signature using Sourmash