Below is a user-friendly summary of how different function work like visualizations—Heat Map, Bubble Map, Pie Chart, Time Series, and the Phylogenetic Tree—are used to explore genome data. Additionally, you will find how these visualizations are prepared in the backend, from coordinate cleaning and text classification to how the phylogenetic tree is generated.
What It Shows: A breakdown of genome-sequenced isolates by “isolation source category,” such as “Animal associated,” “Human associated,” “Soil associated.”
How It Works:
Why It's Useful:
Points to Note:
What It Shows: How the number of collected genomes changes over time—typically by year or month.
How It Works:
Why It's Useful:
Points to Note:
What It Shows: A choropleth map indicating the distribution of genome-sequenced isolates by country. Darker shades represent higher counts of isolates.
How It Works:
Why It's Useful:
Points to Note:
What It Shows: Circles plotted at specific coordinates (latitude/longitude) to indicate the precise location of collection sites (if available).
How It Works:
Why It's Useful:
Points to Note:
What It Shows:
A phylogenetic tree based on pairwise genome similarities among selected genomes. The resulting tree (in Newick format) can be displayed in either a traditional left-to-right layout or a radial layout.
How It Works
Why It's Useful:
Points to Note:
Accurate and meaningful visualizations require standardized and reliable data. Below is a summary of the key steps taken to ensure the charts represent the underlying genome information correctly.
Data originates from NCBI, which provides genomic metadata (including “Isolation Host,” “Isolation Source,” and “GPS Coordinates”). The raw tables can be quite inconsistent: some columns might be well-formatted (Country), while others are free text.
To ensure each sample has valid latitude/longitude coordinates and that country names are consistent, we:
We standardized free-text isolation descriptions (e.g., “human stool,” “soil sample,” “rhizosphere”) by:
Heat Map and Bubble Map both rely on the cleaned latitude/longitude data and corrected country names. Pie Chart uses the newly classified “isolation source category” to display the proportion of samples from different environments or hosts. Time Series leverages date fields (e.g., “Date of isolation”) to show patterns in sampling over time. Phylogenetic Tree visualizes evolutionary relationships, with optional metadata-enriched leaf labels.
Employing these preprocessing steps—cleaning coordinates, standardizing country names, categorizing isolation sources, and labeling tree leaves—each chart or tree more accurately presents its information on the genomic landscape. A researcher (or any interested user) can then explore:
This comprehensive approach makes complex genomic metadata more accessible and interpretable for all audiences.
Follow the steps from https://sourmash.readthedocs.io/en/latest/tutorial-install.html on how to install Sourmash. Then when installed run the command:
sourmash sketch dna -p scaled=1000,k=21 -p scaled=1000,k=51 ralstonia_example.fna -o ralstonia_example.sig
Now you can use the ralstonia_example.sig file to identify your signature file on genomeRxiv. Don't forget to use the signature identification page https://genomerxiv.cs.vt.edu/signature