4.5 Project: Taxonomy Profiling

4.5.1 Purpose

In this activity, we will characterize the microbiome of a soil sample from the BioDIGS Project, a nationwide initiative that aims to unearth soil biodiversity through collaborative genomic research and education. We will use the Galaxy platform to perform quality control (QC), sequence quality filtering, taxonomy profiling, and visualization of a metagenomics dataset generated by Nanopore long-read sequencing technology.

Image test

4.5.2 Learning Objectives

Use tools on the Galaxy platform to:

Dig into a real metagenomics sample from the field.
Import data and perform QC and quality filtering of your soil metagenome with the NanoPlot and fastp tools.
Run taxonomy profiling workflow to classify and visualize taxonomy of a soil sample.

Throughout this activity you will also compare soil and gut metagenomes.

4.5.3 Introduction

The BioDIGS (BioDiversity and Informatics for Genomics Scholars) Project is a nationwide initiative involving students, researchers, and educators across more than 40 research and teaching institutions. Participants collect samples, analyze data, and interpret results to understand the relationships between the soil microbiome, environment, and health.

The BioDIGS Project sample used in this activity was sequenced using the long-read Oxford Nanopore Technology PromethION Instrument. A 6.1 GB subset of the data - BioDIGS_PromethION-Mo1_1_subset - is used in this activity.

Note, the total time for a Galaxy step to complete depends on and will increase based on multiple factors such as the input file size, a long queue when many other people are analyzing data, the complexity of the job itself and errors. See the table below for the minimum time a step will take for this assignment – be sure to start early as when Galaxy is busy each step can take 2-to-10 times longer to complete.

Table of approximate minimum times for a job to be completed on Galaxy using specified tools.

Import Data	Nanoplot	fastp	Taxonomy workflow
10 min	10 min	15 min	30 min

Note, you can save time by:

Submitting multiple jobs that use the same input (NanoPlot and fastp)
Submitting a job like the taxonomy workflow that uses the fastp output as input as soon as the output appears in your history even before the fastp job is complete.

4.5.4 Activity 1 – Data import and QC

Estimated time: 50 min

4.5.4.1 Part 1 - Import data and run NanoPlot

4.5.4.2 Instructions

Import the dataset into Galaxy.
1. Open the nanopore soil subset from a public history BioDIGS_PromethION-M01_1_subset
2. Click on Import this history, select Copy only the active, non-deleted datasets and then Copy History.
3. Confirm BioDIGS_PromethION-MO1_1_subset exists in your history by clicking on the Home button on top left.
Run the NanoPlot tool in Galaxy to assess sequence quality using the default settings.
1. Click on the Tools icon. Then, in the search bar enter ‘NanoPlot’ and select the NanoPlot tool.
2. Under files browse to select your BioDIGS_PromethION-MO1_1_subset fastq dataset.
3. Click on Run Tool and wait ~10 minutes as the NanoPlot job is scheduled, run, and completed.

4.5.4.3 Questions

1. How big is your dataset (how many gigabases)?

2. What are the following quality characteristics of your dataset?
Read mean length (mean read length):
Read Mean quality (mean_qual):
Proportion of reads with quality > Q20 (Reads > Q20):

Compare soil NanoPlot results with the gut NanoPlot results.

	soil (this activity)	Zymo gut standard (taxonomy profiling pre-lab)
mean read length:
mean_qual:
Reads with > Q10 bases (%):

4. Which dataset has better sequence quality, the Zymo-gut-standard (taxonomy profiling pre-lab) or the Nanopore-soil subset (this activity)? Why?

4.5.4.4 Part 2 - Quality filtering with fastp

4.5.4.5 Instructions

Filtering raw sequencing reads is a common step to improve data quality before downstream analysis. Nanopore datasets often include reads with relatively low base-call quality - for example, bases below a Phred quality score of Q10 - which should match what you observed in your NanoPlot results above.

Given this, we will apply a stricter quality filter using a Phred threshold of Q20. This means that we will retain only reads that on average meet the chosen Q20 criterion, and discard lower-quality reads to improve the reliability of subsequent analysis.

NOTE: Unlike Nanopore reads, PacBio HiFi (High Fidelity) reads are already at Q20 or above, so quality filtering of PacBio HiFi is typically not required.

Run fastp tool to filter out reads with an average quality score of less than Q20. In fastp, the default filtering is based on Phred quality score of 15 (Q15). To filter based on Phred quality score of 20 (Q20) adjust tool parameters:
1. Click on the Tools icon. Then, in the search bar enter fastp and select the fastp tool.
2. Under Filter Options, in an empty Qualified quality phred box, enter 20 (indicating Q20).
3. Click on Run Tool and wait ~10-15 minutes as the fastp job is scheduled, run, and completed.

4.5.4.6 Questions

Compare your dataset BEFORE and AFTER filtering using fastp: HTML report output.

	Before	After
Mean Length
total reads
Q20 bases (%):

2. How many reads were removed by filtering?

4.5.5 Activity 2 – Taxonomy Profiling

Estimated time: 50 min (~35 min computing)

4.5.5.1 Part 1 - Run Taxonomy Profiling Workflow

4.5.5.2 Instructions

Run the ‘Taxonomy Profiling’ workflow on your fastp-filtered data from Activity 1 and view the results.
1. Open the taxonomy-profiling public workflow https://usegalaxy.org/u/cutsort/w/taxonomy-profiling.
2. Click on Run.
3. Browse to select your fastp-filtered fastq dataset “fastp on data1:Read 1 output” dataset by clicking on the ‘...’ tab.
4. Under kraken_database select Prebuilt Refseq indexes: PlusPF(Standard plus protozoa and fungi)(Version:2022-06-07 - Downloaded: 2022-09-04T165121Z).
5. Click Run Workflow.
6. Wait ~30 minutes as the Kraken 2, KrakenTools, and Krona jobs are scheduled, run, and completed.
Click on the Display icon (eyeball) next to the output file with converted_kraken_report. Explore the metagenomic diversity of the soil sample by performing the taxonomy profiling spreadsheet activity you did during week 1.
1. Click on converted_kraken_report, find the download button and download the report.
2. Change the extension of your taxonomy file from .tabular to .tsv.
3. Upload your taxonomy .tsv file to Google Drive and open it with Google Sheets.
4. Create a header row and enter the following column information.
  - Col A = Counts.
  - Cols B-H correspond to taxonomic ranks k(Kingdom), p(Phylum), c(Class), o(Order), f(Family), g(Genus) and s(Species).
  - Each row corresponds to a different taxa.
5. Evaluate what proportion of data was taxonomically classified.
  - Insert a new column A; we will use this temporary column for calculations, so you can name this column “Calculations”.
  - In e.g. cell A2, calculate the sum of all reads observed in the soil sample.

4.5.5.3 Questions

1. How many total read counts are there?

2. Determine the percentage of reads that are unclassified

3. What percentage of reads are classified?

Identify the most abundant taxa (those at >0.1%).
1. Remember, soil is one of the most diverse microbial environments with many more microbial species than in the gut. Therefore, abundant species can still be quite low abundance.
2. Select columns B through I.
3. In the Data menu, select “Sort range by column B (Z to A)”.
4. Insert a new column C.
  - we will use this temporary column for calculations;
  - you can name this column “% abundance”.
5. In new column C, calculate % abundance for each row.
  - by dividing each count value by the total number of reads and multiplying by 100.

4A. How many ‘abundant’ taxa (at > 0.1%) do you observe?

4B. What are the taxonomic ranks of most abundant taxa?

4C. What is the most abundant eukaryote observed and its read count?

4D. What is the most abundant archaea observed and its read count?

4E. What is the most abundant virus observed and its read count?

4.5.5.4 Part 2 - Analyze Kraken 2 results

4.5.5.5 Instructions

Click on the Display icon (eyeball) next to the output file with kraken2_with_pluspf_database_output_report. This output report is an extended version of the converted_kraken_report. The output contains 6 columns. See info for select column headers below:
- Column 1: Percentage (%) of a given taxon
- Column 2: # of reads per given taxon
- Column 4: A rank code, indicating (U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. Note, that in this extended file, some rank codes will have numbers associated with them; Ignore this aspect of the document for the moment.
- Column 6: Identified taxa/scientific name.
Of note: The benefit of kraken2_with_pluspf_database_output_report is that it summarizes converted_kraken_report and calculates summary percentages for taxonomic ranks. For example, your converted_kraken_report has hundreds of lines for phylum Proteobacteria, while kraken2_with_pluspf_database_output_report has 1 line summarizing the percent abundance of all Proteobacteria.

4.5.5.6 Questions

1. What is the percentage of Unclassified taxa? Does it match your calculations in Activity 2 - Part I?

2. What percentage of bacteria is Proteobacteria, the most abundant Phyla observed?

3. What is the most abundant class observed and at what percentage?

4.5.5.7 Part 3 - Krona Pie Chart

4.5.5.8 Instructions

Krona pie chart is one of the outputs of the taxonomy workflow, and it is an interactive visualization tool for exploring the composition of metagenomes.

View the Krona results: Click on the Display icon (eyeball) next to the output file named krona_pie_chart.
Double click on Bacteria kingdom (k_Bacteria) to explore further.
Answer questions below

4.5.5.9 Questions

1. What are the 2 main phyla you observe?

2. What appears to be the more diverse phyla of the two?

Use Krona and/or Kraken2 outputs to compare your taxonomy from this soil sample, to the gut taxonomy results from your taxonomy-prelab (Zymo-gut-standard ZymoBIOMICS® Gut Microbiome Standard

3A. Fill out the comparison table below

	Nanopore soil pilot	Zymo gut standard
What are 2 most abundant phyla
What are 2 most abundant species
% Classified taxa
% Unclassified taxa

3B. Discuss taxonomy diversity between soil and gut, providing 3 points:
1.
2.
3.

4.5.6 Grading Criteria

Download as Microsoft Word (.docx) and upload on Canvas

4.5.7 Footnotes

Resources

Google Doc
Species composition in the Gut Microbiome Standard dataset: ZymoBIOMICS® Gut Microbiome Standard

Contributions and Affiliations

Valeriya Gaysinskaya, Johns Hopkins University
Frederick Tan, Johns Hopkins University

Last Revised: January 2025