Searching for anthrax in the New York City subway metagenome.


In February 2015 Chris Mason and his team published an in-depth analysis of metagenomic data (environmental shotgun DNA sequence) from samples isolated from public surfaces in the New York City (NYC) subway system. Along with a ton of really interesting findings, the authors claimed to have detected DNA from the bacterial biothreat pathogens Bacillus anthracis (which causes anthrax) and Yersinia pestis (causes plague) in some of the samples. This predictably led to a huge interest from the press and scientists on social media. The authors followed up with an re-analysis of the data on, where they showed some results that suggested the tools that they were using for species identification overcalled anthrax and plague.

B. anthracis is a Gram-positive bacterium that forms tough spores as part of its lifecycle. The 5.2 M basepair (Mb) main chromosome is very similar to those of other bacteria in species informally called the ‘Bacillus cereus group’ (including B. cereus, B. thuringiensis and B. mycoides). Bacillus cereus group strains in general are commonly found in soil but B. anthracis itself is very rare and generally associated with livestock grazing sites with a past history of anthrax.

What sets B. anthracis apart from close relatives is the presence of two plasmids: pXO1 (181kb), which carries the lethal toxin genes and pXO2 (94kb), which includes genes for a protective capsule. Without one of these plasmids, B. anthracis is considered attenuated in virulence and unable to cause classic anthrax. Other B. cereus group bacteria can have plasmids very similar to pXO1 and pXO2 but missing the important virulence genes. Rarely, other B. cereus group carry pXO1 and appear to cause anthrax-like disease. Its a confusing situation, not helped by the current overly-narrow species definitions. This recent review gives more information.

The NYC subway metagenome study raised very timely questions about using unbiased DNA sequencing for pathogen detection. We were interested in this dataset as soon as the publication appeared and started looking deeper into why the analysis software gave false positive results and indeed what exactly was found in the subway samples. We decided to wrap up the results of our preliminary analysis and put it on this site. This report focuses on the results for B. anthracis but we also did some preliminary work on Y.pestis and may follow up on this later.


  1. Accessing metagenome data and controls: Where we obtained the data and how we constructed artificial controls by mixing recent whole genome shotgun data from pathogens and near-neighbors with NYC subway metagenome data.
  2. Mapping metagenome data to plasmids (and chromosomes): Looking at the patterns of sequence coverage over the key virulence associated plasmids, pXO1, pXO2 (and pMT of Y. pestis) in metagenome samples and controls.
  3. Species identification with Kraken : Kraken is a popular k-mer based software for read identification. We showed that Kraken was a sensitive way to find B. anthracis when it was present in low abundance but the method also produced a small number of false positive reads on near-neighbor B. cereus sequence control data.
  4. Custom SNP (single nucleotide polymorphism) assays for B. anthracis core genome: We identified 31-mer words that corresponded to SNPs in the core genome of B. anthracis that were not found in close relatives. This gave a rapid specific test for B. anthracis. However, we still detected two potential positive SNPs in one of the NYC subway samples.

Summary of Results

Our control experiments showed that direct mapping of sequences reads from the metagenome was a sensitive way to detect plasmids, Plasmid reads were present when as little as 0.01x B. anthracis genome equivalents were added. However, some B. cereus reads mapped to pXO2 because the strain we used probably contained a closely related mobile element. What differentiated B. anthracis from the control was that coverage was evenly distributed across the plasmids. We found that the NYC samples had evidence of quite deep coverage of B. cereus-like chromosome and plasmids, which was not an unexpected result. The pattern of coverage of the plasmids qualitatively resembled B. cereus rather than B. anthracis. There were some reads in sample SRR1748708 that mapped across the part of the pXO1 plasmid containing the lethal toxin gene cluster but none of the reads ultimately mapped to B. anthracis using BLAST (Table 4).

We showed that Kraken was a sensitive tool for detection of B. anthracis down to at least 0.01X genome coverage. However, the B. cereus control also gave us false positive B. anthracis reads, albeit at a lower incidence than the true positive. On the NYC subway samples we obtained a reasonably large number of B. anthracis specific calls but this could be explained by the fact that the sample also contained coverage of B. cereus group organisms.

Our novel assay of 1,793 31-mers specific to the B. anthracis core genome was sensitive at 0.01X B. anthracis genome coverage and did not give false positives on the B. cereus control at up to 5x coverage. Surprisingly, we obtained 4 positive matches on one of the NYC subway samples(Table 13). There are a number of possible reasons for this:

  1. B. anthracis is truly present in this sample at about 0.01x genome coverage along with a much greater number of B. cereus group organisms. At this level of coverage we would expect to see some hits to the lethal toxin gene in pXO1, which we did. However, the reads that mapped inside the lethal toxin gene were found by BLASTN to be false positives. It is possible that the 0.01x genome coverage the B. anthracis pXO1 reads stochastically missed the toxin gene and we were seeing false positive hits from the other B. cereus group strains present. Alternatively, we detected a B. anthracis variant that had lost its plasmids.
  2. The SNPs are present in a B. cereus strain, very closely related to B. anthracis that has not been sequenced (if its sequence were available in a public database, it would have resulted in the SNPs being filtered out of consideration).
  3. The SNPs were present in a B. cereus strain whose ancestor had at some point undergone homologous recombination in these areas of the chromosome with an ancestor of B. anthracis.
  4. The sequences were introduced through laboratory cross-contamination.
  5. The SNPs were an artifact of random sequencing error. On this last point (thanks to a helpful comment from Paul Keim) we found that B. anthracis SNPs k-mers appear when there are high levels of B. cereus background (Table 11). The approximate number of false positives will be equal to, Number of SNPs tested * Coverage * error rate / 3 (because there is 1 in 3 chance of a random error giving the SNP). In this case the numbers are approximately, > 1800 SNPs * 15x coverage * 0.001 / 3 = 9 false posives: not orders of magnitude different to the results seen in (Table 11).

At the present time, we believe the most likely explanation for the results is either/or option 2 and 5. There is no direct evidence for the lethal toxin. There is enough B. cereus DNA in the background to produce false positive results. However, the distribution of the 4 matches present over 2 SNPs (Table 13) could indicate that an unusual strain is present. Such a B. cereus strain would be a fraction of a mixture of B. cereus group strains present in the sample.

Deeper sequence coverage of this sample preferably combined with culture and PCR-based analysis would be helpful here.


We believe that the biggest message to take from this investigation is that there is no one-size-fits-all approach to bacterial species indentification in metagenome samples. There are several reasons for this. There is no consistently applied definition for the boundary that divides bacterial species based on DNA sequence identity. While the rule of thumb is that at approximately > 95% average sequence identity, strains fall into different species, the main chromsomes of some species (like B. cereus and B. anthracis) can be >99% similar to each other in terms of DNA sequence identity. Secondly, some species distinctions rely on the presence or absence of mobile elements like plasmids and phages (again B. cereus and B. anthracis are a great example), which are hard to model using a uniform approach like Kraken or Metaphlan. These genetic elements are often modular in structure with very similar ‘backbones’ but with key genes inserted or deleted. Additionally, the generalist species indentification programs rely on databases of sequenced genomes, which are not uniform in their coverage of different species, or different lineages within species.

Individual species have unique features that can be used to extract information from metagenomic sequence. In this case, if you were to have an organism-specific detection algorthim for B. cereus you would need to account for the presence of the plasmids with at least 2 times the coverage of the chromsome. You would also expect eveness of coverage across reads mapped to a pXO1 reference to be a strong indication for the presence of B. anthracis.

Even if you have developed a sophisticated algorithm you will still need to use judgement in interpreting results. In the case of the B. cereus group, we dont know what proportion of the DNA came from spores and what proportion was in the more fragile vegative state and how this might vary in other environments. There is also still much that we dont know about bacteria in the environment, especialy the conditions under which they exchange DNA. Can we say that its not natural to find evidence of a small number of “anthracis specific” SNPs on a B. cereus chromsome? This could be a rare, as yet unsequenced lineage, or because of recent recombination, or even convergent evolution. Judgement is especially important where the amount of data supporting a putative pathogen is small. Obviously orthologous validation (culture, PCR etc) is desirable if the sample is available.

Finally, when searching for pathogens, negative results should be also treated carefully. The limit of detection for a pathogen will be affected by the amount of DNA sequence generated and the complexity of the microbial community.


1. Accessing metagenome data and controls

Accessing metagenome data

On February 10th 2015, we used prefetch, from SRA Toolkit (v 2.4.4), to download each of the 1,572 runs associated with SRA study SRP051511. We then converted each run from SRA format to FASTQ format using fastq-dump, again from SRA Toolkit (v 2.4.4). Each FASTQ file was also converted to FASTA format using fastq_to_fasta from FASTX Toolkit (v For each SRA run we used the information contained in SRP051511_info.txt, acquired from SRP051511’s RunInfo Table, to associate sample names used in the study with their corresponding SRA run accessions.

We created the python script,, to parse DataTable5-metaphlan-metadata_v19.txt. From this we were able to determine which NYC samples contained reads associated with Bacillus antracis and the Yersinia genus (not shown). We then used another python script, in order to associate the NYC sample names with their corresponding SRA run accession (Table 1).

Creating metagenomic controls for B. anthracis and a close relative, B. cereus

As a control for these analysis we randomly selected NYC sample SRR1749070 which is B anthracis free and added different amounts of known B. anthracis or B. cereus sequences to it. Using these controls we produced results we would expect to see if B. anthracis was present in the metagenomic sequences. We used B. anthracis (DRR014739) (strain H29: isolated in Zambia, unpublished data from Division of Infection and Immunity, Research Center for Zoonosis Control, Hokkaido Univerisity) and B. cereus (SRR642775) (strain VD143, unpublished data from the Broad Institute) sequencing projects as our controls. The reads from the B. anthracis project were trimmed from 300bp (Illumina MiSeq) down to the first 100bp using fastq_trimmer (v0.0.13.2) from FASTX Toolkit. This was necessary to make the reads more similar to the NYC sample, SRR1749070, which was 100bp Illumina HiSeq 2000 reads. This step was not necessary for the B. cereus control because it too was 100bp Illumina HiSeq 2000 reads.

Each control was then randomly subsampled using seqtk (commit 43ff625a3211b51f301cb356a34fb8d1e593d50a) to 0.1x, 0.25x, 0.5x, 1x, and 5x coverage. B. anthracis was additionally subsampled to 0.01x, 0.05x and 0.10x coverage and B. cereus to 10x, 50x and 100x coverage. Coverage was estimated using the total size of the B. anthracis genome, the pXO1 plasmid and the pXO2 plasmids. For B. cereus we used only the total size of the B. cereus genome. These subsampled coverages were then added to the sequences of the B. anthracis free NYC sample SRR1749070. This produced multiple samples for each control, one with no control sequences and those with the different levels of coverage. These samples were then subjected to the same mapping pipeline as the B. anthracis positive samples mentioned above. This process of downloading and preparing the controls for analysis was automated using the following scripts (B. anthracis) and (B. cereus).

2. Mapping metagenome data to plasmids (and chromosomes)

Mapping against pXO1 and pXO2 using BWA

We used BWA (v 0.7.5a-r405) to map reads from each of the NYC and control samples against reference pXO1 and pXO2 plasmids as well as reference B. anthracis and B. cereus completed genomes. We used the following references: pXO1 reference CP009540, pXO2 reference NC_007323, B. anthracis reference CP009541, and B. cereus reference NC_003909. The aligned reads in SAM format were converted to sorted BAM and indexed using samtools (v 1.1). The per base coverage was extracted using genomeCoverageBed from bedtools (v2.16.2). Coverage across the plasmids and chromosomes were plotted for mulitple sliding windows with the Rscript plot-coverage.R. Mapped reads were extracted and saved in both FASTQ and FASTA format using bam2fastq (v1.1.0) and fastq_to_fasta from FASTX Toolkit v

For pXO1, we focused on coverage of the virulence genes (cya, lef, pagA and pagR). To visualize this, an alternate plot was created which included subplots of coverage of these genes using the RScript plot-pxo1-anthrax-toxin-coverage.R. The reads that mapped to each gene were also extracted and blasted (blastn v2.2.30+) against the NT database (built on Feb 9, 2015). For each gene, a count of the organism names of which the top five hits of each read belonged to was recorded.

The analysis of the plasmids was automated using the following scripts (NYC samples) and (control samples). Analysis of the completed chromosomes was automated using the scripts (NYC samples) and (control samples). Summaries of the results of mapping samples against pXO1, pXO2 and chromosomes were generated using the following scripts:, and

pXO1 Results

For each sample we calculated the summary statistics of coverage across the complete pXO1 plasmid (Table 2). NYC sample SRR174708 has the best average coverage (2x), but each of the NYC samples have a median coverage of 0x. These results are nearly the opposite of the B. anthracis control. Even at an estimated coverage of 0.25x, the mean coverage of pXO1 is 1.23x (median 1x). The mean coverage only improves as the estimated coverage of B. anthracis reads increases. The B. cereus control includes low levels of pXO1 overage.

Figure 1 offers a visual representation in the difference in coverage across the pXO1 plasmid. SRR1748708 was selected because it has the greatest mean coverage of the NYC samples. These three plots depict very different stories. In the B. cereus control (Figure 1C), although there were few reads that mapped, there exists regions of pXO1 that are similar to B. cereus. However, SRR1748708 (Figure 1A) did have significant coverage across the pXO1 plasmid. Comparing SRR1748708 and the B. anthracis control (Figure 1B), the similar mean coverage are very different from one another. SRR1748708 has regions of high coverage and regions of low coverage. The B. anthracis control, on the other hand, has a very similar coverage across the complete pXO1 plasmid. This suggests SRR1748708 may contain a plasmid backbone that is shared among Bacillus species.

The pXO1 plasmid has a mean coverage of 2x in SRR1748708; were the genes associated with the anthrax toxin covered? We next mapped coverage of reads to the cya, lef, pagA and pagR genes for SRR1748708 (Figure 2) and the B. anthracis 0.5x (Figure 3) and 5x (Figure 4) controls. For each of these plots, the top plot is coverage across the whole pXO1 plasmid. The B. anthracis controls clearly showed a number of reads mapping to each of the genes. The total number of reads mapped to each gene for each sample was extracted (Table 3). At 0.5x a total of 197 reads mapped to these genes. At 5x the coverage the total number of reads mapped jumped to 1,817. clearly evident. Looking at SRR1748708, the genes are essentially devoid of mapped reads but reads are mapped to these genes. Although only 72 reads mapped to these genes it is necessary to investigate these reads further.

The next step was to determine the likely origin of the mapped reads in the each of the genes. In order to do so, we extracted these reads and saved them in FASTA format. We then BLASTed those reads against a local NT database. We chose to only retain the top five hits for each read. Our goal was not to determine the exact organism of origin, but instead to get an idea of what these reads are mapping to (Table 4). There was a clear difference between the results of the NYC sample SRR1748708 (Table 4A) and the B. anthracis control (Table 4B). The reads from the B. anthracis controls most often return B. anthracis (~85%) as the top hit followed by B. cereus (~14.6%). The NYC samples did not have a top hit to B. anthracis. The most common was B. thuringiensis followed by various organisms including humans and other Bacillus species. It is important to state that in this case the total hit counts are not important. What is important is the difference between the NYC samples and the control B. anthracis.

pXO2 Results

Similar to pXO1, we calculated the summary statistics of coverage across the complete pXO2 plasmid for each sample (Table 5). We observed similar patterns of coverage. Again NYC sample SRR174708 has the best mean coverage (4.5x), but each of the NYC samples again have a median coverage of 0x. The B. anthracis 0.5x control has a mean coverage of 1.34x (median 1x). Again the mean coverage improves as the estimated coverage of B. anthracis reads increases. Finally, as before, the B. cereus control contained low levels of pXO2 coverage.

Using methods described previously with pXO1, Figure 5 offers a visual representation in the difference in coverages across the pXO2 plasmid. Again, SRR1748708 was selected because it has the greatest mean coverage of the NYC samples.

There was clearly a peak in each of the plots above, which is likely a repeat region within the Bacillus species. SRR1748708 (Figure 5A) mostly maps to this region. As with pXO1, the B. anthracis control (Figure 5B) maintains similar coverage across the complete pXO2 plasmid. Unlike pXO1, there seems to be a number of regions in pXO2 that are very similar to B. cereus as depicted by the plot of the B. cereus control (Figure 5C).

Completed B. anthracis and B. cereus Chromosome Results

In order to demonstrate how similar the chromosomes of B. anthracis and B. cereus are to one another, we also mapped reads to completed chromosomes. Summary statistics of mapped reads can be found (Table 6, Table 7). Unlike the pXO1 and pXO2 plasmids each of the NYC samples have a mean coverage greater than 1x for both B. anthracis and B. cereus. The difference in the mean coverage between B. anthracis and B. cereus for each NYC sample is negligble (SRR1748707:0.0555, SRR1749083:0.1124, and SRR1748708:0.5252). Again focusing on SRR1748708, because it has the greatest mean coverage against both B. anthracis and B. cereus, Figure 6 demonstrates the coverage across each chromosome.

Using 10kb sliding windows (5kb overlap), the two plots above show consistent coverage across both the B. anthracis (Figure 6A) and B. cereus (Figure 6B) chromosomes. The discrepency in peaks of coverage can be explained by the location of the origin of replication in each of the references. The other two samples, SRR1748707 and SRR1749083, show similar patterns of consistent coverage. At first glance, these plots could be mistakenly taken as evidence for the presence of both B. anthracis and B. cereus in the NYC samples. Recall, the genomes of organisms within the B. cereus group are very similar to one another. Just how similar?

Figure 7 demonstrates cross mappings of reads from B. anthracis and B. cereus to one another. These two plots show B. cereus reads consistenly mapping to the B. anthracis chromosome (Figure 7A) and likewise B. anthracis reads consistenly mapping to the B. cereus chromosome (Figure 7B). The similarity between the two chromosomes would likely cause difficulties distiguishing between the two organisms based on sequence homology alone.

3. Species identification with Kraken

We used Kraken (v. 0.10.4-beta) and its pre-built Bacteria database to identify B. anthracis reads within the B. anthracis positive NYC samples and our controls (described above). We also used B. anthracis, B cereus and B. thurigiensis sequencing projects as controls. We used the python script to extract the proportion of reads covered by B.anthracis, B. cereus group and other bacteria species respectively in the different samples.

B. anthracis reads were detected by Kraken within each of the NYC samples (Table 8). The results from Kraken Control Results suggests that using Kracken we could detect the presence of minute amounts of sequence reads from the organism.

The results from our B. anthracis, B cereus and B. thurigiensis sequencing project controls can be found at the following link: Anthrax WGS search. See Figure 8 for a visual representation of the species distribution of each Kraken run.

4. Custom SNP assay for the B. anthracis core genome

As a first pass, we made a list of SNPs specific to only the B. anthracis core genome. These SNPs were not not found in other B. cereus. We then extracted SNPs and flanking regions as 31-mers, with the SNP based occupying 16th position.

Identifying B. anthracis specific SNPs from the whole genome alignment*

ProgressiveMAUVE was used to perform whole genome alignment of 10 B. cereus group genomes that were selected based on previously published whole genome phylogeny in order to capture the maximum diversity. ProgressiveMAUVE was performed using script The B. cereus strain ATCC 10987 was used as the reference genome for SNP calling (see below). The 10 genomes (Table 9) and the progressiveMAUVE output can be accessed here.

The MAUVE alignment was loaded into MAUVE Version 2.4.0 in order to visualize the alignment and to export the SNPs using the “export SNP” option. The nucleotide positions of the reference genome (BCE.fasta, ATCC 10987) as well as the corresponding SNP position on the B. anthracis genome, along with the 10 nucleotide pattern at the position were extracted. In this nucleotide pattern, the first nuleotide will be from the reference genome and the 3rd nucleotide will be from the B. anthracis genome. Anthrax-specific SNPs (i.e SNP nucloetide positions found in only the Anthrax genome) were identified using the script, where a number (corresponding to the position of the nucleotide variant) was assigned. Finally, those nucleotide positions that have only the number 3 assigned (position of the B. anthracis genome on the alignment) were extracted along with the corresponding nucleotide positions at the reference genome as well as at the B. anthracis genome Anthrax_specific-SNPPattern.txt. In total, 9,538 Anthrax-specific SNPs were identified.

Generating all the B. cereus 31-mers to downselect against the 9,538 Anthrax-specific SNPs

Using the query:

txid86661[Organism:exp] NOT txid1392[Organism:exp] AND (biomol_genomic[PROP] AND refseq[filter])

we downloaded 6,948 B. cereus group (excluding Bacillus anthracis) sequences in FASTA format from NCBI Refseq to downselect against the 9,538 Anthrax-specific SNPs identified in the first pass.

First, we extracted all the 31-mers in and around all the 9,538 Anthrax-specific SNP positions from the B. anthracis Ames ancestor genome so that the 16th nucleotide in the 31-mer is the SNP at that specific position. This was done using the script

We then generated all the 31-mers present in all the 6,948 B. cereus group (excluding Bacillus anthracis) sequences using JELLYFISH by the script The resultant JELLYFISH database of all the B. cereus 31-mers were queried by the 9,538 Anthrax-specific 31-mers using the script All the 31-mers with zero-counts against the B. cereus database were outputed. A total of 1,793 31-mers that had zero-counts were retained and used in all the further downstream analysis.

All the 31-mers were generated using the script The B. anthracis specific 1,793 31-mers were queried against the 31-mer database of all the control and real samples using and the 31-mers that matched the B. anthracis specific 1,793 31-mers were extracted along with their counts in each sample. Table 10 lists the total number of 31-mers found (ie each individual SNP multiplied by the number of occurences). In the control sample that did not contain any B. anthracis reads (0x coverage), none of the anthrax-specific 31-mers were matched, as expected. At 5x coverage 65% of the Anthrax-specific 31-mers were matched. In contrast, we obtained matches to the SNP samples in the B. cereus controls only when the genome coverage was 50x or above (Table 11). These mathces are due to rare random sequencing error causing a B. cereus k-mer to match the B. anthracis sequence. This results showed that the 1,793 31-mers were specific for B. anthracis but that sequence error could generate false positive results at high coverage of B. cereus.

Of the NYC samples believed to be potentialy positive for B. anthracis only sample SRR1748708 contained matches to B. anthracis-specific 31-mers with a frequency count of 2 for each of the 2 31-mers (Table 12) and Table 13. This number of matches would potentially indicate a positive result if B. anthracis present at ~ 0.01X coverage Table 10, or the false positives that would be expected if the coverage of the B. cereus background was ~15-20 fold or greater(Table 11). In these simplistic scenarios, it would be would be unlikely that each SNP would be present on two different reads (muchmore likely to be 4 separate SNPs discovered). The explanation for the presence of these SNPs may be more complex than either the presence of classic B. anthracis or false positives due to high background of B. cereus. At the level of data avaialble here, its hard to say more at this time.


Project Data Structure

        - Supplementary information from study and information about the study on SRA
        - Reference genomes and plasmids used for mapping
        - Mapping results of the B. anthracis and Yersinia samples against references
        - All custom scripts used in these analyses

In House
        - Symbolic links to programs built in src folder
        - Logged output from scripts
        - Each of the 1572 samples in FASTQ and FASTA format
        - Symbolic links to FASTQ of samples containing B. anthracis and Yersinia
        - Each of the 1572 samples in SRA format
        - Programs downloaded and built for this project