NIST Data Publication:
Supporting data for "DNA polymerase characteristics influence noise levels in
sequencing of short tandem repeats"
Version 1.0.0
DOI: https://doi.org/10.18434/mds2-4088

Authors:
  Tova  Lindh
    Lund University, Division of Biotechnology and Applied Microbiology
    Department of Process and Life Science Engineering
  Maja  Sidstedt
    National Forensic Center
    Swedish Police Authority
  Kevin M. Kiesler
    National Institute of Standards and Technology
    Biomolecular Measurement Division
  Peter M. Vallone
    National Institute of Standards and Technology
    Biomolecular Measurement Division
  Johannes  Hedman
    Lund University, Division of Biotechnology and Applied Microbiology
    Department of Process and Life Science Engineering

Contact:
  Kevin Kiesler
    kevin.kiesler@nist.gov

Description:

Polymerase chain reaction (PCR) applications including sequencing rely on
accurate thermostable DNA polymerases. Polymerization errors may hinder the
detection of low-level DNA variants such as mutations in clinical samples or DNA
from minor contributors in crime scene traces. Short Tandem Repeat (STR) markers
are particularly affected by artefacts. Apart from the regular random base
substitutions, the repeated structure of STRs makes them prone to formation of
stutter products. However, the mechanisms leading to stutter formation have not
yet been fully elucidated. Here, we applied an STR assay based on Unique
Molecular Identifiers (UMIs) to study the effects of DNA polymerases with
different characteristics on the amplicon yield as well as the formation of PCR
errors. The application of UMIs made it possible to study the impact on error
formation of applying genomic DNA (mimicking the early PCR cycles) or amplicons
(later cycles) as template. The levels of base substitutions were clearly
connected to the fidelity of the DNA polymerases, which in turn was coupled with
having an integrated 3’to 5’ exonuclease domain. Stutter formation, on the other
hand, was not as directly associated with fidelity, as two high-fidelity
polymerases showed quite different levels of stutter. DNA binding domains
generally improve processivity which could lower the incidence of stutter.
However, this was not clear in the present study as a polymerase having a DNA
binding domain gave the highest stutter levels. Overall, the degree of
polymerase stuttering is likely due to several different DNA polymerase
characteristics. Identifying a DNA polymerase that provides low levels of
stutters and base substitutions may enable the detection of low-level variants
such as DNA from minor contributors in mixed forensic traces.


--------------
Data Use Notes
--------------

This data is publicly available according to the NIST statements of
copyright, fair use and licensing; see
https://www.nist.gov/director/copyright-fair-use-and-licensing-statements-srd-data-and-software

You may cite the use of this data as follows:
Lindh, Tova, Sidstedt, Maja, Kiesler, Kevin M., Vallone, Peter M., Hedman,
Johannes (2026), Supporting data for "DNA polymerase characteristics influence
noise levels in sequencing of short tandem repeats", Version 1.0.0, National
Institute of Standards and Technology, https://doi.org/10.18434/mds2-4088
(Accessed: [give download date])

-------------
Data Overview
-------------
The repository contains files from the sequencing instrument, in fastq format, constituting all data from the experiments performed. 

File naming structure:
"Number"_"Polymerase used in barcoding PCR"_"Polymerase used in adaptor PCR"_"DNA sample"_"Input amount"_"replicate"

Where:
"Number" is the sample barcode I.D. (1 through 88) for each individual library preparation
"Polymerase used in barcoding PCR" is the enzyme used in the first steps of PCR to introduce the Universal Molecular Index sequence tag
"Polymerase used in adaptor PCR" is the enzyme used in the second phase of amplification to generate high concentration PCR products with sequencing adaptors at 5' and 3' ends
"DNA sample" is the name of the DNA template used, which may be: 2800M, NIST SRM 2391d Component C, or one of ten single source samples used as a testbed (SS1 through SS10)
"Input amount" is the quantity of DNA (in ng) used as template for the PCR amplification
"replicate" is the number corresponding to replicate library preparations for the sample (either 1 or 2)

---------------
Version History
---------------

1.0.0 (this version)
  initial release