CaeNDR | FAQ

How do I cite CaeNDR?

Please use the citation below.

CaeNDR, the Caenorhabditis Natural Diversity Resource

Timothy A Crombie, Ryan McKeown, Nicolas D Moya, Kathryn S Evans, Samuel J Widmayer, Vincent LaGrassa, Natalie Roman, Orzu Tursunova, Gaotian Zhang, Sophia B Gibson, Claire M Buchanan, Nicole M Roberto, Rodolfo Vieira, Robyn E Tanny, Erik C Andersen

(2023 Oct 19) Nucleic Acids Research

[ Article on Nucleic Acids Research | DOI | Pubmed ]

Or use this bibtex entry


@article{10.1093/nar/gkad887,
    author   = {Crombie, Timothy A and McKeown, Ryan and Moya, Nicolas D and Evans, Kathryn S and Widmayer, Samuel J and LaGrassa, Vincent and Roman, Natalie and Tursunova, Orzu and Zhang, Gaotian and Gibson, Sophia B and Buchanan, Claire M and Roberto, Nicole M and Vieira, Rodolfo and Tanny, Robyn E and Andersen, Erik C},
    title    = "{CaeNDR, the Caenorhabditis Natural Diversity Resource}",
    journal  = {Nucleic Acids Research},
    volume   = {52},
    number   = {D1},
    pages    = {D850-D858},
    year     = {2023},
    month    = {10},
    abstract = "{Studies of model organisms have provided important insights into how natural genetic differences shape trait variation. These discoveries are driven by the growing availability of genomes and the expansive experimental toolkits afforded to researchers using these species. For example, Caenorhabditis elegans is increasingly being used to identify and measure the effects of natural genetic variants on traits using quantitative genetics. Since 2016, the C. elegans Natural Diversity Resource (CeNDR) has facilitated many of these studies by providing an archive of wild strains, genome-wide sequence and variant data for each strain, and a genome-wide association (GWA) mapping portal for the C. elegans community. Here, we present an updated platform, the Caenorhabditis Natural Diversity Resource (CaeNDR), that enables quantitative genetics and genomics studies across the three Caenorhabditis species: C. elegans, C. briggsae and C. tropicalis. The CaeNDR platform hosts several databases that are continually updated by the addition of new strains, whole-genome sequence data and annotated variants. Additionally, CaeNDR provides new interactive tools to explore natural variation and enable GWA mappings. All CaeNDR data and tools are accessible through a freely available web portal located at caendr.org.}",
    issn     = {0305-1048},
    doi      = {10.1093/nar/gkad887},
    url      = {https://doi.org/10.1093/nar/gkad887},
    eprint   = {https://academic.oup.com/nar/article-pdf/52/D1/D850/55039590/gkad887.pdf},
}

What are hyper-divergent regions? How should I use variants that fall within these regions?

Hyper-divergent regions are genomic intervals that contain sequences not found in the N2 reference strain. They were identified by high levels of variation and low coverage from read alignments. For a more full description, please read this paper. We highly recommend that you use the genome browser and view the BAM files for strains of interest. We also released a genomic view track to see where we have classified divergent regions. If you find that your region of interest overlaps with a hyper-divergent region, then we recommend taking any variants as preliminary. Long-read sequencing is required to identify the actual genomic sequences in this region.

How much confidence do we have in the indel variants?

GATK calls indel variants (1-50 bp) and short structural variants. The variant calling at these sites was not optimized and ran default parameters. These variants should be considered preliminary until confirmed by PCR or long-read sequencing.

How were the filter thresholds determined?

Optimal filter thresholds would faithfully separate real variant sites from non-variant sites. However, we had no way to know which variant sites were true or false using the experimental data. Therefore, we created simulated data with a "truth set" of variants artificially inserted into a BAM file. In this way, we know precisely the positions of true variants. After variant calling with the simulated BAM file, we looked at the various quality metrics and asked what thresholds of these metrics would best separate real variants from incorrectly called variants. We chose filter thresholds to maximize true positive rate and precision while minimizing the false positive rate. These filter thresholds were used in processing the wild isolate data.

See our filter optimization report for further details

How are strains grouped by isotype?

In 2012, we published genome-wide variant data from reduced representation sequencing of approximately 10% of the C. elegans genome (RAD-seq). Using these data, we grouped strains into isotypes. We also found many strains that were mislabeled as wild isolates but were instead N2 derivatives, recombinants from laboratory experiments, and mutagenesis screen isolates (detailed in Strain Issues). These strains were not characterized further. For the isotypes, we chose one strain to be the isotype reference strain. This strain can be ordered through CaeNDR here.

After 2012, with advances in genome sequencing, we transitioned our sequencing to whole-genome short-read sequencing. All isotype reference strains were resequenced whole-genome. The other strains within an isotype were not, so we use the RAD-seq variant data to group isotypes for these strains.