Join Us

5 Must-Have Features in a DNA Methylation Detection Kits

Author: XMtongxue

May. 26, 2025

DNA Methylation Detection Kit | BioChain Institute Inc.

JavaScript seems to be disabled in your browser. For the best experience on our site, be sure to turn on Javascript in your browser.

For more information, please visit our website.

DNA 5-methylcytosine detection and methylation phasing using ...

The ccsmeth algorithm for 5mCpG detection

Recurrent neural network (RNN) and attention mechanism are widely used artificial neural networks in natural language processing39,40. Both RNN and attention mechanism have been applied in base modification detection from nanopore long reads16,17,41. Here, we propose ccsmeth, a deep-learning method that is composed of bidirectional GRU42 and Bahdanau attention43 networks, to detect CpG methylation from PacBio CCS reads. ccsmeth is designed to predict methylation states of CpGs at both read level and site level. For a targeted CpG, ccsmeth first predicts the methylation probability (or binary methylation state) of the CpG in a read (i.e., at single-molecule resolution, read level), and then summarizes the read-level methylation states to get its methylation frequency (site level) in the targeted genome (Fig. 1, Methods). During the generation of a CCS read, the IPD and PW values of each base in forward and reverse complement strands of the CCS read are averaged from corresponding subreads (Fig. 1a). To predict the read-level methylation state of a CpG, ccsmeth extracts a 21-mer sequence context that includes the CpG itself in the center, with the kinetics information (the averaged IPD, the averaged PW and the number of covered subreads) of each base. Since CpG methylation are mostly symmetric in human44, ccsmeth constructs two feature matrixes from the forward and reverse complement strand for a symmetric CpG pair (Fig. 1b). After processing the feature matrixes, ccsmeth outputs a read-level methylation probability Pr (\({P}_{r}\in [0,\,1]\)).

Before calling methylation at the site level, the CCS reads should be aligned to the reference genome. In ccsmeth, we provide two modes to infer the site-level methylation frequency of CpGs: count mode and model mode (Methods, Supplementary Fig. 1). In count mode, based on read-level methylation probabilities, binary methylation state (0 as unmethylated, 1 as methylated) of a CpG in per read is set by a probability cutoff (0.5 as default). Then the methylation frequency is calculated by counting the number of reads where the CpG is predicted as methylated, and the total number of reads mapped to the CpG. In the model mode of ccsmeth, we leverage the read-level methylation probabilities of neighboring CpGs to increase the confidence of the site-level methylation detection in a way similar to pb-CpG-tools. Specifically, for a targeted CpG, the read-level methylation probabilities of the CpG and its 10 adjacent CpGs, together with the distance (in base pair) of all 11 CpGs to the targeted CpG are organized into a feature matrix. The feature matrix is first input into a BIGRU layer to capture the forward and reverse flow of interactions between adjacent CpGs. Then an attention layer is applied to optimize the weights of each adjacent CpGs, which allows the model to focus on the most relevant interactions. A methylation probability Ps (\({P}_{s}\in [0,\,1]\)) is finally outputted as the methylation frequency of the targeted CpG (Fig. 1c, Supplementary Fig. 1).

ccsmeth accurately detects CpG methylation at single-molecule resolution

To evaluate ccsmeth at read level (i.e., at single-molecule resolution), we first use three groups of CCS datasets (M01&W01, M02&W02, M03&W03) sequenced using M.SssI-treated and PCR-treated human DNA on different versions of PacBio sequencers24. In M.SssI-treated DNA, the CpG methyltransferase M.SssI methylates all CpGs, while PCR-treated DNA which is prepared via whole genome amplification (WGA) contains nearly no methylated bases24. As shown in Supplementary Table 1, M01-03 are M.SssI-treated DNA samples, and W01-03 are PCR-treated DNA samples. For each group of datasets, we randomly select 50% methylated reads and unmethylated reads for model training. The remaining 50% methylated and unmethylated reads are used for testing. We use the same reads to train and test the HK model24. primrose does not provide the interface for training, so we exclude it for comparison on these three datasets. As shown in Fig. 2a, ccsmeth outperforms the HK model on all three datasets. ccsmeth achieves accuracies of 0., 0., and 0. on M01&W01, M02&W02, and M03&W03, respectively. The accuracies of ccsmeth are 5.4%, 4.4%, and 3.7% higher than HK model on the three datasets, respectively. ccsmeth achieves either around or above 0.95 AUCs, which are 3.3%, 3.4%, and 2.6% higher than HK model on the three datasets, respectively.

The read lengths of the three CCS datasets from Tse et al.24 are all less than 10Kb, while CCS reads used in practice are usually in 10–25 Kb23. Therefore, we further use the long (≥ 10 Kb) CCS reads of three human samples (NA, HG002, and SD_P1) for read-level evaluation (Methods, Supplementary Table 2): The CCS reads of NA are sequenced using PCR-treated and M.SssI-treated DNA of NA with 10 Kb insert size; The CCS reads of HG002 native DNA sequenced using three different insert sizes (15Kb, 20Kb, 24Kb) were taken from Baid et al.26 and Human Pangenome Reference Consortium23; The CCS reads of SD_P1 are sequenced using 15Kb DNA insert size. The mean subread depths of the CCS reads in these datasets range from 7.6× to 14.1× (Supplementary Table 1). We train the read-level model of ccsmeth using NA CCS reads aligned to autosomes of the reference genome and one SMRT cell of HG002 CCS reads (Methods). The CCS reads of NA aligned to chrX, 6 SMRT cells of HG002 CCS reads and the SD_P1 CCS reads are used for testing (Methods). We run primrose with its built-in model on the same testing data for comparison. As shown in Fig. 2b, ccsmeth gets 0.-0. accuracies, 0.–0. sensitivities, 0.–0. specificities, and 0.-0. AUCs, which are higher than those of primrose on all five datasets. Especially, ccsmeth gets much higher accuracies and specificities on HG002 CCS reads of 15 Kb, 20 Kb, and 24 Kb insert sizes: >4% higher accuracies and >7% higher specificities than those of primrose. To compare ccsmeth with the HK model, we subsampled 100 K ZMW reads from the datasets of NA and three HG002 insert sizes (15 Kb, 20 Kb, 24 Kb) since HK model is extensively time-consuming on large datasets. The results show that HK model achieves similar accuracies with primrose, especially on the HG002 datasets. Accuracies of both HK model and primrose are lower than that of ccsmeth (Supplementary Fig. 2). Besides evaluating ccsmeth genome widely, we also evaluate ccsmeth in specific genomic contexts and regions to explore whether the performance of ccsmeth is correlated with any genomic features (Supplementary Note 1). As shown in Supplementary Fig. 3, ccsmeth outperforms primrose in all tested regions. The results also show that ccsmeth tends to have higher accuracies in regions with high CpG densities, but has relative lower accuracies in intergenic regions, CpG shores, CpG shelves, and some repetitive regions.

The read-level accuracy of ccsmeth can be further improved by filtering out ambiguous calls. As shown in Supplementary Fig. 4a, by filtering out the calls with methylation probability close to 0.5, the incorrect calls of ccsmeth can be reduced. We define p = |PrPr| to filter out the ambiguous calls, where Pr is the methylation probability, and Pr is the unmethylated probability defined as 1 - Pr. Like modbam2bed45, we set p to 0.33 for testing. Supplementary Fig. 4b shows that when p is set to 0.33, the accuracies of ccsmeth improve by 3.5-4.2% with 8.9–12.8% of calls being discarded.

ccsmeth accurately detects CpG methylation at genome-wide site level

We use the CCS reads of HG002, SD_P1, and CHM13 to evaluate ccsmeth on 5mCpG detection at the site level (Methods, Supplementary Table 1 and 2). 2 SMRT cells of HG002 CCS reads are used to train the site-level model of ccsmeth; 6 SMRT cells of HG002 CCS reads (two for each of three insert sizes 15Kb, 20Kb, 24Kb), 2 SMRT cells of SD_P1 15Kb reads, and 2 SMRT cells of CHM13 20Kb reads are used for testing. There are 25.6×, 17.0×, 28.4×, 19.6×, and 16.5× average genome coverage of HG002 (15Kb, 20Kb, 24Kb), SD_P1 and CHM13 CCS reads used for testing in total, respectively. We downloaded BS-seq and nanopore R9.4.1 sequencing data of the three human samples as the benchmark (Supplementary Table 3). When evaluating ccsmeth, we subsample reads of the five datasets under certain coverages and compare the site-level results of ccsmeth and primrose with the results of BS-seq and nanopore sequencing (Fig. 3a–d). We repeat the subsampling of each coverage 5 times and get averaged values of metrics for comparison. The results show that the model mode of ccsmeth and primrose/pb-CpG-tools achieve higher Pearson correlations with BS-seq and nanopore sequencing than the count mode of ccsmeth and primrose/pb-CpG-tools do. Meanwhile, by setting p to 0.33 to filter out ambiguous calls, the Pearson correlations of the count mode of ccsmeth with BS-seq and nanopore sequencing improve by ~1-2% (Fig. 3a–d). On all tested datasets, ccsmeth achieves higher correlations than primrose/pb-CpG-tools does in both modes, especially under low coverages. For example, using 10× HG002 15 Kb, 20 Kb, and 24 Kb CCS reads, ccsmeth in model mode obtains 0., 0., and 0. correlations with BS-seq, while primrose/pb-CpG-tools in model mode only obtains 0., 0., and 0. correlations, respectively. (Fig. 3a). ccsmeth in model mode also obtains 0., 0., and 0. correlations with nanopore sequencing when using 10× HG002 15 Kb, 20 Kb, and 24 Kb CCS reads, which are 4.3%, 5.5%, and 4.9% higher than those obtained by primrose/pb-CpG-tools in model mode, respectively (Fig. 3b). On all tested datasets, ccsmeth also gets lower root mean square errors (RMSEs) than primrose/pb-CpG-tools gets in most cases (Supplementary Tables 5–12).

The model mode of ccsmeth can also be applied to the read-level results of primrose. As shown in Supplementary Tables 5–12, primrose with ccsmeth in model mode gets higher correlations and lower RMSEs with both BS-seq and nanopore sequencing than primrose with pb-CpG-tools in count mode gets. Especially, under low coverages (<15×) of HG002 and CHM13 datasets, primrose with ccsmeth in model mode outperforms primrose with pb-CpG-tools in model mode (Supplementary Tables 5–10 and 12). These results further demonstrate the effectiveness of the site-level model of ccsmeth.

We further test ccsmeth using the total CCS reads of HG002 15 Kb, 20 Kb, 24 Kb, SD_P1, and CHM13. Using the reads of HG002 15 Kb, 20 Kb, and 24 Kb datasets, ccsmeth gets 0., 0., and 0. correlations with BS-seq, and gets 0., 0., and 0. correlations with nanopore sequencing, respectively (Fig. 3e). When combining the total 71.0× CCS reads of HG002, ccsmeth achieves 0. and 0. correlations with BS-seq and nanopore sequencing, respectively (Supplementary Fig. 5, Supplementary Data 1). ccsmeth gets 0. correlation with BS-seq using the total SD_P1 reads, and gets 0. correlation with nanopore sequencing using total CHM13 reads (Fig. 3f, g, Supplementary Fig. 6). The results of ccsmeth on the three HG002 datasets are also highly correlated with each other (correlations>0.), which show the reproducibility of ccsmeth (Fig. 3e). We further use the 71.0× HG002 CCS reads to explore of which CpG contexts are predicted more accurately by the model mode in terms of methylation frequencies (Supplementary Note 2 and Supplementary Fig. 7). We classify the CpGs into two groups Gm and Gc. Gm contains CpGs whose methylation frequencies are more accurately predicted by model mode, while Gc contains CpGs whose methylation frequencies are more accurately predicted by count mode. We find that the CpGs in Gm tend to have either very low (<0.2) or high (>0.8) methylation frequencies.

Haplotype-aware methylation calling and ASM detection using PacBio CCS data

Following ccsmeth, we further develop a Nextflow38 pipeline called ccsmethphase for haplotype-aware methylation calling and ASM detection using only PacBio CCS data (Fig. 4a, Methods). In this pipeline, ccsmeth is used to call methylation states. Clair333 is used to call single nucleotide variants (SNVs). The SNVs called by Clair3 are then phased by WhatsHap46 to generate haplotypes. DSS47 is used to detect differentially methylated regions (DMRs) between two haplotypes.

We evaluate ccsmethphase with the total 71.0× HG002 CCS reads (Supplementary Table 2). We also use the HG002 BS-seq data (Supplementary Fig. 8, Supplementary Note 3) and nanopore data (Supplementary Fig. 9, Supplementary Note 4) to phase CpG methylations for comparison. First, we investigate the haplotype-aware methylation status of known imprinted regions using PacBio CCS data. We get 204 known imprinted intervals from Akbari et al.48, in which there are 102 well-characterized imprinted intervals49,50,51,52,53 (Methods). We compare the methylation difference of each imprinted interval between the two haplotypes of HG002 (Methods). As shown in Fig. 4b, the well-characterized imprinted intervals have large methylation differences between the two haplotypes (median = 0.53), while 22.1% of other known imprinted intervals also show large (>0.5) methylation differences. The methylation differences of known imprinted intervals got from CCS data are highly consistent with those got from BS-seq and nanopore data: Pearson correlations are 0. and 0., respectively (Supplementary Fig. 10 and 11). We examine the known imprinted intervals on SD_P1 CCS data and get consistent results (Supplementary Fig. 12a).

We then assess ccsmethphase on ASM detection using the HG002 sequencing data. Using the CCS reads, ccsmethphase generates 14,390 DMRs. Using the BS-seq and nanopore reads with corresponding pipelines, and 16,250 DMRs are generated, respectively. 81.4% DMRs generated using BS-seq reads are closely next to the genomic locations of the CCS-generated DMRs (distance<10 kb), and 70.8% of the DMRs overlap with the CCS-generated DMRs (Fig. 4c). Among the DMRs generated using nanopore reads, 68.8% DMRs are closely next to the genomic locations of the CCS-generated DMRs, and 51.7% DMRs overlap with the CCS-generated DMRs (Fig. 4d). Most of the CCS-generated DMRs are also closely next to the genomic locations of the DMRs generated using BS-seq and nanopore data in HG002 (Supplementary Fig. 13). From the SD_P1 CCS reads, ccsmethphase generates 8,183 DMRs. In both HG002 and SD, most of the known imprinted intervals are either overlapped with or near the CCS-generated DMRs (Fig. 4e, Supplementary Fig. 12b and 14), which also shows the ability of ccsmethphase on ASM detection. We also assess ccsmethphase on ASM detection using the CCS data of a Chinese family trio, in which HN_FA is the father, HN_MO is the mother, and HN_S1 is the son. The results show that ccsmethphase not only can detect known imprinted intervals but also reveals the patterns of parental imprinting correctly (Supplementary Note 6, Supplementary Figs. 15–18).

We further compare genome-wide site-level methylation frequencies of CpGs at two haplotypes detected by PacBio CCS data with those by BS-seq and nanopore data. To accomplish this, we would expect consistent haplotype assignment (i.e., all maternal SNVs are assigned to one haplotype, and all paternal SNVs are assigned to another haplotype). However, because of the uneven reads coverage across the reference genome, we can only generate discrete haplotype blocks when using reads of a single sample to phase SNVs34. Therefore, we use the phased SNVs generated by Illumina trio data of HG002 to phase the CCS and nanopore reads. We compare the methylation frequencies of the phased CpGs predicted using CCS reads with those using BS-seq and nanopore reads. PacBio CCS gets >0.93 correlations with BS-seq and nanopore sequencing in both maternal and paternal haplotypes (Fig. 4f, g, Supplementary Table 13). This result further demonstrates that ccsmethphase can accurately detect haplotype-aware methylation in the human genome using CCS reads.

Assessment of PacBio CCS for methylation detection and phasing in repetitive genomic regions

With longer reads, PacBio CCS is expected to profile methylation of more CpGs in the human genome than short-read sequencing technologies do. Using T2T-CHM (T2T: Telomere-to-Telomere) as the reference genome, we first assess the number of CpGs covered by HG002 CCS reads, especially in repetitive genomic regions: repetitive genomic elements annotated by RepeatMasker54,55, segmental duplications (SDs)56, and peri/centromeric satellites (cenSats)57. We also assess the total HG002 BS-seq and nanopore reads for comparison (Supplementary Table 3). As shown in Fig. 5a, using 15× coverage of CCS reads, 32.85 M (96.9%) of human CpGs are covered, of which there are 31.33 M CpGs covered by at least 5 reads. The CpGs covered by 15× CCS reads are more than the CpGs covered by 117.5× Illumina BS-seq reads. When using all 71.0× testing CCS reads, 32.74 M (96.6%) CpGs in the human genome are covered by at least 5 mapped reads, which are almost the same as the number of CpGs covered by 65.8× nanopore reads (Fig. 5a). In RepeatMasker repeats, SDs, and cenSats, PacBio CCS detects methylation states of 96.8%, 88.4%, and 85.3% CpGs, respectively. Compared to BS-seq, methylation states of 10.4%, 33.6%, and 34.7% CpGs in RepeatMasker repeats, SDs, and cenSats can only be detected by using CCS, respectively (Fig. 5b). In non-RepeatMasker regions, PacBio CCS detects methylation states of 96.1% CpGs, which are 8.9% more than the CpGs detected by BS-seq (Supplementary Fig. 19a).

The HG002 CCS reads are shorter than the HG002 nanopore reads (mean read length: 18,797 bp vs. 21,933 bp). However, the number of CpGs phased by PacBio CCS (i.e., CpGs covered by at least 5 phased CCS reads) is not significantly less than the number of CpGs phased by nanopore sequencing: 26.97 M vs. 27.66 M (Fig. 5c). Both PacBio CCS and nanopore sequencing phase much more CpGs than BS-seq. with limited read length, BS-seq can only phase 6.71 M of human CpGs. PacBio CCS phases 85.4%, 60.2%, and 46.5% of the CpGs in RepeatMasker repeats, SDs, and cenSats, which are 63.8%, 45.9%, and 35.7% more than the CpGs phased by BS-seq, respectively (Fig. 5d). In non-RepeatMasker regions, PacBio CCS phases 64.4% more CpGs than BS-seq does (Supplementary Fig. 19b). Notably, PacBio CCS phases slightly more CpGs than nanopore sequencing in cenSats, which may indicate that highly accurate CCS reads are more suitable for SNV detection and methylation phasing across peri/centromeric regions than nanopore R9.4.1 reads are.

The methylation frequencies of CpGs predicted using PacBio CCS in repetitive genomic regions are highly correlated with those predicted using BS-seq and nanopore sequencing. In the RepeatMasker repeats, SDs, and cenSats of HG002, PacBio CCS gets 0., 0., and 0. correlations with BS-seq, and gets 0., 0., and 0. correlations with nanopore sequencing, respectively (Supplementary Table 16). For the haplotype-aware methylation detection in the repetitive genomic regions of HG002, PacBio CCS also gets >0.89 and >0.90 correlations with BS-seq and nanopore sequencing, respectively (Supplementary Table 17). In summary, with ccsmeth and ccsmethphase, PacBio CCS can be a comprehensive and accurate technology for 5mCpG detection and methylation phasing in repetitive genomic regions.

Ethical statement and sample collection

This study is compliant with the “Guidance of the Ministry of Science and Technology (MOST) of China for the Review and Approval of Human Genetic Resources”. The genome sequencing of the Chinese sample SD_P1 and the Chinese family trio (HN_FA, HN_MO, HN_S1) was approved by the Research Ethics Committee in the School of Life Sciences, Central South University (No. -1-6). We selected the Chinese samples from the Chinese autism spectrum disorder cohort61 with no specific sex or age requirements. All the participants signed the informed consent before sample collection. The genomic DNA for PacBio sequencing and bisulfite sequencing was extracted from the peripheral blood of each sample. For the experiment of Zebrafish sample, all animal protocols were reviewed and approved by the Animal Care and Use Committee at Zhongshan Ophthalmic Center, Sun Yat-sen University.

PacBio CCS data of human

We sequenced a SMRT cell CCS data of NA (GM cell line from Coriell Institute). ~11 μg genomic DNA was extracted using QIAGEN MagAttract HMW DNA Kit (QIAGEN, Cat# ). The extracted DNA of NA was amplified via whole genome amplification. Half of the amplified DNA was then treated with the CpG methyltransferase M.SssI. Before library preparation, the genomic DNA was sheared to ~20 Kb on a MegaRuptor3 (Diagenode). Libraries of the M.SssI-treated DNA and the other half amplified DNA were prepared in 10 Kb insert size with the Express Template Prep kit 2.0 (PacBio, No. 100-938-900), and were then barcoded to sequence on a PacBio Sequel II sequencer with Sequel II sequencing kit 2.0 (PacBio, No. 101-826-100). We sequenced the native genomic DNA of four Chinese samples using the same procedure on the Sequel II system. We got 2 SMRT cells of CCS reads (19.6× mean genome coverage in total) in 15Kb insert size for SD_P1. We also got 2 SMRT cells of CCS reads in 15 Kb insert size for the HN trio samples. There are 21.8×, 22.4×, and 21.1× reads for HN_FA, HN_MO, and HN_S1, respectively.

We downloaded three CCS raw subreads of a human sample from Tse et al.24: M01 and W01; M02 and W02; M03 and W03. Each of the three datasets contains two groups of reads: the methylated reads sequenced using M.SssI-treated DNA (M01, M02, and M03) and the unmethylated reads sequenced using amplified DNA (W01, W02, and W03). The three datasets were sequenced on PacBio sequencers with Sequel I sequencing kit 3.0, Sequel II sequencing kit 1.0, and Sequel II sequencing kit 2.0, respectively.

We downloaded 9 SMRT cells of HG002 CCS raw subreads in 15 Kb, 20 Kb, and 24 Kb insert sizes from Baid et al.26 and Human Pangenome Reference Consortium23. We also got 2 SMRT cells of CHM13 CCS raw subreads in 20 Kb insert size from Nurk et al.27. CCS subreads in all datasets were processed to generate CCS reads using pbccs (v6.4.0, https://github.com/PacificBiosciences/ccs). Details of all CCS data are provided in Supplementary Table 1.

Data partition of the PacBio CCS data of human

For each dataset of M.SssI-treated and PCR-treated human DNA from Tse et al.24, we randomly select 50% methylated reads and 50% unmethylated reads for model training, while the remaining 50% methylated and unmethylated reads are used for evaluation at read level.

To evaluate ccsmeth using datasets of long CCS reads, we train the read-level model of ccsmeth using NA CCS reads aligned to autosomes and a SMRT cell of HG002 CCS reads. We use another 2 SMRT cells of HG002 reads for site-level model training. The CCS reads of NA aligned to chrX and chrM, the left 6 SMRT cells of HG002 CCS reads, the 2 SMRT cells of CHM13 CCS reads, and the CCS reads of the HD family trio are used for evaluation (Supplementary Table 2).

Illumina and nanopore data of human

We sequenced the SD_P1 sample using BS-seq. The extracted genomic DNA (≥ 1 μg) was first sheared by Covaris and purified to 200–350 bp. The sheared DNA was then end-repaired and ligated to methylated adapters. The adapter-ligated DNA was bisulfite-converted with EZ DNA Methylation-Gold Kit (Zymo Research, Cat# D) and then PCR-amplified. Qubit® 2.0 Fluorometer (Invitrogen) was used to quantify the DNA fragments of the library. Finally, the library was sequenced on a NovaSeq sequencer (Illumina). In total, we got 15.7× coverage of 2 × 150 bp paired reads.

Goto TSKT to know more.

We downloaded BS-seq and nanopore R9.4.1 data of HG002 from ONT Open Datasets (https://labs.epi2me.io/dataindex/). There are 117.5× coverage of 2 × 150 bp paired reads of BS-seq, and 9.5 million (65.8× coverage) nanopore reads with a mean length of 21,933 bp for HG002. We also got Illumina whole-genome sequencing (WGS) trio data of HG002 from GIAB62, in which there are 63.1×, 55.7×, and 67.9× coverage of 2 × 250 bp paired reads for HG002, HG003, and HG004, respectively. We downloaded 6.7 million (41.8× coverage) nanopore R9.4.1 reads of CHM13 from Nurk et al.27. The mean length of the CHM13 nanopore reads is 19,891 bp.

We used Bismark63 (v0.23.1) to process all BS-seq data. For the nanopore data, we basecalled the Fast5 files (raw reads) using Guppy (version 4.2.2+effbaf8). Then we used DeepSignal2 (v0.1.2, https://github.com/PengNi/deepsignal2), an improved version of DeepSignal17, to call methylation from the nanopore reads. Details of all Illumina and nanopore data used in this study are provided in Supplementary Table 3.

Reference genome and annotations

We used CHM13 v2.027 as the human reference genome to process all sequencing data. The gene annotations were downloaded from the GitHub repository marbl/CHM. The annotations of repetitive genomic elements (RepeatMasker)54,55, segmental duplications56, peri/centromeric satellites57, and CpG islands were downloaded from corresponding tracks of UCSC Genome Browser (T2T CHM13v2.0/hs1)64.

We got 205 known imprinted intervals of human from Akbari et al.48, which were generated from five previous studies49,50,51,52,53. Of these intervals, there were 102 “well-characterized” intervals that were reported by at least two studies49. By using UCSC LiftOver65, GRCh38 coordinates of 204 intervals (102 “well-characterized” and 102 “other” intervals) were successfully converted to CHM13 coordinates.

PacBio CCS and BS-seq data of Zebrafish

We sequenced the Zebrafish sample using PacBio CCS and BS-seq with the same procedure for sequencing the human samples. The genomic DNA of Zebrafish was extracted from the muscular tissues of the Zebrafish adults (TU wild-type line, male and female), which were provided by China Zebrafish Resource Center. >10 μg and >1 μg DNA was used for PacBio CCS and BS-seq, respectively. In total, we got 23.3× CCS reads and 29.5× BS-seq reads.

Methylation calling of ccsmeth at read level

To call CpG methylation at read level, ccsmeth needs CCS reads with kinetics information in BAM format, which can be generated from raw subreads by pbccs with “--hifi-kinetics” option. The process of ccsmeth to call methylation at reads level is as follows (Fig. 1):

  1. (1)

    Feature extraction. Each CCS read with kinetics information contains IPD and PW values for bases in forward and reverse complement strands of the read, which are averaged from corresponding bases in subreads. Before extracting features for CpGs in a CCS read, we first normalize the IPD and PW values of each strand in the read using Z-score normalization. Then for a CpG in the forward strand of the CCS read, we extract a 21-mer sequence context surrounding the CpG. Finally, the averaged IPD and PW values, the number of covered subreads of each base in the 21-mer, together with the 21-mer nucleotide sequence form a 4 × 21 feature matrix. We also construct a feature matrix for the symmetric CpG in the reverse complement strand using the same way.

  2. (2)

    Methylation state prediction. We use the two feature matrixes to predict a single methylation state for the symmetric CpG pair. Each of the two matrixes is fed into a deep neural network, which contains three bidirectional Gated Recurrent Unit (BiGRU) layers42 and one Bahdanau attention layer43 (Fig. 1b, Supplementary Note 5). Each BiGRU layer has a hidden size of 256. The outputs from the two attention layers are processed by a full connection layer and then by a Softmax layer. Finally, a methylation probability Pr is outputted. A binary methylation state of the CpG is also set based on Pr: if Pr >0.5, the CpG is predicted as methylated, otherwise is predicted as unmethylated.

Methylation calling of ccsmeth at site level

We used the following steps to call CpG methylation at site level:

  1. (1)

    Alignment. CCS reads should be aligned to the reference genome before site-level methylation calling. We use pbmm2 (v1.9.0, https://github.com/PacificBiosciences/pbmm2), a modified version of minimap266 for PacBio native data formats, to align all the CCS reads used in this study.

  2. (2)

    Methylation calling in count mode. In count mode, based on the binary methylation states of CpGs in the mapped reads, the methylation frequency of a CpG is calculated as the number of reads where the CpG is called methylated divided by the total number of reads mapped to the CpG (Supplementary Fig. 1).

  3. (3)

    Methylation calling in model mode. In model mode, we used the information of neighboring CpGs to predict site-level methylation frequencies in a way similar to pb-CpG-tools. For a targeted CpG, the read-level methylation probabilities of the CpG and each of its 10 adjacent CpGs are first summarized in a histogram with 20 discretized bins separately. Then each histogram is normalized by its L2-norm67 value. The distances of 11 CpGs to the targeted CpG have also been calculated. 11 normalized histograms and the 11 distance values are organized into a 21 × 11 feature matrix, which is then fed into a BiGRU layer and an attention layer (Fig. 1c). The BiGRU layer has a hidden size of 32. At last, ccsmeth outputs a methylation probability Ps (\({P}_{s}\in [0,\,1]\)) as the methylation frequency of the targeted CpG.

Model training of ccsmeth

  1. (1)

    Training of the read-level model. For the datasets of M.SssI-treated and amplified DNA, we extract positive (methylated) and negative (methylated) samples from reads of M.SssI-treated and PCR-treated DNA, respectively. For CCS data of native DNA, based on the results of BS-seq, we take CpGs which are covered with at least 5 reads and have 100% methylation frequency as high-confidence methylated sites. CpGs that have at least 5 mapped reads and zero methylation frequency are selected as high-confidence unmethylated sites. Then we extract positive and negative samples from the reads which are mapped to the high-confidence methylated and unmethylated sites, respectively.

    To train the model of ccsmeth for read-level 5mCpG detection, we split the total training samples at a ratio of 99:1 as the training dataset and the validation dataset. The model parameters are learned on the training dataset by minimizing the loss calculated by cross-entropy (Supplementary Note 5) with a batch size of 512 and an initial learning rate of 0.001. The learning rate is adopted by Adam optimizer68 and decays by a factor of 0.1 after every epoch. The parameter betas in Adam optimizer are set to (0.9, 0.999). We use two strategies to prevent overfitting. First, we add a dropout layer in each of the GRU layers and the fully connected layer. We set the dropout probability to 0.5 at each dropout layer. Second, we use early stopping69 during training. We set at least 10 epochs and at most 50 epochs for each training. The model parameters with the current best performance on the validation dataset are saved in every epoch. During epochs 10 to 50, if the best performance of the current epoch decreases, we stop the training process.

  2. (2)

    Training of the site-level model. The training of the site-level model of ccsmeth is treated as a regression problem. From the results of BS-seq, we select CpGs with at least 10× coverage, and use the methylation frequencies of the CpGs as training targets. We then generate the read-level methylation probabilities calculated by ccsmeth for each targeted CpG as features. The targeted CpGs, alongside the features, are then split at a ratio of 99:1 as the training dataset and the validation dataset. Finally, we use the same training process of the read-level model to train the site-level model but with a different loss function: during the training of the site-level model, we use the mean squared error (MSE) (Supplementary Note 5) instead of cross-entropy to calculate the loss.

Evaluation of ccsmeth

  1. (1)

    Evaluation at read level. For the controlled methylation datasets, we extract positive and negative samples from the reads of PCR-treated and M.SssI-treated DNA, respectively. For the CCS reads of HG002 and SD_P1, we first select high-confidence methylated and unmethylated sites from the results of BS-seq. Then we extract positive and negative samples of the selected sites from CCS reads. We calculate accuracy, sensitivity, specificity, and Area Under the Curve (AUC) based on the prediction of randomly selected 100,000 positive samples and 100,000 negative samples. We repeat the subsampling 5 times for each evaluation. (2) Evaluation at site level. We evaluate ccsmeth at the genome-wide site level by comparing per-site methylation frequencies predicted by ccsmeth with the results of BS-seq and nanopore sequencing. Using the methylation frequencies of CpGs, Pearson correlation (r), the coefficient of determination (r2), Spearman correlation (ρ), and root mean square error (RMSE) are calculated. For each comparison, we only compare CpGs that have at least 5× coverage in results of both PacBio CCS and BS-seq (or nanopore sequencing). To evaluate the methylation frequencies predicted by ccsmeth under different coverage of CCS reads, we randomly subsample the CCS reads using rasusa70 (v0.7.0).

    ccsmeth is implemented using Python3 and PyTorch (version 1.11.0). We evaluate ccsmeth, HK model, primrose (version 1.3.0), and pb-CpG-tools (v1.1.0) on the same testing datasets. The source code of the HK model was taken from Tse et al.24 under the CUHK software license. We also compared the runtime and peak memory of the main steps of ccsmeth, HK model, and primrose (Supplementary Note 7, Supplementary Fig. 22, and Supplementary Tables 18–19).

ccsmethphase for methylation phasing and ASM detection

In the ccsmethphase pipeline (Fig. 4a), we use ccsmeth to call read-level methylation from PacBio CCS reads. Then the CCS reads are aligned to the reference genome by using pbmm2 (v1.9.0). We use Clair333 (v0.1-r11 minor 2) with the “hifi” model to call variants and only keep the “PASS” SNVs (i.e., high-quality SNVs). We use WhatsHap46 (version 1.4) to phase the “PASS” SNVs, and then to phase the reads (i.e., tag the reads by the haplotypes). The methylation frequencies of CpGs in each haplotype are then inferred by ccsmeth. At last, we use DSS47 (version 2.44.0) with default parameters to perform differential methylation analysis (DMA). By taking methylation frequencies of CpGs in two haplotypes as input, DSS calls differentially methylated regions (DMRs) using Wald tests47. We only consider regions generated by DSS with p-value < 0.001 and |methylation difference | >0.2 as significant DMRs.

ccsmethphase is implemented by integrating ccsmeth and other necessary tools using Nextflow (version 22.04.5.). We evaluated the runtime of the ccsmethphase pipeline on an HPC cluster. Details of the evaluation are shown in Supplementary Note 7, Supplementary Table 20, and Supplementary Fig. 23.

Methylation difference of known imprinted intervals between two haplotypes

We compare the haplotype-level methylation difference of each known imprinted interval calculated by ccsmethphase with the results of BS-seq and nanopore sequencing (Supplementary Note 3 and 4). For each interval, we first split the CpGs in the reads which are mapped to the interval into two groups (Ghp1 and Ghp2) according to the haplotype tags of the reads. Then we calculate the methylation level of the interval in each haplotype (Mhp1 and Mhp2) as the fraction of CpGs that are predicted as methylated in the corresponding group (Eq. (1)). At last, we calculate the methylation difference of the interval between two haplotypes (Eq. (2)). Note that we only calculate the methylation difference of intervals that have at least 5 CpGs covered by reads in both haplotypes.

$${M}_{{hp}1}=\frac{{{{{{\rm{No}}}}}}. \, {{{{{\rm{of}}}}}}\; {{{{{\rm{methylated}}}}}}\; {{{{{\rm{CpGs}}}}}}\; {{{{{\rm{in}}}}}}\,{G}_{{hp}1}}{{{{{{\rm{No}}}}}}.\;{{{{{\rm{of}}}}}}\; {{{{{\rm{total}}}}}}\; {{{{{\rm{CpGs}}}}}}\; {{{{{\rm{in}}}}}}\,{G}_{{hp}1}},\, {M}_{{hp}2}=\frac{{{{{{\rm{No}}}}}}.\, {{{{{\rm{of}}}}}}\; {{{{{\rm{methylated}}}}}}\; {{{{{\mathrm{CpGs}}}}}}\; {{{{{\rm{in}}}}}} \, {G}_{{hp}2}} {{{{{{\rm{No}}}}}}. \; {{{{{\rm{of}}}}}}\; {{{{{\rm{total}}}}}}\; {{{{{\rm{CpGs}}}}}}\; {{{{{\rm{in}}}}}}\,{G}_{{hp}2}}$$ (1) $${M}_{{diff}}={{{{{\rm{|}}}}}}{M}_{{hp}1}-{M}_{{hp}2}{{{{{\rm{|}}}}}}$$ (2)

Statistics and reproducibility

This study obtained 5 human samples (NA, SD_P1, HN_FA, HN_MO, HN_S1) for generating PacBio CCS data. We also used publicly available datasets of samples M01&W01, M02&W02, M03&W03, HG002, and CHM13. The training and evaluation process of the proposed method followed standard practices for separating training, validation, and testing datasets. During evaluation, we performed orthogonal validation for samples HG002, CHM13, and SD_P1 by using nanopore sequencing and bisulfite sequencing. We also compared the results of sequencing data from different HG002 SMRT cells to validate the reproducibility. No statistical method was used to predetermine sample size. CCS reads that have less than 3 full-length subreads, and “Fail” Nanopore sequencing reads (mean Q-score≤9) were not used in the study. The experiments were not randomized. The Investigators were not blinded to allocation during experiments and outcome assessment. All the statistical details for training and evaluation can be found in the figure legends, Methods section, and Supplementary information.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Contact us to discuss your requirements of DNA Methylation Detection Kits. Our experienced sales team can help you identify the options that best suit your needs.

24

0

Comments

0/2000

All Comments (0)

Guest Posts

If you are interested in sending in a Guest Blogger Submission,welcome to write for us!

Your Name: (required)

Your Email: (required)

Subject:

Your Message: (required)

0/2000