Short-read sequencing has enabled the de novo assembly of a few individual human genomes, but with inherent limitations in characterizing repeat elements. Here we sequence a Chinese individual HX1 by single-molecule real-time (SMRT) long-read sequencing, construct a physical map by NanoChannel arrays, and generate a de novo assembly of 2.93Gb (contig N50: 8.3Mb, scaffold N50: 22.0Mb, including 39.3Mb N-bases), together with 206Mb of alternative haplotypes. The assembly fully or partially fills 274 (28.4%) N-gaps in the reference genome GRCh38. Comparison to GRCh38 reveals 12.8Mb of HX1-specific sequences, including 4.1Mb that are not present in previously reported Asian genomes. Furthermore, long-read sequencing of the transcriptome reveals novel spliced genes that are not annotated in GENCODE and are missed by short-read RNA-Seq. Our results imply that improved characterization of genome functional variation may require the use of a range of genomic technologies on diverse human populations.
Material | Platform | # Cells | # Reads | Bases | Coverage | Mean length | N50 length |
---|---|---|---|---|---|---|---|
DNA | Illumina HiSeq X | - | 2.8 billion reads | 428.8 G | 143X | 151 | 151 |
DNA | PacBio SMRT cell | 377 cells | 44.2M reads | 309.0G | 103X | 7.0Kb | 12.1Kb |
DNA | BioNano IrysChip | 12 cells | 1.169M molecules (>150kb) | 302.8G | 101X | 259.0Kb | 224.7Kb |
RNA | PacBio SMRT cell | 50 cells (1-2kb, 2-3kb, 3-5kb,5kb+) | 2.721M error-corrected reads | 5.8G | - | 2.1Kb | 2.7Kb |
RNA | Illumina HiSeq 2500 | NA | 48.9M reads | 4.4G | - | 90 | 90 |
DNA | Oxord Nanopore | 20 cells | 4.8M reads | 91G | 30X | 17.1kb | 25.8kb |
DNA | 10x Genomics linked-read | - | 861M reads | 130G | 43X | 151 | 151 |
Raw data can be accessed from SRA.
All Pacbio data XML files
BioNano raw data (4 cells) (8 cells)
Illumina Omni 2.5M array project data
Processed IsoSeq data (see README)
New in 2019: bisulfate sequencing data is available at SRA
New in 2019: Nanopore sequencing data is available at SRA
New in 2019: 10X Genomics linked-read whole-genome sequencing data is available at SRA
polished HX1 primary contigs
polished HX1 primary contigs + associate contigs
scaffolds from polished HX1 primary contigs
scaffolds from polished HX1 primary contigs + associate contigs
non-GRCh38 sequences in HX1
non-GRCh38 non-YH2.0 sequences in HX1
CNV and SV calls on BioNano, SMRT long read, Illumina short read, and microarray data from the 2016 paper.
Whole genome alignment (hg38 full analysis + decoy vs hx1f4full_3rdfixedv2) by MUMmer3
MD5SUM
Kai Wang: kaichop@gmail.com