Short-read sequencing has enabled the de novo assembly of a few individual human genomes, but with inherent limitations in characterizing repeat elements. Here we sequence a Chinese individual HX1 by single-molecule real-time (SMRT) long-read sequencing, construct a physical map by NanoChannel arrays, and generate a de novo assembly of 2.93Gb (contig N50: 8.3Mb, scaffold N50: 22.0Mb, including 39.3Mb N-bases), together with 206Mb of alternative haplotypes. The assembly fully or partially fills 274 (28.4%) N-gaps in the reference genome GRCh38. Comparison to GRCh38 reveals 12.8Mb of HX1-specific sequences, including 4.1Mb that are not present in previously reported Asian genomes. Furthermore, long-read sequencing of the transcriptome reveals novel spliced genes that are not annotated in GENCODE and are missed by short-read RNA-Seq. Our results imply that improved characterization of genome functional variation may require the use of a range of genomic technologies on diverse human populations.
|Illumina HiSeq X
|2.8 billion reads
|PacBio SMRT cell
|1.169M molecules (>150kb)
|PacBio SMRT cell
|50 cells (1-2kb, 2-3kb, 3-5kb,5kb+)
|2.721M error-corrected reads
|Illumina HiSeq 2500
|10x Genomics linked-read
Raw data can be accessed from SRA.
All Pacbio data XML files
BioNano raw data (4 cells) (8 cells)
Illumina Omni 2.5M array project data
Processed IsoSeq data (see README)
New in 2019: bisulfate sequencing data is available at SRA
New in 2019: Nanopore sequencing data is available at SRA
New in 2019: 10X Genomics linked-read whole-genome sequencing data is available at SRA
non-GRCh38 sequences in HX1
non-GRCh38 non-YH2.0 sequences in HX1
CNV and SV calls on BioNano, SMRT long read, Illumina short read, and microarray data from the 2016 paper.
Whole genome alignment (hg38 full analysis + decoy vs hx1f4full_3rdfixedv2) by MUMmer3
Kai Wang: firstname.lastname@example.org