HuaXia1 (HX1) Project

Background

Short-read sequencing has enabled the de novo assembly of a few individual human genomes, but with inherent limitations in characterizing repeat elements. Here we sequence a Chinese individual HX1 by single-molecule real-time (SMRT) long-read sequencing, construct a physical map by NanoChannel arrays, and generate a de novo assembly of 2.93Gb (contig N50: 8.3Mb, scaffold N50: 22.0Mb, including 39.3Mb N-bases), together with 206Mb of alternative haplotypes. The assembly fully or partially fills 274 (28.4%) N-gaps in the reference genome GRCh38. Comparison to GRCh38 reveals 12.8Mb of HX1-specific sequences, including 4.1Mb that are not present in previously reported Asian genomes. Furthermore, long-read sequencing of the transcriptome reveals novel spliced genes that are not annotated in GENCODE and are missed by short-read RNA-Seq. Our results imply that improved characterization of genome functional variation may require the use of a range of genomic technologies on diverse human populations.

Summary of data sets

Material	Platform	# Cells	# Reads	Bases	Coverage	Mean length	N50 length
DNA	Illumina HiSeq X	-	2.8 billion reads	428.8 G	143X	151	151
DNA	PacBio SMRT cell	377 cells	44.2M reads	309.0G	103X	7.0Kb	12.1Kb
DNA	BioNano IrysChip	12 cells	1.169M molecules (>150kb)	302.8G	101X	259.0Kb	224.7Kb
RNA	PacBio SMRT cell	50 cells (1-2kb, 2-3kb, 3-5kb,5kb+)	2.721M error-corrected reads	5.8G	-	2.1Kb	2.7Kb
RNA	Illumina HiSeq 2500	NA	48.9M reads	4.4G	-	90	90
DNA	Oxord Nanopore	20 cells	4.8M reads	91G	30X	17.1kb	25.8kb
DNA	10x Genomics linked-read	-	861M reads	130G	43X	151	151

Download

NCBI download

Raw data can be accessed from SRA.

Other raw data

All Pacbio data XML files
BioNano raw data (4 cells) (8 cells)
Illumina Omni 2.5M array project data
Processed IsoSeq data (see README)
New in 2019: bisulfate sequencing data is available at SRA
New in 2019: Nanopore sequencing data is available at SRA
New in 2019: 10X Genomics linked-read whole-genome sequencing data is available at SRA

Assemblies

polished HX1 primary contigs
polished HX1 primary contigs + associate contigs
scaffolds from polished HX1 primary contigs
scaffolds from polished HX1 primary contigs + associate contigs

Other analysis results

non-GRCh38 sequences in HX1
non-GRCh38 non-YH2.0 sequences in HX1
CNV and SV calls on BioNano , SMRT long read, Illumina short read, and microarray data from the 2016 paper.
Whole genome alignment (hg38 full analysis + decoy vs hx1f4full_3rdfixedv2) by MUMmer3
MD5SUM

Contact

Kai Wang: kaichop@gmail.com