HuaXia1 (HX1) Project

Background

Short-read sequencing has enabled the de novo assembly of a few individual human genomes, but with inherent limitations in characterizing repeat elements. Here we sequence a Chinese individual HX1 by single-molecule real-time (SMRT) long-read sequencing, construct a physical map by NanoChannel arrays, and generate a de novo assembly of 2.93Gb (contig N50: 8.3Mb, scaffold N50: 22.0Mb, including 39.3Mb N-bases), together with 206Mb of alternative haplotypes. The assembly fully or partially fills 274 (28.4%) N-gaps in the reference genome GRCh38. Comparison to GRCh38 reveals 12.8Mb of HX1-specific sequences, including 4.1Mb that are not present in previously reported Asian genomes. Furthermore, long-read sequencing of the transcriptome reveals novel spliced genes that are not annotated in GENCODE and are missed by short-read RNA-Seq. Our results imply that improved characterization of genome functional variation may require the use of a range of genomic technologies on diverse human populations.

Summary of data sets

MaterialPlatform# Cells# ReadsBasesCoverageMean lengthN50 length
DNAIllumina HiSeq X-2.8 billion reads428.8 G143X151151
DNAPacBio SMRT cell377 cells44.2M reads309.0G103X7.0Kb12.1Kb
DNABioNano IrysChip12 cells1.169M molecules (>150kb)302.8G101X259.0Kb224.7Kb
RNAPacBio SMRT cell50 cells (1-2kb, 2-3kb, 3-5kb,5kb+)2.721M error-corrected reads5.8G-2.1Kb2.7Kb
RNAIllumina HiSeq 2500NA48.9M reads4.4G-9090
DNAOxord Nanopore20 cells4.8M reads91G30X17.1kb25.8kb
DNA10x Genomics linked-read-861M reads130G43X151151

Download

NCBI download

Raw data can be accessed from SRA.

Other raw data

All Pacbio data XML files
BioNano raw data (4 cells) (8 cells)
Illumina Omni 2.5M array project data
Processed IsoSeq data (see README)
New in 2019: bisulfate sequencing data is available at SRA
New in 2019: Nanopore sequencing data is available at SRA
New in 2019: 10X Genomics linked-read whole-genome sequencing data is available at SRA

Assemblies

polished HX1 primary contigs
polished HX1 primary contigs + associate contigs
scaffolds from polished HX1 primary contigs
scaffolds from polished HX1 primary contigs + associate contigs

Other analysis results

non-GRCh38 sequences in HX1
non-GRCh38 non-YH2.0 sequences in HX1
CNV and SV calls on BioNano, SMRT long read, Illumina short read, and microarray data from the 2016 paper.
Whole genome alignment (hg38 full analysis + decoy vs hx1f4full_3rdfixedv2) by MUMmer3
MD5SUM

Contact

Kai Wang: kaichop@gmail.com