Scallop genome reveals molecular adaptations to semi-sessile life and neurotoxins (Nature Communications)
Bivalve molluscs are descendants of an early-Cambrian lineage superbly adapted to benthic filter feeding. Adaptations in form and behavior are well recognized, but the underlying molecular mechanisms are largely unknown. Here, we investigate the genome, various transcriptomes, and proteomes of the scallop Chlamys farreri, a semi-sessile bivalve with well-developed adductor muscle, sophisticated eyes, and remarkable neurotoxin resistance. The scallop’s large striated muscle is energy-dynamic but not fully differentiated from smooth muscle. Its eyes are supported by highly diverse, intronless opsins expanded by retroposition for broadened spectral sensitivity. Rapid byssal secretion is enabled by a specialized foot and multiple proteins including expanded tyrosinases. The scallop uses hepatopancreas to accumulate neurotoxins and kidney to transform to high-toxicity forms through expanded sulfotransferases, probably as deterrence against predation, while it achieves neurotoxin resistance through point mutations in sodium channels. These findings suggest that expansion and mutation of those genes may have profound effects on scallop’s phenotype and adaptation.
Identification of a neural crest stem cell niche by Spatial Genomic Analysis (Nature Communications)
The neural crest is an embryonic population of multipotent stem cells that form numerous defining features of vertebrates. Due to lack of reliable techniques to perform transcriptional profiling in intact tissues, it remains controversial whether the neural crest is a heterogeneous or homogeneous population. By coupling multiplex single molecule fluorescence in situ hybridization with machine learning algorithm based cell segmentation, we examine expression of 35 genes at single cell resolution in vivo. Unbiased hierarchical clustering reveals five spatially distinct subpopulations within the chick dorsal neural tube. Here we identify a neural crest stem cell niche that centers around the dorsal midline with high expression of neural crest genes, pluripotency factors, and lineage markers. Interestingly, neural and neural crest stem cells express distinct pluripotency signatures. This Spatial Genomic Analysis toolkit provides a straightforward approach to study quantitative multiplex gene expression in numerous biological systems, while offering insights into gene regulatory networks via synexpression analysis.
Inference of differentiation time for single cell transcriptomes using cell population reference data (Nature Communications)
Single-cell RNA sequencing (scRNA-seq) is a powerful method for dissecting intercellular heterogeneity during development. Conventional trajectory analysis provides only a pseudotime of development, and often discards cell-cycle events as confounding factors. Here using matched cell population RNA-seq (cpRNA-seq) as a reference, we developed an “iCpSc” package for integrative analysis of cpRNA-seq and scRNA-seq data. By generating a computational model for reference “biological differentiation time” using cell population data and applying it to single-cell data, we unbiasedly associated cell-cycle checkpoints to the internal molecular timer of single cells. Through inferring a network flow from cpRNA-seq to scRNA-seq data, we predicted a role of M phase in controlling the speed of neural differentiation of mouse embryonic stem cells, and validated it through gene knockout (KO) experiments. By linking temporally matched cpRNA-seq and scRNA-seq data, our approach provides an effective and unbiased approach for identifying developmental trajectory and timing-related regulatory events.
Functional mapping and annotation of genetic associations with FUMA (Nature Communications)
A main challenge in genome-wide association studies (GWAS) is to pinpoint possible causal variants. Results from GWAS typically do not directly translate into causal variants because the majority of hits are in non-coding or intergenic regions, and the presence of linkage disequilibrium leads to effects being statistically spread out across multiple variants. Post-GWAS annotation facilitates the selection of most likely causal variant(s). Multiple resources are available for post-GWAS annotation, yet these can be time consuming and do not provide integrated visual aids for data interpretation. We, therefore, develop FUMA: an integrative web-based platform using information from multiple biological resources to facilitate functional annotation of GWAS results, gene prioritization and interactive visualization. FUMA accommodates positional, expression quantitative trait loci (eQTL) and chromatin interaction mappings, and provides gene-based, pathway and tissue enrichment results. FUMA results directly aid in generating hypotheses that are testable in functional experiments aimed at proving causal relations.
Contributions of Zea mays subspecies mexicana haplotypes to modern maize (Nature Communications)
Maize was domesticated from lowland teosinte (Zea mays ssp. parviglumis), but the contribution of highland teosinte (Zea mays ssp. mexicana, hereafter mexicana) to modern maize is not clear. Here, two genomes for Mo17 (a modern maize inbred) and mexicana are assembled using a meta-assembly strategy after sequencing of 10 lines derived from a maize-teosinte cross. Comparative analyses reveal a high level of diversity between Mo17, B73, and mexicana, including three Mb-size structural rearrangements. The maize spontaneous mutation rate is estimated to be 2.17 × 10−8 ~3.87 × 10−8 per site per generation with a nonrandom distribution across the genome. A higher deleterious mutation rate is observed in the pericentromeric regions, and might be caused by differences in recombination frequency. Over 10% of the maize genome shows evidence of introgression from the mexicana genome, suggesting that mexicana contributed to maize adaptation and improvement. Our data offer a rich resource for constructing the pan-genome of Zea mays and genetic improvement of modern maize varieties.
Accurate assembly of transcripts through phase-preserving graph decomposition（Nature Biotechnology）
We introduce Scallop, an accurate reference-based transcript assembler that improves reconstruction of multi-exon and lowly expressed transcripts. Scallop preserves long-range phasing paths extracted from reads, while producing a parsimonious set of transcripts and minimizing coverage deviation. On 10 human RNA-seq samples, Scallop produces 34.5% and 36.3% more correct multi-exon transcripts than StringTie and TransComb, and respectively identifies 67.5% and 52.3% more lowly expressed transcripts. Scallop achieves higher sensitivity and precision than previous approaches over a wide range of coverage thresholds.
Landscape and evolution of tissue-specific alternative polyadenylation across Drosophila species（Genome Biology）
Drosophila melanogaster has one of best-described transcriptomes of any multicellular organism. Nevertheless, the paucity of 3′-sequencing data in this species precludes comprehensive assessment of alternative polyadenylation (APA), which is subject to broad tissue-specific control.
Here, we generate deep 3′-sequencing data from 23 developmental stages, tissues, and cell lines of D. melanogaster, yielding a comprehensive atlas of ~ 62,000 polyadenylated ends. These data broadly extend the annotated transcriptome, identify ~ 40,000 novel 3′ termini, and reveal that two-thirds of Drosophila genes are subject to APA. Furthermore, we dramatically expand the numbers of genes known to be subject to tissue-specific APA, such as 3′ untranslated region (UTR) lengthening in head and 3′ UTR shortening in testis, and characterize new tissue and developmental 3′ UTR patterns. Our thorough 3′ UTR annotations permit reassessment of post-transcriptional regulatory networks, via conserved miRNA and RNA binding protein sites. To evaluate the evolutionary conservation and divergence of APA patterns, we generate developmental and tissue-specific 3′-seq libraries from Drosophila yakuba and Drosophila virilis. We document broadly analogous tissue-specific APA trends in these species, but also observe significant alterations in 3′ end usage across orthologs. We exploit the population of functionally evolving poly(A) sites to gain clear evidence that evolutionary divergence in core polyadenylation signal (PAS) and downstream sequence element (DSE) motifs drive broad alterations in 3′ UTR isoform expression across the Drosophila phylogeny.
These data provide a critical resource for the Drosophila community and offer many insights into the complex control of alternative tissue-specific 3′ UTR formation and its consequences for post-transcriptional regulatory networks.
Intron retention enhances gene regulatory complexity in vertebrates（Genome Biology）
While intron retention (IR) is now widely accepted as an important mechanism of mammalian gene expression control, it remains the least studied form of alternative splicing. To delineate conserved features of IR, we performed an exhaustive phylogenetic analysis in a highly purified and functionally defined cell type comprising neutrophilic granulocytes from five vertebrate species spanning 430 million years of evolution.
Our RNA-sequencing-based analysis suggests that IR increases gene regulatory complexity, which is indicated by a strong anti-correlation between the number of genes affected by IR and the number of protein-coding genes in the genome of individual species. Our results confirm that IR affects many orthologous or functionally related genes in granulocytes. Further analysis uncovers new and unanticipated conserved characteristics of intron-retaining transcripts. We find that intron-retaining genes are transcriptionally co-regulated from bidirectional promoters. Intron-retaining genes have significantly longer 3′ UTR sequences, with a corresponding increase in microRNA binding sites, some of which include highly conserved sequence motifs. This suggests that intron-retaining genes are highly regulated post-transcriptionally.
Our study provides unique insights concerning the role of IR as a robust and evolutionarily conserved mechanism of gene expression regulation. Our findings enhance our understanding of gene regulatory complexity by adding another contributor to evolutionary adaptation.
9. 合并一组对齐的DNA Read提高了识别大量缺失的准确性。
Jointly aligning a group of DNA reads improves accuracy of identifying large deletions （Nucleic Acids Research）
Performing sequence alignment to identify structural variants, such as large deletions, from genome sequencing data is a fundamental task, but current methods are far from perfect. The current practice is to independently align each DNA read to a reference genome. We show that the propensity of genomic rearrangements to accumulate in repeat-rich regions imposes severe ambiguities in these alignments, and consequently on the variant calls—with current read lengths, this affects more than one third of known large deletions in the C. Venter genome. We present a method to jointly align reads to a genome, whereby alignment ambiguity of one read can be disambiguated by other reads. We show this leads to a significant improvement in the accuracy of identifying large deletions (≥20 bases), while imposing minimal computational overhead and maintaining an overall running time that is at par with current tools. A software implementation is available as an open-source Python program called JRA at https://bitbucket.org/jointreadalignment/jra-src.
10. MethBank 3.0：跨越多种物种的DNA甲基化组的数据库。
MethBank 3.0: a database of DNA methylomes across a variety of species （Nucleic Acids Research）
MethBank (http://bigd.big.ac.cn/methbank) is a database that integrates high-quality DNA methylomes across a variety of species and provides an interactive browser for visualization of methylation data. Here, we present an updated implementation of MethBank (version 3.0) by incorporating more DNA methylomes from multiple species and equipping with more enhanced functionalities for data annotation and more friendly web interfaces for data presentation, search and visualization. MethBank 3.0 features large-scale integration of high-quality methylomes, involving 34 consensus reference methylomes derived from a large number of human samples, 336 single-base resolution methylomes from different developmental stages and/or tissues of five plants, and 18 single-base resolution methylomes from gametes and early embryos at multiple stages of two animals. Additionally, it is enhanced by improving the functionalities for data annotation, which accordingly enables systematic identification of methylation sites closely associated with age, sites with constant methylation levels across different ages, differentially methylated promoters, age-specific differentially methylated cytosines/regions, and methylated CpG islands. Moreover, MethBank provides tools to estimate human methylation age online and to identify differentially methylated promoters, respectively. Taken together, MethBank is upgraded with significant improvements and advances over the previous version, which is of great help for deciphering DNA methylation regulatory mechanisms for epigenetic studies.
mirDIP 4.1—integrative database of human microRNA target predictions （Nucleic Acids Research）
MicroRNAs are important regulators of gene expression, achieved by binding to the gene to be regulated. Even with modern high-throughput technologies, it is laborious and expensive to detect all possible microRNA targets. For this reason, several computational microRNA–target prediction tools have been developed, each with its own strengths and limitations. Integration of different tools has been a successful approach to minimize the shortcomings of individual databases. Here, we present mirDIP v4.1, providing nearly 152 million human microRNA–target predictions, which were collected across 30 different resources. We also introduce an integrative score, which was statistically inferred from the obtained predictions, and was assigned to each unique microRNA–target interaction to provide a unified measure of confidence. We demonstrate that integrating predictions across multiple resources does not cumulate prediction bias toward biological processes or pathways. mirDIP v4.1 is freely available at http://ophid.utoronto.ca/mirDIP/.