How Transcription Factor Models Predict Universal Features of Regulatory DNA
Shedding light on how genomes evolve and the full scope of their function is a critical research topic in the field of molecular biology and genetics. Most functional DNA leans toward regulating gene expression rather than directly coding for proteins, with over 5% of the total human genome predicted to be regulatory sequences. These genome regions include promoters and enhancers, which initiate the transcription process, the first step in generating proteins. Transcription factor (TF) binding sites delineate and mark these regulatory sites. Transcription factors do not bind by themselves; instead, they conglomerate, in two different modes: the TFs can bind together in complexes called “enhanceosomes”, where spacing and orientation of the TFs matter, or in a billboard model, where spacing and orientation don’t matter, and only the collection of TFs does. But, precisely how TFs collaborate and accurately recognize regulatory sequences across the genome is still unknown. The global properties that define regulatory areas are also poorly characterized and require further research—which aligns with the research aims of the Hughes lab.
Currently enrolled in the Computational Biology in Molecular Genetics Ph.D. track, MoGen grad student Zain Patel, supervised by MoGen interim chair and PI Dr. Timothy Hughes systemically determined whether four different computational models of regulatory elements, which represent different aspects of the two modes described above, align with global features of regulatory sequences. The features of such regulatory sequences include: 1) short and unique/diverse sequences, 2) 1-2% of the genome being an active regulatory site in any one cell type, and 3) high turnover or mutation rate compared to genes. The scientists analyzed four different computational models that classify and differentiate regulatory sequences, two of which were personally developed by Patel and Hughes and the other two tools were described in previous publications.
The original computational models developed and trained by the Hughes lab were the logistic regression (LR) and multimeric motif models. Both models take advantage of the fact that TFs generally preferentially bind specific regulatory sequences called motifs to identify regulatory elements. These motifs can be monomeric—a sequence containing the binding site of one TF—or dimeric or multimeric—a sequence containing binding sites for two or more TFs, with specific spacing between them. For the LR model, the group trained it to identify open chromatin or areas of the chromosome not wrapped up in histone proteins. These open chromatin regions are called DHS sites, and the model distinguishes them from closed chromatin (DNA wound up in histones), termed non-DHS sites, in any specific cell type. These DHS sites are regions of the chromosome accessible to the DNase I enzyme, which cuts unwound DNA. They serve as markers for transcriptional or gene expression activity and, by proxy, regulatory sites. The multimeric model, on the other hand, replicates the enhanceosome model of regulatory elements. It generates and derives the multimeric motifs computationally from experimentally determined dimeric motifs.
Overall, the paper discovered that all four models, despite their varying parameters, accurately modelled and firmly reaffirmed the general global properties of regulatory sequences. “These properties were not too obvious before, but this paper makes it apparent that they are expected elements of regulatory elements” Patel noted. Hughes added, “For example, there is an extensive literature that empirically describes rapid evolutionary turnover of regulatory DNA, which to many of us – including me – was initially very surprising. But, in fact, every model we tested predicts that this should occur. We probably should have expected it in advance”. The analysis also implies that many TFs, four on average, are needed to specify and identify a regulatory site in the genome. It also posited that a type of transcription factor called master regulators or pioneer factors reduces the number of TFs needed at a given regulatory site, making it easier to mutate/evolve and decreasing complexity. Additional findings are that the general range of sequence lengths for motifs with 1-6 TFs is typically less than 100 base pairs long. Moreover, the positive weight scores from the originally developed LR model accurately detected TFs with known biological roles in the cell type they were trained on, and those negative scores corresponded to proteins that repressed gene activity.
Pinpointing the remaining regulatory sites is work that remains to be done in the future, as hundreds of TFs still don’t have a defined motif. “There are 1639 genes that code for specific transcription factors, and we know the motifs for around 1200 of them,” Patel notes, “thus, we still don’t know the motifs of around 400 of them.” Essentially, we will obtain a more complete picture of how the genome and gene regulation works after mapping all these sites. Such research will also facilitate more accurate predictions on the impacts of mutations if researchers locate them in already established motifs alongside providing future targets for CRISPR-Cas9 research. Patel also mentions that research into regulatory elements in the genome is valuable in understanding animal development, which depends heavily on proper gene regulation and expression, and diseases involving dysfunctional gene regulation. Additionally, the insight from this field of research is beneficial for evolutionary biology, especially for understanding how new species emerge and explaining the differences between the genomes and gene expression of different species. Overall, this publication demonstrates the power of computational methods in modelling biological processes such as gene expression and regulation and sheds light on the biology of transcription factors and genome function. “I’ve always been a proponent of observation, but after this, I have to acknowledge that you can also learn things from models”, concludes Hughes. “Maybe next we’ll get into hypothesis testing”.
A big thank you to Zain Patel and Dr. Hughes for their help!
Read the article below:
Patel, Z.M., Hughes, T.R. Global properties of regulatory sequences are predicted by transcription factor recognition mechanisms. Genome Biol 22, 285 (2021). https://doi.org/10.1186/s13059-021-02503-y