The proper response of biological system for various environments is a fundamental requirement for homeostasis of every living organism. About 23000 protein-coding genes are stored in the human genome, but all genes are not expressed for appropriate m...
The proper response of biological system for various environments is a fundamental requirement for homeostasis of every living organism. About 23000 protein-coding genes are stored in the human genome, but all genes are not expressed for appropriate mediation of a given condition. Tissue-specific and condition-specific genes are presumed to be tightly regulated by set of activated transcription factors (TFs) among approximately 2600 TFs in each mammalian cell types.
Gene expression is precisely regulated by combinatorial transcriptional machinery organized by transcription factors, chromatin-remodeling complexes, and recruited general transcription factors, including RNA polymerase II, in response to the activated signaling pathways triggered from cellular receptors that recognize changed cellular conditions. The occupation of transcription factors to promoters or enhancers they regulate is the initial step in the transcriptional regulatory process. However, how a cell regulates gene expressions for various conditions still remains poorly understood. In this thesis, I present systematic approaching methods for modeling transcriptional regulatory mechanism for a given cell type-specific and condition-specific gene through the integration of multiple resources between DNA motif detection and gene expression information.
For identification of the hidden regulatory codes in the genome, modeling of transcriptional regulatory modules through prediction of transcription factor binding site (TFBS) was discussed as two issues. The first issue is for finding direct target genes of signal transducer and activator of transcription 3 (STAT3) by using its DNA sequence-dependent regulatory manner. I show that both of the reconstruction of position weight matrix (PWM) for STAT3 motifs and background correction using five background promoter sequences models can provide increased statistical and probabilistic prediction of true positive STAT3 TFBS discovery. In addition, STAT TFBSs predicted from the reconstructed PWMs for STAT family were highly overlapped in the same genome loci of the known functional STAT3 TFBSs and were positioned in the evolutionally conserved regions across homologous proximal promoter sequences of five mammalian species. In the application of these observed features of STAT3 TFBSs into STAT-Finder program, I also validated that our program can predict functional STAT3 TFBSs with more enhanced performance, low false positive and high true positive, and successfully identified eight novel STAT3 target genes among highly and commonly expressed genes in the multiple cancer cells, as an experimental validation of prediction modeling. The second issue is the features of STAT3 TFBSs as the generalizing criteria for the application of prediction to other TFs. I demonstrated that successful expansion of STAT-Finder program to TFBS-Screener program for available 368 TFs with supporting evidences using 51 TF ChIP-Seq (chromatin immunoprecipitation with massively parallel DNA sequencing) dataset as a golden positive. Furthermore, evaluated TFBS prediction results from multiple ChIP-Seq datasets for 26 TFs can provide cis-regulatory motifs module that might function coordinately in the TF bound ChIP loci in a given condition- or cell-type regulation without any pre-assumptions.
For interpretation of the cellular contexts in various conditions, I discussed about the integration of heterogeneous microarray data for cell-type or condition-specific gene expression and its application for inference of cellular context. The first issue is for selection of significant genes which are specifically expressed in a given cell type from the heterogeneous microarray data. I found that the usage of maximum intensity value, among normalized intensities of clustered Probe Set IDs with high quality annotation, for a target gene in the Affymetrix platforms, enhanced integration efficiency between different microarray platforms of human and mouse. I also validated my developed integration methods through successful detections of known marker gene sets for the general embryonic stem cells, T-cells, and B-cells from the heterogeneous microarray data for various hematopoiesis lineage cell types. As the last part, the second issue is discussed about the applications of the heterogeneous microarrays and TFBS-prediction for screening putative transcription factors. I found that most of known regulatory TFs for the given target gene were highly discriminated from other TFs by static calculation of expression correlations between TF and the given target gene in the integrated microarrays for various conditions. For IL6 as the target gene model for experimental validation, I also identified that IRF7 is the novel TF that it contributes high level of IL6 mRNA in the complex condition by poly-(I:C), a synthetic analog of dsRNA for viral RNA genome, in conjunction with IL1-beta.
In summary, I suggest flexible systematic modeling methods which can provide putative transcription factors for the initial step of transcriptional regulatory mechanism for a given target gene in a given condition.