You are here
Discovering Combinatorial Gene Interactions in High-Dimensional Data
In the past decade or so, new technologies in biotech have meant an explosion in the availability of high-dimensional genomic data (microarrays, SNP data, RNA-Seq...): their dimension and noise levels making it necessary to rely on machine learning techniques and statistical models to extract meaningful signal and narrow down the field for further experimental research. In this presentation, I will try to give a very general overview of some of the statistical techniques commonly used to treat high-dimensional data, as well as a more detailed illustration of the technique we developed to identify combinatorial interaction effects in such data. A crucial aspect of machine-learning when dealing with high-dimensional data, is the concept of sparsity: how much of the input’s variables find their way in the model. By using regularisation techniques (the addition of a tailored penalisation component), it is possible to ensure certain properties of the statistical model (size, elimination of collinear variables…). Another, is the fitting of complex statistical models that cannot be solved analytically, usually requiring optimising a non-linear objective function (e.g to maximise likelihood or minimise empirical error). While relatively simple in application, both techniques require some understanding of the underlying statistical assumption and information theoretic implications, in order to obtain satisfying results. After presenting some separate examples of these techniques and their use in common bioinformatics problem, I will introduce our recent work, which combines them with data-mining (itemset mining) and fractional programming techniques to fit complex statistical models over (non-linear) combination of heterogeneous input variables, allowing for example to identify sets of genes (up- or down-regulated) that drive complex phenotypes or clinical observations. This work was in particular successfully applied to a combination of cDNA microarray and gene mutation copy number data paired with (right-censored) survival data, to identify interactions (and potential synthetic lethals) playing a role in neuroblastoma and breast-cancer.
References:
D duVerle, I Takeuchi, Y Murakami-Tonami, K Kadomatsu, K Tsuda. Discovering combinatorial interactions in survival data. Bioinformatics, 29(23):3053–3059, 2013.
R. Tibshirani et al. The lasso method for variable selection in the cox model. Statistics in medicine, 16(4):385–395, 1997.
D. Ghosh. Penalized discriminant methods for the classification of tumors from gene expression data. Biometrics, 59(4):992–1000, 2003.
T. Uno, T. Asai, Y. Uchida, and H. Arimura. An efficient algorithm for enumerating closed patterns in transaction databases. In Discovery Science, pages 57–59. Springer, 2004