Tissue heterogeneity is prevalent in gene expression studies

Gregor Sturm, Markus List, Jitao David Zhang

Research output: Contribution to journalArticlepeer-review

6 Scopus citations

Abstract

Lack of reproducibility in gene expression studies is a serious issue being actively addressed by the biomedical research community. Besides established factors such as batch effects and incorrect sample annotations, we recently reported tissue heterogeneity, a consequence of unintended profiling of cells of other origins than the tissue of interest, as a source of variance. Although tissue heterogeneity exacerbates irreproducibility, its prevalence in gene expression data remains unknown. Here, we systematically analyse 2 667 publicly available gene expression datasets covering 76 576 samples. Using two independent data compendia and a reproducible, open-source software pipeline, we find a prevalence of tissue heterogeneity in gene expression data that affects between 1 and 40% of the samples, depending on the tissue type. We discover both cases of severe heterogeneity, which may be caused by mistakes in annotation or sample handling, and cases of moderate heterogeneity, which are likely caused by tissue infiltration or sample contamination. Our analysis establishes tissue heterogeneity as a widespread phenomenon in publicly available gene expression datasets, which constitutes an important source of variance that should not be ignored. Consequently, we advocate the application of quality-control methods such as BioQC to detect tissue heterogeneity prior to mining or analysing gene expression data.

Original languageEnglish
Article numberlqab077
JournalNAR Genomics and Bioinformatics
Volume3
Issue number3
DOIs
StatePublished - 1 Sep 2021
Externally publishedYes

Fingerprint

Dive into the research topics of 'Tissue heterogeneity is prevalent in gene expression studies'. Together they form a unique fingerprint.

Cite this