TY - GEN
T1 - Assessing the significance of data mining results on graphs with feature vectors
AU - Günnemann, Stephan
AU - Dao, Phuong
AU - Jamali, Mohsen
AU - Ester, Martin
PY - 2012
Y1 - 2012
N2 - Assessing the significance of data mining results is an important step in the knowledge discovery process. While results might appear interesting at a first glance, they can often be explained by already known characteristics of the data. Randomization is an established technique for significance testing, and methods to assess data mining results on vector data or network data have been proposed. In many applications, however, both sources are simultaneously given. Since these sources are rarely independent of each other but highly correlated, naively applying existing randomization methods on each source separately is questionable. In this work, we present a method to assess the significance of mining results on graphs with binary features vectors. We propose a novel null model that preserves correlation information between both sources. Our randomization exploits an adaptive Metropolis sampling and interweaves attribute randomization and graph randomization steps. In thorough experiments, we demonstrate the application of our technique. Our results indicate that while simultaneously using both sources is beneficial, often one source of information is dominant for determining the mining results.
AB - Assessing the significance of data mining results is an important step in the knowledge discovery process. While results might appear interesting at a first glance, they can often be explained by already known characteristics of the data. Randomization is an established technique for significance testing, and methods to assess data mining results on vector data or network data have been proposed. In many applications, however, both sources are simultaneously given. Since these sources are rarely independent of each other but highly correlated, naively applying existing randomization methods on each source separately is questionable. In this work, we present a method to assess the significance of mining results on graphs with binary features vectors. We propose a novel null model that preserves correlation information between both sources. Our randomization exploits an adaptive Metropolis sampling and interweaves attribute randomization and graph randomization steps. In thorough experiments, we demonstrate the application of our technique. Our results indicate that while simultaneously using both sources is beneficial, often one source of information is dominant for determining the mining results.
UR - http://www.scopus.com/inward/record.url?scp=84874088593&partnerID=8YFLogxK
U2 - 10.1109/ICDM.2012.70
DO - 10.1109/ICDM.2012.70
M3 - Conference contribution
AN - SCOPUS:84874088593
SN - 9780769549057
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 270
EP - 279
BT - Proceedings - 12th IEEE International Conference on Data Mining, ICDM 2012
T2 - 12th IEEE International Conference on Data Mining, ICDM 2012
Y2 - 10 December 2012 through 13 December 2012
ER -