TY - GEN
T1 - Learning better while sending less
T2 - IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015
AU - Xiao, Han
AU - Lin, Shou De
AU - Yeh, Mi Yen
AU - Gibbons, Phillip B.
AU - Eckert, Claudia
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2015/12/2
Y1 - 2015/12/2
N2 - We consider a novel distributed learning problem: A server receives potentially unlimited data from clients in a sequential manner, but only a small initial fraction of these data are labeled. Because communication bandwidth is expensive, each client is limited to sending the server only a small (high-priority) fraction of the unlabeled data it generates, and the server is limited in the amount of prioritization hints it sends back to the client. The goal is for the server to learn a good model of all the client data from the labeled and unlabeled data it receives. This setting is frequently encountered in real-world applications and has the characteristics of online, semi-supervised, and active learning. However, previous approaches are not designed for the client-server setting and do not hold the promise of reducing communication costs. We present a novel framework for solving this learning problem in an effective and communication-efficient manner. On the server side, our solution combines two diverse learners working collaboratively, yet in distinct roles, on the partially labeled data stream. A compact, online graph-based semi-supervised learner is used to predict labels for the unlabeled data arriving from the clients. Samples from this model are used as ongoing training for a linear classifier. On the client side, our solution prioritizes data based on an active-learning metric that favors instances that are close to the classifier's decision hyperplane and yet far from each other. To reduce communication, the server sends the classifier's weight-vector to the client only periodically. Experimental results on real-world data sets show that this particular combination of techniques outperforms other approaches, and in particular, often outperforms (communication expensive) approaches that send all the data to the server.
AB - We consider a novel distributed learning problem: A server receives potentially unlimited data from clients in a sequential manner, but only a small initial fraction of these data are labeled. Because communication bandwidth is expensive, each client is limited to sending the server only a small (high-priority) fraction of the unlabeled data it generates, and the server is limited in the amount of prioritization hints it sends back to the client. The goal is for the server to learn a good model of all the client data from the labeled and unlabeled data it receives. This setting is frequently encountered in real-world applications and has the characteristics of online, semi-supervised, and active learning. However, previous approaches are not designed for the client-server setting and do not hold the promise of reducing communication costs. We present a novel framework for solving this learning problem in an effective and communication-efficient manner. On the server side, our solution combines two diverse learners working collaboratively, yet in distinct roles, on the partially labeled data stream. A compact, online graph-based semi-supervised learner is used to predict labels for the unlabeled data arriving from the clients. Samples from this model are used as ongoing training for a linear classifier. On the client side, our solution prioritizes data based on an active-learning metric that favors instances that are close to the classifier's decision hyperplane and yet far from each other. To reduce communication, the server sends the classifier's weight-vector to the client only periodically. Experimental results on real-world data sets show that this particular combination of techniques outperforms other approaches, and in particular, often outperforms (communication expensive) approaches that send all the data to the server.
KW - big data
KW - distributed system
KW - online learning
KW - semi-supervised learning
UR - http://www.scopus.com/inward/record.url?scp=84962822374&partnerID=8YFLogxK
U2 - 10.1109/DSAA.2015.7344833
DO - 10.1109/DSAA.2015.7344833
M3 - Conference contribution
AN - SCOPUS:84962822374
T3 - Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015
BT - Proceedings of the 2015 IEEE International Conference on Data Science and Advanced Analytics, DSAA 2015
A2 - Pasi, Gabriella
A2 - Kwok, James
A2 - Zaiane, Osmar
A2 - Gallinari, Patrick
A2 - Gaussier, Eric
A2 - Cao, Longbing
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 19 October 2015 through 21 October 2015
ER -