TY - GEN
T1 - WikiWho
T2 - 23rd International Conference on World Wide Web, WWW 2014
AU - Flöck, Fabian
AU - Acosta, Maribel
PY - 2014/4/7
Y1 - 2014/4/7
N2 - Revisioned text content is present in numerous collaboration platforms on the Web, most notably Wikis. To track authorship of text tokens in such systems has many potential applications; the identification of main authors for licensing reasons or tracing collaborative writing patterns over time, to name some. In this context, two main challenges arise. First, it is critical for such an authorship tracking system to be precise in its attributions, to be reliable for further processing. Second, it has to run efficiently even on very large datasets, such as Wikipedia. As a solution, we propose a graphbased model to represent revisioned content and an algorithm over this model that tackles both issues effectively. We describe the optimal implementation and design choices when tuning it to a Wiki environment. We further present a gold standard of 240 tokens from English Wikipedia articles annotated with their origin. This gold standard was created manually and confirmed by multiple independent users of a crowdsourcing platform. It is the first gold standard of this kind and quality and our solution achieves an average of 95% precision on this data set. We also perform a first-ever precision evaluation of the state-of-the-art algorithm for the task, exceeding it by over 10% on average. Our approach outperforms the execution time of the state-of-the-art by one order of magnitude, as we demonstrate on a sample of over 240 English-Wikipedia articles. We argue that the increased size of an optional materialization of our results by about 10% compared to the baseline is a favorable trade-off, given the large advantage in runtime performance. Copyright is held by the International World Wide Web Conference Committee (IW3C2).
AB - Revisioned text content is present in numerous collaboration platforms on the Web, most notably Wikis. To track authorship of text tokens in such systems has many potential applications; the identification of main authors for licensing reasons or tracing collaborative writing patterns over time, to name some. In this context, two main challenges arise. First, it is critical for such an authorship tracking system to be precise in its attributions, to be reliable for further processing. Second, it has to run efficiently even on very large datasets, such as Wikipedia. As a solution, we propose a graphbased model to represent revisioned content and an algorithm over this model that tackles both issues effectively. We describe the optimal implementation and design choices when tuning it to a Wiki environment. We further present a gold standard of 240 tokens from English Wikipedia articles annotated with their origin. This gold standard was created manually and confirmed by multiple independent users of a crowdsourcing platform. It is the first gold standard of this kind and quality and our solution achieves an average of 95% precision on this data set. We also perform a first-ever precision evaluation of the state-of-the-art algorithm for the task, exceeding it by over 10% on average. Our approach outperforms the execution time of the state-of-the-art by one order of magnitude, as we demonstrate on a sample of over 240 English-Wikipedia articles. We argue that the increased size of an optional materialization of our results by about 10% compared to the baseline is a favorable trade-off, given the large advantage in runtime performance. Copyright is held by the International World Wide Web Conference Committee (IW3C2).
KW - Authorship
KW - Collaborative writing
KW - Communitydriven content creation
KW - Content modeling
KW - Online collaboration
KW - Version control
KW - Wikipedia
UR - http://www.scopus.com/inward/record.url?scp=84909594577&partnerID=8YFLogxK
U2 - 10.1145/2566486.2568026
DO - 10.1145/2566486.2568026
M3 - Conference contribution
AN - SCOPUS:84909594577
T3 - WWW 2014 - Proceedings of the 23rd International Conference on World Wide Web
SP - 843
EP - 853
BT - WWW 2014 - Proceedings of the 23rd International Conference on World Wide Web
PB - Association for Computing Machinery
Y2 - 7 April 2014 through 11 April 2014
ER -