TY - JOUR
T1 - Tampering with Twitter’s Sample API
AU - Pfeffer, Jürgen
AU - Mayer, Katja
AU - Morstatter, Fred
N1 - Publisher Copyright:
© 2018, The Author(s).
PY - 2018/12/1
Y1 - 2018/12/1
N2 - Social media data is widely analyzed in computational social science. Twitter, one of the largest social media platforms, is used for research, journalism, business, and government to analyze human behavior at scale. Twitter offers data via three different Application Programming Interfaces (APIs). One of which, Twitter’s Sample API, provides a freely available 1% and a costly 10% sample of all Tweets. These data are supposedly random samples of all platform activity. However, we demonstrate that, due to the nature of Twitter’s sampling mechanism, it is possible to deliberately influence these samples, the extent and content of any topic, and consequently to manipulate the analyses of researchers, journalists, as well as market and political analysts trusting these data sources. Our analysis also reveals that technical artifacts can accidentally skew Twitter’s samples. Samples should therefore not be regarded as random. Our findings illustrate the critical limitations and general issues of big data sampling, especially in the context of proprietary data and undisclosed details about data handling.
AB - Social media data is widely analyzed in computational social science. Twitter, one of the largest social media platforms, is used for research, journalism, business, and government to analyze human behavior at scale. Twitter offers data via three different Application Programming Interfaces (APIs). One of which, Twitter’s Sample API, provides a freely available 1% and a costly 10% sample of all Tweets. These data are supposedly random samples of all platform activity. However, we demonstrate that, due to the nature of Twitter’s sampling mechanism, it is possible to deliberately influence these samples, the extent and content of any topic, and consequently to manipulate the analyses of researchers, journalists, as well as market and political analysts trusting these data sources. Our analysis also reveals that technical artifacts can accidentally skew Twitter’s samples. Samples should therefore not be regarded as random. Our findings illustrate the critical limitations and general issues of big data sampling, especially in the context of proprietary data and undisclosed details about data handling.
KW - Experiments
KW - Manipulation
KW - Sampling
KW - Twitter Data
UR - http://www.scopus.com/inward/record.url?scp=85058932808&partnerID=8YFLogxK
U2 - 10.1140/epjds/s13688-018-0178-0
DO - 10.1140/epjds/s13688-018-0178-0
M3 - Article
AN - SCOPUS:85058932808
SN - 2193-1127
VL - 7
JO - EPJ Data Science
JF - EPJ Data Science
IS - 1
M1 - 50
ER -