ByteDance, the company behind the popular video-sharing app TikTok, is facing scrutiny for its aggressive data collection practices. The company has launched a powerful web scraper named ‘Bytespider’ which is collecting data at a rate significantly higher than other major tech firms.
Research conducted by Kasada, a bot management company, and Dark Visitors, a group that monitors scraper bots, confirmed Bytespider’s activity. According to Kasada CEO Sam Crowther, Bytespider collects data 25 times faster than GPTbot, used by OpenAI for ChatGPT, and 3,000 times faster than ClaudeBot from Anthropic. This aggressive data collection strategy comes despite the looming threat of a U.S. ban on TikTok due to national security concerns. President Joe Biden has demanded the sale or shutdown of TikTok, citing these concerns.
Adding to the controversy, Bytespider is disregarding robots.txt, a voluntary code that advises scrapers to avoid certain websites. This disregard further highlights ByteDance’s aggressive approach to data collection.
The increased web scraping is linked to ByteDance’s efforts to develop a new large language model (LLM) to improve TikTok’s search capabilities. A recent update to TikTok’s search function allows real-time keyword searches for ads, potentially enhancing ad visibility. ByteDance has yet to respond to inquiries regarding Bytespider.
This aggressive web scraping by ByteDance follows a trend among major tech companies. In June, both OpenAI and Anthropic were accused of ignoring web scraping rules and bypassing the robots.txt protocol to gather data for AI model training. This practice sparked controversy, highlighting the tension between AI development and data privacy. In August, NVIDIA faced scrutiny for scraping videos from platforms like YouTube to train its AI models. This revelation raised concerns about content creators’ rights and the ethical implications of using publicly available data without explicit consent. Similarly, in September, Microsoft’s owned LinkedIn was criticized for using user data for AI training without updating its terms of service, particularly affecting users in the U.S.
The aggressive data collection practices of ByteDance and other tech companies highlight the growing tension between AI development and data privacy. As AI technology continues to evolve, concerns about ethical data usage and user consent are becoming increasingly important.