Join us at ICWSM on Tuesday May 26, 2026 from 8am-12pm in Room 6 at the USC Institute for Information Sciences (NOT the conference location) for a hands-on tutorial on best practices for TikTok data collection. Participants will learn about limitations in the TikTok data collection informed by current research, how to navigate these concerns in computational social science research, and how to collect TikTok data from three different tools.
This tutorial is organized by Gayoung Jeon, Cameron Moy, and Deen Freelon at The University of Pennsylvania Annenberg School for Communication Politics, Identity, and Communication Lab (PICL)
Abstract
This 4 hour hands-on tutorial provides researchers with practical tools and frameworks for TikTok data collection for Computational Social Science. Recent work systematically testing three TikTok data collection techniques (currently under review at the 2026 ICWSM Conference) reveals TikTok data collection method decisions dramatically alters research results. Participants in our tutorial will learn how to use web-scraping data collection methods (Pyktok and Apify) as well as the official TikTok Research API. This tutorial will explore best practices for data collection from three endpoints---Users, Hashtags, and Comments---using strategies identified through stress testing that: 1) Reduce algorithmic selection bias in data collection; 2) Substitute or fill missing data by combining multiple tools for a more complete dataset; and 3) Improve collection efficiency by balancing resources and dataset size (including strategies to minimize resource waste). Lastly, we introduce a checklist for reporting data collection procedures and results to increase the transparency, replicability, and generalizability of TikTok research. By engaging in this tutorial, researchers will be equipped with actionable methods to obtain high-quality TikTok datasets and decision-making criteria for optimizing collection parameters to answer empirical TikTok research questions.
Tutorial Schedule
| Session | Presenter | Duration |
|---|---|---|
| 1. Introduction & Overview | Cameron Moy | 30 mins |
| 2. Official Research API Data Collection | Gayoung Jeon | 30 mins |
| 3. Pyktok Data Collection | Cameron Moy | 30 mins |
| 4. Apify Data Collection | Gayoung Jeon | 30 mins |
| ===== Break ===== | 15 mins | |
| 5. Combining Tools | Gayoung Jeon | 45 mins |
| 6. Methods & Limitations Reporting | Cameron Moy | 30 mins |
| 7. Closing Thoughts and Reflections | Cameron Moy | 30 mins |
Recommended Prerequisites
The tutorial welcomes participants of all backgrounds. Programming experience is not required for participation, but familiarity with Python is recommended. Although not required for participation, to make the most of the workshop, we recommend participants 1) apply for TikTok's Official Research API at least one month in advance of the tutorial and 2) sign up for an Apify account, which offers $5 worth of free credits.
File Downloads
Organizers
Gayoung Jeon
Gayoung Jeon is a doctoral student at the University of Pennsylvania Annenberg School for Communication. Her research combines computational psycholinguistics and artificial intelligence (AI) to examine how AI technologies influence cognitive processes and scientific research. She currently studies the factors driving LLM misalignment that result in the generation of anti-democratic content. Her work has been published in IJOC, JITP, and Visual Communication (VisCom)
Cameron Moy
Cameron Moy is a doctoral student at the University of Pennsylvania Annenberg School for Communication. His research interests include social media, marginalized communities, and data access. Currently, he employs web scraping techniques to monitor algorithmic changes on TikTok amid changing US ownership. His work has appeared at top venues ACM including CHI, FAccT, and DIS.
Deen Freelon
Deen Freelon (PhD) is the Allan Randall Freelon Sr. Professor and a Presidential Professor at the Annenberg School for Communication, where he directs the Politics, Identities, and Communication Lab (PICL). A widely recognized expert on digital politics and computational social science, he has authored or coauthored over 60 book chapters, funded reports, and articles in journals such as Nature, Science, and the Proceedings of the National Academy of Sciences. He was one of the first communication researchers to apply computational methods to social media data and has developed eight open-source research software packages.
Freelon is the main author of Pyktok, an open-source Python module for collecting video, text, and metadata from TikTok. Pyktok is designed primarily to serve scientific research, enabling data collection from hashtags, user profiles, comments, and ``You may like'' videos (or so-called related videos). The tool has been ported to R as ``traktok,'' demonstrating its broad impact across the computational social science community. He also developed ReCal, a free online intercoder reliability service, that has been running continuously since 2008 and used by tens of thousands of researchers worldwide.

