Conversational Analysis made easy using PRAW, PSAW and Convokit
This article is in continuation of my previous article on Unveiling Conversational Insights, check it out here:
A large chunk of research in Natural Language Processing (NLP) has been devoted to studying single-level posts and comments, which are often devoid of surrounding context necessary to fully comprehend the intended meaning and accurately interpret the nuances of a conversation.
However, since conversations form the very basis of human discourse, an increasing number of researchers have started devoting time to analysis on conversational data, which can accelerate the progress in this field.
In this post, I will be giving a bird’s eye view of the Python libraries that can be utilized to collect publicly available Reddit posts from any (public) subreddit(s), including their entire comment thread, and preprocess them such that they can be stored as a Convokit corpus.
Reddit is a social media platform structured in sub-forums, or subreddits, each focused on a given topic. These subreddits are a rich source of conversational data and have been extensively utilized by researchers specializing in the field of Natural Language Understanding (NLU).
Here are several reasons why Reddit is advantageous for studying conversations and conducting NLP research:
- Diverse and Active Userbase: Reddit has a vast and diverse userbase, with millions of active users participating in discussions across numerous communities (subreddits). This diversity leads to a wide range of topics, perspectives, and language usage, providing ample opportunities to study various conversational patterns and linguistic phenomena.
- Subreddit Communities: Reddit’s subreddit system allows users to form communities around specific interests, creating focused conversational contexts. This feature enables researchers to study conversations within distinct domains, analyzing specialized language use, jargon, and community-specific norms. It also allows for comparative studies across different subreddits.
- Long and Threaded Conversations: Reddit conversations often involve multi-turn interactions with threaded discussions. This structure provides contextual information and enables the study of conversational coherence, discourse markers, conversational strategies, and other aspects of dialogue. Researchers can explore how conversations unfold over time and analyze the patterns of interaction within threads.
PRAW
PRAW is a Python package that simplifies the interaction with the Reddit API. It provides a convenient way to authenticate, make API requests, and access Reddit data.
With its object-oriented approach, developers can easily navigate through subreddits, submissions, comments, and perform actions such as searching, posting, and retrieving content. It is highly customizable, supporting various authentication methods and allowing developers to tailor their interactions with the Reddit API.
However, it has several limitations such as:
- Rate Limit: Reddit has strict API rate limits, and PRAW enforces these limits to prevent abuse. This means that there is a maximum number of requests you can make within a certain time period. If you exceed these limits, you may receive errors or be temporarily blocked from making further requests.
- Data Depth: PRAW provides access to the most recent posts and comments but may not retrieve historical data beyond a certain limit. If you require extensive historical data or want to access data from deleted posts or comments, you may face limitations with PRAW.
- Data Completeness: Due to rate limiting and other factors, PRAW may not be able to retrieve all available data from a subreddit or thread. There might be gaps or missing information in the extracted data.
- Performance and Scalability: When dealing with large amounts of data or heavy traffic, PRAW’s performance may be impacted. It might not be the most efficient solution for extracting data at scale or processing high-volume requests.
PSAW
On the other hand, PSAW (PushShift API Wrapper), in general, offers some advantages and ways to overcome certain limitations compared to PRAW when it comes to extracting Reddit data.
- Extended Historical Data: Pushshift maintains an extensive archive of Reddit data, including posts and comments dating back several years. Unlike PRAW, which primarily focuses on recent data, Pushshift allows you to access and retrieve historical data beyond Reddit’s API limitations.
- Bulk Data Retrieval: Pushshift provides the ability to extract large amounts of data in bulk. Instead of making individual requests for each post or comment, Pushshift allows you to retrieve data in batches, which can significantly improve efficiency when dealing with large-scale data extraction.
- Greater Search Flexibility: Pushshift offers more advanced search capabilities compared to Reddit’s native API. It allows you to perform complex queries, filter by specific criteria (such as author, subreddit, time range, etc.), and extract data based on custom search parameters. This provides more control and precision in extracting the desired data.
- Reduced Rate Limiting Constraints: While Pushshift has its own rate limits, they are generally more generous compared to Reddit’s API limits. This allows for faster data extraction and reduces the chances of hitting rate limits, enabling you to retrieve larger amounts of data within a shorter time frame.
Convokit
There exist some challenges associated with NLP in conversations, such as their temporal nature, the involvement of multiple speakers and text units, and the significance of speaker and utterance sequence. Dealing with these complexiies on own can quickly become a headache. Hence, a group of researchers at Cornell University created a Python module for conversational analysis called ConvoKit.
ConvoKit is a valuable framework designed to convert raw conversational data into a more manageable and analyzable format, facilitating easier manipulation and sharing with others. It offers a range of pre-implemented linguistic analyses, including context-independent linguistic coordination and politeness strategies. Additionally, ConvoKit provides access to various conversational corpora, such as the extensive 900K subreddits and Wikipedia editors’ talk pages, which can be downloaded and thoroughly analyzed.
The frameowrk has two fundamental concepts: a corpus and a transformation. A corpus is a collection of conversations, and we can perform transformations on those conversations.
Every corpus has 3 main elements: speakers, conversations and utterances. Speakers are participants, and the things they say are called utterances.
Transformers are functions that take a corpus as input and output a modified corpus after some necessary transformations. In the upcoming article, I will practically demonstrate the use of some transformers such as linguistic coordination, politeness strategies, etc. which are offered as a part of this framework.
The article provides rudimentary information on PRAW and PSAW, which have been extensively used for engaging with live and historical Reddit conversations. It also introduces you to Convokit if you plan to work with dialogue-based data and haven’t used it before.
You can also use official APIs or wrappers for scraping your own data from any of your favorite sources — such as Twitter, Quora, Sina Weibo, etc. and then store and analyze them using the Convokit platform.
Thereafter, you can then use this data for sentiment analysis, topic modeling, user profiling, network analysis, community engagement or any related use-case.
In the next article we will be utilizing these libraries for developing a Reddit Conversation Scraper. So stay tuned 🙂