'PushshiftAPI not returning all comments

I'm using the following code to acquire comments for a given Reddit post. We only want the top/first-level comments, but this filter isn't implemented yet because we couldn't get this basic code returning what we expect:

import pandas as pd
import datetime as dt
from pmaw import PushshiftAPI

comments = pd.DataFrame()
api = PushshiftAPI()
subreddit = "Conservative"
limit = 100000

# ids are loaded from another df in original code, but list of 3 here for simplicity
ids = ['ly98ob', 'lxku9i', 'lxzjv5']

# main loop
for id in ids:
    # get comments for this post using the link_id parameter
    new_comments = api.search_comments(subreddit=subreddit, link_id=id)
    # TROUBLE IS HERE^^-----------------------^^ returns only ~26 comments
    new_comments = pd.DataFrame(new_comments)

    # add new comments to commentsdataframe
    comments = pd.concat([comments, new_comments], sort=False, ignore_index=True)

# some additional prints and save to csv is also in the code

I checked out the solution from this Reddit Pushshift post, but even the api call: https://api.pushshift.io/reddit/comment/search/?link_id=ly98ob does not achieve more than 25 comments.

I would expect the api.search_comments(...) to return much more comments than the ~26 we get now. Is there any (obvious) thing I'm missing or error in the code in order to scrape all comments for a given post id?



Solution 1:[1]

The search_comments and search_submission_comment_ids methods are unable to return any comments after Nov 26th, 2021 for some reason. Until that's resolved, here's a quick workaround that I've implemented for my own uses that blends PMAW (get submissions by date) with PRAW (get comments for those submissions):

  submissions = api_praw.search_submissions(subreddit=subreddit, before=before, after=after, limit=10)
sub_list = [sub for sub in submissions]
try:
    # [['subreddit', 'title', 'selftext', 'author', 'score', 'created_utc', 'id', 'num_comments', 'permalink', 'upvote_ratio']]
    sub_df = pd.DataFrame(sub_list)
    sub_df['permalink'] = 'www.reddit.com' + sub_df['permalink']

    sub_ids = list(sub_df['id'])

    comment_list = []
    for sub_id in sub_ids:
        submission = reddit.submission(sub_id)
        submission.comments.replace_more(limit=None)
        for comment in submission.comments.list():
            comment_list.append(comment.__dict__)
            
    comments_df = pd.DataFrame(comment_list)

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1