'If no tweets in period X continue to period Y with next_token

I am using Tweepy to extract tweets from the Twitter API v.2.0. Note: I have Academic Research access.

My code loops over five lists start_time, end_time, search_query, date, username. More description below.

My code collects up to 20 tweets for each day in each list. However, if there are no tweets for a specific day the code goes into an infinite loop where it tries to find the next_token without success.

If no tweets are found for a specific day then the code should resume to the subsequent day/elements in lists start_time and end_time. How can that be done?

My code extracts tweets in the time between the elements in lists start_date and end_date corresponding to 24 hours. The two lists run in a parallel loop. Each of start_date and end_datecontain three lists with 11 elements.

start_date = [['2022-02-06T00:00:00.000Z', '2022-02-07T00:00:00.000Z', '2022-02-08T00:00:00.000Z', '2022-02-09T00:00:00.000Z', '2022-02-10T00:00:00.000Z', '2022-02-11T00:00:00.000Z', '2022-02-12T00:00:00.000Z', '2022-02-13T00:00:00.000Z', '2022-02-14T00:00:00.000Z', '2022-02-15T00:00:00.000Z', '2022-02-16T00:00:00.000Z'], ['2022-01-28T00:00:00.000Z', '2022-01-29T00:00:00.000Z', '2022-01-30T00:00:00.000Z', '2022-01-31T00:00:00.000Z', '2022-02-01T00:00:00.000Z', '2022-02-02T00:00:00.000Z', '2022-02-03T00:00:00.000Z', '2022-02-04T00:00:00.000Z', '2022-02-05T00:00:00.000Z', '2022-02-06T00:00:00.000Z', '2022-02-07T00:00:00.000Z'], ['2022-01-28T00:00:00.000Z', '2022-01-29T00:00:00.000Z', '2022-01-30T00:00:00.000Z', '2022-01-31T00:00:00.000Z', '2022-02-01T00:00:00.000Z', '2022-02-02T00:00:00.000Z', '2022-02-03T00:00:00.000Z', '2022-02-04T00:00:00.000Z', '2022-02-05T00:00:00.000Z', '2022-02-06T00:00:00.000Z', '2022-02-07T00:00:00.000Z']]
end_date = [['2022-02-06T23:59:59.000Z', '2022-02-07T23:59:59.000Z', '2022-02-08T23:59:59.000Z', '2022-02-09T23:59:59.000Z', '2022-02-10T23:59:59.000Z', '2022-02-11T23:59:59.000Z', '2022-02-12T23:59:59.000Z', '2022-02-13T23:59:59.000Z', '2022-02-14T23:59:59.000Z', '2022-02-15T23:59:59.000Z', '2022-02-16T23:59:59.000Z'], ['2022-01-28T23:59:59.000Z', '2022-01-29T23:59:59.000Z', '2022-01-30T23:59:59.000Z', '2022-01-31T23:59:59.000Z', '2022-02-01T23:59:59.000Z', '2022-02-02T23:59:59.000Z', '2022-02-03T23:59:59.000Z', '2022-02-04T23:59:59.000Z', '2022-02-05T23:59:59.000Z', '2022-02-06T23:59:59.000Z', '2022-02-07T23:59:59.000Z'], ['2022-01-28T23:59:59.000Z', '2022-01-29T23:59:59.000Z', '2022-01-30T23:59:59.000Z', '2022-01-31T23:59:59.000Z', '2022-02-01T23:59:59.000Z', '2022-02-02T23:59:59.000Z', '2022-02-03T23:59:59.000Z', '2022-02-04T23:59:59.000Z', '2022-02-05T23:59:59.000Z', '2022-02-06T23:59:59.000Z', '2022-02-07T23:59:59.000Z']]

For each list in start_date and end_date there is a search_query. The two lists date and username are used to name the CSV files containing the extracted tweets.

search_query = ['(@brikeilarcnn OR "Brianna Keilar") -is:retweet', '(@brianstelter OR "Brian Stelter") -is:retweet', '(@Acosta OR "Jim Acosta") -is:retweet']
username = ['@brikeilarcnn', '@brianstelter', '@Acosta']
date = ['2022-02-11', '2022-02-02', '2022-02-02']

Apart from the lists above my code below contain three important functions: create_url, connect_to_endpoint, and append_to_csv. Both functions work as intended. However, if code for the functions is needed to answer my question I can provide it.

for suffixes_1, suffixes_2, name, day, user_handle in zip(start_time, end_time, search_query,
                                                           date, username):
    for s1, s2 in zip(suffixes_1, suffixes_2):

        # Inputs
        count = 0  # Counting tweets per time period/journalist
        max_count = 20  # Max tweets per time period/journalist
        flag = True
        next_token = None

        # create csv files named after date
        csvFile = open(day + "_" + user_handle + ".csv", "a", newline="", encoding='utf-8')
        csvWriter = csv.writer(csvFile)

        # create headers for the four variables: author_id, created_at, id, and tweet
        csvWriter.writerow(
            ['author_id', 'created_at', 'id', 'tweet'])
        csvFile.close()

        # create url for tweet extraction based on for loop:
        # loop over queries/name, start_time/si and end_time/s2
        url = create_url(name, s1, s2, max_results)
        json_response = connect_to_endpoint(url[0], headers, url[1], next_token)
        result_count = json_response['meta']['result_count']
        # Check if flag is true
        while flag:

            # Check if max_count reached
            if count >= max_count:
                break
            print("-------------------")
            print("Token: ", next_token) # The line that is # continuously printed when next_token = None

            if 'next_token' in json_response['meta']:
                #  Save the token to use for next call
                next_token = json_response['meta']['next_token']
                print("Next Token: ", next_token) 
                if result_count is not None and result_count > 0 and next_token is not None:
                    print("Start Date: ", s1, "Name of journalist:", user_handle)
                    append_to_csv(json_response, day + "_" + user_handle + ".csv")
                    count += result_count
                    total_tweets += result_count
                    print("Total # of Tweets added: ", total_tweets)
                    print("-------------------")

                    sleep(5) # sleep for 5 sec. to avoid flooding 

                    # If no next token exists
            else:
                if result_count is not None and result_count > 0:
                    print("-------------------")
                    print("Start Date: ", s1, "Name of journalist:", user_handle)
                    append_to_csv(json_response, day + "_" + user_handle + ".csv")
                    count += result_count
                    total_tweets += result_count
                    print("Total # of Tweets added: ", total_tweets)
                    print("-------------------")

                    sleep(5) # sleep for 5 sec. to avoid flooding 

                    # Since this is the final request, turn flag to false to move to the next time period.
                    flag = False
                    next_token = None

                    sleep(5) # sleep for 5 sec. to avoid flooding 

print("Total number of results: ", total_tweets)  

Output when the code reaches a serach_query = name, start_time = s1, and end_time = s2 that matches with 0 tweets. The console continues to print the following output as it cannot find next_token. Is it possible to skip to the next combination of serach_query, s1, and s2?

-------------------
Token:  None
-------------------
Token:  None
-------------------
Token:  None
-------------------
Token:  None
-------------------
Token:  None
-------------------
Token:  None
-------------------
Token:  None
-------------------
Token:  None
-------------------
Token:  None
-------------------

Quick fix: Remove any search_query i.e. journalists for which there are days with no tweets causing Token: None. This will 'solve' the issue though possibly bias to the dataset.

NOTE: My real data contains 138 search_query. However, for simplicity, I have reduced it to three.



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source