'If no tweets in period X continue to period Y with next_token
I am using Tweepy to extract tweets from the Twitter API v.2.0. Note: I have Academic Research access.
My code loops over five lists start_time, end_time, search_query, date, username. More description below.
My code collects up to 20 tweets for each day in each list. However, if there are no tweets for a specific day the code goes into an infinite loop where it tries to find the next_token without success.
If no tweets are found for a specific day then the code should resume to the subsequent day/elements in lists start_time and end_time. How can that be done?
My code extracts tweets in the time between the elements in lists start_date and end_date corresponding to 24 hours. The two lists run in a parallel loop. Each of start_date and end_datecontain three lists with 11 elements.
start_date = [['2022-02-06T00:00:00.000Z', '2022-02-07T00:00:00.000Z', '2022-02-08T00:00:00.000Z', '2022-02-09T00:00:00.000Z', '2022-02-10T00:00:00.000Z', '2022-02-11T00:00:00.000Z', '2022-02-12T00:00:00.000Z', '2022-02-13T00:00:00.000Z', '2022-02-14T00:00:00.000Z', '2022-02-15T00:00:00.000Z', '2022-02-16T00:00:00.000Z'], ['2022-01-28T00:00:00.000Z', '2022-01-29T00:00:00.000Z', '2022-01-30T00:00:00.000Z', '2022-01-31T00:00:00.000Z', '2022-02-01T00:00:00.000Z', '2022-02-02T00:00:00.000Z', '2022-02-03T00:00:00.000Z', '2022-02-04T00:00:00.000Z', '2022-02-05T00:00:00.000Z', '2022-02-06T00:00:00.000Z', '2022-02-07T00:00:00.000Z'], ['2022-01-28T00:00:00.000Z', '2022-01-29T00:00:00.000Z', '2022-01-30T00:00:00.000Z', '2022-01-31T00:00:00.000Z', '2022-02-01T00:00:00.000Z', '2022-02-02T00:00:00.000Z', '2022-02-03T00:00:00.000Z', '2022-02-04T00:00:00.000Z', '2022-02-05T00:00:00.000Z', '2022-02-06T00:00:00.000Z', '2022-02-07T00:00:00.000Z']]
end_date = [['2022-02-06T23:59:59.000Z', '2022-02-07T23:59:59.000Z', '2022-02-08T23:59:59.000Z', '2022-02-09T23:59:59.000Z', '2022-02-10T23:59:59.000Z', '2022-02-11T23:59:59.000Z', '2022-02-12T23:59:59.000Z', '2022-02-13T23:59:59.000Z', '2022-02-14T23:59:59.000Z', '2022-02-15T23:59:59.000Z', '2022-02-16T23:59:59.000Z'], ['2022-01-28T23:59:59.000Z', '2022-01-29T23:59:59.000Z', '2022-01-30T23:59:59.000Z', '2022-01-31T23:59:59.000Z', '2022-02-01T23:59:59.000Z', '2022-02-02T23:59:59.000Z', '2022-02-03T23:59:59.000Z', '2022-02-04T23:59:59.000Z', '2022-02-05T23:59:59.000Z', '2022-02-06T23:59:59.000Z', '2022-02-07T23:59:59.000Z'], ['2022-01-28T23:59:59.000Z', '2022-01-29T23:59:59.000Z', '2022-01-30T23:59:59.000Z', '2022-01-31T23:59:59.000Z', '2022-02-01T23:59:59.000Z', '2022-02-02T23:59:59.000Z', '2022-02-03T23:59:59.000Z', '2022-02-04T23:59:59.000Z', '2022-02-05T23:59:59.000Z', '2022-02-06T23:59:59.000Z', '2022-02-07T23:59:59.000Z']]
For each list in start_date and end_date there is a search_query. The two lists date and username are used to name the CSV files containing the extracted tweets.
search_query = ['(@brikeilarcnn OR "Brianna Keilar") -is:retweet', '(@brianstelter OR "Brian Stelter") -is:retweet', '(@Acosta OR "Jim Acosta") -is:retweet']
username = ['@brikeilarcnn', '@brianstelter', '@Acosta']
date = ['2022-02-11', '2022-02-02', '2022-02-02']
Apart from the lists above my code below contain three important functions: create_url, connect_to_endpoint, and append_to_csv. Both functions work as intended. However, if code for the functions is needed to answer my question I can provide it.
for suffixes_1, suffixes_2, name, day, user_handle in zip(start_time, end_time, search_query,
date, username):
for s1, s2 in zip(suffixes_1, suffixes_2):
# Inputs
count = 0 # Counting tweets per time period/journalist
max_count = 20 # Max tweets per time period/journalist
flag = True
next_token = None
# create csv files named after date
csvFile = open(day + "_" + user_handle + ".csv", "a", newline="", encoding='utf-8')
csvWriter = csv.writer(csvFile)
# create headers for the four variables: author_id, created_at, id, and tweet
csvWriter.writerow(
['author_id', 'created_at', 'id', 'tweet'])
csvFile.close()
# create url for tweet extraction based on for loop:
# loop over queries/name, start_time/si and end_time/s2
url = create_url(name, s1, s2, max_results)
json_response = connect_to_endpoint(url[0], headers, url[1], next_token)
result_count = json_response['meta']['result_count']
# Check if flag is true
while flag:
# Check if max_count reached
if count >= max_count:
break
print("-------------------")
print("Token: ", next_token) # The line that is # continuously printed when next_token = None
if 'next_token' in json_response['meta']:
# Save the token to use for next call
next_token = json_response['meta']['next_token']
print("Next Token: ", next_token)
if result_count is not None and result_count > 0 and next_token is not None:
print("Start Date: ", s1, "Name of journalist:", user_handle)
append_to_csv(json_response, day + "_" + user_handle + ".csv")
count += result_count
total_tweets += result_count
print("Total # of Tweets added: ", total_tweets)
print("-------------------")
sleep(5) # sleep for 5 sec. to avoid flooding
# If no next token exists
else:
if result_count is not None and result_count > 0:
print("-------------------")
print("Start Date: ", s1, "Name of journalist:", user_handle)
append_to_csv(json_response, day + "_" + user_handle + ".csv")
count += result_count
total_tweets += result_count
print("Total # of Tweets added: ", total_tweets)
print("-------------------")
sleep(5) # sleep for 5 sec. to avoid flooding
# Since this is the final request, turn flag to false to move to the next time period.
flag = False
next_token = None
sleep(5) # sleep for 5 sec. to avoid flooding
print("Total number of results: ", total_tweets)
Output when the code reaches a serach_query = name, start_time = s1, and end_time = s2 that matches with 0 tweets. The console continues to print the following output as it cannot find next_token. Is it possible to skip to the next combination of serach_query, s1, and s2?
-------------------
Token: None
-------------------
Token: None
-------------------
Token: None
-------------------
Token: None
-------------------
Token: None
-------------------
Token: None
-------------------
Token: None
-------------------
Token: None
-------------------
Token: None
-------------------
Quick fix: Remove any search_query i.e. journalists for which there are days with no tweets causing Token: None. This will 'solve' the issue though possibly bias to the dataset.
NOTE: My real data contains 138 search_query. However, for simplicity, I have reduced it to three.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
