'Reduce API call from gspread library

I'm currently trying to get data using gspread API from a drive folder containing about 50 excel files, each containing about 10 sheets (about 500 sheets in total).

I want to get 3 specific columns for all files in all sheets and append it into a dataframe.

I got this code working :

scope = ["https://spreadsheets.google.com/feeds",'https://www.googleapis.com/auth/spreadsheets',"https://www.googleapis.com/auth/drive.file","https://www.googleapis.com/auth/drive"]
creds = ServiceAccountCredentials.from_json_keyfile_name("gspread/service_account.json",scope)
client = gs.authorize(creds)

file_list = client.list_spreadsheet_files()
file_list = list(map(itemgetter('name'),file_list))

for files in file_list:
        file = client.open(files)
        worksheet_list = file.worksheets()
        print(files)
        for sheet in worksheet_list[1:]:
            print(sheet)
            set_with_dataframe(sheet, df)
            df = get_as_dataframe(sheet, parse_dates=True, usecols=[5, 7, 9], skiprows=1, header=None)
            df.drop(df.index[10:100], axis=0, inplace=True)
            print('Ajout de : ',df)
            df_final = df_final.append(df)
df_final.fillna('', inplace=True)
df_final.reset_index(drop=True, inplace=True)
print(df_final)

The thing is I always get an error 429 (too much API calls) and I can't figured out how to reduced the number of calls. Even when using get_all_values() or get_all_records() functions it has to loops through all sheets.

Even when making a time.sleep(30) for each files.

If I'm not wrong I have about 22 calls for each files of 10 sheets (file = client.open(files) and worksheet_list = file.worksheets()) and 2 for each sheets ( set_with_dataframe() and get_as_dataframe()

I could make a time.sleep() for each worksheets instead but it would take very long (500+ sheets).

I could also change the pause time depending on the number of sheets (some files have more).

My questions are : 1) Is there function that could get all data from all sheets instead of having to loop through all sheets? 2) If not, is there a solution to reduce the calls without having to use pauses...?

If yes, it would be a solution and would reduce drastically the calls

Thanks in advance and nice evening, Alex



Solution 1:[1]

to answer your questions:

  1. There will be soon in gspread a method they returns all the data from all the sheets. See this issue here

  2. As of today the above solution is not available so in order to pull all data from a spreadsheet you can build your own request and use gspread to handle that request for you. Have a look at the library source code in the Worksheet class in the method that return some values (like get_range or get or range). You can easily build your request and request the full spreadsheet values.

  3. on the side, if you wish to reduce API calls you can do so by opening your spreadsheet files using their ID instead of listing all the files then looping over them. You will reduce the initial API call to the Drive API to list the files. The method list_spreadsheet_files is costly because the response is paginated, so it will make as many calls to get all the pages (i don't know the size of a page). Then the method open is costly too because it makes a new call to list_spreadsheet_files then loops over the result to open the right file using it's ID. I would definitely start on that part of your code to reduce API calls.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Dharman