'Google spreadsheet to Pandas dataframe via Pydrive without download

How do I read the content of a Google spreadsheet into a Pandas dataframe without downloading the file?


I think gspread or df2gspread may be good shots, but I've been working with pydrive so far and got close to the solution.

With Pydrive I managed to get the export link of my spreadsheet, either as .csv or .xlsx file. After the authentication process, this looks like


    gauth = GoogleAuth()
    gauth.LocalWebserverAuth()
    drive = GoogleDrive(gauth)
    
    # choose whether to export csv or xlsx
    data_type = 'csv'
    
    # get list of files in folder as dictionaries
    file_list = drive.ListFile({'q': "'my-folder-ID' in parents and 
    trashed=false"}).GetList()
    
    export_key = 'exportLinks'
    
    excel_key = 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet'
    csv_key = 'text/csv'
    
    if data_type == 'excel':
        urls = [ file[export_key][excel_key] for file in file_list ]
    
    elif data_type == 'csv':
        urls = [ file[export_key][csv_key] for file in file_list ]

The type of url I get for xlsx is

https://docs.google.com/spreadsheets/export?id=my-id&exportFormat=xlsx

and similarly for csv

https://docs.google.com/spreadsheets/export?id=my-id&exportFormat=csv

Now, if I click on these links (or visit them with webbrowser.open(url)), I download the file, that I can then normally read into a Pandas dataframe with pandas.read_excel() or pandas.read_csv(), as described here.

How can I skip the download, and directly read the file into a dataframe from these links?

I tried several solutions:

    pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 2

Interestingly these numbers (1, 6, 2) do not depend on the number of rows and columns in my spreadsheet, hinting that the script is trying to read not what it is intended to.

  • The analogue pd.read_excel(url) gives
    ValueError: Excel file format cannot be determined, you must specify an engine manually.

and specifying e.g. engine = 'openpyxl' gives

zipfile.BadZipFile: File is not a zip file
  • BytesIO solution looked promising, but

    r = requests.get(url)
    data = r.content
    df = pd.read_csv(BytesIO(data))

still gives


    pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 6, saw 2

If I print(data) I get hundreds of lines of html code


    b'\n<!DOCTYPE html>\n<html lang="de">\n  <head>\n  <meta charset="utf-8">\n  <meta content="width=300, initial-scale=1" name="viewport">\n 
    ...
    ...
     </script>\n  </body>\n</html>\n'



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source