'How to send HTML string from Cloud Functions to BigQuery using Pub/Sub and Dataflow?

Overview

Setup: I have a Google Cloud Function, which is triggered by message to a Pub/Sub topic. I have set up a Cloud Scheduler job to send a message to this Pub/Sub topic every minute. My cloud function is triggered, and sends a message to another Pub/Sub topic. I have a Dataflow job set up, which streams data from this Pub/Sub topic to BigQuery.

The problem: if the message my cloud function publishes contains a simple string, it makes it into BigQuery. If the message is a html code of a scraped website, then the result does not show up in BigQuery, and I don't know where it gets lost.


Detailed Walkthrough

The trigger

My function, function-3 is triggered by a message to the topic called simple:

enter image description here

Here is my Cloud Scheduler job:

enter image description here

It sends a string to simple topic every minute (as indicated by the * * * * *).


The function

The source of my function has two files: main.py and requirements.txt.

main.py takes https://www.bbc.com/ and gets its html code as string using resuests and BeautifulSoup (bs). Then, it publishes 2 strings to the topic scrape. These two strings are "publish_this" and the string version of the BBC website html source. Code for main.py:

def hello_pubsub(event, context):

    import re
    import json
    import base64
    import requests
    import bs4 as bs
    from google.cloud import pubsub_v1

    def publish(message):
        project_id = "adventdalen"
        topic_name = "scrape"
        publisher = pubsub_v1.PublisherClient()
        topic_path = publisher.topic_path(project_id, topic_name)
        future = publisher.publish(
            topic_path, data=message.encode('utf-8')
        )


    url = "https://www.bbc.com/"
    page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'})
    bspage = bs.BeautifulSoup(page.text, 'html.parser')
    bspage = str(bspage)

    publish(json.dumps({"html":"publish_this"})) # <- this makes into bigquery
    publish(json.dumps({"html":re.escape(bspage)})) # <- this does not.

requirements.txt is:

# Function dependencies, for example:
# package>=version
google-cloud-pubsub
requests
bs4

The Dataflow job

I have a Dataflow job called ps-to-bq-scrape:

enter image description here

The target of this is the BigQuery table adventdalen:scrape.scrape (highlighted on above screenshot, see right bit), with scrape as inputTopic (4 rows above highlight).

The BigQuery table

In the BigQuery table, I expect to have rows equal to "publish_this", and strings of the BBC website html source. Instead, I find this:

enter image description here

Only the publish_this rows appear. To make sure I am not deceiving myself by looking only the "Preview", I query for every row not equal to publish_this:

enter image description here

and I get no results. The BBC source code got lost somewhere.


Question

Something is wrong with main.py above, I believe. How do I modify main.py, so that not only the text "publish_this", but the source HTML also makes it to a BigQuery row?

(It would also be useful to know if something is wrong with the setup, and not with main.py - I believe this is unlikely though, and the issue can be solved by fixing main.py.)



Solution 1:[1]

One thing that can be happening is that the URL is not parsing in the correct way to a proper string. I would recommend you to use a library for uri parse and then parse it to a string.

You can use the yarl library for complex uri to parse.

You can see this example:

>>> from yarl import URL 
>>> url = URL('https://www.python.org/~guido?arg=1#frag') 
>>> url 
URL('https://www.python.org/~guido?arg=1#frag')

You also can use the urllib.parse.unquote() library, this library handles decoding from percent-encoded data to UTF-8 bytes and then to text.

You can see this example:

>>> from urllib.parse import unquote 
>>> url = 'example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0' 
>>> unquote(url)
'example.com?title=????????+??????'

In case that the parse did not work, can you share the logs on the dataflow job to see if there is any problem in the input or the output.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Jose Gutierrez Paliza