'How to send HTML string from Cloud Functions to BigQuery using Pub/Sub and Dataflow?
Overview
Setup: I have a Google Cloud Function, which is triggered by message to a Pub/Sub topic. I have set up a Cloud Scheduler job to send a message to this Pub/Sub topic every minute. My cloud function is triggered, and sends a message to another Pub/Sub topic. I have a Dataflow job set up, which streams data from this Pub/Sub topic to BigQuery.
The problem: if the message my cloud function publishes contains a simple string, it makes it into BigQuery. If the message is a html code of a scraped website, then the result does not show up in BigQuery, and I don't know where it gets lost.
Detailed Walkthrough
The trigger
My function, function-3 is triggered by a message to the topic called simple:
Here is my Cloud Scheduler job:
It sends a string to simple topic every minute (as indicated by the * * * * *).
The function
The source of my function has two files: main.py and requirements.txt.
main.py takes https://www.bbc.com/ and gets its html code as string using resuests and BeautifulSoup (bs). Then, it publishes 2 strings to the topic scrape. These two strings are "publish_this" and the string version of the BBC website html source. Code for main.py:
def hello_pubsub(event, context):
import re
import json
import base64
import requests
import bs4 as bs
from google.cloud import pubsub_v1
def publish(message):
project_id = "adventdalen"
topic_name = "scrape"
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(project_id, topic_name)
future = publisher.publish(
topic_path, data=message.encode('utf-8')
)
url = "https://www.bbc.com/"
page = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'})
bspage = bs.BeautifulSoup(page.text, 'html.parser')
bspage = str(bspage)
publish(json.dumps({"html":"publish_this"})) # <- this makes into bigquery
publish(json.dumps({"html":re.escape(bspage)})) # <- this does not.
requirements.txt is:
# Function dependencies, for example:
# package>=version
google-cloud-pubsub
requests
bs4
The Dataflow job
I have a Dataflow job called ps-to-bq-scrape:
The target of this is the BigQuery table adventdalen:scrape.scrape (highlighted on above screenshot, see right bit), with scrape as inputTopic (4 rows above highlight).
The BigQuery table
In the BigQuery table, I expect to have rows equal to "publish_this", and strings of the BBC website html source. Instead, I find this:
Only the publish_this rows appear. To make sure I am not deceiving myself by looking only the "Preview", I query for every row not equal to publish_this:
and I get no results. The BBC source code got lost somewhere.
Question
Something is wrong with main.py above, I believe. How do I modify main.py, so that not only the text "publish_this", but the source HTML also makes it to a BigQuery row?
(It would also be useful to know if something is wrong with the setup, and not with main.py - I believe this is unlikely though, and the issue can be solved by fixing main.py.)
Solution 1:[1]
One thing that can be happening is that the URL is not parsing in the correct way to a proper string. I would recommend you to use a library for uri parse and then parse it to a string.
You can use the yarl library for complex uri to parse.
You can see this example:
>>> from yarl import URL
>>> url = URL('https://www.python.org/~guido?arg=1#frag')
>>> url
URL('https://www.python.org/~guido?arg=1#frag')
You also can use the urllib.parse.unquote() library, this library handles decoding from percent-encoded data to UTF-8 bytes and then to text.
You can see this example:
>>> from urllib.parse import unquote
>>> url = 'example.com?title=%D0%BF%D1%80%D0%B0%D0%B2%D0%BE%D0%B2%D0%B0%D1%8F+%D0%B7%D0%B0%D1%89%D0%B8%D1%82%D0%B0'
>>> unquote(url)
'example.com?title=????????+??????'
In case that the parse did not work, can you share the logs on the dataflow job to see if there is any problem in the input or the output.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Jose Gutierrez Paliza |





