In this tutorial, we'll learn how to use Google Cloud Dataflow to
- implement the MinimalWordCount example, and then
- modify the example to write the results to BigQuery by creating our own schema
- Install pip (version >= 7.0.0).
- (optional, but recommended) Set-up a Python virtual environment:
- Install pyenv-virtualenv (Instructions for macOS).
- Create and activate an environment:
pyenv virtualenv gcp_env
pyenv activate gcp_env
- Install BigQuery and Apache Beam Python SDK:
pip install -r requirements.txt
- Set-up your project as instructed under the 'Before you begin' section. If you haven't already, you may sign-up for the free GCP trial credit
-
First, let's test our code locally on our laptop.
python DirectRunner.py
Output: A text file whose name starts with
wordcount_output-*
.You may see the following warning
No handlers could be found for logger "oauth2client.contrib.multistore_file"
.
This is a harmless warning and you may ignore it. -
Now that the code works locally, let's test it on the cloud.
Edit
DataflowRunner.py
with your values foryour-project-ID
andyour-bucket-name
. You don't need to create theyour-bucket-name/staging
andyour-bucket-name/temp
sub-directories; the script automatically creates them.python DataflowRunner.py
It may take a few minutes to generate the output files in
your-bucket-name
. You may monitor this process in your project on the cloud. The job name isword-count-job
.Output: Go to
your-bucket-name
. You'll see two folders namedstaging/
andtemp/
, and three text files who names start withwordcount_output-*
. These text files contain all the unique words in the input file and their frequencies.
-
Again, let's first test our code locally.
Edit
DirectRunnerBQ.py
with your values foryour-project-ID
.python DirectRunnerBQ.py
Output: Prints ten table values on the shell.
-
Now, let's run the script on the cloud.
Edit
DataflowRunnerBQ.py
with your values foryour-project-ID
andyour-bucket-name
.python DataflowRunnerBQ.py
Output: Prints ten table values on the shell.
Again, you may monitor the dataflow process. The job name is
word-count-bq-job
.
To avoid incurring charges to your account,
- delete your BigQuery dataset. The dataset is named
wordcount_dataset
. - delete your buckets
If you created a Python virtual environment, deactivate it:
pyenv deactivate
- Apache-beam GitHub repo
- Google Cloud Platform BigQuery GitHub repo