Skip to content

Latest commit

 

History

History

dataflow

README

Project Summary

In this tutorial, we'll learn how to use Google Cloud Dataflow to

  1. implement the MinimalWordCount example, and then
  2. modify the example to write the results to BigQuery by creating our own schema

Requirements

  1. Install pip (version >= 7.0.0).
  2. (optional, but recommended) Set-up a Python virtual environment:
pyenv virtualenv gcp_env
pyenv activate gcp_env
  1. Install BigQuery and Apache Beam Python SDK:
pip install -r requirements.txt
  1. Set-up your project as instructed under the 'Before you begin' section. If you haven't already, you may sign-up for the free GCP trial credit

Run Scripts

1. (Basic) MinimalWordCount Example

  • First, let's test our code locally on our laptop.

    python DirectRunner.py

    Output: A text file whose name starts with wordcount_output-*.

    You may see the following warning
    No handlers could be found for logger "oauth2client.contrib.multistore_file".
    This is a harmless warning and you may ignore it.

  • Now that the code works locally, let's test it on the cloud.

    Edit DataflowRunner.py with your values for your-project-ID and your-bucket-name. You don't need to create the your-bucket-name/staging and your-bucket-name/temp sub-directories; the script automatically creates them.

    python DataflowRunner.py

    It may take a few minutes to generate the output files in your-bucket-name. You may monitor this process in your project on the cloud. The job name is word-count-job.

    Output: Go to your-bucket-name. You'll see two folders named staging/ and temp/, and three text files who names start with wordcount_output-*. These text files contain all the unique words in the input file and their frequencies.

2. (Modified) Write MinimalWordCount Output to BigQuery

  • Again, let's first test our code locally.

    Edit DirectRunnerBQ.py with your values for your-project-ID.

    python DirectRunnerBQ.py

    Output: Prints ten table values on the shell.

  • Now, let's run the script on the cloud.

    Edit DataflowRunnerBQ.py with your values for your-project-ID and your-bucket-name.

    python DataflowRunnerBQ.py

    Output: Prints ten table values on the shell.

    Again, you may monitor the dataflow process. The job name is word-count-bq-job.

Clean-up

To avoid incurring charges to your account,

If you created a Python virtual environment, deactivate it:

pyenv deactivate

Code Reference