Name		Name	Last commit message	Last commit date
parent directory ..
DataflowRunner.py		DataflowRunner.py
DataflowRunnerBQ.py		DataflowRunnerBQ.py
DirectRunner.py		DirectRunner.py
DirectRunnerBQ.py		DirectRunnerBQ.py
README.md		README.md
requirements.txt		requirements.txt

README.md

README

Project Summary

In this tutorial, we'll learn how to use Google Cloud Dataflow to

implement the MinimalWordCount example, and then
modify the example to write the results to BigQuery by creating our own schema

Requirements

Install pip (version >= 7.0.0).
(optional, but recommended) Set-up a Python virtual environment:

Install pyenv-virtualenv (Instructions for macOS).
Create and activate an environment:

pyenv virtualenv gcp_env
pyenv activate gcp_env

Install BigQuery and Apache Beam Python SDK:

pip install -r requirements.txt

Set-up your project as instructed under the 'Before you begin' section. If you haven't already, you may sign-up for the free GCP trial credit

Run Scripts

1. (Basic) MinimalWordCount Example

First, let's test our code locally on our laptop.
```
python DirectRunner.py
```
Output: A text file whose name starts with wordcount_output-*.

You may see the following warning
No handlers could be found for logger "oauth2client.contrib.multistore_file".
This is a harmless warning and you may ignore it.
Now that the code works locally, let's test it on the cloud.

Edit DataflowRunner.py with your values for your-project-ID and your-bucket-name. You don't need to create the your-bucket-name/staging and your-bucket-name/temp sub-directories; the script automatically creates them.
```
python DataflowRunner.py
```
It may take a few minutes to generate the output files in your-bucket-name. You may monitor this process in your project on the cloud. The job name is word-count-job.

Output: Go to your-bucket-name. You'll see two folders named staging/ and temp/, and three text files who names start with wordcount_output-*. These text files contain all the unique words in the input file and their frequencies.

2. (Modified) Write MinimalWordCount Output to BigQuery

Again, let's first test our code locally.

Edit DirectRunnerBQ.py with your values for your-project-ID.
```
python DirectRunnerBQ.py
```
Output: Prints ten table values on the shell.
Now, let's run the script on the cloud.

Edit DataflowRunnerBQ.py with your values for your-project-ID and your-bucket-name.
```
python DataflowRunnerBQ.py
```
Output: Prints ten table values on the shell.

Again, you may monitor the dataflow process. The job name is word-count-bq-job.

Clean-up

To avoid incurring charges to your account,

delete your BigQuery dataset. The dataset is named wordcount_dataset.
delete your buckets

If you created a Python virtual environment, deactivate it:

pyenv deactivate

Code Reference

Apache-beam GitHub repo
Google Cloud Platform BigQuery GitHub repo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataflow

dataflow

README.md

README

Project Summary

Requirements

Run Scripts

1. (Basic) MinimalWordCount Example

2. (Modified) Write MinimalWordCount Output to BigQuery

Clean-up

Code Reference

Files

dataflow

Directory actions

More options

Directory actions

More options

Latest commit

History

dataflow

Folders and files

parent directory

README.md

README

Project Summary

Requirements

Run Scripts

1. (Basic) MinimalWordCount Example

2. (Modified) Write MinimalWordCount Output to BigQuery

Clean-up

Code Reference