Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding support for partitioned s3 source #4

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

abhishekd0907
Copy link
Collaborator

Background

S3-SQS source doesn't support reading partition columns from the S3 bucket. As a result, the dataset formed using S3-SQS source doesn't contain the partition columns leading to issue #2

How this PR Handles the Problem

With the new changes, the user can specify partition columns in the schema with isPartitioned set to true in column metadata.

Example:

import org.apache.spark.sql.types._
import org.apache.spark.sql.types.MetadataBuilder

val metaData = (new MetadataBuilder).putString("isPartitioned", "true").build()

val partitionedSchema = new StructType().add(StructField("col1", IntegerType, true, metaData))

Also, the user needs to specify the basePath in options if the schema contains partition columns. Specifying partitioned
columns without specifying the basePath will throw an error.

Example:

 `s3://bucket/basedDir/part1=10/part2=20/file.json` will have basePath as `s3://bucket/basedDir/`

## Using Parrtitioned S3 Bucket

In case your S3 bucket is partitioned, your schema must contain both data columns as well as partition
columns. Moreover, partition columns need to have `isPartitioned` set to `true` in their metadata.
Copy link
Collaborator

@itsvikramagr itsvikramagr Aug 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this constraint which this module is adding or is it normally expected that the partition column will have isPartitioned in their metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants