Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43860: [Go][Parquet] Handle the error correctly #43861

Merged
merged 3 commits into from
Aug 28, 2024

Conversation

bigsheeper
Copy link
Contributor

@bigsheeper bigsheeper commented Aug 28, 2024

Rationale for this change

Fixes: #43860

What changes are included in this PR?

Return error correctly

Are these changes tested?

Yes

Are there any user-facing changes?

Nope

Signed-off-by: bigsheeper <[email protected]>
@bigsheeper bigsheeper requested a review from zeroshade as a code owner August 28, 2024 03:36
Copy link

⚠️ GitHub issue #43860 has been automatically assigned in GitHub to PR creator.

@mapleFU
Copy link
Member

mapleFU commented Aug 28, 2024

Are these changes tested?
Yes

Would you mind add a test for this? I guess parquet-testing/bad-data might contain some bad file?

@bigsheeper
Copy link
Contributor Author

Are these changes tested?
Yes

Would you mind add a test for this? I guess parquet-testing/bad-data might contain some bad file?

@mapleFU Sorry, could you clarify what you mean by "parquet-testing/bad-file"? Can you provide a link or more context?

@mapleFU
Copy link
Member

mapleFU commented Aug 28, 2024

You can try to follow[1]. And parquet-testing uses this repo [2]. It includes some bad data-files in "bad_data", would you mind check can this being tested using these files?

[1]

dir := os.Getenv("PARQUET_TEST_DATA")
if dir == "" {
t.Skip("no path supplied with PARQUET_TEST_DATA")
}
assert.DirExists(t, dir)

[2] https://github.com/apache/parquet-testing

@bigsheeper
Copy link
Contributor Author

You can try to follow[1]. And parquet-testing uses this repo [2]. It includes some bad data-files in "bad_data", would you mind check can this being tested using these files?

[1]

dir := os.Getenv("PARQUET_TEST_DATA")
if dir == "" {
t.Skip("no path supplied with PARQUET_TEST_DATA")
}
assert.DirExists(t, dir)

[2] https://github.com/apache/parquet-testing

@mapleFU It seems that the PARQUET_TEST_DATA environment variable only leads to the files from the data directory and does not include bad files.

@mapleFU
Copy link
Member

mapleFU commented Aug 28, 2024

Sorry, would you mind try something like below?

std::string get_bad_data_dir() {
  // PARQUET_TEST_DATA should point to ARROW_HOME/cpp/submodules/parquet-testing/data
  // so need to reach one folder up to access the "bad_data" folder.
  std::string data_dir(get_data_dir());
  std::stringstream ss;
  ss << data_dir << "/../bad_data";
  return ss.str();
}

If cannot touch data, can we handle write a test ourselve using go?

@bigsheeper
Copy link
Contributor Author

bigsheeper commented Aug 28, 2024

I agree. The root cause of this issue is that the error was not thrown. Maybe we can just use one mock error in ut? If that's ok, I'll add a ut for this change.

@mapleFU
Copy link
Member

mapleFU commented Aug 28, 2024

Both is ok for me. I'll be glad to see a go ut, if bad_data can reproduce it would be convient. I'm ok for both

@joellubi joellubi self-requested a review August 28, 2024 09:40
Signed-off-by: bigsheeper <[email protected]>
@bigsheeper
Copy link
Contributor Author

Both is ok for me. I'll be glad to see a go ut, if bad_data can reproduce it would be convient. I'm ok for both

@mapleFU I attempted to read a bad file, but no error was returned. In my case, the io.reader reported an error (specifically, a timeout connecting to S3: Get "http://x.x.x.x:9000/a-bucket//tmp/test_2292796098596731149.parquet": dial tcp x.x.x.x:9000: connect: connection refused), but the Parquet reader did not propagate this error. I'v simulate this scenario in the ut. Please help to review. :)

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Aug 28, 2024
@mapleFU
Copy link
Member

mapleFU commented Aug 28, 2024

Dial a s3 address in parquet ut would be a disaster and be unstable, I don't suggest that

Mocking might be ok here

@bigsheeper
Copy link
Contributor Author

bigsheeper commented Aug 28, 2024

Dial a s3 address in parquet ut would be a disaster and be unstable, I don't suggest that

Mocking might be ok here

I didn’t dial an S3 address in Parquet ut; I only mock an error when io.Reader calls ReadAt.
We need to ensure any error is thrown and handled properly, otherwise users might face situations where they read 0 rows of Parquet data, which would be disastrous for them.

Signed-off-by: bigsheeper <[email protected]>
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Aug 28, 2024
@mapleFU
Copy link
Member

mapleFU commented Aug 28, 2024

I didn’t dial an S3 address in Parquet ut; I only mock an error when io.Reader calls ReadAt.
We need to ensure any error is thrown and handled properly, otherwise users might face situations where they read 0 rows of Parquet data, which would be disastrous for them.

I see. A lot of module under Parquet-go might need handle error gracefully. Maybe some fuzzing script to generate bad file or inject bad io-state would help finding them, sigh...

@bigsheeper
Copy link
Contributor Author

bigsheeper commented Aug 28, 2024

If you're interested in the issue we've encountered, see: milvus-io/milvus#35662 (comment) :)
image

@zeroshade
Copy link
Member

@mapleFU Several areas of the Go parquet lib rely on using panics with a top-level recover to return errors. I agree we should definitely try to find any spots where we're missing proper handling.

Go has support for fuzz testing built in https://go.dev/doc/security/fuzz/ and I've been meaning to get around to adding that to the parquet lib but haven't had the time. It would be great for us to add that

Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks!

If there are any other areas we can identify that have incorrect error handling we should try to find them, test, and update them.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Aug 28, 2024
@mapleFU
Copy link
Member

mapleFU commented Aug 28, 2024

LGTM

@mapleFU
Copy link
Member

mapleFU commented Aug 28, 2024

@bigsheeper FYI #43607 these are similiar problem

Copy link
Member

@joellubi joellubi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!

@zeroshade zeroshade merged commit 9d40a6a into apache:main Aug 28, 2024
27 checks passed
@zeroshade zeroshade removed the awaiting merge Awaiting merge label Aug 28, 2024
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 9d40a6a.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 9 possible false positives for unstable benchmarks that are known to sometimes produce them.

bigsheeper added a commit to bigsheeper/arrow that referenced this pull request Aug 29, 2024
czs007 added a commit to milvus-io/arrow that referenced this pull request Aug 29, 2024
mapleFU pushed a commit to mapleFU/arrow that referenced this pull request Sep 3, 2024
### Rationale for this change
Fixes: apache#43860

### What changes are included in this PR?
Return error correctly

### Are these changes tested?
Yes

### Are there any user-facing changes?
Nope

* GitHub Issue: apache#43860

Authored-by: bigsheeper <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
khwilson pushed a commit to khwilson/arrow that referenced this pull request Sep 14, 2024
### Rationale for this change
Fixes: apache#43860

### What changes are included in this PR?
Return error correctly

### Are these changes tested?
Yes

### Are there any user-facing changes?
Nope

* GitHub Issue: apache#43860

Authored-by: bigsheeper <[email protected]>
Signed-off-by: Matt Topol <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Go] Error was not handled
4 participants