Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improve][Zeta] Disable hdfs filesystem cache of checkpoint #6718

Merged
merged 1 commit into from
Apr 29, 2024

Conversation

LeonYoah
Copy link
Contributor

…cache function is disabled by default.

Purpose of this pull request

Does this PR introduce any user-facing change?

When using hadoop-aws-3.1.4.jar and aws-java-sdk-bundle-1.11.271.jarto connect hdfs or s3 file systems, the default mode is cache. In multithreaded scenarios, FileSysyem objects are often closed, resulting in the closure of the connection pool. If the objects are taken from the cache, some unknown exceptions will be caused

How was this patch tested?

no

Check list

@LeonYoah
Copy link
Contributor Author

I don't have an oss environment, so I can't run oss unit tests:
image

@hailin0 hailin0 changed the title [Feature][CheckPoint-stroage]:Added Disable cache configuration. The cache function is disabled by default. [Improve][Zeta] Disable hdfs filesystem cache of checkpoint Apr 17, 2024
@@ -186,3 +186,40 @@ seatunnel:

```

### Enable cache

When storage:type is hdfs, cache is disabled by default. If you want to enable it, set `disable.cache: false`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you share what risks will be caused if cache is turned on?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you share what risks will be caused if cache is turned on?

I think in the scenario of seatunel, it should not be turned on at any time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you share what risks will be caused if cache is turned on?

You can take a look at the bug #6678 I proposed, which is [S3 connector], which causes occasional task failures due to the use of cache. at the same time, you can also refer to these two issues, which are bug: https://issues.apache.org/jira/browse/HADOOP-15819 and aws/aws-sdk-java#2337 about hadoop-aws and aws-sdk-java. The problems caused by turning on cache in multithreaded environment are described in detail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense to me.

@Hisoka-X
Copy link
Member

Could you add a test case to cover this bug?

@LeonYoah
Copy link
Contributor Author

Could you add a test case to cover this bug?

Currently, there is no bug for connection closure in hdfs cache, so it is a precaution in advance.

Copy link
Contributor

@dailai dailai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hailin0 hailin0 merged commit 53c8957 into apache:dev Apr 29, 2024
12 checks passed
chaorongzhi pushed a commit to chaorongzhi/seatunnel that referenced this pull request Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants