Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metadata-ingestion: Update great-expectations dependency from 0.15 to 0.16 #8115

Open
vrld opened this issue May 24, 2023 · 18 comments
Open
Assignees
Labels
accepted An Issue that is confirmed as a bug by the DataHub Maintainers.

Comments

@vrld
Copy link

vrld commented May 24, 2023

Currently, DataHub depends on great-expectations <= 0.15.50, which is no longer actively maintained. The latest version is 0.16.13, which adds Fluent Datasources that make GX much more user friendly.

However, the new releases remove deprecated code that is used by DataHub, e.g., SQLAlchemyDataset/Datasource in the data profiler and probably some data-asset related stuff in the GX action.

Please update the dependency to 0.16 so that our users can use the new GX version with the datahub action.

@github-actions
Copy link

This issue is stale because it has been open for 30 days with no activity. If you believe this is still an issue on the latest DataHub release please leave a comment with the version that you tested it with. If this is a question/discussion please head to https://slack.datahubproject.io. For feature requests please use https://feature-requests.datahubproject.io

@github-actions github-actions bot added the stale label Jun 25, 2023
@github-actions
Copy link

This issue was closed because it has been inactive for 30 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 26, 2023
@jelledv
Copy link

jelledv commented Oct 12, 2023

Any update on this?

@hsheth2 hsheth2 reopened this Nov 20, 2023
@hsheth2 hsheth2 added accepted An Issue that is confirmed as a bug by the DataHub Maintainers. stale and removed stale labels Nov 20, 2023
@hsheth2
Copy link
Collaborator

hsheth2 commented Nov 20, 2023

This issue is on our radar, but unfortunately isn't a simple fix because of the level of customization and patching we've done in our existing GX-based data profilers. We've had some conversations with the GX team around what it would take to get this done, and are working to scope it accordingly.

@DSchmidtDev
Copy link
Contributor

any updates on this issue? I mean GX is at 0.18 in the meantime :)

@mateocolina
Copy link

mateocolina commented Apr 3, 2024

Any updates on this? they are about to move to 1.x.x :)

@KulykDmytro
Copy link
Contributor

KulykDmytro commented Apr 19, 2024

to make datahub work with recent airflow need to bump GE at least to 0.16.8

currently it clinches with urllib3 version pinned in older GE versions to 1.26, while airflow pinned to 2.x

and botocore for python 3.10+ too

Thus, great-expectations (>=0.15.12,<0.15.50) requires urllib3 (>=1.25.4,<1.27)

@VladShuvalov
Copy link

Are there any loose timelines around when this can be resolved?

@cburroughs
Copy link
Contributor

I'm sorry, I know these sorts of "me too" comments are rarely of much help. I wanted to highlight that great-expectations at the pinned version has a variety of upper bounds constraints: https://raw.githubusercontent.com/great-expectations/great_expectations/0.15.50/requirements.txt

altair>=4.0.0,<4.2.1
pydantic>=1.10.4,<2.0
urllib3>=1.25.4,<1.27

And at least for us the problem isn't so much that "great expectations is old" but that being on the lower side of these transitive dependencies -- like the pydantic v1-v2 transitions -- has ever increasing opportunity costs. (In our particular transitive set pydantic <2 is also keeping us on pandas<2, which adds further to the expense.)

I know this doesn't change anything about the difficulty of migration, but I hope it clarifies the "cost" somewhat when this issue is next triaged.

@am2222
Copy link

am2222 commented Jun 4, 2024

Any updates on this? The latest version of datahub_action for GX also needs to get updated to reflect the latest changes. It is a one line change tho.

@shirshanka
Copy link
Contributor

Just want to clarify which of these issues people are trying to solve:

  1. Use datahub_action with latest GX
  2. Install datahub ingestion sources inside one big venv (e.g. airflow)

@am2222
Copy link

am2222 commented Jun 29, 2024

@shirshanka for the datahub action to work with the latest version of GX I managed to just modify a couple of lines of code to fix the class constructor function. But the bigger issue is that if we have airflow installed with the datahub plugin we cannot use the latest version of GX in our dags due to version conflict.

@cburroughs
Copy link
Contributor

Install datahub ingestion sources inside one big venv (e.g. airflow)

This one. We use a monorepo and minimizing the number of transitive dependency sets we are juggling maximizes the usefulness of said monorepo.

@jskrzypek
Copy link

@shirshanka It looks like the changes that introduced pydantic v2 support in great-expectations will be easy to backport to 0.15.50. If I do that, would datahub consider using them as a springboard to support pydantic v2 for plugins?

@jskrzypek
Copy link

If anyone wants it, I pushed it up to my fork, and here's the diff from 0.15.50. I am going to try patching datahub on a fork to consume this version of great expectations, and see if that works for us.

@hsheth2
Copy link
Collaborator

hsheth2 commented Aug 23, 2024

We've done some work on our end in #11096. The main outcome of that is the GX validation action now lives in the acryl-datahub-gx-plugin package (published here https://pypi.org/project/acryl-datahub-gx-plugin/) instead of acryl-datahub[great-expectations], and supports newer versions of GX in addition to 0.15.x. It's currently an rc, pending a bit of manual testing we want to do.

For ingestion (e.g. snowflake/bigquery/redshift/other sql sources), we still depend on GX 0.15.50 for profiling, and that remains a particularly tricky dependency to loosen given the extent of the monkey-patching we've done to improve query efficiency.

If you're using the only the Python SDKs, you usually can install acryl-datahub or maybe add a limited set of plugins e.g. acryl-datahub[sql-parser] and hence avoid seeing our pin on GX.

We recommend not installing full ingestion sources into your main environment (e.g. avoid having a dependency on acryl-datahub[snowflake]), and recommend either using UI-based ingestion or isolating the programmatic ingestion pipelines using venvs. For Airflow, we have an example using the PythonVirtualenvOperator in our docs.

However, I recognize that this isn't a full fix yet, and so I'll be leaving this issue open for now.

@jskrzypek those improvements sounds great - we'd definitely be open to using the forked GX version that supports pydantic v2. The core acryl-datahub SDK already supports both pydantic 1 and 2, but many of our sources still require v1 because of the GX dependency.

@jskrzypek
Copy link

@hsheth2 cool! Please feel free to just take over my fork of GX if you want. It shouldn't require much ongoing maintenance, but I don't really have the time or bandwidth to keep up with it.

I am not sure if GX would consider adopting it themselves, but imagine a request to do so will be more well received if it comes from a project like datahub – our company doesn't use GX directly.

@svdimchenko
Copy link
Contributor

svdimchenko commented Jan 22, 2025

This is so confusing to still be pinned to pydantic-v1 and GE release happened on Mar 10, 2023 in 2025. Can GE dependency be removed from sql-common requirements and marked as options (we don't use profiling data actually) ?

"great-expectations>=0.15.12, <=0.15.50",

I'm struggling to install both acryl-datahub[postgres] and tableauserverclient with following error:

 And because tableauserverclient==0.36 depends on urllib3>=2.2.2 and only tableauserverclient<=0.36 is available, we can conclude that great-expectations>=0.15.12,<=0.15.50 and
      tableauserverclient>=0.36 are incompatible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted An Issue that is confirmed as a bug by the DataHub Maintainers.
Projects
None yet
Development

Successfully merging a pull request may close this issue.