Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add EnsureHashAlgorithm #346

Closed
mih opened this issue May 9, 2023 · 7 comments
Closed

Add EnsureHashAlgorithm #346

mih opened this issue May 9, 2023 · 7 comments
Assignees

Comments

@mih
Copy link
Member

mih commented May 9, 2023

There is detection logic in MultiHash that could/should become exposed in this more standard fashion. For commands that support it (download, ls-file-collection) it would also make sense to implement EnsureMultiHash

@mslw
Copy link
Collaborator

mslw commented May 11, 2023

MultiHash uses hashlib, and hashlib provides constant attributes hashlib.algorithms_guaranteed (doc) and hashlib.algorithms_available (both sets) - seems like a perfect fit.

@mslw mslw self-assigned this May 12, 2023
@mslw
Copy link
Collaborator

mslw commented May 12, 2023

TIL from hashlib docs:

hashlib.algorithms_available: A set containing the names of the hash algorithms that are available in the running Python interpreter. These names will be recognized when passed to new() [but have no named constructors in hashlib -- mslw]

hashlib.new Is a generic constructor that takes the string name of the desired algorithm as its first parameter. It also exists to allow access to the above listed hashes as well as any other algorithms that your OpenSSL library may offer. The named constructors are much faster than new() and should be preferred.

So with the way MultiHash currently works (hr = getattr(hashlib, h.lower(), None)), we need to check against algorithms_guaranteed.

We could also extend MultiHash to use the additional ones through new() if not found in attributes. However, we probably care much less about the non-guaranteed algorithms. This is what's guaranteed and additionally available on my laptop:

>>> import hashlib
>>> hashlib.algorithms_guaranteed
{'shake_256', 'sha224', 'sha3_512', 'md5', 'sha3_256', 'sha3_224', 'shake_128', 'sha3_384', 'sha512', 'sha384', 'blake2s', 'blake2b', 'sha1', 'sha256'}
>>> hashlib.algorithms_available - hashlib.algorithms_guaranteed  # set difference
{'sm3', 'md4', 'ripemd160', 'sha512_256', 'whirlpool', 'md5-sha1', 'sha512_224'}

And example usage:

>>> hasattr(hashlib, "ripemd160")
False
>>> h = hashlib.new("ripemd160")   # must go through new
>>> h.update(b"Nobody inspects the spammish repetition")
>>> h.hexdigest()
'cc4a5ce1b3df48aec5d22d1f16b894a0b894eccc'

I will probably use algorithms_guaranteed for now.

@mslw
Copy link
Collaborator

mslw commented May 15, 2023

Trying to build EnsureHashAlgorithm as a naive EnsureChoice derivation (see draft PR above) and testing on ls-file-collection, I noticed that hash names are already coerced to a list (by joint_constraints if I'm reading correctly) when they reach param_constraints:

❱ datalad ls-file-collection --hash md5 directory foo
[ERROR  ] 1 command parameter constraint violation
hash=['md5']
  is not one of ({'sha3_224', 'sha3_384', 'shake_128', 'sha3_256', 'sha1', 'sha256', 'shake_256', 'sha3_512', 'sha512', 'sha224', 'md5', 'blake2b', 'blake2s', 'sha384'},)

I just wanted to confirm whether this is an intended behavior, and ask about the correct way to implement the constraint - should it handle a list, or handle a single item, but be somehow wrapped inside parameter_constraints?

@mslw
Copy link
Collaborator

mslw commented May 16, 2023

I was doubly wrong in my comment above:

  • When coming from CLI, hash is a list, because it is a parameter with --append (argparse behavior). The order of validation is fine (first param_constraints, then joint_constraints). When coming from CLI it can be a string.
  • I also took too much of a shortcut with EnsureChoice - the error message shows that my constraint compared againist the entire set, not set elements.

I think I've got it now - parameter constraint should expect a string, but the command should get EnsureHashAlgorithm | EnsureIterableOf(EnsureHashAlgorithm) 😄

@mslw
Copy link
Collaborator

mslw commented May 16, 2023

EnsureHashAlgorithm | EnsureListOf(EnsureHashAlgorithm) pases valid inputs correctly, but the combination is not perfect when it comes to error messaging:

  • hash="abc" will fail EnsureHashAlgorithm (bad name) -> go into EnsureListOf which will iterate letter by letter (!) -> "a" will fail EnsureHashAlgorithm again; reported as:
hash='abc'
  does not match any of 2 alternatives
    - is not one of ('sha3_256', 'shake_256', 'sha3_384', 'md5', 'shake_128', 'sha384', 'sha3_224', 'blake2s', 'sha1', 'blake2b', 'sha224', 'sha512', 'sha256', 'sha3_512')
    - is not one of ('sha3_256', 'shake_256', 'sha3_384', 'md5', 'shake_128', 'sha384', 'sha3_224', 'blake2s', 'sha1', 'blake2b', 'sha224', 'sha512', 'sha256', 'sha3_512')
  • hash=["md5", "abc"] will fail EnsureHashAlgorithm (is a list) -> go into EnsureListOf -> "md5" will pass EnsureHashAlgorithm -> "abc" will fail EnsureHashAlgorithm; reported as:
hash=['md5', 'abc']
  does not match any of 2 alternatives
    - is not one of ('sha3_256', 'shake_256', 'sha3_384', 'md5', 'shake_128', 'sha384', 'sha3_224', 'blake2s', 'sha1', 'blake2b', 'sha224', 'sha512', 'sha256', 'sha3_512')
    - is not one of ('sha3_256', 'shake_256', 'sha3_384', 'md5', 'shake_128', 'sha384', 'sha3_224', 'blake2s', 'sha1', 'blake2b', 'sha224', 'sha512', 'sha256', 'sha3_512')

Maybe an EnsureMultiHash to handle str or list is in fact the way (or at least custom error messaging for current use).

Sorry for spamming the issue, I am working on it in small chunks and keeping notes here.

@mih
Copy link
Member Author

mih commented May 22, 2023

@mslw

EnsureHashAlgorithm() | EnsureListOf(EnsureHashAlgorithm) passes valid inputs correctly, but the combination is not perfect when it comes to error messaging

The approach is good. That the messaging is suboptimal exposes two issues:

  1. the self-description of EnsureChoice is not good
  2. the error handling of EnsureIterableOf is not good enough

The following diff sketches a solution to both issues. I have not checked what would break when applied in general, but in your PR this changes the error message to

❯ datalad ls-file-collection --hash abc directory .
[ERROR  ] 1 command parameter constraint violation
hash=['abc']
  does not match any of 2 alternatives
    - is not one of ('md5', 'shake_256', 'sha512', 'sha3_224', 'sha3_384', 'sha256', 'blake2b', 'sha384', 'sha3_512', 'sha1', 'sha224', 'shake_128', 'blake2s', 'sha3_256')
    - list item is not one of {'md5', 'shake_256', 'sha512', 'sha3_224', 'sha3_384', 'sha256', 'blake2b', 'sha384', 'sha3_512', 'sha1', 'sha224', 'shake_128', 'blake2s', 'sha3_256'} 
diff --git a/datalad_next/constraints/basic.py b/datalad_next/constraints/basic.py
index e60cbae..4a47a11 100644
--- a/datalad_next/constraints/basic.py
+++ b/datalad_next/constraints/basic.py
@@ -275,6 +275,9 @@ class EnsureChoice(Constraint):
     def short_description(self):
         return '{%s}' % ', '.join([repr(c) for c in self._allowed])
 
+    def __str__(self):
+        return f"one of {self.short_description()}"
+
 
 class EnsureKeyChoice(EnsureChoice):
     """Ensure value under a key in an input is in a set of possible values"""
diff --git a/datalad_next/constraints/compound.py b/datalad_next/constraints/compound.py
index 99fe8d6..bb0d87f 100644
--- a/datalad_next/constraints/compound.py
+++ b/datalad_next/constraints/compound.py
@@ -77,10 +77,12 @@ class EnsureIterableOf(Constraint):
             iter = self._iter_type(
                 self._item_constraint(i) for i in value
             )
-        except TypeError as e:
+        except (ConstraintError, TypeError) as e:
             self.raise_for(
                 value,
-                "cannot coerce to target (item) type",
+                "{itertype} item is not {itype}",
+                itertype=self._iter_type.__name__,
+                itype=self._item_constraint,
                 __caused_by__=e,
             )
         if self._min_len is not None or self._max_len is not None:

@mih
Copy link
Member Author

mih commented Oct 27, 2023

This was added in #492

@mih mih closed this as completed Oct 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants