Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for parameter dependencies #1

Open
kno10 opened this issue Jan 18, 2014 · 1 comment
Open

Support for parameter dependencies #1

kno10 opened this issue Jan 18, 2014 · 1 comment

Comments

@kno10
Copy link

kno10 commented Jan 18, 2014

I'm looking for a good tool to benchmark ELKI ( http://elki.dbs.ifi.lmu.de/ ) clustering performance across parameters.

The problem is, that the parameters aren't as nicely uniform as in your examples, and they have strong interdependencies.

The most interesting parameter obviously is the clustering algorithm. Say I'm looking only at k-means and DBSCAN for this example (but there are tons more in ELKI, which is why I could need benchmarking tool support).

  • k-means has the key parameters "kmeans.k" (the number of clusters) and the initialization method. Randomized initialization methods will also have a seed parameter, to fix the random seed.
  • for DBSCAN, the key parameters are the distance function, the radius epsilon (which depends a lot on the distance function), and minPts which interplays with the radius: a larger radius will need a larger minPts.

The big challenge here are the dependencies of the parameters. The most simple one is that the "k" parameter only exists for k-means, whereas for DBSCAN one needs to choose distance function, minPts and epsilon. But then, there are also k-means initialization heuristics that have parameters such as the random seed...

Will 3x be able to handle such complex cases?

@netj
Copy link
Owner

netj commented Jan 20, 2014

Thanks for your input.

Unfortunately, 3X does not support such nested/dependent parameters explicitly yet. However, I think there is a relatively simple way to emulate them for now without losing much functionality of the tool. By defining all dependent parameters at the top level without any special structure, and assigning a special value (e.g., null or undef) to all irrelevant parameters, you can achieve similar effect of having dependent parameters.

In your example of benchmarking ELKI, you could define an additional null value for all dependent input parameters:

  • algorithm
    • kmeans
    • DBSCAN
    • ...
  • k
    • null
    • 3
    • 4
    • ...
  • distance
    • null
    • Euclidean
    • ...
  • ...

Because all input parameter values will be available to your program as environment variables, you can easily grab the values of the relevant dependent parameters based on what value algorithm is set to (e.g., value of k when algorithm=kmeans). Of course, since 3X won't take care of invalid input combinations (e.g., algorithm=DBSCAN k=3 distance=Euclidean or algorithm=kmeans k=null distance=null), you will need to put extra care when using some features, such as generating a full combination (cross product) of parameter values for planning runs. However, many features will still work well with such flattened parameter space, such as charting some output metric across algorithms, or charting the effect of a dependent parameter of a particular algorithm.

Our initial plan related to this issue was to provide a way to have user-defined, general constraints over the input parameter space, so 3X can automatically rule out invalid cases. However, I now see that supporting dependent (or hierarchical) parameters can be more intuitive, and have a lot of use cases in the data mining and machine learning domain.

I will keep this issue open to collect more concrete ideas until we make 3X handle dependent parameters natively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants