-
-
Notifications
You must be signed in to change notification settings - Fork 1k
GSoC_2018_project_detox
...continuing from last year, featuring Michele, who was student last year, as mentor.
This project is the third of its kind: attempting to clean the internals of Shogun and replace obsolete code and concepts with more modern counterparts. This year, we want to focus on data representations and linear algebra API.
- Viktor (github: vigsterkr, IRC: wiking)
- Michele (github: micmn, IRC: micmn)
- Pan (github: oxphos, IRC: OXPHOS)
Medium to difficult
You need know
- C++
- In particular, type systems & safety (I.e. a language that is type safe, like C++ or Java)
- Linear algebra in computers, Eigen3, Shogun's
linalg
- Machine learning code basics
- Design patterns and software engineering principles
For every sub-project:
- Write down a list of classes/methods/concepts that will need change (there are comments below)
- Think (and discuss) how every sub-project's problems could be solved efficiently
- Write down pseudo-code of how the API should look like
- Write down pseudo-code of how the internals would look like
- Draft minimal a prototype of how you want to implement your change
- Work on a one-by-one basis
Here are some sub-projects. We are open for more:
NOTE: A GSoC project will address multiple (or ideally all) of those topics.
We want to modernize Shogun's main data representation, CFeatures
.
In order to make thread-safety easier, we plan to make the interface immutable. This means making all methods const
.
Everything that changes the state or content of a features object will have to create a copy first, and return a new instance (that might share the underlying memory).
The first step is to come up with a list of all non-const
methods in features, decide which ones can be made const
easily.
We want to support slicing data points (not components), i.e. generating views, behind a clean and simple API. This will replace the [existing subsets], see void CFeatures::add_subset(SGVector<index_t> subset)
, CSubsetStack, etc.
We envision an API in the lines of
features.view(indices) # returns new instance of features that points to the same data
There already is a WIP pull request here.
Once that immutable features and views are done, we really would like to see a multithreaded version of cross-validation implemented, using shared memory for the features (no cloning), see here.
Something in the lines of (could be more elegant, but this is to illustrate the purpose)
#pragma omp parallel
for (auto i in range(num_folds))
{
inds_train, inds_test = splitting->fold(i) # generate train/test indices
auto feats_train = features->view(inds_train) # returns training view instance (thread-safe, no data copying)
# ... same for validation set & labels
fold_machine = machine->clone() # clones learning machine instance (without data), cheap
fold_machine->fit(feats_train, labels_train) # non const call, but on my own instance, so thread safe
result[i] = evaluation(fold_machine->predict(fets_test), labels_test)
}
We would like to move towards an un-templated features type, and leave it to algorithms to check types at runtime (is only done once by the algorithm). The old classes can stay as specializations, but the public API should not expose them, or at least not all of them but a small set.
Next, features should be easily constructed from files, streams, matrices, etc. For that, we would like to have a global factory method that can do exactly that, i.e.
* `Features features(Matrix)`
* `Features features(SparseMatrix)`
* `Features features(FileStream)`
* `Features features(ArrowBuffer)`
* `Features features(Strings)`
The resulting features would transparently load/read/stream the underlying data and all behave the same.
Features should not offer direct access to the underlying memory, i.e. the feature vector for CDenseFeatures
.
This is since for that one needs to know the basic word-size (float32, float64) of the underlying data, which would convolute the algorithm codes (those should be independent of the word-size).
As a consequence, we would like to remove all methods that return vectors/matrices/etc.
This is a long list of changes, and we need to start by collecting all cases, and discussing which ones to change first.
Instead, we would like to perform computation over features using an API based on iterators. We have already made a few transitions in this direction, for example see Perceptron. Have a look here for inspiration.
Picking up this
Many parts of Shogun are split into .cpp
and .h
, which makes compilation/development much easier: changes in the implementation of a low level data-structure does not cause the whole project to be re-compiled.
There are many cases, however, where this is not done (especially in templated code).
This part of the project can serve as a nice initial contribution.
More topics that one could work on include: serialization, smart pointers, using std:: instead of our own data-structures, and more. Let us know if something in particular is of interest for you. We might also change things around while the project is running ;)
Cleaning up the interal APIs of Shogun will lead to a huge exposure to advanced software concept, and you can be sure to learn a lot about API design, algorithms, and good practices in software development. The project will make it much easier to develop clean code within Shogun, and as such make the project more attractive for scientists to implement their work in.