Skip to content

Latest commit

 

History

History

python-linear-regression

CHUV DockerHub ImageVersion ImageLayers

Python linear-regression

Continuous target

Python implementation of multivariate linear regression. It supports both nominal and categorical variables and implicitly drop null values in data. Both single-node and distributed mode return JSON with structure such as

{
    'agegroup_50-59y': {
        'coef': 3.2571304466,
        'p_values': 0.7387901953,
        'std_err': 9.5993224941,
        't_values': 0.3393083677
    },
    'intercept': {
        'coef': 1042.2837545842,
        'p_values': 0.0,
        'std_err': 45.1479998776,
        't_values': 23.0859342033
    },
    ...
}

Categorical target

Multinominal logistic regression implemented as a log-linear model by fitting logistic regressions on one class versus the others. Only single-node mode is supported, for distributed mode use SGD regression.

The output is JSON where each category has its own coefficients

{
  'AD': {
    'agegroup_50-59y': {
        'coef': 3.2571304466,
        'p_values': 0.7387901953,
        'std_err': 9.5993224941,
        't_values': 0.3393083677
    },
    'intercept': {
        'coef': 1042.2837545842,
        'p_values': 0.0,
        'std_err': 45.1479998776,
        't_values': 23.0859342033
    },
    ...
  },
  'CN': ...
}

Single-node mode

Regression coefficients and statistics are calculated using statsmodels package.

Usage

docker run python-linear-regression compute

Distributed mode

Aggregation mode pools the local betas and XtX matrices, constructs normal equations from these blocks and uses them to calculate aggregated betas (see original R implementation). Calculated betas are identical to the single-node mode, however standard errors, t-statistics and p-values are estimated from the local standard errors and might differ from the single-node case. This is because we do not have residuals available in the aggregation step and therefore cannot compute standard error of the residuals. In order to do that, we would have to propagate aggregate betas back to nodes, recalculate standard error there and perform one more aggregation step.

Usage

It has two modes

  1. compute --mode intermediate
  2. compute --mode aggregate --job-ids 1 2 3

Intermediate mode returns the same output as a single-node mode and aggregate mode combines these outputs into single estimate.

Build (for contributors)

Run: ./build.sh

Test (for contributors)

Run: ./tests/test.sh

Publish (for contributors)

Run: ./publish.sh