Name		Name	Last commit message	Last commit date
parent directory ..
tests		tests
.bumpversion.cfg		.bumpversion.cfg
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
build.sh		build.sh
captain.yml		captain.yml
linear_regression.py		linear_regression.py
publish.sh		publish.sh
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
slack.json		slack.json

README.md

Python linear-regression

Continuous target

Python implementation of multivariate linear regression. It supports both nominal and categorical variables and implicitly drop null values in data. Both single-node and distributed mode return JSON with structure such as

{
    'agegroup_50-59y': {
        'coef': 3.2571304466,
        'p_values': 0.7387901953,
        'std_err': 9.5993224941,
        't_values': 0.3393083677
    },
    'intercept': {
        'coef': 1042.2837545842,
        'p_values': 0.0,
        'std_err': 45.1479998776,
        't_values': 23.0859342033
    },
    ...
}

Categorical target

Multinominal logistic regression implemented as a log-linear model by fitting logistic regressions on one class versus the others. Only single-node mode is supported, for distributed mode use SGD regression.

The output is JSON where each category has its own coefficients

{
  'AD': {
    'agegroup_50-59y': {
        'coef': 3.2571304466,
        'p_values': 0.7387901953,
        'std_err': 9.5993224941,
        't_values': 0.3393083677
    },
    'intercept': {
        'coef': 1042.2837545842,
        'p_values': 0.0,
        'std_err': 45.1479998776,
        't_values': 23.0859342033
    },
    ...
  },
  'CN': ...
}

Single-node mode

Regression coefficients and statistics are calculated using statsmodels package.

Usage

docker run python-linear-regression compute

Distributed mode

Aggregation mode pools the local betas and XtX matrices, constructs normal equations from these blocks and uses them to calculate aggregated betas (see original R implementation). Calculated betas are identical to the single-node mode, however standard errors, t-statistics and p-values are estimated from the local standard errors and might differ from the single-node case. This is because we do not have residuals available in the aggregation step and therefore cannot compute standard error of the residuals. In order to do that, we would have to propagate aggregate betas back to nodes, recalculate standard error there and perform one more aggregation step.

Usage

It has two modes

compute --mode intermediate
compute --mode aggregate --job-ids 1 2 3

Intermediate mode returns the same output as a single-node mode and aggregate mode combines these outputs into single estimate.

Build (for contributors)

Run: ./build.sh

Test (for contributors)

Run: ./tests/test.sh

Publish (for contributors)

Run: ./publish.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

python-linear-regression

python-linear-regression

README.md

Python linear-regression

Continuous target

Categorical target

Single-node mode

Usage

Distributed mode

Usage

Build (for contributors)

Test (for contributors)

Publish (for contributors)

Files

python-linear-regression

Directory actions

More options

Directory actions

More options

Latest commit

History

python-linear-regression

Folders and files

parent directory

README.md

Python linear-regression

Continuous target

Categorical target

Single-node mode

Usage

Distributed mode

Usage

Build (for contributors)

Test (for contributors)

Publish (for contributors)