Skip to content

runtraining.sh

Chris Churas edited this page Oct 15, 2018 · 11 revisions

This script runs CDeep3M training to generated what is known as a trained model. In the case of CDeep3M we are actually training 3 separate models which will be described below.

This script is actually a wrapper that invokes CreateTrainJob.m and run_all_train.sh

NOTE: If multiple GPUs are available this script will run the training in parallel

Example:

runtraining.sh --numiterations 1000 ~/augtrain ~/model

Usage:

usage: runtraining.sh [-h] [--1fmonly] [--numiterations NUMITERATIONS]
                              [--gpu GPU] [--base_lr BASE_LR] [--power POWER] 
                              [--momentum MOMENTUM] 
                              [--weight_decay WEIGHT_DECAY] 
                              [--average_loss AVERAGE_LOSS] 
                              [--lr_policy POLICY] [--iter_size ITER_SIZE] 
                              [--snapshot_interval SNAPSHOT_INTERVAL]
                              [--validation_dir VALIDATION_DIR]
                              [--additerations NUMITERATIONS]
                              [--retrain TRAINOUTDIR]
                              augtrainimages trainoutdir

              Version: 1.6.0

              Trains Deep3M model using caffe with training data
              passed into script. 

              For further information about parameters below please see: 
              https://github.com/BVLC/caffe/wiki/Solver-Prototxt

    
positional arguments:
  augtrainimages       Augmented training data from PreprocessTrainingData.m
  trainoutdir          Desired output directory

optional arguments:
  -h, --help           show this help message and exit
  --1fmonly            Only train 1fm model
  --gpu                Which GPU to use, can be a number ie 0 or 1 or
                       all to use all GPUs (default all)
  --base_learn         Base learning rate (default 1e-02)
  --power              Used in poly and sigmoid lr_policies. (default 0.8)
  --momentum           Indicates how much of the previous weight will be 
                       retained in the new calculation. (default 0.9)
  --weight_decay       Factor of (regularization) penalization of large
                       weights (default 0.0005)
  --average_loss       Number of iterations to use to average loss
                       (default 16)
  --lr_policy          Learning rate policy (default poly)
  --iter_size          Accumulate gradients across batches through the 
                       iter_size solver field. (default 8)
  --snapshot_interval  How often caffe should output a model and solverstate.
                       (default 2000)
  --numiterations      Number of training iterations to run (default 30000)
  --validation_dir     Augmented validation data
  --retrain            Continue training trained models from train directory
                       passed in here, writing results to trainoutdir
  --additerations      If --retrain is set, this value is added to the
                       latest iteration model file found in the 
                       <retrain dir>/1fm/trainedmodel directory. For example,
                       if the latest iteration found in 
                       <retrain>/1fm/trainedmodel is 10000 and 
                       --additerations is set to 500 then training will
                       run to 10500 iterations. (default 2000)

This script will create a new directory, denoted as trainoutdir in usage above, which will be structured as follows:

Tree view of directory showing only base files and directories

├── 1fm
│   ├── log
│   ├── trainedmodel
├── 3fm
│   ├── log
│   ├── trainedmodel
├── 5fm
│   ├── log
│   ├── trainedmodel
├── parallel.jobs
├── readme.txt
├── valid_file.txt
└── train_file.txt

1fm, 3fm, 5fm

These directories contain the trained models and each one has an identical structure as seen here with actual files:

├── #fm
│   ├── deploy.prototxt
│   ├── label_class_selection.prototxt
│   ├── log
│   │   ├── caffe.bin.INFO
│   │   ├── caffe.bin.ip-XXX.ubuntu.log.INFO.XXXX
│   │   └── out.log
│   ├── solver.prototxt
│   ├── trainedmodel
│   │   ├── 1fm_classifer_iter_###.caffemodel
│   │   └── 1fm_classifer_iter_###.solverstate
│   ├── train_file.txt
│   ├── train_val.prototxt
│   └── valid_file.txt

The actual trained model resides under #fm/trainedmodel in the .caffemodel file.

The other file .solverstate is needed to resume training, but not needed for prediction.

The ### in the .caffemodel and .solverstate denote the iteration of the model file.

As caffe trains multiple .caffemodel files will be output so in this #fm/trainedmodel can exist multiple files at different iterations of completion.

Clone this wiki locally