Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jitter matrix processing direction has large cpu/memory cache contention and overhead #151

Open
diablodale opened this issue May 28, 2020 · 0 comments

Comments

@diablodale
Copy link

the min matrix operator reads the calculate direction variable 2,075,759 times too many on each matrix calculation. It can read it once per matrix calculation.

When the matrix is processed a call is made to the subclass calc_cell() method for each cell. The order in which the cells are iterated will be one of the options provided here [by m_direction]

For example at:

if (self->m_min_object.direction() != matrix_operator_base::iteration_direction::forward) {
ip = ip_last;
op = op_last;
for (auto j = n - 1; j >= 0; --j) {
matrix_coord position(j, i);
if (self->m_min_object.direction() == matrix_operator_base::iteration_direction::bidirectional) {
const std::array<U, 1> tmp = {{*op}};

  1. This variable and its access is not thread-safe
  2. If changed during the middle of an ndim parallelized action, the effects are indeterminate.
  3. To get this direction value, the CPU must deference a pointer and read the variable (2 + up to framewidth) times on every section (up to every row) of every ndim parallelized block for every matrix frame.

As an example, its possible for a single frame, for the current min-api codebase to access m_direction the following number of times for an HD color rgba image.

(2+framewidth=1922 potential accesses each jit_calculate_vector section) x (1080 for every HD row) = 2,075,760 times

That's very poor. 😞 Because its a single value that only needs to be read 1 time from the class variable on each frame.

And worse, this class member variable access is spread across multiple cpu caches. And since it is a read/write dereferenced variable and could be changed by any thread running on any cpu on any core, then the cache is constantly thrashing making access slow.

There are multiple possible improvements, and combinations of them can be used.

  1. make direction compile time. It is a subset of externals that need to process cells in a matrix in dynamically runtime changing calculation order.
  2. make direction set on the class. For that subset of externals that need a dynamically runtime changing calculation order and further that subset of customers using that subset of externals that choose a non-standard direction, let them set it as an argument on the max object, or a read-only attribute. Then it is a const set at class construction.
  3. copy direction from that class's member variable to an ndim/thread const local parameter/struct. This means that the direction can not change during the scope of a single matrix_calc. And when the value is copied into a param/struct for each ndim section, this allows for cache locality on the cpu that is running that ndim thread.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants