README

= CLVisc: a (3+1)D viscous hydrodynamic program parallelized on GPU using OpenCL =

The program is used to simulate the evolution of strongly coupled quark gluon plasma produced in relativistic heavy ion collisions.
Please cite the following paper if you used CLVisc for publications or reused part of its code,

@article{Pang:2018zzo,
      author         = "Pang, Long-Gang and Petersen, Hannah and Wang, Xin-Nian",
      title          = "{Pseudorapidity distribution and decorrelation of
                        anisotropic flow within CLVisc hydrodynamics}",
      year           = "2018",
      eprint         = "1802.04449",
      archivePrefix  = "arXiv",
      primaryClass   = "nucl-th",
      SLACcitation   = "%%CITATION = ARXIV:1802.04449;%%"
}

    Copyright (C) 2018, Long-Gang Pang

    This program is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program.  If not, see <https://www.gnu.org/licenses/>.
    
==Installation==

First of all, get CLVisc using: 

   git clone https://gitlab.com/snowhitiger/PyVisc.git

1. Install OpenCL
   (1) For MacBook Pro, OpenCL is supported by default, skip this step.
   (2) For Linux using Nvidia GPU, install CUDA -- Shipped with OpenCL. url: https://developer.nvidia.com/cuda-downloads
   (3) For Linux using AMD GPU, install AMD APP SDK from http://developer.amd.com/amd-accelerated-parallel-processing-app-sdk/
   (4) For super cluster with GPUs, ask the IT-help people for the OpenCL/Cuda support.

2. Download and install the latest Anaconda from https://www.continuum.io/downloads
   Important: please choose Python2.7 (although most of the code work well with Python3.*)
   Notice: in case you use Python2.7 from other sources, please also install matplotlib, h5py, pandas, sympy.
   These 4 packages are delivered with Anaconda by default.
  
3. Install PyOpenCL
   `conda install -c conda-forge pyopencl`

   Till now you can run ideal.py and viscous.py in pyvisc/ directory to run one ideal and one viscous hydro event,
   the hydrodynamic evolution will produce and print the evolution history and freeze out hyper-surface in result/
   directory. In order to calculate smooth particle spectra or sample hadrons from hyper-surface, one needs to 
   additionally install *cmake* and *gsl*.
   
4. Install cmake
   (1) For MacBook, 
	Run in Terminal app:
	    `ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" < /dev/null 2> /dev/null`
	and press *enter/return* key. Wait for the command to finish.
	Run:
	    `brew install cmake`
	Done! You can now use cmake.
    (2) For Linux, search on google

5. Install gsl library
    (1) For MacBook, `brew install gsl`
    (2) For Linux, search on google


6. Event-by-event hydro using trento initial condition
   (1) Install trento
    cd 3rdparty/trento_with_participant_plane/
    mkdir build
    cd build
    cmake ..
    make
    make install

    (2) Compile the MC sampling spectra calculation subroutines
    cd sampler/
    mkdir build
    cd build
    cmake ..
    make

    (3) Compile the Smooth spectra calcualation subroutines
    cd CLSmoothSpec/
    mkdir build
    cd build
    cmake ..
    make 
    Notice: this step will fail with MacOS version > 10.8 because apple depreciated some OpenCL functions.
    Please use this subroutine on a GPU cluster/Linux machine.
    This program will be updated in the future to consider the MacOS updates.
    
    (4) In PyVisc/pyvisc/, modify the output path in ebe_trento.py and  run
    #python ebe_trento.py collision_sys centrality gpu_id num_of_events
     
    python ebe_trento.py auau200 0_5 0 100
    python ebe_trento.py pbpb2760 20_30 0 100
    python ebe_trento.py pbpb5020 30_40 0 100

7. If there is error : No module named 'mako', one can install Mako using
    pip install --user Mako

8. Modify cache_dir in cache.py if the cluster does not have /tmp directory

        anaconda2/lib/python2.7/site-packages/pyopencl-2016.2-py2.7-linux-x86_64.egg/pyopencl/cache.py
        
        322 def _create_built_program_from_source_cached(ctx, src, options_bytes,
        323         devices, cache_dir, include_path):
        324     from os.path import join                                                        
        325 
        326     if cache_dir is None:
        327         import appdirs                                                              
        328         #cache_dir = join(appdirs.user_cache_dir("pyopencl", "pyopencl"),           
        329         #        "pyopencl-compiler-cache-v2-py%s" % (                              
        330         #            ".".join(str(i) for i in sys.version_info),))                  
        331         cache_dir = '/lustre/nyx/hyihp/lpang/tmp/'                                  

==Examples==
1. cd pyvisc
python ideal.py

2. cd pyvisc
python visc.py
Notice: the visc.py has huge GPU memory demands and can only be run on GPUs whose memory > 5GB.

3. cd python
modify ebe_trento.py to run event-by-event hydrodynamics with Trento initial condition


==The BSZ dependence==
For ideal hydrodynamics, with lattice 385*385*115, per step running time is:

BSZ      8      16      32      64      128
Ideal(s)  0.37  0.218   0.178   0.155   0.157
Visc(s)-GPU  3.12  1.65     1.17   1.01    1.17
Visc(s)-CPU 6.64   6.45  6.63   7.0   7.58 


==The importance of concurrent reading from Global memory.==
Here I used NX=NY=NZ=201 for a test, in principle the time cost for visc_src_alongx, 
visc_src_alongy, visc_src_alongz should have no difference. However, from line profiler
by using: {{{kernpro -l -v visc.py }}}
One gets 41.9 vs 38.4 vs 6.9 for x, y and z direction.
Why there is so big difference? It can be explained by the order of the data in global memory,
where we use:
{{{
     for (int i = 0; i < NX; i++ )
     for (int j = 0; j < NY; j++ )
     for (int k = 0; k < NZ; k++ ) {
         pimn[i*NY*NZ + j*NZ + K] = some number;
     }
}}}
The data is continues along z direction, which makes it much faster to read from 
global memory to local memory in z direction than x and y due to concurrent.


Total time: 19.2475 s
File: visc.py
Function: IS_stepUpdate at line 165

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   165                                               @profile
   166                                               def IS_stepUpdate(self, step):
   167                                                   #print "ideal update finished"
   168        52          152      2.9      0.0          NX, NY, NZ, BSZ = self.cfg.NX, self.cfg.NY, self.cfg.NZ, self.cfg.BSZ
   169                                           
   170        52       143954   2768.3      0.7          self.kernel_IS.visc_src_christoffel(self.queue, (NX*NY*NZ,), None,
   171        52          134      2.6      0.0                  self.d_IS_src, self.d_pi[step], self.ideal.d_ev[step],
   172        52       581359  11180.0      3.0                  self.ideal.tau, np.int32(step)).wait()
   173                                           
   174        52       159298   3063.4      0.8          self.kernel_IS.visc_src_alongx(self.queue, (BSZ, NY, NZ), (BSZ, 1, 1),
   175        52          143      2.8      0.0                  self.d_IS_src, self.d_udx, self.d_pi[step], self.ideal.d_ev[step],
   176        52      8055724 154917.8     41.9                  self.eos_table, self.ideal.tau).wait()
   177                                           
   178                                                   #print "udx along x"
   179                                           
   180        51       156991   3078.3      0.8          self.kernel_IS.visc_src_alongy(self.queue, (NX, BSZ, NZ), (1, BSZ, 1),
   181        51          151      3.0      0.0                  self.d_IS_src, self.d_udy, self.d_pi[step], self.ideal.d_ev[step],
   182        51      7381515 144735.6     38.4                  self.eos_table, self.ideal.tau).wait()
   183                                           
   184                                                   #print "udy along y"
   185        51       157382   3085.9      0.8          self.kernel_IS.visc_src_alongz(self.queue, (NX, NY, BSZ), (1, 1, BSZ),
   186        51          137      2.7      0.0                  self.d_IS_src, self.d_udz, self.d_pi[step], self.ideal.d_ev[step],
   187        51      1329880  26076.1      6.9                  self.eos_table, self.ideal.tau).wait()
   188                                           
   189                                                   #print "udz along z"
   190        51       302246   5926.4      1.6          self.kernel_IS.update_pimn(self.queue, (NX*NY*NZ,), None,
   191        51          141      2.8      0.0                  self.d_pi[3-step], self.d_goodcell, self.d_pi[1], self.d_pi[step],
   192        51           82      1.6      0.0                  self.ideal.d_ev[1], self.ideal.d_ev[2], self.d_udiff,
   193        51           94      1.8      0.0                  self.d_udx, self.d_udy, self.d_udz, self.d_IS_src,
   194        51       978116  19178.7      5.1                  self.eos_table, self.ideal.tau, np.int32(step)
   195                                                           ).wait()

==Usage of vloadn to speed up global data access==

In kernel_visc.cl, one needs to load (pitt, pitx, pity, pitz) and (pixt, pixx, pixy, pixz) in src_alongx,
needs to load (pitt, pitx, pity, pitz) and (piyt, piyx, piyy, piyz) in src_alongy,
needs to load (pitt, pitx, pity, pitz) and (pizt, pizx, pizy, pizz) in src_alongz;

Since the data are stored in for(i, j, k) order, so loading data along z is faster than along y.
However, self.kernel_visc.kt_src_alongx is much faster than loading data along y. 
This may be caused by continues address for pixx, pixy, pizx.
I tried to use vload4 but it does not speed up the code, which means the compiler already did the optimization.

Total time: 8.60907 s
File: visc.py
Function: visc_stepUpdate at line 124

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   124                                               @profile
   125                                               def visc_stepUpdate(self, step):
   126                                                   ''' Do step update in kernel with KT algorithm for visc evolution
   127                                                       Args:
   128                                                           gpu_ev_old: self.d_ev[1] for the 1st step,
   129                                                                       self.d_ev[2] for the 2nd step
   130                                                           step: the 1st or the 2nd step in runge-kutta
   131                                                   '''
   132                                                   # upadte d_Src by KT time splitting, along=1,2,3 for 'x','y','z'
   133                                                   # input: gpu_ev_old, tau, size, along_axis
   134                                                   # output: self.d_Src
   135       108          324      3.0      0.0          NX, NY, NZ, BSZ = self.cfg.NX, self.cfg.NY, self.cfg.NZ, self.cfg.BSZ
   136       108       334097   3093.5      3.9          self.kernel_visc.kt_src_christoffel(self.queue, (NX*NY*NZ, ), None,
   137       108          278      2.6      0.0                           self.ideal.d_Src, self.ideal.d_ev[step],
   138       108          162      1.5      0.0                           self.d_pi[step], self.eos_table,
   139       108       854615   7913.1      9.9                           self.ideal.tau, np.int32(step)
   140                                                                    ).wait()
   141                                           
   142       108       296185   2742.5      3.4          self.kernel_visc.kt_src_alongx(self.queue, (BSZ, NY, NZ), (BSZ, 1, 1),
   143       108          275      2.5      0.0                  self.ideal.d_Src, self.ideal.d_ev[step],
   144       108          166      1.5      0.0                  self.d_pi[step], self.eos_table,
   145       108      1313623  12163.2     15.3                  self.ideal.tau).wait()
   146                                           
   147       108       296962   2749.6      3.4          self.kernel_visc.kt_src_alongy(self.queue, (NX, BSZ, NZ), (1, BSZ, 1),
   148       108          251      2.3      0.0                  self.ideal.d_Src, self.ideal.d_ev[step],
   149       108          167      1.5      0.0                  self.d_pi[step], self.eos_table,
   150       108      2435409  22550.1     28.3                  self.ideal.tau).wait()
   151                                           
   152       108       296962   2749.6      3.4          self.kernel_visc.kt_src_alongz(self.queue, (NX, NY, BSZ), (1, 1, BSZ),
   153       108          261      2.4      0.0                  self.ideal.d_Src, self.ideal.d_ev[step],
   154       108          180      1.7      0.0                  self.d_pi[step], self.eos_table,
   155       108      1093978  10129.4     12.7                  self.ideal.tau).wait()
   156                                           
   157                                                   # if step=1, T0m' = T0m + d_Src*dt, update d_ev[2]
   158                                                   # if step=2, T0m = T0m + 0.5*dt*d_Src, update d_ev[1]
   159                                                   # Notice that d_Src=f(t,x) at step1 and 
   160                                                   # d_Src=(f(t,x)+f(t+dt, x(t+dt))) at step2
   161                                                   # output: d_ev[] where need_update=2 for step 1 and 1 for step 2
   162       108       409861   3795.0      4.8          self.kernel_visc.update_ev(self.queue, (NX*NY*NZ, ), None,
   163       108          277      2.6      0.0                                self.ideal.d_ev[3-step], self.ideal.d_ev[1],
   164       108          169      1.6      0.0                                self.d_pi[0], self.d_pi[3-step],
   165       108          147      1.4      0.0                                self.ideal.d_Src,
   166       108      1274723  11803.0     14.8                                self.eos_table, self.ideal.tau, np.int32(step)).wait()