-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathREADME
262 lines (218 loc) · 14.6 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
= CLVisc: a (3+1)D viscous hydrodynamic program parallelized on GPU using OpenCL =
The program is used to simulate the evolution of strongly coupled quark gluon plasma produced in relativistic heavy ion collisions.
Please cite the following paper if you used CLVisc for publications or reused part of its code,
@article{Pang:2018zzo,
author = "Pang, Long-Gang and Petersen, Hannah and Wang, Xin-Nian",
title = "{Pseudorapidity distribution and decorrelation of
anisotropic flow within CLVisc hydrodynamics}",
year = "2018",
eprint = "1802.04449",
archivePrefix = "arXiv",
primaryClass = "nucl-th",
SLACcitation = "%%CITATION = ARXIV:1802.04449;%%"
}
Copyright (C) 2018, Long-Gang Pang
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
==Installation==
First of all, get CLVisc using:
git clone https://gitlab.com/snowhitiger/PyVisc.git
1. Install OpenCL
(1) For MacBook Pro, OpenCL is supported by default, skip this step.
(2) For Linux using Nvidia GPU, install CUDA -- Shipped with OpenCL. url: https://developer.nvidia.com/cuda-downloads
(3) For Linux using AMD GPU, install AMD APP SDK from http://developer.amd.com/amd-accelerated-parallel-processing-app-sdk/
(4) For super cluster with GPUs, ask the IT-help people for the OpenCL/Cuda support.
2. Download and install the latest Anaconda from https://www.continuum.io/downloads
Important: please choose Python2.7 (although most of the code work well with Python3.*)
Notice: in case you use Python2.7 from other sources, please also install matplotlib, h5py, pandas, sympy.
These 4 packages are delivered with Anaconda by default.
3. Install PyOpenCL
`conda install -c conda-forge pyopencl`
Till now you can run ideal.py and viscous.py in pyvisc/ directory to run one ideal and one viscous hydro event,
the hydrodynamic evolution will produce and print the evolution history and freeze out hyper-surface in result/
directory. In order to calculate smooth particle spectra or sample hadrons from hyper-surface, one needs to
additionally install *cmake* and *gsl*.
4. Install cmake
(1) For MacBook,
Run in Terminal app:
`ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" < /dev/null 2> /dev/null`
and press *enter/return* key. Wait for the command to finish.
Run:
`brew install cmake`
Done! You can now use cmake.
(2) For Linux, search on google
5. Install gsl library
(1) For MacBook, `brew install gsl`
(2) For Linux, search on google
6. Event-by-event hydro using trento initial condition
(1) Install trento
cd 3rdparty/trento_with_participant_plane/
mkdir build
cd build
cmake ..
make
make install
(2) Compile the MC sampling spectra calculation subroutines
cd sampler/
mkdir build
cd build
cmake ..
make
(3) Compile the Smooth spectra calcualation subroutines
cd CLSmoothSpec/
mkdir build
cd build
cmake ..
make
Notice: this step will fail with MacOS version > 10.8 because apple depreciated some OpenCL functions.
Please use this subroutine on a GPU cluster/Linux machine.
This program will be updated in the future to consider the MacOS updates.
(4) In PyVisc/pyvisc/, modify the output path in ebe_trento.py and run
#python ebe_trento.py collision_sys centrality gpu_id num_of_events
python ebe_trento.py auau200 0_5 0 100
python ebe_trento.py pbpb2760 20_30 0 100
python ebe_trento.py pbpb5020 30_40 0 100
7. If there is error : No module named 'mako', one can install Mako using
pip install --user Mako
8. Modify cache_dir in cache.py if the cluster does not have /tmp directory
anaconda2/lib/python2.7/site-packages/pyopencl-2016.2-py2.7-linux-x86_64.egg/pyopencl/cache.py
322 def _create_built_program_from_source_cached(ctx, src, options_bytes,
323 devices, cache_dir, include_path):
324 from os.path import join
325
326 if cache_dir is None:
327 import appdirs
328 #cache_dir = join(appdirs.user_cache_dir("pyopencl", "pyopencl"),
329 # "pyopencl-compiler-cache-v2-py%s" % (
330 # ".".join(str(i) for i in sys.version_info),))
331 cache_dir = '/lustre/nyx/hyihp/lpang/tmp/'
==Examples==
1. cd pyvisc
python ideal.py
2. cd pyvisc
python visc.py
Notice: the visc.py has huge GPU memory demands and can only be run on GPUs whose memory > 5GB.
3. cd python
modify ebe_trento.py to run event-by-event hydrodynamics with Trento initial condition
==The BSZ dependence==
For ideal hydrodynamics, with lattice 385*385*115, per step running time is:
BSZ 8 16 32 64 128
Ideal(s) 0.37 0.218 0.178 0.155 0.157
Visc(s)-GPU 3.12 1.65 1.17 1.01 1.17
Visc(s)-CPU 6.64 6.45 6.63 7.0 7.58
==The importance of concurrent reading from Global memory.==
Here I used NX=NY=NZ=201 for a test, in principle the time cost for visc_src_alongx,
visc_src_alongy, visc_src_alongz should have no difference. However, from line profiler
by using: {{{kernpro -l -v visc.py }}}
One gets 41.9 vs 38.4 vs 6.9 for x, y and z direction.
Why there is so big difference? It can be explained by the order of the data in global memory,
where we use:
{{{
for (int i = 0; i < NX; i++ )
for (int j = 0; j < NY; j++ )
for (int k = 0; k < NZ; k++ ) {
pimn[i*NY*NZ + j*NZ + K] = some number;
}
}}}
The data is continues along z direction, which makes it much faster to read from
global memory to local memory in z direction than x and y due to concurrent.
Total time: 19.2475 s
File: visc.py
Function: IS_stepUpdate at line 165
Line # Hits Time Per Hit % Time Line Contents
==============================================================
165 @profile
166 def IS_stepUpdate(self, step):
167 #print "ideal update finished"
168 52 152 2.9 0.0 NX, NY, NZ, BSZ = self.cfg.NX, self.cfg.NY, self.cfg.NZ, self.cfg.BSZ
169
170 52 143954 2768.3 0.7 self.kernel_IS.visc_src_christoffel(self.queue, (NX*NY*NZ,), None,
171 52 134 2.6 0.0 self.d_IS_src, self.d_pi[step], self.ideal.d_ev[step],
172 52 581359 11180.0 3.0 self.ideal.tau, np.int32(step)).wait()
173
174 52 159298 3063.4 0.8 self.kernel_IS.visc_src_alongx(self.queue, (BSZ, NY, NZ), (BSZ, 1, 1),
175 52 143 2.8 0.0 self.d_IS_src, self.d_udx, self.d_pi[step], self.ideal.d_ev[step],
176 52 8055724 154917.8 41.9 self.eos_table, self.ideal.tau).wait()
177
178 #print "udx along x"
179
180 51 156991 3078.3 0.8 self.kernel_IS.visc_src_alongy(self.queue, (NX, BSZ, NZ), (1, BSZ, 1),
181 51 151 3.0 0.0 self.d_IS_src, self.d_udy, self.d_pi[step], self.ideal.d_ev[step],
182 51 7381515 144735.6 38.4 self.eos_table, self.ideal.tau).wait()
183
184 #print "udy along y"
185 51 157382 3085.9 0.8 self.kernel_IS.visc_src_alongz(self.queue, (NX, NY, BSZ), (1, 1, BSZ),
186 51 137 2.7 0.0 self.d_IS_src, self.d_udz, self.d_pi[step], self.ideal.d_ev[step],
187 51 1329880 26076.1 6.9 self.eos_table, self.ideal.tau).wait()
188
189 #print "udz along z"
190 51 302246 5926.4 1.6 self.kernel_IS.update_pimn(self.queue, (NX*NY*NZ,), None,
191 51 141 2.8 0.0 self.d_pi[3-step], self.d_goodcell, self.d_pi[1], self.d_pi[step],
192 51 82 1.6 0.0 self.ideal.d_ev[1], self.ideal.d_ev[2], self.d_udiff,
193 51 94 1.8 0.0 self.d_udx, self.d_udy, self.d_udz, self.d_IS_src,
194 51 978116 19178.7 5.1 self.eos_table, self.ideal.tau, np.int32(step)
195 ).wait()
==Usage of vloadn to speed up global data access==
In kernel_visc.cl, one needs to load (pitt, pitx, pity, pitz) and (pixt, pixx, pixy, pixz) in src_alongx,
needs to load (pitt, pitx, pity, pitz) and (piyt, piyx, piyy, piyz) in src_alongy,
needs to load (pitt, pitx, pity, pitz) and (pizt, pizx, pizy, pizz) in src_alongz;
Since the data are stored in for(i, j, k) order, so loading data along z is faster than along y.
However, self.kernel_visc.kt_src_alongx is much faster than loading data along y.
This may be caused by continues address for pixx, pixy, pizx.
I tried to use vload4 but it does not speed up the code, which means the compiler already did the optimization.
Total time: 8.60907 s
File: visc.py
Function: visc_stepUpdate at line 124
Line # Hits Time Per Hit % Time Line Contents
==============================================================
124 @profile
125 def visc_stepUpdate(self, step):
126 ''' Do step update in kernel with KT algorithm for visc evolution
127 Args:
128 gpu_ev_old: self.d_ev[1] for the 1st step,
129 self.d_ev[2] for the 2nd step
130 step: the 1st or the 2nd step in runge-kutta
131 '''
132 # upadte d_Src by KT time splitting, along=1,2,3 for 'x','y','z'
133 # input: gpu_ev_old, tau, size, along_axis
134 # output: self.d_Src
135 108 324 3.0 0.0 NX, NY, NZ, BSZ = self.cfg.NX, self.cfg.NY, self.cfg.NZ, self.cfg.BSZ
136 108 334097 3093.5 3.9 self.kernel_visc.kt_src_christoffel(self.queue, (NX*NY*NZ, ), None,
137 108 278 2.6 0.0 self.ideal.d_Src, self.ideal.d_ev[step],
138 108 162 1.5 0.0 self.d_pi[step], self.eos_table,
139 108 854615 7913.1 9.9 self.ideal.tau, np.int32(step)
140 ).wait()
141
142 108 296185 2742.5 3.4 self.kernel_visc.kt_src_alongx(self.queue, (BSZ, NY, NZ), (BSZ, 1, 1),
143 108 275 2.5 0.0 self.ideal.d_Src, self.ideal.d_ev[step],
144 108 166 1.5 0.0 self.d_pi[step], self.eos_table,
145 108 1313623 12163.2 15.3 self.ideal.tau).wait()
146
147 108 296962 2749.6 3.4 self.kernel_visc.kt_src_alongy(self.queue, (NX, BSZ, NZ), (1, BSZ, 1),
148 108 251 2.3 0.0 self.ideal.d_Src, self.ideal.d_ev[step],
149 108 167 1.5 0.0 self.d_pi[step], self.eos_table,
150 108 2435409 22550.1 28.3 self.ideal.tau).wait()
151
152 108 296962 2749.6 3.4 self.kernel_visc.kt_src_alongz(self.queue, (NX, NY, BSZ), (1, 1, BSZ),
153 108 261 2.4 0.0 self.ideal.d_Src, self.ideal.d_ev[step],
154 108 180 1.7 0.0 self.d_pi[step], self.eos_table,
155 108 1093978 10129.4 12.7 self.ideal.tau).wait()
156
157 # if step=1, T0m' = T0m + d_Src*dt, update d_ev[2]
158 # if step=2, T0m = T0m + 0.5*dt*d_Src, update d_ev[1]
159 # Notice that d_Src=f(t,x) at step1 and
160 # d_Src=(f(t,x)+f(t+dt, x(t+dt))) at step2
161 # output: d_ev[] where need_update=2 for step 1 and 1 for step 2
162 108 409861 3795.0 4.8 self.kernel_visc.update_ev(self.queue, (NX*NY*NZ, ), None,
163 108 277 2.6 0.0 self.ideal.d_ev[3-step], self.ideal.d_ev[1],
164 108 169 1.6 0.0 self.d_pi[0], self.d_pi[3-step],
165 108 147 1.4 0.0 self.ideal.d_Src,
166 108 1274723 11803.0 14.8 self.eos_table, self.ideal.tau, np.int32(step)).wait()