forked from NAG-DevOps/speed-hpc
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathscheduler-scripting.tex
867 lines (748 loc) · 31 KB
/
scheduler-scripting.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
% 2.2.3 User Scripting
% -------------------
% TMP scheduler-specific section
The final part of the job script involves the commands that will be executed by the job.
This section should include all necessary commands to set up and run the tasks
your script is designed to perform. You can use any Linux command in this section,
ranging from a simple executable call to a complex loop iterating through multiple commands.\\
\noindent \textbf{Best Practice}: prefix any compute-heavy step with \tool{srun}.
This ensures you gain proper insights on the execution of your job.\\
\noindent Each software program may have its own execution framework, as it's the script's author (e.g., you)
responsibility to review the software's documentation to understand its requirements.
Your script should be written to clearly specify the location of input and output files and the degree of parallelism needed.\\
\noindent Jobs that involve multiple interactions with data input and output files, should make use of \api{TMPDIR},
a scheduler-provided workspace nearly 1~TB in size.
\api{TMPDIR} is created on the local disk of the compute node at the start of a job, offering faster I/O operations
compared to shared storage (provided over NFS).
An sample job script using \api{TMPDIR} is available at \texttt{/home/n/nul-uge/templateTMPDIR.sh}:
the job is instructed to change to \api{\$TMPDIR}, to make the new directory \texttt{input}, to copy data from
\texttt{\$SLURM\_SUBMIT\_DIR/references/} to \texttt{input/} (\api{\$SLURM\_SUBMIT\_DIR} represents the
current working directory), to make the new directory \texttt{results}, to
execute the program (which takes input from \texttt{\$TMPDIR/input/} and writes
output to \texttt{\$TMPDIR/results/}), and finally to copy the total end results
to an existing directory, \texttt{processed}, that is located in the current
working directory.
% TODO: verify:
\api{TMPDIR} only exists for the duration of the job, though,
so it is very important to copy relevant results from it at job's end.
% -------------- 2.3 Sample Job Script ----------------------
% -----------------------------------------------------------
\subsection{Sample Job Script}
\label{sect:sample-job-script}
Here's a basic job script, \file{tcsh.sh} shown in \xf{fig:tcsh.sh}.
You can copy it from our \href{https://github.com/NAG-DevOps/speed-hpc}{GitHub repository}.
\begin{figure}[htpb]
\lstinputlisting[language=csh,frame=single,basicstyle=\ttfamily]{tcsh.sh}
\caption{Source code for \file{tcsh.sh}}
\label{fig:tcsh.sh}
\end{figure}
\noindent
The first line is the shell declaration (also know as a shebang) and sets the shell to \emph{tcsh}.
The lines that begin with \texttt{\#SBATCH} are directives for the scheduler.
\begin{itemize}
\item \option{-J} (or \option{--job-name}) sets \emph{tcsh-test} as the job name.
%\item \texttt{--chdir} tells the scheduler to execute the job from the current working directory
\item \option{--mem=1GB} requests and assigns 1GB of memory to the job.
Jobs require the \option{--mem} option to be set either in the script
or on the command line; \textbf{if it's missing, job submission will be rejected.}
\end{itemize}
\noindent The script then:
\begin{enumerate}
\item Sleeps on a node for 30 seconds.
\item Uses the \tool{module} command to load the \texttt{gurobi/8.1.0} environment.
\item Prints the list of loaded modules into a file.
\end{enumerate}
\noindent
The scheduler command, \tool{sbatch}, is used to submit (non-interactive) jobs.
From an ssh session on ``speed-submit'', submit this job with
\begin{verbatim}
sbatch ./tcsh.sh
\end{verbatim}
\noindent
You will see, \texttt{Submitted batch job 2653} where $2653$ is a job ID assigned.
The commands \tool{squeue} and \tool{sinfo} can be used
to look at the status of the cluster:
%\texttt{squeue -l} and \texttt{sinfo -la}.
\small
\begin{verbatim}
[serguei@speed-submit src] % squeue -l
Thu Oct 19 11:38:54 2023
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
2641 ps interact b_user RUNNING 19:16:09 1-00:00:00 1 speed-07
2652 ps interact a_user RUNNING 41:40 1-00:00:00 1 speed-07
2654 ps tcsh-tes serguei RUNNING 0:01 7-00:00:00 1 speed-07
[serguei@speed-submit src] % sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
ps* up 7-00:00:00 14 drain speed-[08-10,12,15-16,20-22,30-32,35-36]
ps* up 7-00:00:00 1 mix speed-07
ps* up 7-00:00:00 7 idle speed-[11,19,23-24,29,33-34]
pg up 1-00:00:00 1 drain speed-17
pg up 1-00:00:00 3 idle speed-[05,25,27]
pt up 7-00:00:00 7 idle speed-[37-43]
pa up 7-00:00:00 4 idle speed-[01,03,25,27]
\end{verbatim}
\normalsize
\noindent
\textbf{Remember} that you only have 30 seconds before the job is essentially over, so
if you do not see a similar output, either adjust the sleep time in the
script, or execute the \tool{squeue} statement more quickly. The \tool{squeue}
output listed above shows that your job 2654 is running on node \texttt{speed-07},
and its time limit is 7 days, etc.\\
% TODO
%, that it
%was started at 16:39:30 on 12/03/2018, and that it is a single-core job (the
%default).
Once the job finishes, there will be a new file in the directory that the job
was started from, with the syntax of, \texttt{slurm-<job id>.out}, so
in this example the file is, \file{slurm-2654.out}. This file represents the
standard output (and error, if there is any) of the job in question. If you
look at the contents of your newly created file, you will see that it
contains the output of the, \texttt{module list} command.
Important information is often written to this file.
%
%Congratulations on your first job!
% -------------- 2.4 Common Job Management Commands Summary ---
% -------------------------------------------------------------
\subsection{Common Job Management Commands Summary}
\label{sect:job-management-commands}
Here is a summary of useful job management commands for handling various aspects of
job submission and monitoring on the Speed cluster:
\begin{itemize}
\item Submitting a job:
\small
\begin{verbatim}
sbatch -A <ACCOUNT> -t <MINUTES> --mem=<MEMORY> -p <PARTITION> ./<myscript>.sh
\end{verbatim}
\normalsize
\item Checking your job(s) status:
\small
\begin{verbatim}
squeue -u <ENCSusername>
\end{verbatim}
\normalsize
\item Displaying cluster status:
\small
\begin{verbatim}
squeue
\end{verbatim}
\normalsize
\begin{itemize}
\item Use \option{-A} for per account (e.g., \texttt{-A vidpro}, \texttt{-A aits}),
\item Use \option{-p} for per partition (e.g., \texttt{-p ps}, \texttt{-p pg}, \texttt{-p pt}), etc.
\end{itemize}
\item Displaying job information:
\small
\begin{verbatim}
squeue --job <job-ID>
\end{verbatim}
\normalsize
\item Displaying individual job steps: (to see which step failed if you used \tool{srun})
\small
\begin{verbatim}
squeue -las
\end{verbatim}
\normalsize
\item Monitoring job and cluster status: (view \tool{sinfo} and watch the queue for your job(s))
\small
\begin{verbatim}
watch -n 1 "sinfo -Nel -pps,pt,pg,pa && squeue -la"
\end{verbatim}
\normalsize
\item Canceling a job:
\small
\begin{verbatim}
scancel <job-ID>
\end{verbatim}
\normalsize
\item Holding a job:
\small
\begin{verbatim}
scontrol hold <job-ID>
\end{verbatim}
\normalsize
\item Releasing a job:
\small
\begin{verbatim}
scontrol release <job-ID>
\end{verbatim}
\normalsize
\item Getting job statistics: (including useful metrics like ``maxvmem'')
\small
\begin{verbatim}
sacct -j <job-ID>
\end{verbatim}
\normalsize
\api{maxvmem} is one of the more useful stats that you can elect to display
as a format option.
\small
\begin{verbatim}
% sacct -j 2654
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
2654 tcsh-test ps speed1 1 COMPLETED 0:0
2654.batch batch speed1 1 COMPLETED 0:0
2654.extern extern speed1 1 COMPLETED 0:0
% sacct -j 2654 -o jobid,user,account,MaxVMSize,Reason%10,TRESUsageOutMax%30
JobID User Account MaxVMSize Reason TRESUsageOutMax
------------ --------- ---------- ---------- ---------- ----------------------
2654 serguei speed1 None
2654.batch speed1 296840K energy=0,fs/disk=1975
2654.extern speed1 296312K energy=0,fs/disk=343
\end{verbatim}
\normalsize
See \texttt{man sacct} or \texttt{sacct -e} for details of the available formatting options.
You can define your preferred default format in the \api{SACCT\_FORMAT} environment variable
in your \texttt{.cshrc} or \texttt{.bashrc} files.
\item Displaying job efficiency: (including CPU and memory utilization)
\small
\begin{verbatim}
seff <job-ID>
\end{verbatim}
\normalsize
Don't execute it on \texttt{RUNNING} jobs (only on completed/finished jobs), else
efficiency statistics may be misleading. If you define the following
directive in your batch script, your GCS ENCS email address will receive an email
with \tool{seff}'s output when your job is finished.
\small
\begin{verbatim}
#SBATCH --mail-type=ALL
\end{verbatim}
\normalsize
Output example:
\small
\begin{verbatim}
Job ID: XXXXX
Cluster: speed
User/Group: user1/user1
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 4
CPU Utilized: 00:04:29
CPU Efficiency: 0.35% of 21:32:20 core-walltime
Job Wall-clock time: 05:23:05
Memory Utilized: 2.90 GB
Memory Efficiency: 2.90% of 100.00 GB
\end{verbatim}
\normalsize
\end{itemize}
% -------------- 2.5 Advanced sbatch Options ------------------
% -------------------------------------------------------------
\subsection{Advanced \tool{sbatch} Options}
\label{sect:submit-options}
\label{sect:qsub-options}
In addition to the basic sbatch options presented earlier,
there are several advanced options that are generally useful:
\begin{itemize}
\item E-mail notifications:
\begin{verbatim}
--mail-type=<TYPE>
\end{verbatim}
Requests the scheduler to send an email when the job changes state.
\texttt{<TYPE>} can be \texttt{ALL}, \texttt{BEGIN}, \texttt{END}, or \texttt{FAIL}.
Mail is sent to the default address of,
%
\begin{verbatim}
<ENCSusername>@encs.concordia.ca
\end{verbatim}
%
which you can consult via \url{webmail.encs.concordia.ca} (use VPN from off-campus)
unless a different address is supplied
(see, \option{--mail-user}).
The report sent when a job ends includes job
runtime, as well as the maximum memory value hit (\api{maxvmem}).
\begin{verbatim}
--mail-user [email protected]
\end{verbatim}
Specifies a different email address for notifications rather than the default.
\item Export environment variables used by the script.:
\begin{verbatim}
--export=ALL
--export=NONE
--export=VARIABLES
\end{verbatim}
\item Job runtime:
\begin{verbatim}
-t <MINUTES> or DAYS-HH:MM:SS
\end{verbatim}
sets a job runtime of min or HH:MM:SS. Note that if you give a single number,
that represents \emph{minutes}, not hours. The set runtime should not exceed
the default maximums of 24h for interactive jobs and 7 days for batch jobs.
\item Job Dependencies:
\begin{verbatim}
--depend=<state:job-ID>
\end{verbatim}
Runs the job only when the specified job \verb|<job-ID>| finishes. This is useful for creating job chains where
subsequent jobs depend on the completion of previous ones.
\end{itemize}
\noindent \textbf{Note:} \tool{sbatch} options can be specified during the job-submission
command, and these \emph{override} existing script options (if present). The
syntax is
\begin{verbatim}
sbatch [options] PATHTOSCRIPT
\end{verbatim}
but unlike in the script, the options are specified without the leading \verb+#SBATCH+
e.g.:
\begin{verbatim}
sbatch -J sub-test --chdir=./ --mem=1G ./tcsh.sh
\end{verbatim}
% -------------- 2.6 Array Jobs -------------------------------
% -------------------------------------------------------------
\subsection{Array Jobs}
\label{sect:array-jobs}
Array jobs are those that start a batch job or a parallel job multiple times.
Each iteration of the job array is called a task and receives a unique job ID.
Array jobs are particularly useful for running a large number of similar tasks with slight variations.\\
\noindent
To submit an array job (Only supported for batch jobs), use the \option{--array} option of the \tool{sbatch}
command as follows:
\begin{verbatim}
sbatch --array=n-m[:s]] <batch_script>
\end{verbatim}
\noindent \textbf{where}
\begin{itemize}
\item
\texttt{n}: indicates the start-id.
\item
\texttt{m}: indicates the max-id.
\item
\texttt{s}: indicates the step size.
\end{itemize}
\noindent \textbf{Examples:}
\begin{itemize}
\item Submit a job with 1 task where the task-id is 10.
\begin{verbatim}
sbatch --array=10 array.sh
\end{verbatim}
\item Submit a job with 10 tasks numbered consecutively from 1 to 10.
\begin{verbatim}
sbatch --array=1-10 array.sh
\end{verbatim}
\item Submit a job with 5 tasks numbered consecutively with a step size of 3 (task-ids 3,6,9,12,15)
\begin{verbatim}
sbatch --array=3-15:3 array.sh
\end{verbatim}
\item Submit a job with 50000 elements, where \%a maps to the task-id between 1 and 50K.
\begin{verbatim}
sbatch --array=1-50000 -N1 -i my_in_%a -o my_out_%a array.sh
\end{verbatim}
\end{itemize}
\noindent \textbf{Output files for Array Jobs:}\\
The default output and error-files are \texttt{slurm-job\_id\_task\_id.out}.
%
This means that Speed creates an output and an error-file for each task
generated by the array-job, as well as one for the super-ordinate array-job.
To alter this behavior use the \option{-o} and \option{-e} options of \tool{sbatch}.\\
For more details about Array Job options, please review the manual pages for
\tool{sbatch} by executing the following at the command line on \tool{speed-submit}
\texttt{man sbatch}.
% -------------- 2.7 Requesting Multiple Cores ----------------
% -------------------------------------------------------------
\subsection{Requesting Multiple Cores (i.e., Multithreading Jobs)}
\label{sect:multicore-jobs}
For jobs that can take advantage of multiple machine cores, you can
request up to 32 cores (per job) in your script using the following options:
\begin{verbatim}
#SBATCH -n <#cores for processes>
#SBATCH -n 1
#SBATCH -c <#cores for threads of a single process>
\end{verbatim}
\noindent Both \tool{sbatch} and \tool{salloc} support \option{-n} on the command line,
and it should always be used either in the script or on the command line as the
default $n=1$.\\
\noindent \textbf{Important Considerations}:
\begin{itemize}
\item Do not request more cores than you think will be useful,
as larger-core jobs are more difficult to schedule.
\item If you are running a program that scales out to the maximum single-machine
core count available, please request 32 cores to avoid node
oversubscription (i.e., overloading the CPUs).
\end{itemize}
\noindent \textbf{Note:} \option{--ntasks} or \option{--ntasks-per-node}
(\option{-n}) refers to processes (usually the ones run with \tool{srun}).
\option{--cpus-per-task} (\option{-c}) corresponds to threads per process.\\
\noindent Some programs consider them equivalent, while others do not. For example,
Fluent uses \option{--ntasks-per-node=8} and \option{--cpus-per-task=1},
whereas others may set \option{--cpus-per-task=8} and \option{--ntasks-per-node=1}.
If one of these is not 1, some applications need to be configured to use \texttt{n * c} total cores.\\
\noindent Core count associated with a job appears under,
``AllocCPUS'', in the, \texttt{sacct -j <job-id>}, output.
\small
\begin{verbatim}
[serguei@speed-submit src] % squeue -l
Thu Oct 19 20:32:32 2023
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
2652 ps interact a_user RUNNING 9:35:18 1-00:00:00 1 speed-07
[serguei@speed-submit src] % sacct -j 2652
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
2652 interacti+ ps speed1 20 RUNNING 0:0
2652.intera+ interacti+ speed1 20 RUNNING 0:0
2652.extern extern speed1 20 RUNNING 0:0
2652.0 gydra_pmi+ speed1 20 COMPLETED 0:0
2652.1 gydra_pmi+ speed1 20 COMPLETED 0:0
2652.2 gydra_pmi+ speed1 20 FAILED 7:0
2652.3 gydra_pmi+ speed1 20 FAILED 7:0
2652.4 gydra_pmi+ speed1 20 COMPLETED 0:0
2652.5 gydra_pmi+ speed1 20 COMPLETED 0:0
2652.6 gydra_pmi+ speed1 20 COMPLETED 0:0
2652.7 gydra_pmi+ speed1 20 COMPLETED 0:0
\end{verbatim}
\normalsize
% -------------- 2.8 Interactive Jobs -------------------------
% -------------------------------------------------------------
\subsection{Interactive Jobs}
\label{sect:interactive-jobs}
Interactive job sessions allow you to interact with the system in real-time.
These sessions are particularly useful for tasks such as testing, debugging, optimizing code,
setting up environments, and other preparatory work before submitting batch jobs.
% 2.8.1 Command Line
% -------------------
\subsubsection{Command Line}
\label{sect:command-line}
To request an interactive job session, use the \texttt{salloc} command with appropriate options.
This is similar to submitting a batch job but allows you to run shell commands interactively
within the allocated resources. For example:
\begin{verbatim}
salloc -J interactive-test --mem=1G -p ps -n 8
\end{verbatim}
Within the allocated \tool{salloc} session, you can run shell commands as usual.
It is recommended to use \tool{srun} for compute-intensive steps within \tool{salloc}.
If you need a quick, short job just to compile something on a GPU node,
you can use an interactive srun directly. For example, a 1-hour allocation:\\
\noindent \textbf{For tcsh}:
\begin{verbatim}
srun --pty -n 8 -p pg --gpus=1 --mem=1G -t 60 /encs/bin/tcsh
\end{verbatim}
\noindent \textbf{For bash}:
\begin{verbatim}
srun --pty -n 8 -p pg --gpus=1 --mem=1G -t 60 /encs/bin/bash
\end{verbatim}
% 2.8.2 Graphical Applications
% -------------------
\subsubsection{Graphical Applications}
\label{sect:graphical-applications}
To run graphical UI applications (e.g., MALTLAB, Abaqus CME, IDEs like PyCharm, VSCode, Eclipse, etc.) on Speed,
you need to enable X11 forwarding from your client machine Speed then to the compute node.
To do so, follow these steps:
\begin{enumerate}
\item Run an X server on your client machine:
\begin{itemize}
\item \textbf{Windows:} Use MobaXterm with X turned on, or Xming + PuTTY with X11 forwarding, or XOrg under Cygwin
\item \textbf{macOS:} Use XQuarz -- use its \tool{xterm} and \texttt{ssh -X}
\item \textbf{Linux:} Use \texttt{ssh -X speed.encs.concordia.ca}
\end{itemize}
For more details, see \href{https://www.concordia.ca/ginacody/aits/support/faq/xserver.html}{How do I remotely launch X(Graphical) applications?}
\item Verify that X11 forwarding is enabled by printing the \api{DISPLAY} variable:
\begin{verbatim}
echo $DISPLAY
\end{verbatim}
\item Start an interactive session with X11 forwarding enabled (Use the \option{--x11} with \tool{salloc} or \tool{srun}), for example:
\begin{verbatim}
salloc -p ps --x11=first --mem=4G -t 0-06:00
\end{verbatim}
\item Once landed on a compute node, verify \api{DISPLAY} again.
\item Set the \api{XDG\_RUNTIME\_DIR} variable to a directory in your \tool{speed-scratch} space:
\begin{verbatim}
mkdir -p /speed-scratch/$USER/run-dir
setenv XDG_RUNTIME_DIR /speed-scratch/$USER/run-dir
\end{verbatim}
\item Launch your graphical application:
\begin{verbatim}
module load matlab/R2023a/default
matlab
\end{verbatim}
\end{enumerate}
\noindent
\textbf{Note:} with X11 forwarding the graphical rendering is happening on
your client machine! That is you are not using GPUs on Speed to render
graphics, instead all graphical information is forwarded from Speed to
your desktop or laptop over X11, which in turn renders it using its
own graphics card. Thus, for GPU rendering jobs either keep them
non-interactive or use VirtualGL.\\
\noindent
Here's an example of starting PyCharm (see \xf{fig:pycharm}).
\textbf{Note:} If using VSCode, it's currently only supported with the \tool{--no-sandbox} option.\\
\noindent \textbf{TCSH version:}
\small
\begin{verbatim}
ssh -X speed (XQuartz xterm, PuTTY or MobaXterm have X11 forwarding too)
[speed-submit] [/home/c/carlos] > echo $DISPLAY
localhost:14.0
[speed-submit] [/home/c/carlos] > cd /speed-scratch/$USER
[speed-submit] [/speed-scratch/carlos] > echo $DISPLAY
localhost:13.0
[speed-submit] [/speed-scratch/carlos] > salloc -pps --x11=first --mem=4Gb -t 0-06:00
[speed-07] [/speed-scratch/carlos] > echo $DISPLAY
localhost:42.0
[speed-07] [/speed-scratch/carlos] > hostname
speed-07.encs.concordia.ca
[speed-07] [/speed-scratch/carlos] > setenv XDG_RUNTIME_DIR /speed-scratch/$USER/run-dir
[speed-07] [/speed-scratch/carlos] > /speed-scratch/nag-public/bin/pycharm.sh
\end{verbatim}
\normalsize
\noindent \textbf{BASH version:}
\small
\begin{verbatim}
bash-3.2$ ssh -X speed (XQuartz xterm, PuTTY or MobaXterm have X11 forwarding too)
serguei@speed's password:
[serguei@speed-submit ~] % echo $DISPLAY
localhost:14.0
[serguei@speed-submit ~] % salloc -p ps --x11=first --mem=4Gb -t 0-06:00
bash-4.4$ echo $DISPLAY
localhost:77.0
bash-4.4$ hostname
speed-01.encs.concordia.ca
bash-4.4$ export XDG_RUNTIME_DIR=/speed-scratch/$USER/run-dir
bash-4.4$ /speed-scratch/nag-public/bin/pycharm.sh
\end{verbatim}
\normalsize
\begin{figure}[htpb]
\includegraphics[width=\columnwidth]{images/pycharm}
\caption{Launching PyCharm on a Speed Node}
\label{fig:pycharm}
\end{figure}
% -----------------------------------------------------------------------------
\subsubsection{Jupyter Notebooks}
\label{sect:jupyter}
% 2.8.3 Jupyter Notebooks in Singularity
% -------------------
%\subsubsection{Jupyter Notebook in Singularity}
\paragraph{Jupyter Notebook in Singularity}
\label{sect:jupyter-singularity}
To run Jupyter Notebooks using Singularity (more on Singularity see \xs{sect:singularity-containers}), follow these steps:
\begin{enumerate}
% X11 is not really needed for Jupyter since we tunnel and use a browser
%\item Connect to Speed with X11 forwarding enabled:
\item Connect to Speed, e.g. interactively, using \tool{salloc}
%\item Use the \option{--x11} with \tool{salloc} or \tool{srun} as described in the above example
\item Load Singularity module
\verb+module load singularity/3.10.4/default+
\item
Execute this Singularity command on a single line or save it in a shell script
\href{https://github.com/NAG-DevOps/speed-hpc/blob/master/src/jupyter.sh}{from our GitHub}
where you could easily invoke it.
\scriptsize
\begin{verbatim}
srun singularity exec -B $PWD\:/speed-pwd,/speed-scratch/$USER\:/my-speed-scratch,/nettemp \
--env SHELL=/bin/bash --nv /speed-scratch/nag-public/openiss-cuda-conda-jupyter.sif \
/bin/bash -c '/opt/conda/bin/jupyter notebook --no-browser --notebook-dir=/speed-pwd \
--ip="*" --port=8888 --allow-root'
\end{verbatim}
\normalsize
\item
In a new terminal window, create an \tool{ssh} tunnel between your computer and the node (\texttt{speed-XX}) where Jupyter is
running (using \texttt{speed-submit} as a ``jump server'', see, e.g., in PuTTY, in \xf{fig:putty1} and \xf{fig:putty2})
\small
\begin{verbatim}
ssh -L 8888:speed-XX:8888 <ENCS-username>@speed-submit.encs.concordia.ca
\end{verbatim}
\normalsize
Don't close the tunnel after establishing.
\item
Open a browser, and copy your Jupyter's token (it's printed to you in the terminal)
and paste it in the browser's URL field.
In our case, the URL is:
\small
\begin{verbatim}
http://localhost:8888/?token=5a52e6c0c7dfc111008a803e5303371ed0462d3d547ac3fb
\end{verbatim}
\normalsize
\item Access the Jupyter Notebook interface in your browser.
\end{enumerate}
\begin{figure}[htbp]
\centering
\fbox{\includegraphics{images/putty1}}
\caption{SSH tunnel configuration 1}
\label{fig:putty1}
\end{figure}
\begin{figure}[htbp]
\centering
\fbox{\includegraphics{images/putty2}}
\caption{SSH tunnel configuration 2}
\label{fig:putty2}
\end{figure}
\begin{figure}[htbp]
\centering
\fbox{\includegraphics[width=1.00\textwidth]{images/jupyter.png}}
\caption{Jupyter running on a Speed node}
\label{fig:jupyter}
\end{figure}
\noindent
Another sample is the OpenISS-derived containers with Conda and Jupyter,
see \xs{sect:openiss-examples} for details.
% 2.8.4 JupyterLab in Conda and Pytorch
% -------------------
%\subsubsection{JupyterLab in Conda and Pytorch}
\paragraph{JupyterLab in Conda and Pytorch}
\label{sect:jupyterlabs}
For setting up Jupyter Labs with Conda and Pytorch, follow these steps:
\begin{itemize}
\item Environment preparation: (only once, takes some time to run to install all required dependencies)
\begin{enumerate}
\item Navigate to your speed-scratch directory:
\begin{verbatim}
cd /speed-scratch/\$USER
\end{verbatim}
\item Create a Jupyter (name of your choice) directory and \tool{cd} into it:
\begin{verbatim}
mkdir -p Jupyter
cd Jupyter
\end{verbatim}
\item Start an interactive session:
\begin{verbatim}
salloc --mem=50G --gpus=1 -ppg (or -ppt)
\end{verbatim}
\item
Set \tool{conda} environment variables, and install \tool{jupyterlab} and \tool{pytorch},
as shown in \xf{fig:firsttime.sh} from our GitHub.
\begin{figure}[htpb]
\tiny
\lstinputlisting[language=csh,frame=single,basicstyle=\ttfamily]{../src/jupyterlabs/firsttime.sh}
\normalsize
\caption{Source code for \texttt{firsttime.sh}}
\label{fig:firsttime.sh}
\end{figure}
\end{enumerate}
\item
Execution of Jupyter Labs from \textbf{speed-submit} (repeat this every time you want to run Jupyter Labs):
\begin{enumerate}
\item Start an interactive session:
\begin{verbatim}
salloc --mem=50G --gpus=1 -p pg (or -p pt)
\end{verbatim}
\item
Activate your \tool{conda} environment and run Jupyter Labs, as shown in
\xf{fig:run.sh} (also available on our GitHub).
\begin{figure}[htpb]
\scriptsize
\lstinputlisting[language=csh,frame=single,basicstyle=\ttfamily]{../src/jupyterlabs/run.sh}
\normalsize
\caption{Source code for \texttt{run.sh}}
\label{fig:run.sh}
\end{figure}
\item
Verify which port the system has assigned to your Jupyter Lab instance by examining the URL
\texttt{http://localhost:XXXX/lab?token=} in your terminal as described
previously.
\item
In a new terminal window, create an \tool{ssh} tunnel similar to Jupyter
in Singularity, see \xs{sect:jupyter-singularity}.
\item
Open a browser and copy your Jupyter's token and paste it in the browser's URL field
\end{enumerate}
\end{itemize}
% 2.8.5 JupyterLab + Pytorch in Python venv
% -------------------
%\subsubsection{JupyterLab + Pytorch in Python venv}
\paragraph{JupyterLab + Pytorch in Python venv}
\label{sect:jupyterlabs-venv}
This is an example of Jupyter Labs running in a Python Virtual environment (\texttt{venv}), with Pytorch on Speed.\\
\noindent
\textbf{Note:} Use of Python virtual environments is preferred over Conda at Alliance Canada clusters.
If you prefer to make jobs that are more compatible between Speed and Alliance clusters, use Python
\texttt{venv}s. See \url{https://docs.alliancecan.ca/wiki/Anaconda/en}
and \url{https://docs.alliancecan.ca/wiki/JupyterNotebook}.
\begin{itemize}
\item
Environment preparation: for the FIRST time only:
\begin{enumerate}
\item
Go to your speed-scratch directory: \texttt{cd /speed-scratch/\$USER}
\item
Open an interactive session: \texttt{salloc --mem=50G --gpus=1 --constraint=el9}
\item
Create a Python \texttt{venv} and install \tool{jupyterlab}+\tool{pytorch}
\scriptsize
\begin{verbatim}
module load python/3.11.5/default
setenv TMPDIR /speed-scratch/$USER/tmp
setenv TMP /speed-scratch/$USER/tmp
setenv PIP_CACHE_DIR /speed-scratch/$USER/tmp/cache
python -m venv /speed-scratch/$USER/tmp/jupyter-venv
source /speed-scratch/$USER/tmp/jupyter-venv/bin/activate.csh
pip install jupyterlab
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
exit
\end{verbatim}
\normalsize
\end{enumerate}
\item
Running Jupyter Labs, from \textbf{speed-submit}:
\begin{enumerate}
\item
Open an interactive session: \texttt{salloc --mem=50G --gpus=1 --constraint=el9}
\scriptsize
\begin{verbatim}
cd /speed-scratch/$USER
module load python/3.11.5/default
setenv PIP_CACHE_DIR /speed-scratch/$USER/tmp/cache
source /speed-scratch/$USER/tmp/jupyter-venv/bin/activate.csh
jupyter lab --no-browser --notebook-dir=$PWD --ip="0.0.0.0" --port=8888 --port-retries=50
\end{verbatim}
\normalsize
\item
Verify which port the system has assigned to Jupyter:\\
\texttt{http://localhost:XXXX/lab?token=}
\item
SSH Tunnel creation: similar to Jupyter in Singularity, see \xs{sect:jupyter-singularity}
\item
Open a browser and type: \texttt {localhost:XXXX} (using the port assigned)
\end{enumerate}
\end{itemize}
% 2.8.6 Visual Studio Code
% -------------------
\subsubsection{Visual Studio Code}
\label{sect:vscode}
This is an example of running VScode, it's similar to Jupyter notebooks, but
it doesn't use containers. \textbf{Note:} this a Web-based version; there exists the local
(workstation)~--~remote (speed-node) client-server version too, but it is for advanced users
and is out of scope here (so no support, use it at your own risk).
\begin{itemize}
\item
Environment preparation: for the FIRST time:
\begin{enumerate}
\item
Go to your speed-scratch directory: \texttt{cd /speed-scratch/\$USER}
\item
Create a vscode directory: \texttt{mkdir vscode}
\item
Go to vscode: \texttt{cd vscode}
\item
Create home and projects: \texttt{mkdir \{home,projects\}}
\item
Create this directory: \texttt{mkdir -p /speed-scratch/\$USER/run-user}
\end{enumerate}
\item
Running VScode
\begin{enumerate}
\item
Go to your vscode directory: \texttt{cd /speed-scratch/\$USER/vscode}
\item
Open interactive session: \texttt{salloc --mem=10Gb --constraint=el9}
\item
Set environment variable:\\\texttt{setenv XDG\_RUNTIME\_DIR /speed-scratch/\$USER/run-user}
\item
Run VScode, change the port if needed.
\scriptsize
\begin{verbatim}
/speed-scratch/nag-public/code-server-4.22.1/bin/code-server --user-data-dir=$PWD\/projects \
--config=$PWD\/home/.config/code-server/config.yaml --bind-addr="0.0.0.0:8080" $PWD\/projects
\end{verbatim}
\normalsize
\item
SSH Tunnel creation: similar to Jupyter, see \xs{sect:jupyter-singularity}
\item
Open a browser and type: \texttt{localhost:8080}
\item
If the browser asks for a password, consult:
\begin{verbatim}
cat /speed-scratch/$USER/vscode/home/.config/code-server/config.yaml
\end{verbatim}
\end{enumerate}
\end{itemize}
\begin{figure}[htbp]
\centering
\fbox{\includegraphics[width=1.00\textwidth]{images/vscode.png}}
\caption{VScode running on a Speed node}
\label{fig:vscode}
\end{figure}