Job scripts ----------- The following script can be used as a template to exectute some bash commands for a serial or parallel program. This is just a template that has the most commontly used flags. For working example see the :ref:`job scripts examples `. **template job script** .. literalinclude:: sample_job.sh :linenos: :language: bash **Flags Description** - ``#SBATCH --job-name=my_job_name``: Set the name of the job. This will appear e.g. when the command ``squeue`` is executed to query the queued or running jobs. - ``#SBATCH --account=abc123``: Specify the ID of the project. This number should correspond to the project ID of the service request. Jobs without this flag will be rejected. - ``#SBATCH --partition=normal``: The name of the partition, a.k.a queue to which the job will be submitted. - ``#SBATCH --nodes=2``: The number of nodes that will be reserved for the job. - ``#SBATCH --ntasks-per-node=8``: The number of cores (e.g mpi tasks) that will be reserved per node. - ``#SBATCH --cpus-per-task=2``: The number of cores per task to be reserved for the job (e.g number of openmp threads per mpi task). The total number of cores reserved for the job is the product of the values of the flags ``--nodes``, ``--ntasks-per-node`` and ``--cpus-per-task``. - ``#SBATCH --mem=32000``: the amount of memory per node in MB that will be reserved for the job. Jobs that do not specify this flag will be rejected. - ``#SBATCH --time=1-00:00:00``: The time limit of the job. When the limit is reached, the job is killed by the scheduler. Jobs that do not specify this flag will be rejected. - ``#SBATCH --mail-type=ALL``: recieve email notification for all stages of a job, e.g when the job starts and terminates. - ``#SBATCH --mail-user=abc123@aub.edu.lb``: The email address to which the job notification emails are sent. Job scripts examples ^^^^^^^^^^^^^^^^^^^^ .. reference the scientific applications section of this guide .. _octopus_jobs_examples: Below is a list of job script that can be used to run serial or parallel jobs. For more details please refer to the :ref:`scientific computing section `. .. note:: all the jobs below are just basic working examples and can be copied and modified to suit the user needs. Make sure to change the account though to the one that you are using. .. note:: Every application is different and might need special flags to run correctly. Please consult the documentation of the application that you are using to make sure that you are using the correct flags. You may also email it.helpdesk@aub.edu.lb for advise. serial - single core job """""""""""""""""""""""" .. literalinclude:: serial_single_core_job.sh :linenos: :language: bash single node smp job """"""""""""""""""" .. literalinclude:: smp_single_node_job.sh :linenos: :language: bash parallel multi-host job """"""""""""""""""""""" .. literalinclude:: parallel_multi_host_job.sh :linenos: :language: bash single host GPU job """"""""""""""""""" .. literalinclude:: single_host_gpu_job.sh :linenos: :language: bash multi-host GPU job """""""""""""""""" .. literalinclude:: multi_host_gpu_job.sh :linenos: :language: bash Batch job submission and monitoring procedure ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - submit the job script using SLURM .. code-block:: bash $ sbatch my_job_script.sh This will submit the job to the queueing system. The job could run immediately or set in pending mode until the requested resources are available for the job to run. - check the status of the job .. code-block:: bash $ squeue -a - After the job is dispatched for executing (starts running), monitor the output by checking the ``.o`` file. For more information on using SLURM, please consult the ``man`` pages: .. code-block:: bash $ man sbatch Jobs time limits and checkpoints ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. _octopus_jobs_checkpoints_resume: In-order to have fair usage of the resources and the partitions (queues), different partitions have different time limits. The maximum time limit for jobs is 3 days. Also paritions have different priorities that are necessary for fair usage, for example, short jobs have higher priorities than long jobs. When a job reaches the time limit that is specified in the job script or the time limit of the partition, it is automatically killed and removed the the queue. It is the responsibility of the user to set the job parameters based on the requirements of the job and the available resources. in all the examples below it is the responsibily of the user to manage writing the checkpoint file and loading it. Resubmit a job automatically using job arrays ============================================= In the following example, a job array (``#SBATCH --array=1-30%1``) is used to indicate that the job should be run as a chain of 30 jobs back to back. Using this flow a job can be run for arbitarily long periods, in this case and for the sake of demonstration, this job runs for 30 days using individual jobs that run for 1 day each. When the first job finishes, a checkpoint file ``foo.chkp`` is written to the disk and the execution of the next job starts where `foo.chkp`` is read and the program state is restored and the execution resumes. .. code-block:: bash #!/bin/bash #SBATCH --job-name=my_job_name #SBATCH --account=abc123 ## specify the required resources #SBATCH --partition=normal #SBATCH --nodes=1 #SBATCH --ntasks-per-node=8 #SBATCH --cpus-per-task=2 #SBATCH --mem=12000 #SBATCH --time=0-01:00:00 #SBATCH --array=1-30%1 ## load some modules module load python # start executing the program, MY_CHECKPOINT_FILE=foo.chkp if [ -z "${MY_CHECKPOINT_FILE}" ]; then # checkpoint file is not found, execute this command python train_model_from_scratch.py else # checkpoint file is found, read it and continue training python train_model_from_scratch.py --use-checkpoint=${MY_CHECKPOINT_FILE} fi Each job in the job array will have its own ``.out`` file suffixed with the job array index, e.g ``my_slurm_30.out``. resubmit a job automatically using job dependencies =================================================== The main difference between using job dependencies and job array is that using dependencies the job will be resubmitted infinit times until the user decides to cancel the automatic re-submission. .. warning:: It is important to include a wait time of a few minuites (e.g 5 min) so that the scheduler will not be overloaded by the recursive resubmission of jobs in case something goes wrong. In the template job script below, when the job is submitted, a ``sbatch`` command submits the dependency from within the job. The simulation/program resume procedure is the same as that of using job arrays, i.e if a checkpoint exists, run the program from the checkpoint, otherwise run the program and create the checkpoint. .. code-block:: bash #!/bin/bash #SBATCH --job-name=my_job_name #SBATCH --account=abc123 ## specify the required resources #SBATCH --partition=normal #SBATCH --nodes=1 #SBATCH --ntasks-per-node=8 #SBATCH --cpus-per-task=2 #SBATCH --mem=12000 #SBATCH --time=0-01:00:00 ## submit the dependency that will start after the current job finishes sbatch --dependency=afterok:${SLURM_JOBID} job.sh sleep 300 # start executing the program, MY_CHECKPOINT_FILE=foo.chkp if [ -z "${MY_CHECKPOINT_FILE}" ]; then # checkpoint file is not found, execute this command python train_model_from_scratch.py else # checkpoint file is found, read it and continue training python train_model_from_scratch.py --use-checkpoint=${MY_CHECKPOINT_FILE} fi