Introduction.

Twickel is a computing environment that consists of a compute cluster and several compute servers.
Most resources are controlled by means of a torque/MAUI scheduling and execution system.
This means that most resources have to be requested.

Resources.

The available computational resources will be:
  • A group of 4 machines that use Virtual Box to run virtual machines. For example, the SHARE project is hosted on warmelo.
  • A group of 10 compute nodes with dual Intel E5335 CPUs and 24GB RAM, which are connected with gigabit ethernet.
  • A group of 10 compute nodes with dual Intel E5520 CPUs and 24GB RAM, which are connected with DDR Infiniband.
  • Two compute servers for exclusive access:
    • big3: dual Intel X5550, 144GB RAM.
    • big4: dual Intel X5550, 72GB RAM.
  • One compute server for experimentation with multi-core algorithms:
    • big5: Reserved for experimentation with multi-core algorithms, contact Alfons Laarman
  • Two head nodes:
    • twickel: dual Xeon with 4GB RAM, can be used for compilation and job submission.
    • weldam: dual X5365 with 64GB RAM, can be used for compilation and job submission as well as for doing development work.
The available storage resources will be.
  • Total of 2TB of semi reliable shared storage. (An effort will be made to keep the data safe, but no backups are made.)
  • A single 6TB distributed scratch file system. (If something goes wrong with this file system, it will be reformatted.)
  • Every machine has a limited local scratch file system.

A picture and a complete listing can be found here.

How to use the cluster resources.

Getting an account.

First, you need to get a working EWI account. Then
  • Employees of DACS, SE or FMT can ask the ICTS help desk. (If you mention the names Enno Oosterhuis and Twickel you reduce the chances of the message being forwarded to the wrong person.)
  • Others (including students) need to have an employee ask for them.

Subscribe to the mailing list.

A mailing list for cluster users has been created on the UTwente list server: TWICKEL-USERS.

Logging in and setting up.

Once you have an account then you can log in to twickel.ewi.utwente.nl. and/or weldam.ewi.utwente.nl. (Weldam seems to be unavailable now).
Twickel is a real head node: it's meant for compilation and job submission but not for running applications.
Weldam is a mix between a compute server and a head node: you can use it both for running applications
and for submitting jobs.

Software

All machines currently run openSUSE 11.2.
We are migrating the cluster from openSUSE 11.2 to Scientific Linux.
Twickel, and the compute nodes that it controls, already run Scientific Linux;
Weldam, and the compute nodes that it controls, are being migrated.
For more information see the Hardware Inventory.

Additional software has been installed in the
directory /software. To get access to that software you need to add the following lines to your .bashrc:

export PATH=/software/ewi/bin:$PATH

The FMT group also maintains a directory of software, which uses environment modules.
Currently there are two FMT software directories available: one that is current, i.e. it occasionally gets updated with new software, and a legacy one that is not further extended.
It is not advisable to try to use those two together.

There is a third FMT software directory that contains software that is installed by the Jenkins continous integration server; this can be used together with the current FMT directory of software.

Current FMT directory of Software

To get access to this software, you also need to add the following lines to your .bashrc:

export MOD_BASE=/software/fmtv2
. $MOD_BASE/bin/mod_setup.sh
module load cadp

Software installed via the Jenkins continous integration server

This software is installed in /software/fmt-jenkins
It is probably best to use this together with the 'current fmt software'.

To use this software, together with the current fmt software, such that software in fmt-jenkins takes preference over software in fmtv2,write the following in your .bashrc (or similar file), or on the bash prompt, instead of what was written in the section above.

export MOD_PATH=/software/fmt-jenkins:/software/fmtv2
. /software/fmtv2/bin/mod_setup.sh
module load cadp mcrl2

Now, when you use module load to load a package like mcrl2 or ltsmin without specifying the version, you will get the latest version from the fmt-jenkins tree.
With module avail you can see all available packages.

So, to tell mod_setup.sh that you want to access software from multiple trees, set environment variable MOD_PATH to those trees, before mod_setup.sh is sourced (read).
(right now we only have two such trees: /software/fmtv2 and /software/fmt-jenkins)
Moreover, it is not necessary to set the MOD_BASE variable if you set MOD_PATH
(as long as you only want to use the software installed in multiple trees; when you want to install software in /software/fmtv2 or /software/fmt-jenkins you have to set MOD_BASE to indicate where the software must be installed)

Legacy FMT directory of Software

There still is an older version of the directory of FMT software.
To access that, add the following lines to your .bashrc:

export MOD_HOME=/software/fmt
export LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH:$MOD_HOME/pkg/tcltk-8.5.6/lib
if [ -f /etc/profile.d/modules.sh ] ; then
 . /etc/profile.d/modules.sh
else
 . $MOD_HOME/Modules/3.2.6/init/bash
fi
module use $MOD_HOME/modules
module load cadp

Batch scheduler

To get access to the compute nodes and the other compute servers, you need to
use the torque/MAUI batch scheduler.

Job submission

A job can be submitted to the torque/MAUI batch scheduler by means of the tool qsub.
We continue with a very brief discussion of qsub, mostly focussing on the resource syntax for twickel.

There are several important options. The -W allows passing information to MAUI.
For example to tell MAUI to get exclusive access to a node:

-W x=NACCESSPOLICY:SINGLEJOB

And most important of all the -l option which specifies the resources needed.
We'll provide a few examples.

  • To get one processor on an E5335 use -l nodes=1:E5335
  • To get 4 processors on a single E5335 use -l nodes=1:ppn=4:E5335
  • To get 8 processors each on 2 E5520 machines use -l nodes=2:ppn=8:E5520
  • To run a batch job on big3 use -l nodes=big3 (doesn't work yet: use ssh)

Normally qsub will take a script as argument and run the script.
When you need interactive access to e.g. big4 type

qsub -I -l nodes=big4 -W x=NACCESSPOLICY:SINGLEJOB

And after waiting for currently running jobs to complete, the entire machine will be yours.

Instead of passing option on the command line to qsub, you can also put them in the script.
E.g. if file1.pbs contains

#PBS -N the-name-of-the-job
#PBS -l nodes=1:E5335
#PBS -W x=NACCESSPOLICY:SINGLEJOB

hostname
date
sleep 60
date

and file2.pbs contains
hostname
date
sleep 60
date

Then the commands

qsub file1.pbs
qsub -N the-name-of-the-job -l nodes=1:E5335 -W x=NACCESSPOLICY:SINGLEJOB file2.pbs

behave exactly the same. Note that command line options override the option in the script.

Status of your job

You can get information about the status of your job with the command qstat.

Other commands that give information about the status of the qeueus and cluster nodes:

  • pbsnodes
  • showq

Removing your job

If you want to remove your job, use

qdel jobId

where jobId is the job id returned when you submitted the job with qsub; this is shown in output of qstat (in the leftmost column).

Command overview

(Overview taken from http://clusterinfo.physik.hu-berlin.de/)

Node info

pbsnodes -a                           # show status of all nodes
pbsnodes -a nodeNN # show status of specified node
pbsnodes -l # list inactive nodes
pbsnodelist # list status of all nodes (one per line)

Queue info

qstat -Q                              # show all queues
qstat -Q queue # show status of specified queue
qstat -f -Q queue # show full info for specified queue
qstat -q # show all queues (alternative format)
qstat -q queue # show status of specified queue (alt.)

Job submission and monitoring

qsub jobscript                        # submit to default queue
qsub -q queue jobscript # submit to specified queue
qsub -l nodes=4:ppn=2 jobscript # request 4x2 processors
qsub -l nodes=nodeNN jobscrip # run on specified node
qsub -l cput=HH:MM:SS jobscript       # limit on CPU time (serial job)
qsub -l walltime=HH:MM:SS jobscript # limit on wallclock time (parallel job)
qdel job_no                           # delete job (with job_no from qstat)
qstat -a                              # show all jobs
qstat -a queue # show all jobs in specified queue
qstat -f job_no # show full info for specified job
qstat -n # show all jobs and the nodes they occupy

MPI

Executing a job involving multiple compute nodes communicating through MPI
requires allocating the appropriate number of resources through qsub and
executing the mpirun command with appropriate arguments, including the
program to be run.

Suppose, e.g., we would like to execute the lps2lts-mpi tool with a file
called model.lps on ten compute nodes using two processors per node. This
can be achieved by submitting a script with the following contents to the
batch scheduler:

#PBS -N the-name-of-the-job
#PBS -l nodes=10:ppn=2:E5335
#PBS =W x=NACCESSPOLICY:SINGLEJOB

mpirun -mca btl tcp,self lps2lts-mpi model.lps

More information on the mpirun command and the arguments that can be passed
to it can be found in the in the mpirun manual page.