infinity

Transcription

infinity
INFINITY
https://lcc.ncbr.muni.cz/whitezone/development/infinity/
Petr Kulhánek1,2
[email protected]
1CEITEC
– Central European Institute of Technology, Masaryk University, Kamenice 5, 62500
Brno, Czech Republic
2National Centre for Biomolecular Research, Faculty of Science, Masaryk University,
Kotlářská 2, 611 37 Brno, Czech Republic
Infinity overview, 18th October 2013
-1-
Contents
 Docs & Contacts
how to report a problem
 Terminology
cluster topologies
 Clusters
NCBR/CEITEC clusters, MetaCentrum, CERIT-SC, IT4I
 AMS – Advanced Module System
command overviews, sites, modules, personal sites, big brother
 ABS – Advanced Batch System
command overviews, resources, job submission, job monitoring
Infinity overview, 18th October 2013
-2-
Docs & Contacts
Infinity overview, 18th October 2013
-3-
Docs & Contacts
Infinity web pages:
https://lcc.ncbr.muni.cz/whitezone/development/infinity/
Support contacts:
[email protected]
([email protected])
• bug reports, requests for software compilation, etc.
• only administrators are notified
Mailing list:
[email protected]
https://lcc.ncbr.muni.cz/bluezone/mailman/cgi-bin/listinfo/infinity
• general announcements, software news, general discussion, etc.
• all members of the list are notified
Infinity overview, 18th October 2013
-4-
How to report a problem
Do not assume that we know .... so report
• problem to correct e-mail address ([email protected])
• where and when the problem occurred (name of computer)
• provide path to the job directory
• make all files readable to everyone (chmod -R a+r,a-X *)
• briefly summarize the problem
[email protected]
all e-mails will be silently ignored
Infinity overview, 18th October 2013
-5-
Terminology
Infinity overview, 18th October 2013
-6-
Terminology
User Interface (UI)
(Frontend)
Cluster
Computational Node #1
Computational Node #1
Worker Node
Computational
Node (WN)
#1
Worker
Node
Computational Node (WN)
#1
Worker Node
Computational
Node (WN)
#1
Worker
Node
Computational Node (WN)
#1
Worker Node
(WN)
Computational
Node
#1
Worker Node
(WN)
Worker Node (WN)
Infinity overview, 18th October 2013
-7-
Terminology
User Interface (UI)
(Frontend)
Cluster
Computational Node #1
Computational Node #1
Worker Node
Computational
Node (WN)
#1
Worker
Node
Computational Node (WN)
#1
Worker Node
Computational
Node (WN)
#1
Worker
Node
Computational Node (WN)
#1
Worker Node
(WN)
Computational
Node
#1
Worker Node
(WN)
Worker Node (WN)
Infinity overview, 18th October 2013
User Interface (UI)
(Frontend)
Cluster
Computational Node #1
Computational Node #1
Worker Node
Computational
Node (WN)
#1
Worker
Node
Computational Node (WN)
#1
Worker Node
Computational
Node (WN)
#1
Worker
Node
Computational Node (WN)
#1
Worker Node
(WN)
Computational
Node
#1
Worker Node
(WN)
Worker Node (WN)
-8-
Terminology, independent clusters
User Interface (UI)
(Frontend)
Cluster
Batch
Computational Node #1
Server
Computational Node
#1
Worker
Node
Computational Node (WN)
#1
Worker Node
Computational
Node (WN)
#1
Worker
Node
Computational Node (WN)
#1
Worker Node
Computational
Node (WN)
#1
Worker
Node
(WN)
Computational
Node
#1
Worker Node (WN)
Worker Node (WN)
Examples:
wolf.ncbr.muni.cz
User Interface (UI)
(Frontend)
Cluster
Batch
Computational Node #1
Server
Computational Node
#1
Worker
Node
(WN)
Computational Node #1
Worker Node
Computational
Node (WN)
#1
Worker
Node
Computational Node (WN)
#1
Worker Node
Computational
Node (WN)
#1
Worker
Node
(WN)
Computational
Node
#1
Worker Node (WN)
Worker Node (WN)
sokar.ncbr.muni.cz
Independent clusters, each has own batch system for job management.
Infinity overview, 18th October 2013
-9-
Terminology, grids
User Interface (UI)
(Frontend)
Cluster
Computational Node #1Batch
Computational Node #1
Worker Node
Server
Computational
Node (WN)
#1
Worker
Node
(WN)
Computational Node #1
Worker Node
Computational
Node (WN)
#1
Worker
Node
Computational Node (WN)
#1
Worker Node
(WN)
Computational
Node
#1
Worker Node
(WN)
Worker Node (WN)
Examples:
skirit.ics.muni.cz
User Interface (UI)
(Frontend)
Cluster
Computational Node #1
Computational Node #1
Worker Node
Computational
Node (WN)
#1
Worker
Node
Computational Node (WN)
#1
Worker Node
Computational
Node (WN)
#1
Worker
Node
Computational Node (WN)
#1
Worker Node
(WN)
Computational
Node
#1
Worker Node
(WN)
Worker Node (WN)
perian.ncbr.muni.cz
Clusters use common infrastructure with one batch system for job management.
Infinity overview, 18th October 2013
-10-
Terminology, grids
User Interface (UI)
(Frontend)
Cluster
User Interface (UI)
(Frontend)
Cluster
Computational Node #1Batch
Computational Node #1
Computational Node #1
Computational
Node #1
Worker Node
Server
Worker Node
(WN)
Computational
Node (WN)
#1
Computational
Node
#1
Worker
Node
(WN)
Worker
Node
(WN)
Computational Node #1
Computational
Node
#1
Worker Node
Worker Node
Computational
Node (WN)
#1
Computational
Node (WN)
#1
Worker
Node
(WN)
Worker
Node
(WN)
Computational Node #1
Computational
Node
#1
Worker Node
(WN)
Worker Node
(WN)
Computational
Node
#1
Computational
Node
#1
Worker Node
(WN)
Worker Node
(WN)
Worker Node (WN)
Worker Node (WN)
storage
node
storage node
(SN)
(SN)
#1#1
Examples:
skirit.ics.muni.cz
perian.ncbr.muni.cz
Clusters use common infrastructure with one batch system for job management.
Infinity overview, 18th October 2013
-11-
Terminology, large grids
User Interface (UI)
(Frontend)
Cluster
User Interface (UI)
(Frontend)
Cluster
Batch
Computational Node #1
Computational Node #1
Server
Computational Node #1
Computational
Node #1
Worker Node
Worker Node
(WN)
Computational
Node (WN)
#1
Computational
Node
#1
Worker Node
Worker
Node
(WN)
Computational
Node (WN)
#1
Computational
Node
#1
Worker Node
Worker Node
Computational
Node (WN)
#1
Computational
Node (WN)
#1
Worker
Node
(WN)
Worker
Node
(WN)
Computational
Node
#1
Computational
Node
#1
Worker Node (WN)
Worker Node (WN)
Worker Node (WN)
Worker Node (WN)
storage
node
storage
node
(SN)
(SN)
#1#1
Computational Node #1
Computational Node #1
Worker Node
Computational
Node (WN)
#1
Worker Node
Computational
Node (WN)
#1
Worker Node
Computational
Node (WN)
#1
Worker Node
(WN)
Computational
Node
#1
Batch
Worker Node (WN)
Worker Node (WN)
Server
User Interface (UI)
(Frontend)
Cluster
Infinity overview, 18th October 2013
Computational Node #1
Computational Node #1
Worker Node
Computational
Node (WN)
#1
Worker
Node
Computational Node (WN)
#1
Worker Node
Computational
Node (WN)
#1
Worker
Node
(WN)
Computational
Node
#1
Worker Node (WN)
Worker Node (WN)
User Interface (UI)
(Frontend)
Cluster
-12-
Terminology, large grids
User Interface (UI)
(Frontend)
Cluster
User Interface (UI)
(Frontend)
Cluster
CESNET
Batch
Computational Node #1
Computational Node #1
Server
Computational Node #1
Computational
Node #1
Worker Node
Worker Node
(WN)
Computational
Node (WN)
#1
Computational
Node
#1
Worker Node
Worker
Node
(WN)
Computational
Node (WN)
#1
Computational
Node
#1
skirit,
perian,
gram,
hildor,
...
Worker Node
Worker Node
(WN)
Computational
Node (WN)
#1
Computational
Node
#1
Worker Node
(WN)
Worker Node
(WN)
Computational
Node
#1
Computational
Node
#1
Worker Node
(WN)
Worker Node
(WN)
Worker Node (WN)
Worker Node (WN)
storage
node
storage
node
(SN)
(SN)
#1#1
Computational Node #1
Computational Node #1
Worker Node
Computational
Node (WN)
#1
Worker Node
Computational
Node (WN)
#1
Worker Node
Computational
Node (WN)
#1
Worker Node
(WN)
Computational
Node
#1
Batch
Worker Node (WN)
Worker Node (WN)
Server
User Interface (UI)
(Frontend)
Cluster
Infinity overview, 18th October 2013
Computational Node #1
ComputationalCERIT-SC
Node #1
Worker Node
Computational
Node (WN)
#1
Worker Node
Computational
Node (WN)
#1
Worker
Node
Computational
Node (WN)
#1
zegox,
zewura,
...
Worker
Node
(WN)
Computational
Node
#1
Worker Node (WN)
Worker Node (WN)
User Interface (UI)
(Frontend)
Cluster
-13-
Terminology, large grids
User Interface (UI)
(Frontend)
Cluster
User Interface (UI)
(Frontend)
Cluster
site metacentrum
Batch
Computational Node #1
Computational Node #1
Server
Computational Node #1
Computational
Node #1
Worker Node
Worker Node
(WN)
Computational
Node (WN)
#1
Computational
Node
#1
Worker Node
Worker
Node
(WN)
Computational
Node (WN)
#1
Computational
Node
#1
skirit,
perian,
gram,
hildor,
...
Worker Node
Worker Node
(WN)
Computational
Node (WN)
#1
Computational
Node
#1
Worker Node
(WN)
Worker Node
(WN)
Computational
Node
#1
Computational
Node
#1
Worker Node
(WN)
Worker Node
(WN)
Worker Node (WN)
Worker Node (WN)
storage
node
storage
node
(SN)
(SN)
#1#1
Computational Node #1
Computational Node #1
Worker Node
Computational
Node (WN)
#1
Worker Node
Computational
Node (WN)
#1
Worker Node
Computational
Node (WN)
#1
Worker Node
(WN)
Computational
Node
#1
Batch
Worker Node (WN)
Worker Node (WN)
Server
User Interface (UI)
(Frontend)
Cluster
Infinity overview, 18th October 2013
Computational Node #1
Computational Node #1
Worker Node
Computational
Node (WN)
#1
Worker
Node
Computational Node (WN)
#1
Worker
Node
Computational
Node (WN)
#1
zegox,
zewura,
...
Worker
Node
(WN)
Computational
Node
#1
Worker Node (WN)
Worker Node (WN)
site cerit-sc
User Interface (UI)
(Frontend)
Cluster
-14-
Terminology, large grids
User Interface (UI)
(Frontend)
site cerit-sc
storage
node
storage
node
(SN)
(SN)
#1#1
Computational Node #1
Computational Node #1
Worker Node
Computational
Node (WN)
#1
Worker Node
Computational
Node (WN)
#1
Worker Node
Computational
Node (WN)
#1
Worker Node
(WN)
Computational
Node
#1
Batch
Worker Node (WN)
Worker Node (WN)
Server
User Interface (UI)
(Frontend)
Cluster
Infinity overview, 18th October 2013
Computational Node #1
Computational Node #1
Worker Node
Computational
Node (WN)
#1
Worker
Node
Computational Node (WN)
#1
Worker
Node
Computational
Node (WN)
#1
zegox,
zewura,
...
Worker
Node
(WN)
Computational
Node
#1
Worker Node (WN)
Worker Node (WN)
User Interface (UI)
(Frontend)
Cluster
-15-
Clusters
Infinity overview, 18th October 2013
-16-
NCBR/CEITEC Clusters
Two main cluster:
 WOLF (1.18, 2.34, 24+1 PC, frontend: wolf.ncbr.muni.cz)
 SOKAR [0.18, 3x(24CPU, 72GB), 4x(64CPU, 256GB)]
• Infinity activation is not necessary.
• Job submission requires passwordless ssh connection among cluster nodes.
Infinity overview, 18th October 2013
-17-
Passwordless connection within cluster
1. Create the public/private ssh keys:
Use empty passphrase!
[kulhanek@wolf01 ~]$ cd .ssh
[kulhanek@wolf01 .ssh]$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/kulhanek/.ssh/id_rsa):
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/kulhanek/.ssh/id_rsa.
Your public key has been saved in /home/kulhanek/.ssh/id_rsa.pub.
The key fingerprint is:
e9:07:0b:fc:17:23:b3:c5:1a:8a:0c:1a:98:8f:fe:28 [email protected]
2. Put the public key to the list of authorized keys:
[kulhanek@wolf01 .ssh]$ cat id_rsa.pub >> authorized_keys
3. Test passwordless connection to another node:
[kulhanek@wolf01 .ssh]$ ssh wolf02
If there are some problems with ssh-agent, try
$ ssh-add –D
or login to GUI session again
Infinity overview, 18th October 2013
-18-
MetaCentrum/CERIT-SC
MetaCentrum a CERIT-SC
• National grid environment
• OS Debian
• cca 8500 CPU cores
• CEITEC/NCBR own resources cca 850 CPU
cores
• Total 1000 TB storage capacity
• cca 10 TB per user
http://www.metacentrum.cz/
http://www.cerit-sc.cz/
Free access can be provided to member of any
Czech university
Infinity activation is required. Procedure is described here:
https://lcc.ncbr.muni.cz/whitezone/development/infinity/wiki/index.php/How_to_activate_Infinity
Infinity overview, 18th October 2013
-19-
Kerberos & Support
Access to main services and passwordless authentication among cluster nodes is
maintained via Kerberos tickets.
Main commands:
• kinit
• klist
• kdestroy
A ticket is valid for 10 hours. It is created automatically by ssh when login to any node
using password (it is not created when ssh keys are used for authentication). It can be
created/recreated by the kinit command.
Expired tickets can lead to very spurious behavior.
Support (via Best Practical RT system):
[email protected]
[email protected]
Infinity overview, 18th October 2013
only system related stuff
all problems related to Infinity should be sent
to [email protected]
-20-
IT4I
http://www.it4i.cz/
Access is granted to successful proposals (6 months duration).
Calls are opened twice a year.
Small cluster: ca 3000 CPU, infiniband, PHI and GPU accelerators
Big cluster: full operation in 2015
Infinity overview, 18th October 2013
-21-
AMS
Advanced Module System
(Software management)
https://lcc.ncbr.muni.cz/whitezone/development/infinity/wiki/index.php/How_to_activate_Infinity
Infinity overview, 18th October 2013
-22-
Command Overview
Software management:
• site
switching between computational resources
• module
activation/deactivation of software
• ams-config
configuration of software modules
• ams-host
information about computational node/frontend
• ams-user
information about logged user
• ams-setenv
prepare fake environment for given computational resources
• ams-autosite
name of default site for given computational node/frontend
• ams-root
where the AMS is installed
Use command -h to list all command options.
Infinity overview, 18th October 2013
-23-
AMS
Sites
Infinity overview, 18th October 2013
-24-
Sites
A site is encapsulation of computational resources and software packages.
On independent clusters, there is usually only one site available. On larger grids, there
might be more sites available on each frontend and/or worker nodes.
Available site are listed by the site command:
[kulhanek@perian ~]$ site
>>> AVAILABLE SITES >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[metacentrum] cerit-sc
[kulhanek@perian
~]$ site
site
[kulhanek@sokar ~]$
>>>
>>> AVAILABLE
AVAILABLE SITES
SITES >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[metacentrum]
cerit-sc
[sokar]
Infinity overview, 18th October 2013
-25-
Sites
A site is encapsulation of computational resources and software packages.
On independent clusters, there is usually only one site available. On larger grids, there
might be more sites available on each frontend.
no arguments
Available site are listed by the site command:
[kulhanek@perian ~]$ site
>>> AVAILABLE SITES >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[metacentrum] cerit-sc
[kulhanek@perian
~]$ site
site
[kulhanek@sokar ~]$
the active site is in square brackets
>>>
>>> AVAILABLE
AVAILABLE SITES
SITES >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[metacentrum]
cerit-sc
[sokar]
Infinity overview, 18th October 2013
-26-
Login to UI/WN
[kulhanek@pes ~]$ ssh sokar.ncbr.muni.cz
[email protected]'s password:
Welcome to Ubuntu 12.04.2 LTS (GNU/Linux 3.2.0-37-generic x86_64)
.....
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
*** Welcome to sokar site ***
==============================================================================
Site name : sokar (-active-)
Site ID
: {SOKAR:9848596a-17d1-47e2-9fce-b666fc0e5a36}
~~~ User identification ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
User name
: kulhanek
User groups : compchem,lcc,pmflib
~~~ Host info ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Full host name
: sokar.ncbr.muni.cz
Host arch tokens
: i686,noarch,x86_64
Num of host CPUs
: 16
Host SMP CPU model : Intel(R) Xeon(R) CPU E5530 @ 2.40GHz [Total memory: 24104 MB]
~~~ Site documentation and support ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Documentation : https://lcc.ncbr.muni.cz/whitezone/development/infinity/
Support e-mail : [email protected] [issue tracking system]
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[kulhanek@sokar ~]$
Infinity overview, 18th October 2013
-27-
What site is activated?
If multiple sites are allowed on UI/WN, the default site is determined by the ams-autosite
command if logged from outside of the site.
sokar
no site
ssh
perian
the metacentrum site will be activated
Infinity overview, 18th October 2013
sokar
no site
ssh
zuphux
the cerit-sc site will be activated
-28-
What site is activated?
If multiple sites are allowed on UI/WN, the default site is determined by the ams-autosite
command if logged from outside of the site.
sokar
no site
sokar
ssh
no site
ssh
perian
zuphux
the metacentrum site will be activated
the cerit-sc site will be activated
The site is preserved if it is allowed on the remote computer.
perian
ssh
zuphux
the metacentrum site is preserved
Infinity overview, 18th October 2013
zuphux
ssh
perian
the cerit-sc site is preserved
-29-
Remote login within the site
[kulhanek@perian ~]$ ssh skirit
>>> INFINITY environment will be propagated to the remote computer ...
...
*** Welcome to metacentrum site ***
# ==============================================================================
# Site name : metacentrum (-active-)
# Site ID
: {METACETRUM:276b1c6d-4aca-4b8c-b517-be1f66a85ebe}
#
# ~~~ User identification ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# User name
: kulhanek
# User groups : compchem,pmflib
#
# ~~~ Host info ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Full host name
: skirit.ics.muni.cz
# Host arch tokens
: i686,noarch,x86_64
# Num of host CPUs
: 4
# Host SMP CPU model : Intel(R) Xeon(R) CPU 5160 @ 3.00GHz [Total memory: 3650 MB]
#
# ~~~ Site documentation and support ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Documentation : https://lcc.ncbr.muni.cz/whitezone/development/infinity/
# Support e-mail : [email protected] [issue tracking system]
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>> Active modules were restored ...
>>> Current directory was restored ...
[kulhanek@skirit ~]$
Infinity overview, 18th October 2013
-30-
Remote login within the site
[kulhanek@perian ~]$ ssh skirit
>>> INFINITY environment will be propagated to the remote computer ...
...
...
>>> Active modules were restored ...
>>> Current directory was restored ...
[kulhanek@skirit ~]$
Remote login (interactive) within the site preserves:
• the active site
• active modules
• current working directory (if it is possible)
The feature can be disabled by any option given to the ssh command:
[kulhanek@perian ~]$ ssh –x skirit
...
[kulhanek@skirit ~]$
Infinity overview, 18th October 2013
-31-
Changing the active site
The active site can be changed by the site command:
[kulhanek@perian ~]$ site activate cerit-sc
*** Welcome to cerit-sc site ***
# ==============================================================================
# Site name : cerit-sc (-active-)
# Site ID
: {CERIT-SC:5d1cc70a-efdf-4017-b446-9a050a016f61}
#
# ~~~ User identification ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# User name
: kulhanek
# User groups : compchem,pmflib
#
# ~~~ Host info ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Full host name
: perian.ncbr.muni.cz
# Host arch tokens
: i686,noarch,x86_64
# Num of host CPUs
: 4
# Host SMP CPU model : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz [Total memory: 5500 MB]
#
# ~~~ Site documentation and support ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Documentation : https://lcc.ncbr.muni.cz/whitezone/development/infinity/
# Support e-mail : [email protected] [issue tracking system]
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[kulhanek@perian ~]$
Infinity overview, 18th October 2013
-32-
Info about the active site / a site
The information about the active or any other site can be shown by the site command:
[kulhanek@perian ~]$ site info metacentrum
*** Welcome to metacentrum site ***
# ==============================================================================
# Site name : metacentrum (-not active-)
#
# ~~~ User identification ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# User name
: kulhanek
# User groups : compchem,pmflib
#
# ~~~ Site documentation and support ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
# Documentation : https://lcc.ncbr.muni.cz/whitezone/development/infinity/
# Support e-mail : [email protected] [issue tracking system]
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[kulhanek@perian ~]$
if the name of site is not provided, the active site is shown.
Note: the host information is not available if the site is not active.
More detailed information can be shown by the disp action of the site command:
[kulhanek@perian ~]$ site disp metacentrum
Infinity overview, 18th October 2013
-33-
AMS
Modules
Infinity overview, 18th October 2013
-34-
Modules
Available software on the active site is listed by the module command:
[kulhanek@skirit ~]$ module
software packages are categorized
AVAILABLE MODULES
--- GPU Enabled Software --------------------------------------------------------------------pmemd-cuda
--- Molecular Mechanics and Dynamics --------------------------------------------------------amber
ambertools
cicada
espresso
pmemd-cuda
sander-pmf
...
--- Quantum Mechanics and Dynamics ----------------------------------------------------------adf
cpmd
dalton
gaussview
multiwfn
qmutil
...
--- Docking and Virtual Screening -----------------------------------------------------------autodock
autodock-vina cheminfo
dock
mgltools
xscore
--- Bioinformatics --------------------------------------------------------------------------blast
copasi
clustalw
modeller
rate4site
blast+
cd-hit
fasta
muscle
--- Conversion and Analysis -----------------------------------------------------------------3dna
cats
inchi
openbabel
symmol
...
...
Note: the list can contain module versions, this features is configurable by the
ams-config command.
Infinity overview, 18th October 2013
-35-
Modules via iSoftRepo
https://lcc.ncbr.muni.cz/whitezone/development/infinity/isoftrepo/fcgi-bin/isoftrepo.fcgi
List information about sites and modules. Description, typical usage, ACL (access control
list), versions and default version are listed if available.
Infinity overview, 18th October 2013
-36-
Module build
module name (name of software package)
determine CPU/GPU architecture for which
the build is compiled
name:version:architecture:mode
module version
determine for which parallel execution mode
the build is compiled
All module versions are listed by:
$ module versions <module_name>
All module builds are listed by:
$ module builds <module_name>
Infinity overview, 18th October 2013
-37-
sander/pmemd
The sander/pmemd programs are applications from the AMBER package. They do
molecular dynamics. Detailed information can be found on: http://ambermd.org
#!/bin/bash
# activate module with sander/pmemd
# application
module add amber:12.0
# execute the sander program
sander –O –i prod.in –p topology.parm7 -c input.rst7
Job script:
• only essential logic is present
• in most cases, the script is the same for the sequential and parallel runs of the
same applications
• data are referenced relative to the job directory
Infinity overview, 18th October 2013
-38-
sander – single/parallel execution
The only difference between sequential and parallel execution is in the resource
specification during psubmit. The input data and the job script are the same!
$ psubmit short test_sander ncpus=1
$ psubmit short test_sander ncpus=2
it can be omitted
*.stdout
*.stdout
.....
Module build: amber:12.0:x86_64:single
.....
.....
Module build: amber:12.0:x86_64:para
.....
Computational node:
Infinity overview, 18th October 2013
Computational node:
-39-
Module build, architectures
Architecture
Target
noarch
application requires shell environment, it
should run everywhere
i686
application requires 32-bit environment
x86_64
application requires 64-bit environment
gpu
application requires GPU
cuda
application requires NVIDIA cuda
environment
ib
infiniband is supported
Host architecture tokens are listed via site command or explicitly via the ams-host
command.
The optimal module architecture must match all host architecture tokens. In ambiguous
cases, the build with highest architecture score is used (for example, x86_64 has higher
score than i686).
Infinity overview, 18th October 2013
-40-
Module build, modes
Modes
Target
single
application can utilize only one CPU
node
some parts of application can run in parallel
on single computational node
smp
some parts of application can run in parallel
on single computational node
para
some parts of application can run in parallel
on several computational nodes
noarch
this build cannot be activated
The optimal mode is determined by the available resources (CPUs, GPUs). The resources
can be specified via the psubmit or ams-setenv command.
The ams-setenv command should be used only for testing purposes! Its usage is limited to a
single computational node!
Infinity overview, 18th October 2013
-41-
Default build
In most cases, the module has a default build in the following form:
name:default:auto:auto
use default version
(in most cases, the latest one)
determine the best build for
the given host
determine the best mode
according to requested
computational resources
Note: not all applications have default build set. This means that you must specify module
version or even the whole build.
Infinity overview, 18th October 2013
-42-
Activate module
The application is activated by the module command:
[kulhanek@skirit ~]$ module add amber
# Module specification: amber (add action)
# =============================================================
Requested CPUs
:
1 Requested GPUs
:
0
Num of host CPUs
:
4 Num of host GPUs
:
0
Requested nodes
:
1
Host arch tokens
: i686,noarch,x86_64
Host SMP CPU model : Intel(R) Xeon(R) CPU 5160 @ 3.00GHz [Total memory: 3650 MB]
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Exported module
: amber:12.0
Module build
: amber:12.0:x86_64:single
Infinity overview, 18th October 2013
-43-
Activate module
The application is activated by the module command:
[kulhanek@skirit ~]$ module add amber
# Module specification: amber (add action)
# =============================================================
Requested CPUs
:
1 Requested GPUs
:
0
Num of host CPUs
:
4 Num of host GPUs
:
0
Requested nodes
:
1
Host arch tokens
: i686,noarch,x86_64
Host SMP CPU model : Intel(R) Xeon(R) CPU 5160 @ 3.00GHz [Total memory: 3650 MB]
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Exported module
: amber:12.0
Module build
: amber:12.0:x86_64:single
determined and activated module build
summary of available resources
The build resolution can be shown by the disp action of the module command:
[kulhanek@skirit ~]$ module disp amber
Infinity overview, 18th October 2013
-44-
Activate module
The module activation does not run the application! It only changes the shell environment
in such a way that the application is in the PATH.
activate module vmd
[kulhanek@wolf ~]$ module add vmd
# Module specification: vmd (add action)
# =============================================================
INFO:
additional module povray is required, loading ...
Loaded module : povray:3.6:i686:single
Requested CPUs
:
1 Requested GPUs
:
0
Num of host CPUs
:
4 Num of host GPUs
:
0
Requested nodes
:
1
Host arch tokens
: i686,noarch,x86_64
Host SMP CPU model : Intel(R) Xeon(R) CPU 5160 @ 3.00GHz [Total memory: 3650 MB]
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Exported module
: vmd:1.9.1
Module build
: vmd:1.9.1:x86_64:single
[kulhanek@wolf ~]$ vmd
run the application vmd
Infinity overview, 18th October 2013
-45-
Activate module, recommendations
• provide only module name or module name and its version
• it is recommended to provide explicit module version in computational scripts (the
default version might change time to time)
[kulhanek@skirit ~]$ module add amber:11.1
# Module specification: amber:11.1 (add action)
# =============================================================
INFO:
Module is active, reactivating ..
Unload module : amber:11.1:x86_64:single
Requested CPUs
:
1 Requested GPUs
:
0
Num of host CPUs
:
4 Num of host GPUs
:
0
Requested nodes
:
1
Host arch tokens
: i686,noarch,x86_64
Host SMP CPU model : Intel(R) Xeon(R) CPU 5160 @ 3.00GHz [Total memory: 3650 MB]
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Exported module
: amber:11.1
Module build
: amber:11.1:x86_64:single
Infinity overview, 18th October 2013
-46-
Other operations
List active module:
[kulhanek@skirit ~]$ module active
ACTIVE MODULES
compat-ia32:4.0:i686:single
dynutil-new:4.0.4241:noarch:single
compat-amd64:5.0:x86_64:single
povray:3.6:i686:single
heimdal:meta:i686:single
torque:meta:i686:single
amber:11.1:x86_64:single
mc:4.8.7:x86_64:single
List exported module:
[kulhanek@skirit ~]$ module
exported
[kulhanek@perian
~]$ site
EXPORTED MODULES
>>> AVAILABLE SITES >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
abs:2.0.4216
amber:11.1
[metacentrum]
cerit-sc
dynutil-new:4.0.4241
povray:3.6
Note: Exported modules contains only user activated modules without architecture and
mode parts. They are passed to jobs by the psubmit command.
Infinity overview, 18th October 2013
-47-
Deactivate modules
To deactivate module, use the remove action of the module command:
[kulhanek@skirit ~]$ module remove vmd
# Module name: vmd (remove action)
# =============================================================
Module build : vmd:1.9.1:x86_64:single
Note: The remove action does not remove modules that are activated together with the
module due to dependencies (in the case of the vmd module, the povray module is
not removed).
All modules can be removed by the purge action:
[kulhanek@skirit ~]$ module purge
Note: Use this only for testing purposes! The action removes all modules including the
system ones.
Infinity overview, 18th October 2013
-48-
How to activate modules automatically
[kulhanek@skirit ~]$ ams-config
*** AMS Configuration Centre ***
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
------------------------------------------------------------
Main menu
-----------------------------------------------------------1
- configure visualization (colors, delimiters, etc.)
2
- configure auto-restored modules
-----------------------------------------------------------i
- site info
s
- save changes
p
- print this menu once again
q/r - quit program
Type menu item and press enter:
Infinity overview, 18th October 2013
-49-
Modules by MetaCentrum VO
You can access modules provided by the MetaCentrum VO using the metamodule
command. The command is available only on the metacentrum and cerit-sc sites.
[kulhanek@skirit ~]$ metamodule avail
--------- /packages/run/modules-2.0/modulefiles_torque --------lam-7.1.4
mpich-p4
mpich-shmem
mpich2
mpiexec-0.84
openmpi-1.6-intel openmpi-pgi
lam-7.1.4-intel
mpich-p4-intel
mpich-shmem-intel mpich2-intel
--------- /packages/run/modules-2.0/modulefiles_opensuse --------adf2007
demon-2.3
intelcdk-9
mvs
pgicdk-6.0
turbomole-5.10-huger
amber-12
demon-2.3-shmem
intelcomp-11
namd-2.7b1
pgicdk-8.0
turbomole-5.6
amber-12-pgi
g03
jdk-1.4.2
nfs4acl
pgiwks-6.1
turbomole-6.0
mopac2009
ofed-1.3-mvapich2
python-2.5
turbomole-6.4
....
Note: Extra care must be taken when using
metamodules especially when parallel
execution is required.
Infinity overview, 18th October 2013
lucida
molden
molpro
to list modules, the avail action
must be provided
-50-
ACL, Access Control List
Access to almost all modules is granted to everyone. In certain cases, the access might be
limited by ACL. The list of ACL rules is available only via iSoftService. For example, the
usage of sander-pmf module is limited to users belonging to the pmflib group.
Note: ACL rules can be defined on the level of module or module builds.
Infinity overview, 18th October 2013
-51-
User groups for AMS subsystem
[kulhanek@wolf ~]$ ams-user
User name
: kulhanek (uid: 18773)
Primary group name : lcc (gid: 2001)
Site ID
: {WOLF:669663ca-cb1c-4d0a-8393-13bb8f7a90da}
Configuration realm : default
===================================================================
>>> default ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Priority
: 1
Groups
: compchem
>>> posix ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Priority
: 2
All groups : kulhanek,rmarek,compchem,lcc
User groups : lcc
>>> groups ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Priority
: 3
All groups : pmflib,kulhanek
User groups : pmflib
===================================================================
>>> final
Groups
: compchem,lcc,pmflib
Final AMS user groups.
They are not necessarily related to the unix system groups!
Infinity overview, 18th October 2013
-52-
Personal sites & Big brother
It is possible to install the AMS on your personal computer and thus benefit from uniform
environment provided by the Infinity system.
The installation procedure and prerequisites are available in the Infinity wiki:
https://lcc.ncbr.muni.cz/whitezone/development/infinity/wiki/index.php/Documentation
Note: Activation of modules is monitored (user name, build name, host name, date,
resources and scope).
Infinity overview, 18th October 2013
-53-
ABS
Advanced Batch System
Infinity overview, 18th October 2013
-54-
Command Overview
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
pqueues
list available queues
pnodes
list available computational nodes
pqstat
list jobs in the batch system or given queue
pjobs
list jobs of the logged or other user
infinity-env job shell
psubmit
submit job to the batch system
pinfo
print information about the job
pgo
change current directory to job input directory or log to comp. node
psync
synchronize computational node with job input directory
paliases
define aliases
pkill
terminate job
pkillall
terminate all jobs (they can be filtered)
pcollections manage job collections
presubmit
resubmit the job to the batch system
pstatus
print short job status
abs-config
configure the ABS subsystem
infinity-ijobs-prepare support for internal jobs
infinity-ijobs-copy-into
infinity-ijobs-launch
Use command -h to list all command options.
infinity-ijobs-finalize
Infinity overview, 18th October 2013
-55-
Job
A Job must fulfill following conditions:
• each job must be executed in separate directory (job input directory)
• all job data must be present in job input directory
• job directories need not be nested
• job execution is controlled by a script or by an input file (for autodetected jobs)
• job script must be in the bash shell language
• the absolute pathways should not be used, all paths should be relative to the job
input directory
• directory cannot contain pjob??? directories or files
/home/kulhanek
job1
job2
Infinity overview, 18th October 2013
/home/kulhanek
job1
job2
-56-
Job, cont.
The situation with two jobs in a single directory is detected by Infinity.
[kulhanek@skirit 01.get_hostname]$ psubmit short get_hostname
>>> List of jobs from info files ...
# ST
Job ID
Job Title
Queue
NCPUs NGPUs NNods Last change/Duration
# -- -------------------- --------------- --------------- ----- ----- ----- -------------------F 72998
get_hostname
long
1
0
1 2013-10-11 09:52:28
ERROR: Infinity runtime files were detected in the job input directory!
The presence of runtime files indicates that another job
has been started in this directory. Multiple job submission
within the same directory is not permitted by the Infinity system.
If you really want to submit the job, you have to remove runtime
files of previous one. Please, be very sure that previous job has
been already terminated otherwise undefined behaviour can occur!
Type following to remove runtime files:
rm -f *.info *.infex *.infout *.stdout *.nodes *.gpus *.infkey ___JOB_IS_RUNNING___
Infinity overview, 18th October 2013
-57-
Job Script
Job script can only be in the bash language. The interpreter can be specified directly as the
bash interpreter or as special infinity-env interpreter. In the later cases, the script is
protected from undesired execution that might lead to job data corruption or lost.
#!/bin/bash
#!/usr/bin/env infinity-env
# job script
# job script
Infinity overview, 18th October 2013
-58-
Job Script
Job script can only be in the bash language. The interpreter can be specified directly as the
bash interpreter or as special infinity-env interpreter. In the later cases, the script is
protected from undesired execution that might lead to job data corruption or lost.
#!/bin/bash
#!/usr/bin/env infinity-env
# job script
# job script
[kulhanek@skirit 01.get_hostname]$ ./testme
ERROR: This script can be run as an Infinity job only!
The script is protected by the infinity-env command,
which permits the script execution only via
psubmit commands.
[kulhanek@skirit 01.get_hostname]$
Infinity overview, 18th October 2013
-59-
Job Script, other shells
Job script, which acts as a wrapper
#!/usr/bin/env infinity-env
# activate all required modules here
module add amber
module add cats
# execute script in different shell
./my_tcsh_script
my_tcsh_script
#!/usr/bin/tcsh
....
....
Infinity overview, 18th October 2013
-60-
Job submission, name restrictions
The job path can contain following characters:
a-z A-Z 0-9 _+-.#/
The job script name can contain following characters:
a-z A-Z 0-9 _+-.#
Infinity overview, 18th October 2013
-61-
Job submission
A Job is submitted to the batch system by the psubmit command:
psubmit destination job [resources] [syncmode]
destination:
• queue_name
• node_name@queue_name
• alias_name
• node_name@alias_name
job:
• job script name
• name of job input file for autodected jobs
resources
defines required resources, if not specified default resources are used (in
most cases, ncpus=1 is used)
syncmode
determines the data transfer mode between the job input directory and
working directory on the computational node, by default "sync" mode is
used
Infinity overview, 18th October 2013
-62-
Job submission, cont.
[kulhanek@skirit 01.get_hostname]$ psubmit short testme
Job name
: testme
Job title
: testme (Job type: generic)
Job directory
: skirit.ics.muni.cz:/auto/home/kulhanek/Tests/01.get_hostname
Job project
: -none- (Collection: -none-)
Site name
: metacentrum (Torque server: arien.ics.muni.cz)
Job key
: d70c5f4b-8cd8-42b7-bd90-47a2efad4fe3
========================================================
Req destination : short
Req resources
: -noneReq sync mode
: -noneread carefully if job submission
---------------------------------------Alias
: -nonespecification is correct
Queue
: short
Default resources: maxcpuspernode=8
---------------------------------------Number of CPUs
: 1
Number of GPUs
: 0
then
Max CPUs / node : 8
Number of nodes : 1
Resources
: nodes=1:ppn=1
Sync mode
: sync
---------------------------------------Start after
: -not definedconfirmation is required by default
Exported modules : abs:2.0.4216
Excluded files
: -none========================================================
Do you want to submit the job to the Torque server (YES/NO)?
>
Infinity overview, 18th October 2013
-63-
Job submission, cont.
Confirmation of job submission can be disabled:
1) using -y option of psubmit command
$ psubmit –y short get_hostname
2) temporarily in the terminal/script by the pconfirmsubmit command
$ pconfirmsubmit NO
Confirmation of job submission is temporarily changed !
Confirm submit setup value: NO
Infinity overview, 18th October 2013
-64-
Job submission, cont.
Confirmation of job submission can be disabled:
3) permanently via the abs-config command
[kulhanek@skirit 01.get_hostname]$ abs-config
*** ABS Configuration Centre ***
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Main menu
-----------------------------------------------------------1
- configure aliases
2
- configure confirmation of job submission, e-mail alerts, etc.
-----------------------------------------------------------i
- site info
s
- save changes
....
Infinity overview, 18th October 2013
-65-
Available queues
[kulhanek@skirit 01.get_hostname]$ pqueues
#
# Site name
: metacentrum
# Torque server : arien.ics.muni.cz
#
#
Name
Pri
T
Q
R
O
Max UMax CMax
MaxWall
Mod Required property
# --------------- --- ----- ----- ----- ----- ----- ---- ---- ------------- --- ----------------MetaSeminar
0
0
0
0
0
0
0
0
0d 00:00:00 SE q_metaseminar
backfill
20
2
0
0
2 2000 1000
32
1d 00:00:00 SE q_backfill
debian6
50
0
0
0
0 1000
50
0
1d 00:00:00 SE q_debian6
debian6_long
51
1
0
0
1 1000
50
0 30d 00:00:00 SE q_debian6_long
default
--> normal,short (routing queue)
gpu
65
100
0
16
84
20
0
0
1d 00:00:00 SE q_gpu
gpu_long
55
16
0
9
7
20
0
0
7d 00:00:00 SE q_gpu_long
long
62
931
477
285
169 1000
70
0 30d 00:00:00 SE q_long
ncbr_long
70
16
3
8
5
20
5
0 30d 00:00:00 SE q_ncbr_long
ncbr_medium
65
72
0
18
54 1000
32
0
5d 00:00:00 SE q_ncbr_medium
ncbr_single
64
228
0
112
116 1000 200
1
2d 00:00:00 SE q_ncbr_single
normal
50 1220
411
194
615 1000 100
0
1d 00:00:00 SE q_normal
orca
70
0
0
0
0
120
80
64 30d 00:00:00 SE orca
orca16g
71
0
0
0
0
120
80
64 30d 00:00:00 SE orca16g
preemptible
61
7
0
7
0
0 400
32 30d 00:00:00 SE q_preemptible
privileged
65
25
0
19
6 1000
50
32 30d 00:00:00 SE q_privileged
short
60
169
0
1
168 1000 250
0
0d 02:00:00 SE q_short
#
#
#
#
#
#
Legend:
Pri - Priority, T - Total, Q - Queued, R - Running jobs
O - Other (completed, exiting, hold) jobs
Max - Max running jobs, UMax - Max user running jobs
CMax - Max CPUs per job, MaxWall - Max wall time per job
Mod - Started/(-)Stopped : Enabled/(-)Disabled
Infinity overview, 18th October 2013
By default, only queues
available to the logged user
are shown.
-66-
Available nodes
[kulhanek@skirit 01.get_hostname]$ pnodes -g perian-2
#
# Site name
: metacentrum
# Torque server : arien.ics.muni.cz
#
# Group : perian-2
# ----------------------------------------------------------------------------------# brno cl_perian debian debian50 em64t home_perian linux ncbr nfs4 per q_ncbr_long
# q_ncbr_medium q_ncbr_single x86 x86_64 xeon
# ----------------------------------------------------------------------------------#
#
Node Name
CPUs Free
Status
Extra properties
# ---------------------------- ---- ---- -------------------- ----------------------perian1-2.ncbr.muni.cz
8
8
free nodecpus8,per1,quadcore
perian2-2.ncbr.muni.cz
8
0
job-exclusive nodecpus8,per1,quadcore
perian5-2.ncbr.muni.cz
8
0
job-exclusive nodecpus8,per1,quadcore
perian6-2.ncbr.muni.cz
8
8
free nodecpus8,per1,quadcore
perian7-2.ncbr.muni.cz
8
8
free nodecpus8,per1,quadcore
perian8-2.ncbr.muni.cz
8
7
free nodecpus8,per1,quadcore
perian12-2.ncbr.muni.cz
8
1
free nodecpus8,per2,quadcore
perian13-2.ncbr.muni.cz
8
0
job-exclusive nodecpus8,per2,quadcore
perian14-2.ncbr.muni.cz
8
0
job-exclusive nodecpus8,per2,quadcore
# ----------------------------------------------------------------------------------# Total number of CPUs :
484
# Free CPUs
:
120
common properties for the cluster group
#
#
#
#
#
#
#
#
extra properties for
individual nodes
all node properties
All properties
----------------------------------------------------------------------------------brno cl_perian debian debian50 em64t home_perian hyperthreading linux ncbr nfs4
nodecpus12 nodecpus8 per per1 per2 per3 per4 q_ncbr_long q_ncbr_medium q_ncbr_single
quadcore sixcore x86 x86_64 xeon
----------------------------------------------------------------------------------Total number of CPUs :
484
Free CPUs
:
120
Infinity overview, 18th October 2013
-67-
Resources
Resource token
Avail
Meaning
ncpus
NMI
number of requested CPUs (exception: it can be specified as a number
only)
ngpus
NM
number of requested GPUs
props
NM
required properties of computational nodes
mem
NMI
required memory
vmem
M
required virtual memory
scratch
M
required scratch size
scratch_type
NMI
determine scratch type
maxcpuspernode
NMI
maximum number of CPUs per computational node
the number of nodes is determined from ncpus and maxcpuspernode
walltime
MI
required walltime for job execution
umask
NMI
umask used for new files and directories create on computational
nodes
NCBR clusters, MetaCentrum&CERIT-SC, IT4I
Infinity overview, 18th October 2013
-68-
Resources, cont.
Resource token
Avail
Meaning
account
I
account name (related to project)
place
I
determine placing of job chunks
cpu_freq
I
requested processor frequency
Infinity overview, 18th October 2013
-69-
Resource specification
Resources are specified as comma separated list (for example):
8,props=cl_gram
(ncpus=8,props=cl_perian)
Note: Not all resources tokens are
available on all sites!
Infinity overview, 18th October 2013
-70-
Resources, properties
Single property specification:
props=cl_perian
select nodes that have cl_perian property
Property combination:
props=brno#infiniband
select nodes that have brno AND infiniband properties
Property exclusion:
props=^cl_gram
select any node except nodes with cl_gram property
Node exclusion:
props=^full.node.name
only on the metacentrum
and cerit-sc sites
props=cl_gram:^gram2.zcu.cz
select any node with cl_gram property except gram2.zcu.cz node
Infinity overview, 18th October 2013
-71-
Synchronization modes
The synchronization mode determines how job data are transferred between the job
input directory and the working directory on the computational node:
Supported modes
sync
nosync
jobdir
Infinity overview, 18th October 2013
-72-
Synchronization modes, sync
Mode
Meaning
sync
Data are copied from the job input directory to the working
directory on the computational node. The working directory is
created on the scratch of the computational node. After the job is
finished, all data from the working directory are copied back to the
job input directory. Finally, the working directory is removed if the
data transfer was successful.
User Interface (UI)
(Frontend)
/job/input/dir
rsync
Computational Node #1
Worker Node (WN)
/scratch/job_id/
rsync
Note: default synchronization mode
Infinity overview, 18th October 2013
determined by scratch_type resource token
-73-
Synchronization modes, nosync
Mode
Meaning
nosync
Data are copied from the job input directory to the working
directory on the computational node. The working directory is
created on the scratch of the computational node. After the job is
finished, data are kept on the computational node in the working
directory.
User Interface (UI)
(Frontend)
/job/input/dir
rsync
Computational Node #1
Worker Node (WN)
/scratch/job_id/
Note: this mode should be used only in
very special cases!
Infinity overview, 18th October 2013
determined by scratch_type resource token
-74-
Synchronization modes, jobdir
Mode
Meaning
jobdir
Job data must be on shared volume, which is accessible on both UI
and WN. No data are transferred.
shared volume
User Interface (UI)
(Frontend)
/job/input/dir
/job/input/dir
Computational Node #1
Worker Node (WN)
Note: required by turbomole if it is executed in parallel among more than one
computational node.
Infinity overview, 18th October 2013
-75-
Resources, scratch_type
Value
Avail
Meaning
local
NM
/scratch/$USER/$INF_JOB_ID/main
local
I
/lscratch/$INF_JOB_ID/main
shared
I
/scratch/$USER/$INF_JOB_ID/main
shmem
NMI
/dev/shm/$USER/$INF_JOB_ID
Infinity overview, 18th October 2013
-76-
Resources, walltime
The walltime token is used to specify the maximum execution time of job. Time can be
specified in two ways:
walltime=hhh:mm:ss
walltime=Nu
where u is
Infinity overview, 18th October 2013
s
m
h
d
w
seconds
minutes
hours
days
weeks
-77-
Job monitoring, pinfo
The job progress can be monitored using the pinfo command invoked in the input job
directory or in the working directory on the computational node.
[kulhanek@skirit 01.get_hostname]$ pinfo
Job name
: testme
Job ID
: 2309124.arien.ics.muni.cz
Job title
: testme (Job type: generic)
Job directory
: skirit.ics.muni.cz:/auto/home/kulhanek/Tests/01.get_hostname
Job project
: -none- (Collection: -none-)
Site name
: metacentrum (Torque server: arien.ics.muni.cz)
Job key
: d647eadd-1c4c-4864-8018-e820dc51666d
========================================================
Req destination : short
Req resources
: -noneReq sync mode
: -none---------------------------------------Alias
: -noneQueue
: short
Default resources: maxcpuspernode=8
---------------------------------------Number of CPUs
: 1
Number of GPUs
: 0
Max CPUs / node : 8
Number of nodes : 1
Resources
: nodes=1:ppn=1
Sync mode
: sync
---------------------------------------Start after
: -not definedExported modules : abs:2.0.4216
Excluded files
: -none========================================================
Main node
: tarkil10-1.cesnet.cz
Working directory: /scratch/kulhanek/2309124.arien.ics.muni.cz
---------------------------------------CPU 001
: tarkil10-1.cesnet.cz
========================================================
Job was submitted on 2013-04-03 18:38:54
and was queued for 0d 00:00:32
Job was started on 2013-04-03 18:39:26
and is running for 0d 00:00:11
job submission summary
Infinity overview, 18th October 2013
current job status
-78-
Job monitoring, pgo
The pgo command can be used to change directory among the current directory, the job
input directory and/or the job working directory on the computational node.
User Interface (UI)
(Frontend)
no argument
/any/directory
pgo
/job/input/dir
Computational Node #1
Worker Node (WN)
/working/directory/
pgo job_id
Infinity overview, 18th October 2013
-79-
Job monitoring, pgo
Use JobID to move from any directory to the job input directory:
[kulhanek@skirit ~]$ pgo 2308394
# ST
Job ID
User
Job Title
Queue
NCPUs NGPUs NNods Last change/Duration
# -- ------------ ------------ --------------- --------------- ----- ----- ----- -------------------R 2308394
kulhanek
testme
short
1
0
1
0d 00:00:15
> /auto/home/kulhanek/Tests/01.get_hostname
tarkil10-1.cesnet.cz
INFO: The current directory was set to:
/auto/home/kulhanek/Tests/01.get_hostname
[kulhanek@skirit 01.get_hostname]$
Infinity overview, 18th October 2013
-80-
Job monitoring, pgo II
Use pgo in the job input directory to move to the computational node
[kulhanek@skirit 01.get_hostname]$ pgo
Job name
: testme
Job ID
: 2308394.arien.ics.muni.cz
Job title
: testme (Job type: generic)
Job directory
: skirit.ics.muni.cz:/auto/home/kulhanek/Tests/01.get_hostname
Site name
: metacentrum (Torque server: arien.ics.muni.cz)
Job key
: 12ea4a7b-a6b7-431f-861c-1b26eaf27350
========================================================
Req destination : short
....
Sync mode
: sync
---------------------------------------Start after
: -not definedExported modules : abs:2.0.4216
Excluded files
: -none========================================================
Main node
: tarkil10-1.cesnet.cz
Working directory: /scratch/kulhanek/2308394.arien.ics.muni.cz
---------------------------------------CPU 001
: tarkil10-1.cesnet.cz
========================================================
Job was submitted on 2013-04-03 15:22:58
and was queued for 0d 00:00:08
Job was started on 2013-04-03 15:23:06
and is running for 0d 00:03:11
job input directory on UI
working directory on WN
>>> Satisfying job for pgo action ...
# ST
Job ID
Job Title
Queue
NCPUs NGPUs NNods Last change/Duration
# -- -------------------- --------------- --------------- ----- ----- ----- -------------------R 2308394
testme
short
1
0
1
0d 00:03:11
tarkil10-1.cesnet.cz
> Site and job exported modules were recovered.
[kulhanek@tarkil10-1 2308394.arien.ics.muni.cz]$
Infinity overview, 18th October 2013
-81-
Control files
In the job input directory several control files are created during job submission, in the
course of job execution and its termination.
• *.info
job status file (XML file)
• *.infex
script executed by the batch system (wrapper)
• *.infout
standard output from execution of *.infex script, it is necessary to
analyze it if the job was not terminated successfully
• *.nodes
list of computational node allocated for the job
• *.gpus
list of GPU cards allocated for the job
• *.key
unique ID of the job
• *.stdout
standard output from the job script
Note: It is not wise to delete these files if the job is still running.
Infinity overview, 18th October 2013
-82-
pinfo command, other function
[kulhanek@perian 02.prod-a]$ pinfo -c -r
# ST
Job ID
Job Title
Queue
NCPUs NGPUs NNods Last change/Duration
# -- -------------------- --------------- --------------- ----- ----- ----- -------------------ER 2007744
precycleJob#101 gpu
1
1
1
F 2009982
precycleJob#104 gpu
1
1
1 2013-01-26 07:52:33
F 2010788
precycleJob#106 gpu
1
1
1 2013-01-26 18:44:23
....
Final statistics
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<
Number of all jobs
=
100
Number of prepared jobs =
0
0.00 %
Number of submitted jobs =
0
0.00 %
Number of running jobs
=
0
0.00 %
Number of finished jobs =
98 98.00 %
Number of other jobs
=
2
2.00 %
state
min
max
total
number
averaged
------- ----------------- ----------------- ----------------- ------- ----------------queued
0d 00:00:02
0d 04:04:44
0d 10:04:56
98
0d 00:06:10
running
0d 02:22:41
0d 10:55:56
11d 07:23:39
98
0d 02:46:09
Total CPU time =
11d 07:23:39
New features:
• -r
recursive mode (gather information from all info files in the current directory
and all subdirectories)
• -c
compact mode (job info on a single line)
•-l
print job comment in compact mode
• -p
print job path in compact mode
Infinity overview, 18th October 2013
-83-
pqstat/pjobs commands
Job/batch server monitoring:
• pqstat
list jobs in the batch system or given queue
• pjobs
list jobs of the logged or other user
Interesting options:
• -c
print completed jobs (by default they are not shown)
• -l
print job comment
• -p
print job path
• -f
print completed jobs ordered by time of termination (pjobs)
• -s
filter jobs
[kulhanek@perian 02.prod-a]$ pjobs -p -l
#
# Site name
: metacentrum
# Torque server : arien.ics.muni.cz
#
# ST
Job ID
User
Job Title
Queue
NCPUs NGPUs NNods Last change/Duration
# -- ------------ ------------ --------------- --------------- ----- ----- ----- -------------------R 2230607
kulhanek
run_abf
ncbr_long
8
0
1
17d 13:45:42
> /auto/smaug1.nfs4/home/kulhanek/01.Projects/55.Zora/alpha/03.dimer/04.water/03.abf
perian31-2.ncbr.muni.cz
[kulhanek@perian 02.prod-a]$
Infinity overview, 18th October 2013
job comment
for queued jobs the status provided by the batch system
for running jobs the name of the computational node
-84-
Terminate running/queued jobs
The queued or running job can be prematurely terminated by the pkill command:
[kulhanek@perian normal]$ pkill
Job name
: test_normal
Job ID
: 2311306.arien.ics.muni.cz
Job title
: test_normal (Job type: generic)
Job directory
:
perian.ncbr.muni.cz:/home/kulhanek/Tests/normal
Job project
: -none- (Collection: -none-)
Site name
: metacentrum (Torque server:
arien.ics.muni.cz)
Job key
: 4bc17745-75e7-4b55-a97d-ef40fe83ba93
========================================================
Req destination : short
Req resources
: -noneReq sync mode
: -none---------------------------------------Alias
: -noneQueue
: short
Default resources: maxcpuspernode=8
---------------------------------------Number of CPUs
: 1
Number of GPUs
: 0
Max CPUs / node : 8
Number of nodes : 1
Resources
: nodes=1:ppn=1
Sync mode
: sync
---------------------------------------Start after
: -not definedExported modules : abs:2.0.4216|gaussian:09.C1
Excluded files
: -none===============================================
Main node
: tarkil10-1.cesnet.cz
Working directory:
/scratch/kulhanek/2311306.arien.ics.muni.cz
---------------------------------------CPU 001
: tarkil10-1.cesnet.cz
========================================================
Job was submitted on 2013-04-04 14:26:53
and was queued for 0d 00:02:55
Job was started on 2013-04-04 14:29:48
and is running for 0d 00:01:16
>>> Satisfying job(s) for pkill action ...
# ST
Job ID
Job Title
Queue
NCPUs NGPUs NNods Last change/Duration
# -- -------------------- --------------- --------------R 2311306
test_normal
short
> /home/kulhanek/Tests/normal
tarkil10-1.cesnet.cz
Do you want to kill listed jobs (YES/NO)?
> YES
Listed jobs were killed!
The job is terminated and all data are kept on the
working directory on the computational node!
You have to clean the data by yourself manually!!
Infinity overview, 18th October 2013
-85-
Terminate running/queued jobs II
The job can be killed softly with -s option.
[kulhanek@perian normal]$ pkill -s
...
Do you want to softly kill listed job (YES/NO)?
> YES
Sending TERM signal to tarkil10-1.cesnet.cz:/scratch/kulhanek/2311321.arien.ics.muni.cz ...
>>> Process ID: 32720
[kulhanek@perian normal]$ pinfo
....
========================================================
Main node
: tarkil10-1.cesnet.cz
Working directory: /scratch/kulhanek/2311321.arien.ics.muni.cz
Job exit code
: 143
---------------------------------------CPU 001
: tarkil10-1.cesnet.cz
========================================================
Job was submitted on 2013-04-04 14:35:18
and was queued for 0d 00:00:08
The job script is terminated but the job
Job was started on 2013-04-04 14:35:26
and was running for 0d 00:01:59
itself is finished as usual. It means that in
Job was finished on 2013-04-04 14:37:25
the sync synchronization mode the data
are copied back to the UI and the working
directory on WN is removed.
Infinity overview, 18th October 2013
-86-
Terminate running/queued jobs III
All running or queued jobs can be killed by pkillall command. The job list can be filtered
by –s option.
Infinity overview, 18th October 2013
-87-
Synchronize jobs
Intermediate data of running jobs can be copied between job working and input directories
by the psync command. It can be called either from the job input or working directory.
psync <file1> [file2] ...
psync --all
Infinity overview, 18th October 2013
-88-
pstatus, short job status
The pstatus command prints short status information about the job. It can be executed
without any argument in the job input or working directory. If there is more than one info
file then the status of the job with the largest JobID is printed.
Printed abbreviations
Reason
P
job is prepared for submission (used in conjunction
with collections)
Q
job is queued or hold in the batch system
R
job is running
F
job is finished
K
job was killed by the pkill command
IN
job is in inconsistent state, e.g. the info file shows the
job in running state but the batch system shows it as
finished
UN
no info file in the current directory
Infinity overview, 18th October 2013
-89-
Aliases
Aliases are shortcomings for resource specifications.
abs-config or paliases programs.
[kulhanek@perian test]$ paliases
#
# Site name
: metacentrum
# Torque server : arien.ics.muni.cz
#
They can be defined by the
without argument, it prints
defined aliases
#
Name
Destination
Sync Mode
Resources
# -------------- ---------------- ------------ --------------------U gpu
gpu
sync
props=cl_gram,ngpus=1
[node@]queue
alias name might have the same name as any queue
any resource token
permitted on the site
U = user defined, S = system defined
Infinity overview, 18th October 2013
-90-
Aliases, define new alias
1) using abs-config, recommended
[kulhanek@skirit 01.get_hostname]$ abs-config
*** ABS Configuration Centre ***
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Main menu
-----------------------------------------------------------1
- configure aliases
2
- configure confirmation of job submission, e-mail alerts, etc.
-----------------------------------------------------------i
- site info
s
- save changes
....
2) using paliases, for experienced users
[kulhanek@perian test]$ paliases add al1 normal sync props=brno
[kulhanek@perian test]$
Infinity overview, 18th October 2013
-91-
Aliases, resource resolution
maxcpuspernode=8
increasing priority
site default resources
(
alias resources
psubmit resources
)
maxcpuspernode=64,props=linux,ncpus=32
ncpus=16,mem=30gb
Final resources: ncpus=16,maxcpuspernode=64,props=linux,mem=30gb
Infinity overview, 18th October 2013
-92-
Aliases, resource resolution
maxcpuspernode=8
increasing priority
site default resources
overwrites
(
alias resources
)
maxcpuspernode=64,props=linux,ncpus=32
overwrites
psubmit resources
extends
ncpus=16,mem=30gb
extends
Final resources: ncpus=16,maxcpuspernode=64,props=linux,mem=30gb
Infinity overview, 18th October 2013
-93-
Execution of
applications
Infinity overview, 18th October 2013
-94-
sander/pmemd
The sander/pmemd programs are applications from the AMBER package. They do
molecular dynamics. Detailed information can be found on: http://ambermd.org
#!/bin/bash
# activate module with sander/pmemd
# application
module add amber:12.0
# execute the sander program
sander –O –i prod.in –p topology.parm7 -c input.rst7
Job script:
• only essential logic is present
• in most cases, the script is the same for the sequential and parallel runs of the
same applications
• data are referenced relative to the job directory
Infinity overview, 18th October 2013
-95-
sander – single/parallel execution
The only difference between sequential and parallel execution is in the resource
specification during psubmit. The input data and the job script are the same!
$ psubmit short test_sander ncpus=1
$ psubmit short test_sander ncpus=2
it can be omitted
*.stdout
*.stdout
.....
Module build: amber:12.0:x86_64:single
.....
.....
Module build: amber:12.0:x86_64:para
.....
Computational node:
Infinity overview, 18th October 2013
Computational node:
-96-
gaussian, manual script preparation
The gaussian package contains tools for quantum chemical calculations. Detailed
description can be found on http://www.gaussian.com
#!/bin/bash
# activate gaussian module
module add gaussian:09.C1
# execute g09
g09 input
input file input.com must contain specification for number of CPUs requested for parallel
execution (this number MUST be consistent with resource specification via psubmit
command).
%NProcShared=4
$ psubmit short test_gaussian ncpus=4
Infinity overview, 18th October 2013
-97-
gaussian, manual script preparation
The gaussian package contains tools for quantum chemical calculations. Detailed
description can be found on http://www.gaussian.com
#!/bin/bash
# activate gaussian module
module add gaussian:09.C1
# execute g09
g09 input
input file input.com must contain specification for number of CPUs requested for parallel
execution (this number MUST be consistent with resource specification via psubmit
command).
%NProcShared=4
$ psubmit short test_gaussian ncpus=4
Infinity overview, 18th October 2013
-98-
gaussian, autodetection
The ABS subsystem is able to recognize the gaussian job type. The job script is
automatically created and the input file is automatically updated according to requested
resources.
$ module add gaussian
$ psubmit short input.com ncpus=4
gaussian input file (must have .com extension), this is NOT job script!
Autodetection:
• job script is created automatically with correct gaussian binary name (g98, g03, g09)
• %NProcShared is added or updated in the input file
• check if only single node is requested (parallel execution is limited to a single node)
[kulhanek@perian test]$ psubmit short input.com
Job name
: input
Job title
: input (Job type: gaussian)
Job directory
: perian.ncbr.muni.cz:/home/kulhanek/Tests/test
Job project
: -none- (Collection: -none-)
Site name
: metacentrum (Torque server: arien.ics.muni.cz)
Job key
: 384e3be5-9dac-405e-b235-74609ae4c486
========================================================
Infinity overview, 18th October 2013
-99-
gaussian – single/parallel execution
The only difference between sequential and parallel execution is in the resource
specification during psubmit. The input data are the same!
$ psubmit short input.com ncpus=1
$ psubmit short input.com ncpus=4
it can be omitted
Computational node:
Infinity overview, 18th October 2013
Computational node:
-100-
precycle – restartable MD
The aim of precycle is to split long Molecular Dynamics jobs into smaller chunks that can
then be run more efficiently in queues that have shorter execution walltimes as normal or
backfill.
1) activate !!!dynutil-new!!! module
[kulhanek@perian normal]$ module add dynutil-new:
# Module specification: dynutil-new: (add action)
# =============================================================
Requested CPUs
:
1 Requested GPUs
:
0
Num of host CPUs
:
4 Num of host GPUs
:
0
Requested nodes
:
1
Host arch tokens
: i686,noarch,x86_64
Host SMP CPU model : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz [Total memory: 5500 MB]
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Exported module
: dynutil-new:4.0.4241
Module build
: dynutil-new:4.0.4241:noarch:single
2) get precycle input file template
[kulhanek@perian ~]$ precycle-prep
All neccessary files for precycle were copied to working directory.
[kulhanek@perian ~]$
Infinity overview, 18th October 2013
-101-
precycle – restartable MD, II
3) update the precycle script
....
# input topology -------------------------------------------------------------# file name without path, this file has to be present in working directory
export PRECYCLE_TOP=""
# input coordinates ----------------------------------------------------------# file name without path, this file has to be present in working directory
# this file is used only for first production run
export PRECYCLE_CRD=""
# control file for MD it has to be compatible with used MD program -----------# file name without path, this file has to be presented in working directory
export PRECYCLE_CONTROL="prod.in"
# transform control file (YES/NO) --------------------------------------------# if YES then the RANDOM key is substituted by a random key
export PRECYCLE_TRANSFORM_CONTROL="YES"
# index of first production stage --------------------------------------------export PRECYCLE_START="1"
# index of final production stage --------------------------------------------export PRECYCLE_STOP=""
....
Infinity overview, 18th October 2013
-102-
precycle – restartable MD, III
4) submit the job to batch system via psubmit command
 the presence of input files is checked
 the job is started from the last available restart file
[kulhanek@perian precycle]$ psubmit normal precycleJob 8
# -----------------------------------------------# precycle job summary
# -----------------------------------------------# Job script name
: precycleJob
# System topology
: 1DC1-DNA_fix_sol_joined.parm7
# Initial coordinates : relax10.rst7
# Control file
: prod.in
# Compress trajectory : -no# Name format
: prod%03d
# Storage directory
: storage
# Internal cycles
:
1
# Starting stage
:
1
# Final stage
: 200
# Current stage
: 198 (found restart: storage/prod198.crd)
# MD engine module
: amber:12.0
# MD engine program
: pmemd
Job name
Job title
Job directory
Job project
Site name
Job key
:
:
:
:
:
:
precycleJob
precycleJob#198 (Job type: precycle)
perian.ncbr.muni.cz:/home/kulhanek/Tests/precycle
-none- (Collection: -none-)
metacentrum (Torque server: arien.ics.muni.cz)
2a57df2f-9119-4f59-b113-b3e37415502
Infinity overview, 18th October 2013
-103-
precycle – restartable MD, IV
5) in the case of recoverable failure, simply resubmit the job to the batch system
# WARNING: this type of job supports an automatic job restart of crashed job!!!
#
in the case of failure, please just resubmit a job into queue system
#
without any modification of this script
[kulhanek@perian precycle]$ psubmit normal precycleJob 8
Available in the metacentrum site
Employing GPU
1) change used MD module and MD core in precycleJob
# program to perform MD -----------------------------------------------export MD_CORE="pmemd.cuda"
export MD_MODULE="pmemd-cuda:12.1"
2) request GPU resources during job submission
[kulhanek@perian precycle]$ psubmit gpu precycleJob ngpus=1,props=cl_gram
Infinity overview, 18th October 2013
-104-