Platforms and Platform Configuration

This section will describe key aspects of the platforms that the IPS has been ported to, key locations relevant to the IPS, and the platform configuration settings in general and specific to the platforms described below.

Important Note - while this documentation is intended to remain up to date, it may not always reflect the current status of the machines. If you run into problems, check that the information below is accurate by looking at the websites for the machine. If you are still having problems, contact the framework developers.

Ported Platforms

Each subsection will contain information about the platform in question. If you are porting the IPS to a new platform, these are the items that you will need to know or files and directories to create in order to port the IPS. You will also need a platform configuration file (described below). Available queue names are listed with the most common ones in bold.

The platforms below fall into the following categories:

  • general production machines - large production machines on which the majority of runs (particularly production runs) are made.
  • experimental systems - production or shared machines that are being used by a subset of SWIM members for specific research projects. These systems may also be difficult for others to get accounts.
  • formerly used systems - machines that the IPS was ported to but we either do not have time on that machine, it has been retired by its hosting site, or it is not in wide use anymore.
  • single user systems - laptop or desktop machines for testing small problems.

General Production

Franklin

Franklin is a Cray XT4 managed by NERSC.

  • Account: You must have an account at NERSC and be added to the SWIM project’s group (m876) to log on and access the set of physics binaries in the PHYS_BIN.
  • Logging on - ssh franklin.nersc.gov -l <username>
  • Architecture - 9,572 nodes, 4 cores per node, 8 GB memory per node
  • Environment:
    • OS - Cray Linux Environment (CLE)
    • Batch scheduler/Resource Manager - PBS, Moab
    • Queues - debug, regular, low, premium, interactive, xfer, iotask, special
    • Parallel Launcher (e.g., mpirun) - aprun
    • Node Allocation policy - exclusive node allocation
  • Project directory - /project/projectdirs/m876/
  • Data Tree - /project/projectdirs/m876/data/
  • Physics Binaries - /project/projectdirs/m876/phys-bin/phys/
  • WWW Root - /project/projectdirs/m876/www/<username>
  • WWW Base URL - http://portal.nersc.gov/project/m876/<username>

Hopper

Hopper is a Cray XE6 managed by NERSC.

  • Account: You must have an account at NERSC and be added to the SWIM project’s group (m876) to log on and access the set of physics binaries in the PHYS_BIN.
  • Logging on - ssh hopper.nersc.gov -l <username>
  • Architecture - 6384 nodes, 24 cores per node, 32 GB memory per node
  • Environment:
    • OS - Cray Linux Environment (CLE)
    • Batch scheduler/Resource Manager - PBS, Moab
    • Queues - debug, regular, low, premium, interactive
    • Parallel Launcher (e.g., mpirun) - aprun
    • Node Allocation policy - exclusive node allocation
  • Project directory - /project/projectdirs/m876/
  • Data Tree - /project/projectdirs/m876/data/
  • Physics Binaries - /project/projectdirs/m876/phys-bin/phys/
  • WWW Root - /project/projectdirs/m876/www/<username>
  • WWW Base URL - http://portal.nersc.gov/project/m876/<username>

Stix

Stix is a SMP hosted at PPPL.

  • Account: You must have an account at PPPL to access their Beowulf systems.
  • Logging on:
    1. Log on to the PPPL vpn (https://vpn.pppl.gov)
    2. ssh <username>@portal.pppl.gov
    3. ssh portalr5
  • Architecture - 80 cores, 440 GB memory
  • Environment:
    • OS - linux
    • Batch scheduler/Resource Manager - PBS (Torque), Moab
    • Queues - smpq (this is how you specify that you want to run your job on stix)
    • Parallel Launcher (e.g., mpirun) - mpiexec (MPICH2)
    • Node Allocation policy - node sharing allowed (whole machine looks like one node)
  • Project directory - /p/swim1/
  • Data Tree - /p/swim1/data/
  • Physics Binaries - /p/swim1/phys/
  • WWW Root - /p/swim/w3_html/<username>
  • WWW Base URL - http://w3.pppl.gov/swim/<username>

Experimental Systems

Swim

Swim is a SMP hosted by the fusion theory group at ORNL.

  • Account: You must have an account at ORNL and be given an account on the machine.
  • Logging on - ssh swim.ornl.gov -l <username>
  • Architecture - ? cores, ? GB memory
  • Environment:
    • OS - linux
    • Batch scheduler/Resource Manager - None
    • Parallel Launcher (e.g., mpirun) - mpirun (OpenMPI)
    • Node Allocation policy - node sharing allowed (whole machine looks like one node)
  • Project directory - None
  • Data Tree - None
  • Physics Binaries - None
  • WWW Root - None
  • WWW Base URL - None

Pacman

Pacman is a linux cluster hosted at ARSC.

  • Account: You must have an account to log on and use the system.
  • Logging on - ?
  • Architecture:
    • 88 nodes, 16 cores per node, 64 GB per node
    • 44 nodes, 12 cores per node, 32 GB per node
  • Environment:
    • OS - Red Hat Linux 5.6
    • Batch scheduler/Resource Manager - Torque (PBS), Moab
    • Queues - debug, standard, standard_12, standard_16, bigmem, gpu, background, shared, transfer
    • Parallel Launcher (e.g., mpirun) - mpirun (OpenMPI?)
    • Node Allocation policy - node sharing allowed
  • Project directory - ?
  • Data Tree - ?
  • Physics Binaries - ?
  • WWW Root - ?
  • WWW Base URL - ?

Iter

Iter is a linux cluster (?) that is hosted ???.

  • Account: You must have an account to log on and use the system.
  • Logging on - ?
  • Architecture - ? nodes, ? cores per node, ? GB memory per node
  • Environment:
    • OS - linux
    • Batch scheduler/Resource Manager - ?
    • Queues - ?
    • Parallel Launcher (e.g., mpirun) - mpiexec (MPICH2)
    • Node Allocation policy - node sharing allowed
  • Project directory - /project/projectdirs/m876/
  • Data Tree - /project/projectdirs/m876/data/
  • Physics Binaries - /project/projectdirs/m876/phys-bin/phys/
  • WWW Root - ?
  • WWW Base URL - ?

Odin

Odin is a linux cluster hosted at Indiana University.

  • Account: You must have an account to log on and use the system.
  • Logging on - ssh odin.cs.indiana.edu -l <username>
  • Architecture - 128 nodes, 4 cores per node, ? GB memory per node
  • Environment:
    • OS - GNU/Linux
    • Batch scheduler/Resource Manager - Slurm, Maui
    • Queues - there is only one queue and it does not need to specified in the batchscript
    • Parallel Launcher (e.g., mpirun) - mpirun (OpenMPI)
    • Node Allocation policy - node sharing allowed
  • Project directory - None
  • Data Tree - None
  • Physics Binaries - None
  • WWW Root - None
  • WWW Base URL - None

Sif

Sif is a linux cluster hosted at Indiana University.

  • Account: You must have an account to log on and use the system.
  • Logging on - ssh sif.cs.indiana.edu -l <username>
  • Architecture - 8 nodes, 8 cores per node, ? GB memory per node
  • Environment:
    • OS - GNU/Linux
    • Batch scheduler/Resource Manager - Slurm, Maui
    • Queues - there is only one queue and it does not need to specified in the batchscript
    • Parallel Launcher (e.g., mpirun) - mpirun (OpenMPI)
    • Node Allocation policy - node sharing allowed
  • Project directory - None
  • Data Tree - None
  • Physics Binaries - None
  • WWW Root - None
  • WWW Base URL - None

Retired/Formerly Used Systems

Viz/Mhd

Viz/mhd are SMP machines hosted at PPPL. These systems appear not to be online any more.

  • Account: You must have an account at PPPL to access their Beowulf systems.
  • Logging on:
    1. Log on to the PPPL vpn (https://vpn.pppl.gov)
    2. ssh <username>@portal.pppl.gov
  • Architecture - ? cores, ? GB memory
  • Environment:
    • OS - linux
    • Batch scheduler/Resource Manager - PBS (Torque), Moab
    • Parallel Launcher (e.g., mpirun) - mpiexec (MPICH2)
    • Node Allocation policy - node sharing allowed (whole machine looks like one node)
  • Project directory - /p/swim1/
  • Data Tree - /p/swim1/data/
  • Physics Binaries - /p/swim1/phys/
  • WWW Root - /p/swim/w3_html/<username>
  • WWW Base URL - http://w3.pppl.gov/swim/<username>

Pingo

Pingo was a Cray XT5 hosted at ARSC.

  • Account: You must have an account to log on and use the system.
  • Logging on - ?
  • Architecture - 432 nodes, 8 cores per node, ? memory per node
  • Environment:
    • OS - ?
    • Batch scheduler/Resource Manager - ?
    • Parallel Launcher (e.g., mpirun) - aprun
    • Node Allocation policy - exclusive node allocation
  • Project directory - ?
  • Data Tree - ?
  • Physics Binaries - ?
  • WWW Root - ?
  • WWW Base URL - ?

Jaguar

Jaguar is a Cray XT5 managed by OLCF.

  • Account: You must have an account for the OLCF and be added to the SWIM project group for accounting and files sharing purposes, if we have time on this machine.
  • Logging on - ssh jaguar.ornl.gov -l <username>
  • Architecture - 13,688 nodes, 12 cores per node, 16 GB memory per node
  • Environment:
    • OS - Cray Linux Environment (CLE)
    • Batch scheduler/Resource Manager - PBS, Moab
    • Queues - debug, production
    • Parallel Launcher (e.g., mpirun) - aprun
    • Node Allocation policy - exclusive node allocation
  • Project directory - ?
  • Data Tree - ?
  • Physics Binaries - ?
  • WWW Root - ?
  • WWW Base URL - ?

Single User Systems

The IPS can be run on your laptop or desktop. Many of the items above are not present or relevant in a laptop/desktop environment. See the next section for a sample platform configuration settings.

Platform Configuration File

The platform configuration file contains platform specific information that the framework needs. Typically it does not need to be changed for one user to another or one run to another (except for manual specification of allocation resources). For most of the platforms above, you will find platform configuration files of the form ips/<machine name>.conf. It is not likely that you will need to change this file, but it is described here for users working on experimental machines, manual specification of resources, and users who need to port the IPS to a new machine.

HOST = franklin
MPIRUN = aprun
PHYS_BIN_ROOT = /project/projectdirs/m876/phys-bin/phys/
DATA_TREE_ROOT = /project/projectdirs/m876/data
DATA_ROOT = /project/projectdirs/m876/data/
PORTAL_URL = http://swim.gat.com:8080/monitor
RUNID_URL = http://swim.gat.com:4040/runid.esp

#######################################
# resource detection method
#######################################
NODE_DETECTION = checkjob # checkjob | qstat | pbs_env | slurm_env

#######################################
# manual allocation description
#######################################
TOTAL_PROCS = 16
NODES = 4
PROCS_PER_NODE = 4

#######################################
# node topology description
#######################################
CORES_PER_NODE = 4
SOCKETS_PER_NODE = 1

#######################################
# framework setting for node allocation
#######################################
# MUST ADHERE TO THE PLATFORM'S CAPABILITIES
#   * EXCLUSIVE : only one task per node
#   * SHARED : multiple tasks may share a node
# For single node jobs, this can be overridden allowing multiple
# tasks per node.
NODE_ALLOCATION_MODE = EXCLUSIVE # SHARED | EXCLUSIVE
HOST
name of the platform. Used by the portal.
MPIRUN
command to launch parallel applications. Used by the task manager to launch parallel tasks on compute nodes. If you would like to launch a task directly without the parallel launcher (say, on a SMP style machine or workstation), set this to “eval” – it tells the task manager to directly launch the task as <binary> <args>.
*_ROOT
locations of data and binaries. Used by the configuration file and components to run the tasks of the simulation.
*_URL
portal URLs. Used to connect to and communicate with the portal.
NODE_DETECTION
method to use to detect the number of nodes and processes in the allocation. If the value is “manual,” then the manual allocation description is used. If nothing is specified, all of the methods are attempted and the first one to succeed will be used. Note, if the allocation detection fails, the framework will abort, killing the job. See Porting the IPS for more information [4].
TOTAL_PROCS
number of processes in the allocation [3].
NODES
number of nodes in the allocation [3].
PROCS_PER_NODE
number of processes per node (ppn) for the framework [2].
CORES_PER_NODE
number of cores per node [1].
SOCKETS_PER_NODE
number of sockets per node [1].
NODE_ALLOCATION_MODE
‘EXCLUSIVE’ for one task per node, and ‘SHARED’ if more than one task can share a node [1]. Simulations, components and tasks can set their node usage allocation policies in the configuration file and on task launch.
[1](1, 2, 3) This value should not change unless the machine is upgraded to a different architecture or implements different allocation policies.
[2]Used in manual allocation detection and will override any detected ppn value (if smaller than the machine maximum ppn).
[3](1, 2) Only used if manual allocation is specified, or if no detection mechanism is specified and none of the other mechansims work first. It is the users responsibility for this value to make sense.
[4]Currently the porting documentation is under construction. Use python script ips/framework/utils/test_resource_parsing.py to determine which automatic parsing works for the platform in question. If nothing works, use the manual settings and contact the framework developers to look into developing a method for automatically detecting the allocation.

Due to the recent changes in the framework regarding resource management, some platforms may not have platform configuration files in the repository. Below is a list of those that are in the repo and work with the recent changes to the framework.

  • franklin
  • hopper
  • odin
  • sif
  • stix [5]
  • swim [5]

In addition to these files, there is ips/workstation.conf, a sample platform configuration file for a workstation. It assumes that the workstation:

  • does not have a batch scheduler or resource manager
  • may have multiple cores and sockets
  • does not have portal access
  • will manually specify the allocation
HOST = workstation
MPIRUN = mpirun # eval
PHYS_BIN_ROOT = /home/<username>/phys-bin
DATA_TREE_ROOT = /home/<username>/swim_data
DATA_ROOT = /home/<username>/swim_data
#PORTAL_URL = http://swim.gat.com:8080/monitor
#RUNID_URL = http://swim.gat.com:4040/runid.esp

#######################################
# resource detection method
#######################################
NODE_DETECTION = manual # checkjob | qstat | pbs_env | slurm_env | manual

#######################################
# manual allocation description
#######################################
TOTAL_PROCS = 4
NODES = 1
PROCS_PER_NODE = 4

#######################################
# node topology description
#######################################
CORES_PER_NODE = 4
SOCKETS_PER_NODE = 1

#######################################
# framework setting for node allocation
#######################################
# MUST ADHERE TO THE PLATFORM'S CAPABILITIES
#   * EXCLUSIVE : only one task per node
#   * SHARED : multiple tasks may share a node
# For single node jobs, this can be overridden allowing multiple
# tasks per node.
NODE_ALLOCATION_MODE = SHARED # SHARED | EXCLUSIVE
[5](1, 2) These need to be updated to match the “allocation” size each time. Alternatively, you can just use the command line to specify the number of nodes and processes per node.