Developing against the Framework Application Programming Interface

IPS Services API

The IPS framework contains a set of managers that perform services for the components. A component uses the services API to access them, thus hiding the complexity of the framework implementation. Below are descriptions of the individual function calls grouped by type. To call any of these functions in a component replace ServicesProxy with self.services. The services object is passed to the component upon creation by the framework.

Component Invocation

Component invocation in the IPS means one component is calling another component’s function. This API provides a mechanism to invoke methods on components through the framework. There are blocking and non-blocking versions, where the non-blocking versions require a second function to check the status of the call. Note that the wait_call has an optional argument (block) that changes when and what it returns.

ServicesProxy.call(component_id, method_name, *args)

Invoke method method_name on component component_id with optional arguments *args. Return result from invoking the method.

ServicesProxy.call_nonblocking(component_id, method_name, *args)

Invoke method method_name on component component_id with optional arguments *args. Return call_id.

ServicesProxy.wait_call(call_id, block=True)

If block is True, return when the call has completed with the return code from the call. If block is False, raise ipsExceptions.IncompleteCallException if the call has not completed, and the return value is it has.

ServicesProxy.wait_call_list(call_id_list, block=True)

Check the status of each of the call in call_id_list. If block is True, return when all calls are finished. If block is False, raise ipsExceptions.IncompleteCallException if any of the calls have not completed, otherwise return. The return value is a dictionary of call_ids and return values.

Task Launch

The task launch interface allows components to launch and manage the execution of (parallel) executables. Similar to the component invocation interface, the behavior of launch_task and the wait_task variants are controlled using the block keyword argument and different interfaces to wait_task.

ServicesProxy.launch_task(nproc, working_dir, binary, *args, **keywords)

Launch binary in working_dir on nproc processes. *args are any arguments to be passed to the binary on the command line. **keywords are any keyword arguments used by the framework to manage how the binary is launched. Keywords may be the following:

  • task_ppn : the processes per node value for this task
  • block : specifies that this task will block (or raise an exception) if not enough resources are available to run immediately. If True, the task will be retried until it runs. If False, an exception is raised indicating that there are not enough resources, but it is possible to eventually run. (default = True)
  • tag : identifier for the portal. May be used to group related tasks.
  • logfile : file name for stdout (and stderr) to be redirected to for this task. By default stderr is redirected to stdout, and stdout is not redirected.
  • whole_nodes : if True, the task will be given exclusive access to any nodes it is assigned. If False, the task may be assigned nodes that other tasks are using or may use.
  • whole_sockets : if True, the task will be given exclusive access to any sockets of nodes it is assigned. If False, the task may be assigned sockets that other tasks are using or may use.

Return task_id if successful. May raise exceptions related to opening the logfile, being unable to obtain enough resources to launch the task (ipsExceptions.InsufficientResourcesException), bad task launch request (ipsExceptions.ResourceRequestMismatchException, ipsExceptions.BadResourceRequestException) or problems executing the command. These exceptions may be used to retry launching the task as appropriate.

Note

This is a nonblocking function, users must use a version of ServicesProxy.wait_task() to get result.

ServicesProxy.wait_task(task_id)

Check the status of task task_id. Return the return value of the task when finished successfully. Raise exceptions if the task is not found, or if there are problems finalizing the task.

ServicesProxy.wait_task_nonblocking(task_id)

Check the status of task task_id. If it has finished, the return value is populated with the actual value, otherwise None is returned. A KeyError exception may be raised if the task is not found.

ServicesProxy.wait_tasklist(task_id_list, block=True)

Check the status of a list of tasks. If block is True, return a dictionary of return values when all tasks have completed. If block is False, return a dictionary containing entries for each completed task. Note that the dictionary may be empty. Raise KeyError exception if task_id not found.

ServicesProxy.kill_task(task_id)

Kill launched task task_id. Return if successful. Raises exceptions if the task or process cannot be found or killed successfully.

ServicesProxy.kill_all_tasks()

Kill all tasks associated with this component.

The task pool interface is designed for running a group of tasks that are independent of each other and can run concurrently. The services manage the execution of the tasks efficiently for the component. Users must first create an empty task pool, then add tasks to it. The tasks are submitted as a group and checked on as a group. This interface is basically a wrapper around the interface above for convenience.

ServicesProxy.create_task_pool(task_pool_name)

Create an empty pool of tasks with the name task_pool_name. Raise exception if duplicate name.

ServicesProxy.add_task(task_pool_name, task_name, nproc, working_dir, binary, *args, **keywords)

Add task task_name to task pool task_pool_name. Remaining arguments are the same as in ServicesProxy.launch_task().

ServicesProxy.submit_tasks(task_pool_name, block=True)

Launch all unfinished tasks in task pool task_pool_name. If block is True, return when all tasks have been launched. If block is False, return when all tasks that can be launched immediately have been launched. Return number of tasks submitted.

ServicesProxy.get_finished_tasks(task_pool_name)

Return dictionary of finished tasks and return values in task pool task_pool_name. Raise exception if no active or finished tasks.

ServicesProxy.remove_task_pool(task_pool_name)

Kill all running tasks, clean up all finished tasks, and delete task pool.

Miscellaneous

The following services do not fit neatly into any of the other categories, but are important to the execution of the simulation.

ServicesProxy.get_working_dir()

Return the working directory of the calling component.

The structure of the working directory is defined using the configuration parameters CLASS, SUB_CLASS, and NAME of the component configuration section. The structure of the working directory is:

${SIM_ROOT}/work/$CLASS_${SUB_CLASS}_$NAME_<instance_num>
ServicesProxy.update_time_stamp(new_time_stamp=-1)

Update time stamp on portal.

ServicesProxy.send_portal_event(event_type='COMPONENT_EVENT', event_comment='')

Send event to web portal.

Data Management

The data management services are used by the components to manage the data needed and produced by each step, and for the driver to manage the overall simulation data. There are methods for component local, and simulation global files, as well as replay component file movements. Fault tolerance services are presented in another section.

Staging of local (non-shared) files:

ServicesProxy.stage_input_files(input_file_list)

Copy component input files to the component working directory (as obtained via a call to ServicesProxy.get_working_dir()). Input files are assumed to be originally located in the directory variable INPUT_DIR in the component configuration section.

ServicesProxy.stage_output_files(timeStamp, file_list, keep_old_files=True)

Copy associated component output files (from the working directory) to the component simulation results directory. Output files are prefixed with the configuration parameter OUTPUT_PREFIX. The simulation results directory has the format:

${SIM_ROOT}/simulation_results/<timeStamp>/components/$CLASS_${SUB_CLASS}_$NAME_${SEQ_NUM}

Additionally, plasma state files are archived for debugging purposes:

${SIM_ROOT}/history/plasma_state/<file_name>_$CLASS_${SUB_CLASS}_$NAME_<timeStamp>

Copying errors are not fatal (exception raised).

Staging of global (plasma state) files:

ServicesProxy.stage_plasma_state()

Copy current plasma state to work directory.

ServicesProxy.update_plasma_state(plasma_state_files=None)

Copy local (updated) plasma state to global state. If no plasma state files are specified, component configuration specification is used. Raise exceptions upon copy.

ServicesProxy.merge_current_plasma_state(partial_state_file, logfile=None)

Merge partial plasma state with global state. Partial plasma state contains only the values that the component contributes to the simulation. Raise exceptions on bad merge. Optional logfile will capture stdout from merge.

Staging of replay files:

ServicesProxy.stage_replay_output_files(timeStamp)

Copy output files from the replay component to current sim for physics time timeStamp. Return location of new local copies.

ServicesProxy.stage_replay_plasma_files(timeStamp)

Copy plasma state files from the replay component to current sim for physics time timeStamp. Return location of new local copies.

Configuration Parameter Access

These methods access information from the simulation configuration file.

ServicesProxy.get_port(port_name)

Return a reference to the component implementing port port_name.

ServicesProxy.get_config_param(param)

Return the value of the configuration parameter param. Raise exception if not found.

ServicesProxy.set_config_param(param, value, target_sim_name=None)

Set configuration parameter param to value. Raise exceptions if the parameter cannot be changed or if there are problems setting the value.

ServicesProxy.get_time_loop()

Return the list of times as specified in the configuration file.

Logging

The following logging methods can be used to write logging messages to the simulation log file. It is strongly recommended that these methods are used as opposed to print statements. The logging capability adds a timestamp and identifies the component that generated the message. The syntax for logging is a simple string or formatted string:

self.services.info('beginning step')
self.services.warning('unable to open log file %s for task %d, will use stdout instead',
                      logfile, task_id)

There is no need to include information about the component in the message as the IPS logging interface includes a time stamp and information about what component sent the message:

2011-06-13 14:17:48,118 drivers_ssfoley_branch_test_driver_1 DEBUG    __initialize__(): <branch_testing.branch_test_driver object at 0xb600d0>  branch_testing_hopper@branch_test_driver@1
2011-06-13 14:17:48,125 drivers_ssfoley_branch_test_driver_1 DEBUG    Working directory /scratch/scratchdirs/ssfoley/rm_dev/branch_testing_hopper/work/drivers_ssfoley_branch_test_driver_1 does not exist - will attempt creation
2011-06-13 14:17:48,129 drivers_ssfoley_branch_test_driver_1 DEBUG    Running - CompID =  branch_testing_hopper@branch_test_driver@1
2011-06-13 14:17:48,130 drivers_ssfoley_branch_test_driver_1 DEBUG    _init_event_service(): self.counter = 0 - <branch_testing.branch_test_driver object at 0xb600d0>
2011-06-13 14:17:51,934 drivers_ssfoley_branch_test_driver_1 INFO     ('Received Message ',)
2011-06-13 14:17:51,934 drivers_ssfoley_branch_test_driver_1 DEBUG    Calling method init args = (0,)
2011-06-13 14:17:51,938 drivers_ssfoley_branch_test_driver_1 INFO     ('Received Message ',)
2011-06-13 14:17:51,938 drivers_ssfoley_branch_test_driver_1 DEBUG    Calling method step args = (0,)
2011-06-13 14:17:51,939 drivers_ssfoley_branch_test_driver_1 DEBUG    _invoke_service(): init_task  (48, 'hw', 0, True, True, True)
2011-06-13 14:17:51,939 drivers_ssfoley_branch_test_driver_1 DEBUG    _get_service_response(REQUEST|branch_testing_hopper@branch_test_driver@1|FRAMEWORK@Framework@0|0)
2011-06-13 14:17:51,952 drivers_ssfoley_branch_test_driver_1 DEBUG    _get_service_response(REQUEST|branch_testing_hopper@branch_test_driver@1|FRAMEWORK@Framework@0|0), response = <messages.ServiceResponseMessage object at 0xb60ad0>
2011-06-13 14:17:51,954 drivers_ssfoley_branch_test_driver_1 DEBUG    Launching command : aprun -n 48 -N 24 -L 1087,1084 hw
2011-06-13 14:17:51,961 drivers_ssfoley_branch_test_driver_1 DEBUG    _invoke_service(): getTopic  ('_IPS_MONITOR',)
2011-06-13 14:17:51,962 drivers_ssfoley_branch_test_driver_1 DEBUG    _get_service_response(REQUEST|branch_testing_hopper@branch_test_driver@1|FRAMEWORK@Framework@0|1)
2011-06-13 14:17:51,972 drivers_ssfoley_branch_test_driver_1 DEBUG    _get_service_response(REQUEST|branch_testing_hopper@branch_test_driver@1|FRAMEWORK@Framework@0|1), response = <messages.ServiceResponseMessage object at 0xb60b90>
2011-06-13 14:17:51,972 drivers_ssfoley_branch_test_driver_1 DEBUG    _invoke_service(): sendEvent  ('_IPS_MONITOR', 'PORTAL_EVENT', {'sim_name': 'branch_testing_hopper', 'portal_data': {'comment': 'task_id = 1 , Tag = None , Target = aprun -n 48 -N 24 -L 1087,1084 hw ', 'code': 'drivers_ssfoley_branch_test_driver', 'ok': 'True', 'eventtype': 'IPS_LAUNCH_TASK', 'state': 'Running', 'walltime': '4.72'}})
2011-06-13 14:17:51,973 drivers_ssfoley_branch_test_driver_1 DEBUG    _get_service_response(REQUEST|branch_testing_hopper@branch_test_driver@1|FRAMEWORK@Framework@0|2)
2011-06-13 14:17:51,984 drivers_ssfoley_branch_test_driver_1 DEBUG    _get_service_response(REQUEST|branch_testing_hopper@branch_test_driver@1|FRAMEWORK@Framework@0|2), response = <messages.ServiceResponseMessage object at 0xb60d10>
2011-06-13 14:17:51,987 drivers_ssfoley_branch_test_driver_1 DEBUG    _invoke_service(): getTopic  ('_IPS_MONITOR',)
2011-06-13 14:17:51,988 drivers_ssfoley_branch_test_driver_1 DEBUG    _get_service_response(REQUEST|branch_testing_hopper@branch_test_driver@1|FRAMEWORK@Framework@0|3)
2011-06-13 14:17:52,000 drivers_ssfoley_branch_test_driver_1 DEBUG    _get_service_response(REQUEST|branch_testing_hopper@branch_test_driver@1|FRAMEWORK@Framework@0|3), response = <messages.ServiceResponseMessage object at 0xb60890>
2011-06-13 14:17:52,000 drivers_ssfoley_branch_test_driver_1 DEBUG    _invoke_service(): sendEvent  ('_IPS_MONITOR', 'PORTAL_EVENT', {'sim_name': 'branch_testing_hopper', 'portal_data': {'comment': 'task_id = 1  elapsed time = 0.00 S', 'code': 'drivers_ssfoley_branch_test_driver', 'ok': 'True', 'eventtype': 'IPS_TASK_END', 'state': 'Running', 'walltime': '4.75'}})
2011-06-13 14:17:52,000 drivers_ssfoley_branch_test_driver_1 DEBUG    _get_service_response(REQUEST|branch_testing_hopper@branch_test_driver@1|FRAMEWORK@Framework@0|4)
2011-06-13 14:17:52,012 drivers_ssfoley_branch_test_driver_1 DEBUG    _get_service_response(REQUEST|branch_testing_hopper@branch_test_driver@1|FRAMEWORK@Framework@0|4), response = <messages.ServiceResponseMessage object at 0xb60a90>
2011-06-13 14:17:52,012 drivers_ssfoley_branch_test_driver_1 DEBUG    _invoke_service(): finish_task  (1L, 1)

The table below describes the levels of logging available and when to use each one. These levels are also used to determine what messages are produced in the log file. The default level is WARNING, thus you will see WARNING, ERROR and CRITICAL messages in the log file.

Level When it’s used
DEBUG Detailed information, typically of interest only when diagnosing problems.
INFO Confirmation that things are working as expected.
WARNING An indication that something unexpected happened, or indicative of some problem in the near future (e.g. “disk space low”). The software is still working as expected.
ERROR Due to a more serious problem, the software has not been able to perform some function.
CRITICAL A serious error, indicating that the program itself may be unable to continue running.

For more information about the logging module and how to used it, see Logging Tutorial.

ServicesProxy.log(*args)

Wrapper for ServicesProxy.info().

ServicesProxy.debug(*args)

Produce debugging message in simulation log file. Raise exception for bad formatting.

ServicesProxy.info(*args)

Produce informational message in simulation log file. Raise exception for bad formatting.

ServicesProxy.warning(*args)

Produce warning message in simulation log file. Raise exception for bad formatting.

ServicesProxy.error(*args)

Produce error message in simulation log file. Raise exception for bad formatting.

ServicesProxy.exception(*args)

Produce exception message in simulation log file. Raise exception for bad formatting.

ServicesProxy.critical(*args)

Produce critical message in simulation log file. Raise exception for bad formatting.

Fault Tolerance

The IPS provides services to checkpoint and restart a coupled simulation by calling the checkpoint and restart methods of each component and certain settings in the configuration file. The driver can call checkpoint_components, which will invoke the checkpoint method on each component associated with the simulation. The component’s checkpoint method uses save_restart_files to save files needed by the component to restart from the same point in the simulation. When the simulation is in restart mode, the restart method of the component is called to initialize the component, instead of the init method. The restart component method uses the get_restart_files method to stage in inputs for continuing the simulation.

ServicesProxy.save_restart_files(timeStamp, file_list)

Copy files needed for component restart to the restart directory:

${SIM_ROOT}/restart/$timestamp/components/$CLASS_${SUB_CLASS}_$NAME

Copying errors are not fatal (exception raised).

ServicesProxy.checkpoint_components(comp_id_list, time_stamp, Force=False, Protect=False)

Selectively checkpoint components in comp_id_list based on the configuration section CHECKPOINT. If Force is True, the checkpoint will be taken even if the conditions for taking the checkpoint are not met. If Protect is True, then the data from the checkpoint is protected from clean up. Force and Protect are optional and default to False.

The CHECKPOINT_MODE option controls determines if the components checkpoint methods are invoked.

Possible MODE options are:

WALLTIME_REGULAR:
checkpoints are saved upon invocation of the service call checkpoint_components(), when a time interval greater than, or equal to, the value of the configuration parameter WALLTIME_INTERVAL had passed since the last checkpoint. A checkpoint is assumed to have happened (but not actually stored) when the simulation starts. Calls to checkpoint_components() before WALLTIME_INTERVAL seconds have passed since the last successful checkpoint result in a NOOP.
WALLTIME_EXPLICIT:
checkpoints are saved when the simulation wall clock time exceeds one of the (ordered) list of time values (in seconds) specified in the variable WALLTIME_VALUES. Let [t_0, t_1, ..., t_n] be the list of wall clock time values specified in the configuration parameter WALLTIME_VALUES. Then checkpoint(T) = True if T >= t_j, for some j in [0,n] and there is no other time T_1, with T > T_1 >= T_j such that checkpoint(T_1) = True. If the test fails, the call results in a NOOP.
PHYSTIME_REGULAR:
checkpoints are saved at regularly spaced “physics time” intervals, specified in the configuration parameter PHYSTIME_INTERVAL. Let PHYSTIME_INTERVAL = PTI, and the physics time stamp argument in the call to checkpoint_components() be pts_i, with i = 0, 1, 2, ... Then checkpoint(pts_i) = True if pts_i >= n PTI , for some n in 1, 2, 3, ... and pts_i - pts_prev >= PTI, where checkpoint(pts_prev) = True and pts_prev = max (pts_0, pts_1, ..pts_i-1). If the test fails, the call results in a NOOP.
PHYSTIME_EXPLICIT:
checkpoints are saved when the physics time equals or exceeds one of the (ordered) list of physics time values (in seconds) specified in the variable PHYSTIME_VALUES. Let [pt_0, pt_1, ..., pt_n] be the list of physics time values specified in the configuration parameter PHYSTIME_VALUES. Then checkpoint(pt) = True if pt >= pt_j, for some j in [0,n] and there is no other physics time pt_k, with pt > pt_k >= pt_j such that checkpoint(pt_k) = True. If the test fails, the call results in a NOOP.

The configuration parameter NUM_CHECKPOINT controls how many checkpoints to keep on disk. Checkpoints are deleted in a FIFO manner, based on their creation time. Possible values of NUM_CHECKPOINT are:

  • NUM_CHECKPOINT = n, with n > 0 –> Keep the most recent n checkpoints
  • NUM_CHECKPOINT = 0 –> No checkpoints are made/kept (except when Force = True)
  • NUM_CHECKPOINT < 0 –> Keep ALL checkpoints

Checkpoints are saved in the directory ${SIM_ROOT}/restart

ServicesProxy.get_restart_files(restart_root, timeStamp, file_list)

Copy files needed for component restart from the restart directory:

<restart_root>/restart/<timeStamp>/components/$CLASS_${SUB_CLASS}_$NAME_${SEQ_NUM}

to the component’s work directory.

Copying errors are not fatal (exception raised).

Event Service

The event service interface is used to implement the web portal connection, as well as for components to communicate asynchronously. See the Advanced Features documentation for details on how to use this interface for component communication.

ServicesProxy.publish(topicName, eventName, eventBody)

Publish event consisting of eventName and eventBody to topic topicName to the IPS event service.

ServicesProxy.subscribe(topicName, callback)

Subscribe to topic topicName on the IPS event service and register callback as the method to be invoked whem an event is published to that topic.

ServicesProxy.unsubscribe(topicName)

Remove subscription to topic topicName.

ServicesProxy.process_events()

Poll for events on subscribed topics.