Raincode Automatic Restart Extension


Version 4.2.490.0

1. Raincode Automatic Restart Extension

The Raincode Automatic Restart Extension is an add-on to the Raincode Runtime Library allowing to checkpoint batch programs so that they can be restarted from checkpoint – rather than at the beginning of execution – after a crash.

During the execution of the batch program, checkpoints are generated at regular intervals which can be configured, or determined by the batch program itself. If the batch program uses a database, a commit is performed as part of the checkpointing process. The other data which can be checkpointed are program static data – currently only COBOL Working Storage – and current position for open files and selected open database cursors.

Checkpoint data other than database commit are saved in files. A crashed batch program can be restarted from the most recent database commit and the associated checkpoint files used to restore the program static data and reposition files and database cursors to their state at the moment of the most recent commit.

It is the application programmer’s responsibility to write the batch program in such a way that the latest commit and the associated static data and file and cursor positions are enough to continue program execution.

2. Enabling the Restart Functionality

The Raincode Automatic Restart Extension functionality is enabled when it is activated by the Raincode Runtime Library plug in mechanism by passing the following rclrun command line option:

-Plugin=RainCodeRestart
As any rclrun option, this option can be passed through the environment variable RCLRUNARGS (See section, rclrun command line options.

As the batch program expected to be started by JCL, the environment variable RCLRUNARGS may be set in the JCL catalog configuration file as follows:

  <environmentVariables>
    <envvar name="RCLRUNARGS">-Plugin=RainCodeRestart</envvar>
  </environmentVariables>
If the batch program uses a database, the database may be also be configured in the JCL catalog configuration file. For more details, refer to JCL User Guide

3. Configuring the Restart Functionality

The Raincode Automatic Restart Extension behaviour is driven by the environment variables and by a configuration file described in the following subsections.

3.1. Environment Variables

JCL automatically sets some environment variables, and others are to be configured manually:

  • Set up by JCL Submit

    • RC_JOB_SYSOUT_DIR: the JCL SYSOUT Folder, containing logs and checkpoint files

    • RC_JOB_NAME: the JCL job name

    • RC_STEP_ID: the JCL step starting the batch program

    • RC_JOB_RESTARTED_SYSOUT_DIR: the JCL SYSOUT folder of the restarted JCL step, set up by JCL submit when restarting

  • Set up manually

    • RC_RESTART_CONFIG: mandatory, a path to the Restart configuration file. For more details, see the section, Restart Configuration File

    • RC_RESTART_PGM_CONFIG_DIR: optional, a folder containing the program specific configuration file. For more details, see the section, Configuration Merging

    • RC_RESTART_SHIFT: optional, defaults to 1. For more details, see the section, Checkpoint pacing options

These environment variables, e.g. RC_RESTART_CONFIG, may be set in the JCL catalog configuration file as follows:

  <environmentVariables>
    <envvar name="RC_RESTART_CONFIG" >configuration.xml</envvar>
  </environmentVariables>
For more details, refer to the JCL User Guide

3.2. Configuration File

The configuration file is an XML file consisting of three distinct sections:

  • Processing options

  • Checkpoint pacing options

  • Cursor repositioning options

3.2.1. Processing options

A Processing option has the following attributes:

  • ProgName: name of the batch program to which this processing option applies; wild-cards are allowed

  • JobName: same for the JCL job name

  • StepName: same for the JCL step

  • AutomaticRestart: enable/disable Raincode Automatic Restart Extension

  • AutomaticCheckpoints: enable/disable automatic checkpointing

  • AutomaticCursorPositioning: enable/disable cursor repositioning options

  • RestoreWorkingStorage: specifies at when working storage must be restored when restarting a batch program from checkpoint; currently, only PGMSTART (program start) is supported.

If automatic checkpoints are enabled, the following attributes are used:

  • TriggerCount: number of records to process between two consecutive checkpoints.

  • PaceCheckpoints: enable a checkpoint pacing option

If pace checkpoint is enabled, attribute PacingClassName selects the checkpoint pacing option to be used.

If automatic checkpoints are enabled, the following elements are used:

  • File DDName: the name of the file against which the number of processed records is counted

  • Cursor Name: the name of the cursor against which the number of processed records is counted

  • Cursor Package: the name of the program containing the cursor definition

Elements DDName and Cursor Name/Package are mutually exclusive: record counting may be performed on a file or a cursor, not on both.

Example

 <ProcessingOptions>
 <ProcessingOption AutomaticRestart="true" AutomaticCheckpoints="true" AutomaticCursorPositioning="true" PaceCheckpoints="false" PacingClassName="PACECLAS" TriggerCount="5" RestoreWorkingStorage="PGMSTART" ProgName="ZZ4CRCO2" JobName="ZZJD.*" StepName=".*">
      <Cursor Package="ZZ4CRCO2" Name="ZZDCA01" />
    </ProcessingOption>
    <ProcessingOption AutomaticRestart="true" AutomaticCheckpoints="true" AutomaticCursorPositioning="false" PaceCheckpoints="false" PacingClassName="PACECLAS" TriggerCount="5" RestoreWorkingStorage="PGMSTART" ProgName="ZZ4CRCO8" JobName="ZZJD.*" StepName=".*">
      <File DDName="S1DQDATA" />
    </ProcessingOption>
  </ProcessingOptions>

3.2.2. Checkpoint pacing options

A checkpoint pacing option has the following attributes:

  • PacingClassName: the name of the option, to be selected by the corresponding processing option

  • A list of checkpointPacingOptionShifts:

    • Shift, to be selected by environment variable RC_RESTART_SHIFT. (See section Environment Variables)

    • CheckpointCount = N or an integer value

    • CheckpointElapsedTime = N or a time interval of the form mmmm:ss

Pacing is used in case of automatic checkpointing when checkpoint pacing is enabled: whenever the number of processed records since the beginning of execution or last checkpoint reaches TriggerCount, if the checkpoint count or elapsed time of the class corresponding to the selected pacing class and shift is reached, then perform checkpointing, else check again after processing TriggerCount more records.

Example

<CheckpointPacingOptions>
    <CheckpointPacingOption PacingClassName="PACECLAS">
      <CheckpointPacingOptionShifts>
        <CheckpointPacingOptionShift Shift="1" CheckpointCount="N" CheckpointElapsedTime="0000:05" />
        <CheckpointPacingOptionShift Shift="2" CheckpointCount="N" CheckpointElapsedTime="0000:50" />
      </CheckpointPacingOptionShifts>
    </CheckpointPacingOption>
    <CheckpointPacingOption PacingClassName="CL00005M">
      <CheckpointPacingOptionShifts>
        <CheckpointPacingOptionShift Shift="1" CheckpointCount="N" CheckpointElapsedTime="0005:00" />
      </CheckpointPacingOptionShifts>
    </CheckpointPacingOption>
  </CheckpointPacingOptions>

3.2.3. Cursor repositioning options

A cursor repositioning option has the following attributes:

  • Cursor Name: the name of the cursor to be repositioned

  • Cursor Package: the program containing the definition of the cursor to be repositioned

  • RepositioningMethod: can be COUNT or COLUMN

If RepositioningMethod is COUNT, repositioning is performed – in a way similar to file repositioning – at the cursor’s position at the time of the last checkpoint before the crash.

If the RepositioningMethod is COLUMN, the repositioning option also contains a list of the column names. In that case, the values of those columns are also saved in the checkpoint file. And the repositioning is performed by moving the cursor to the first row, where the column values match the column values in the last checkpoint before the crash.

Example

<CursorRepositioningOptions>
    <CursorRepositioningOption CursorPackage="ZZ4CRCO8" CursorName="ZZDCA01" RepositioningMethod="COLUMN">
      <Columns>
        <Column Name="A01_NUMBER" />
        <Column Name="A01_BRANCH" />
      </Columns>
    </CursorRepositioningOption>
    <CursorRepositioningOption CursorPackage="OTHER" CursorName="OTHER" RepositioningMethod="COUNT" />
  </CursorRepositioningOptions>

3.3. Configuration Merging

It is possible to customize the Raincode Automatic Restart Extension for every batch program by providing a program specific configuration file as follows:

  • The environment variable RC_RESTART_PGM_CONFIG_DIR (see section Environment Variables) must be the path of a folder containing a file named: batch program name.arc.xml

  • That file has the same format as the configuration file and contains the program specific configuration

  • If that program specific file is found:

    • If it contains processing options, they replace the configuration processing options

    • If it contains checkpoint pacing options, they replace the configuration checkpoint pacing options

    • If it contains cursor repositioning options, they replace the configuration processing cursor repositioning options

    • The resulting configuration file is written to the folder identified by environment variable SYSOUT – to allow visualizing it

4. Processing

4.1. General Considerations

Logging and error handling apply to all Raincode Automatic Restart Extension functions, initialization, checkpointing, repositioning.

4.1.1. Logging

  • All logging is done through the logger defined in the Raincode Runtime execution context.

  • The logging level is INFO except for:

    • File read/write and cursor fetch, which is logged at TRACE logging level to avoid cluttering the log file

    • Logging the absence of checkpoint file which is logged at WARNING logging level

    • Processing errors that are logged at ERROR logging level

4.1.2. Error Handling

All processing errors are fatal and cause an abort of the batch program being executed. An error code is logged (see Error codes) as well as an error message. If an exception causes the error, the exception is also logged.

4.2. Initialization

The XML configuration file is read. If a program specific configuration file is found, it is merged into the configuration. See section, Configuration Merging.

Next, the processing option is selected based on the program name, job name, and step id.

Refer to RC_JOB_NAME and RC_STEP_ID in Environment Variables.

If no processing option is found, execution is aborted.

If the Automatic Restart processing option is enabled, then:

  • If the Automatic checkpoints processing option is enabled, then:

  • Check that only one of the options DDName or Cursor Name/Package is set; otherwise, abort

  • If the checkpoint pacing processing option is enabled, then select the corresponding checkpoint pacing option, otherwise abort

All the applicable options are logged.

Example

 [INFO] [RainCodeRestart]: RC_RESTART_CONFIG: configuration.xml
 [INFO] [RainCodeRestart]: RC_JOB_SYSOUT_DIR: C:\Users\jeaneric\AppData\Local\Temp\test_infra_89388\fiodi5re.phl\SYSOUT\JOB0000000001.ZZJDRCO2
 [INFO] [RainCodeRestart]: RC_RESTART_PGM_CONFIG_DIR not defined
 [INFO] [RainCodeRestart]: RC_JOB_NAME: ZZJDRCO2
 [INFO] [RainCodeRestart]: RC_STEP_ID: STEP020
 [INFO] [RainCodeRestart]: RC_JOB_RESTARTED_SYSOUT_DIR not defined
 [INFO] [RainCodeRestart]: Automatic Restart: True, Automatic Checkpoints: True, Automatic Cursor Repositioning: True, Restore Working Storage: PGMSTART
 [INFO] [RainCodeRestart]: Trigger Count: 5, Pace Checkpoints: False, Cursor package: ZZ4CRCO2, Name: ZZDCA01
 [INFO] [RainCodeRestart]: Cursor repositioning package: ZZ4CRCO2, Name: ZZDCA01 Method: COLUMN, Columns: A01_NUMBER A01_BRANCH
 [INFO] [RainCodeRestart]: Cursor repositioning package: OTHER, Name: OTHER Method: COUNT

4.3. ARCSYSIN

If the JCL step starting the batch program defines a file ARCSYSIN containing a line TRMBEFORCKP=n, execution aborts just before the nth checkpoint is emitted.

This allows generating a program crash in a reproducible way.

Example

[INFO] [RainCodeRestart]: TerminateBeforeCheckpoint 15

[ERROR] [RainCodeRestart.Checkpoint]: Abort 9202:  TerminateBeforeCheckpoint 15 reached, aborting...

4.4. Checkpointing

4.4.1. Checkpoint file creation

Checkpointing can be triggered by the batch program or by enabling automatic checkpointing.

In the case of checkpointing triggered by a batch program, the checkpoint is always launched by a database commit:

  • Just before the commit, a snapshot of the program static memory is written to the file, and the current checkpoint is written to a temporary file

  • Just after the commit, the temporary file is renamed; the most recent renamed file corresponds thus to the most recent commit

In the case of automatic checkpointing, there are two possibilities:

  • The batch program has a database connection: in that case, automatic checkpointing launches a commit and the processing is the same as above

  • The batch program has no database connection: in that case, automatic checkpointing launches the writing of the static memory snapshot and the checkpoint file

There are two versions of the checkpoint file and working storage snapshot file: the most recent one and the previous one. All those files are in the folder designated by the environment variable RC_JOB_SYSOUT_DIR (See section Environment Variables).

Example

 [INFO] [RainCodeRestart.AutomaticCheckpoint]: Count 75
 [INFO] [RainCodeRestart]: Pre Commit
 [INFO] [RainCodeRestart.WorkingStorage]: Save C:\Users\jeaneric\AppData\Local\Temp\test_infra_89388\fiodi5re.phl\SYSOUT\JOB0000000001.ZZJDRCO2\checkpoint.0.WS size 2371
 [INFO] [RainCodeRestart.Checkpoint]: Cursor ZZ4CRCO2 ZZDCA01 position 75
 [INFO] [RainCodeRestart.Checkpoint]: File SYSPRINT position 0
 [INFO] [RainCodeRestart.Checkpoint]: File S1DQDATA position 6000
 [INFO] [RainCodeRestart.Checkpoint]: Save checkpoint C:\Users\jeaneric\AppData\Local\Temp\test_infra_89388\fiodi5re.phl\SYSOUT\JOB0000000001.ZZJDRCO2\checkpoint.0.xmltmp version 14
 [INFO] [RainCodeRestart]: Post Commit No error
 [INFO] [RainCodeRestart.Checkpoint]: Rename checkpoint from C:\Users\jeaneric\AppData\Local\Temp\test_infra_89388\fiodi5re.phl\SYSOUT\JOB0000000001.ZZJDRCO2\checkpoint.0.xmltmp to C:\Users\jeaneric\AppData\Local\Temp\test_infra_89388\fiodi5re.phl\SYSOUT\JOB0000000001.ZZJDRCO2\checkpoint.0.xml

4.4.2. Checkpoint file content

The checkpoint file contains:

  • The file name of the current working storage snapshot file

  • The position of the cursors defined in the cursor repositioning options; if the repositioning method is COLUMN, it also contains the cursor column values

  • The position of the files

Example

<?xml version="1.0" encoding="utf-8"?>
<Checkpoint WorkingStorage="checkpoint.0.WS">
  <Cursors>
    <Cursor Open="true" Module="ZZ4CRCO2" Name="ZZDCA01" Position="75">
      <Columns>
        <Column Name="A01_NUMBER">
         <Value>AAEAAAD/////AQAAAAAAAAAEAQAAAAxTeXN0ZW0uSW50MzIBAAAAB21fdmFsdWUACEsAAAAL</Value>
        </Column>
        <Column Name="A01_BRANCH">
         <Value>AAEAAAD/////AQAAAAAAAAAEAQAAAAxTeXN0ZW0uSW50MzIBAAAAB21fdmFsdWUACEsAAAAL</Value>
        </Column>
      </Columns>
    </Cursor>
  </Cursors>
  <Files>
    <File Open="true" Name="SYSPRINT" Position="0" Count="0" />
    <File Open="false" Name="E1DQDATA" Position="80" Count="1" />
    <File Open="true" Name="S1DQDATA" Position="6000" Count="75" />
  </Files>
</Checkpoint>

4.4.3. Static data

The following string indicates the end of the static data which must be saved:

01 FILLER PIC X(30) VALUE '** CHKP AREA END FOR AR/CTL **'
VOLATILE.

The size of the saved static data is logged:


 [INFO] [RainCodeRestart]: Working Storage ZZ4CRCO2 address/size
RainCodeLegacyRuntime.Core.MemoryArea:266888/4139


 [INFO] [RainCodeRestart.WorkingStorage]: Address/size Full
RainCodeLegacyRuntime.Core.MemoryArea:266888/4139 Trimmed
RainCodeLegacyRuntime.Core.MemoryArea:266888/2371

4.5. Repositioning

When JCL is launched in restart mode, it sets the environment variable RC_JOB_RESTARTED_SYSOUT_DIR (See section Environment Variables) to the SYSOUT folder of the restarted step.

When activated, the Raincode Automatic Restart Extension looks up that folder for the most recent checkpoint file.

If no checkpoint file is found, a warning message is logged, and the batch program file execution begins from the beginning.

If a checkpoint file is found, it is copied, as well as the corresponding working storage snapshot file to the folder designated by environment variable RC_JOB_SYSOUT_DIR (See section Environment Variables), and all open files and cursors in the checkpoint file are set to repositioning mode.

On program start, restore working storage from the working storage snapshot file if the processing option Restore Working Storage is PGMSTART- other options are not yet implemented.

Before opening a file, change its mode to extend if its mode is write.

After opening a file or cursor, reposition it and exit repositioning mode. The files which are not read-only are truncated to their position.

Example


 [INFO] [RainCodeRestart]: RC_RESTART_CONFIG: configuration.xml

 [INFO] [RainCodeRestart]: RC_JOB_SYSOUT_DIR: C:\Users\jeaneric\AppData\Local\Temp\test_infra_89388\fiodi5re.phl\SYSOUT\JOB0000000002.ZZJDRCO2

 [INFO] [RainCodeRestart]: RC_RESTART_PGM_CONFIG_DIR not defined

 [INFO] [RainCodeRestart]: RC_JOB_NAME: ZZJDRCO2

 [INFO] [RainCodeRestart]: RC_STEP_ID: STEP020

 [INFO] [RainCodeRestart]: RC_JOB_RESTARTED_SYSOUT_DIR: C:\Users\jeaneric\AppData\Local\Temp\test_infra_89388\fiodi5re.phl\SYSOUT\JOB0000000001.ZZJDRCO2

 [INFO] [RainCodeRestart]: Automatic Restart: True, Automatic Checkpoints: True, Automatic Cursor Repositioning: True, Restore Working Storage: PGMSTART

 [INFO] [RainCodeRestart]: Trigger Count: 5, Pace Checkpoints: False, Cursor package: ZZ4CRCO2, Name: ZZDCA01

 [INFO] [RainCodeRestart]: Cursor repositioning package: ZZ4CRCO2, Name: ZZDCA01 Method: COLUMN, Columns: A01_NUMBER A01_BRANCH

 [INFO] [RainCodeRestart]: Cursor repositioning package: OTHER, Name: OTHER Method: COUNT

 [INFO] [RainCodeRestart.Checkpoint]: Restore checkpoint C:\Users\jeaneric\AppData\Local\Temp\test_infra_89388\fiodi5re.phl\SYSOUT\JOB0000000001.ZZJDRCO2\checkpoint.0.xml

 [INFO] [RainCodeRestart.Checkpoint]: Copy C:\Users\jeaneric\AppData\Local\Temp\test_infra_89388\fiodi5re.phl\SYSOUT\JOB0000000001.ZZJDRCO2\checkpoint.0.WS to C:\Users\jeaneric\AppData\Local\Temp\test_infra_89388\fiodi5re.phl\SYSOUT\JOB0000000002.ZZJDRCO2\checkpoint.0.WS

 [INFO] [RainCodeRestart.Checkpoint]: Reposition cursor ZZ4CRCO2 ZZDCA01
 [INFO] [RainCodeRestart.Checkpoint]: Reposition file SYSPRINT

 [INFO] [RainCodeRestart.Checkpoint]: Reposition file S1DQDATA

 [INFO] [RainCodeRestart.Checkpoint]: Entering reposition mode

 [INFO] [RainCodeRestart.Checkpoint]: Cursor ZZ4CRCO2 ZZDCA01 position 75

 [INFO] [RainCodeRestart.Checkpoint]: File SYSPRINT position 0

 [INFO] [RainCodeRestart.Checkpoint]: File S1DQDATA position 6000

 [INFO] [RainCodeRestart]: File pre open S1DQDATA 00

 [INFO] [RainCodeRestart]: File S1DQDATA change from Output to Extend in restart mode status 00

 [INFO] [RainCodeRestart]: File post open S1DQDATA 00

 [INFO] [RainCodeRestart.Checkpoint]: File S1DQDATA reposition 6000

 [INFO] [RainCodeRestart]: File post close E1DQDATA 00

 [INFO] [RainCodeRestart]: Cursor ZZ4CRCO2 ZZDCA01 open No error

 [INFO] [RainCodeRestart.Checkpoint]: Cursor ZZ4CRCO2 ZZDCA01 reposition 75

 [INFO] [RainCodeRestart.Checkpoint]: Exiting reposition mode

Appendix A: Useful information

A.1. Error codes

  • Configuration = 9200,

  • CannotReadArcsysin,

  • TerminateBeforeCheckpoint,

  • ColumnNotFound,

  • InvalidFileOpenMode,

  • AutomaticCheckpointExpected,

  • CursorRepositioningFailed,

  • FileRepositioningFailed,

  • CannotRetrieveWorkingStorageSize,

  • CannotWriteWorkingStorage,

  • CannotDeleteWorkingStorage,

  • CannotRenameWorkingStorage,

  • CannotCopyWorkingStorage,

  • CannotReadWorkingStorage,

  • CannotWriteCheckpoint,

  • CannotDeleteCheckpoint,

  • CannotRenameCheckpoint,

  • CannotReadCheckpoint.