IBM i: Preemptive problem solving using the watch-for-event function

IBM i: Preemptive problem solving using the watch-for-event function

If you’re responsible for technical support on your IBM i, picture the following scenario: It’s 2A.M. and the night shift operator has called to say that jobs are going into message-wait status. What do you do?

A. Tell them to try to figure the problem out themselves and go back to sleep.
B. Tell them to forget about it until the morning and go back to sleep.
C. Investigate the problem yourself; you’re not fully awake and under pressure to get the end of night routine completed on time.
D. You’ve already taken Chilli-IT’s out of hours support; they’re busily fixing it and you were never bothered in the first place.

If you answered A or B, you may be in the wrong line of work. If you answered C, that’s fine; but late nights and working under pressure is not ideal. If you answered D, that’s another step forward in making sure that your systems run smoothly.

What happens when the problem is fixed and everything is back to normal?

Well, you now know what to do the next time it happens, but ideally you will want to preempt it happening again. At Chilli we have a monitoring suite of software constantly working; collecting server information and providing problem alerts and warnings, etc. We like to preempt potential problems. Why? Because it allows us to address issues sooner and avoid being under pressure when problems becomes urgent. A good example of this is unusual increases in disk space utilisation. Knowing that your disk space utilisation is growing before it grows too large and becomes urgent gives you plenty of time to investigate. Tracking disk space utilisation is a standard check for most IT departments, but what about more unusual examples of preemptive problem solving? Let’s look at one that uses a watch-for-event function.

We had a situation where a number of critical jobs were stopped dead in their tracks after receiving message CPA7090, reason code 4. This message indicated that entries could not be written to a journal receiver because the sequence number for the last written entry had reached it’s limit; in this case 9,999,999,999 entries.

Many customers configure their journals so that the management of receivers is left to the system, i.e. the system automatically generates and attaches new journal receivers when required. The system is also capable of resetting the sequence number to 1 when necessary. However, there are some situations where the sequence number reset is not possible. According to IBM documentation You can reset the sequence number only when all changes are forced to auxiliary storage for all journaled objects and commitment control is not active for the journal. The same documentation also states If you use system change-journal management support (MNGRCV(*SYSTEM)) for the journal, the system attempts to change the journal and reset the sequence number one time. It appears then that should the system not be able to reset the sequence number we get the problem with jobs going into message-wait status and stopping with message CPA7090, reason code 4.

The fix in this case was to use command CHGJRN JRN(<journal>) JRNRCV(*GEN) SEQOPT(*RESET). This generated a new journal receiver; attached the new receiver to the journal and reset the sequence number of the entries back to 1. The retry option was taken for the jobs in message-wait status and everything was back on track (but running later than usual due to the delay whilst the problem was investigated and fixed). We were in luck because any conditions preventing the use of CHGJRN with a sequence number reset were not present.

Chilli’s preemptive solution to the problem was to be able to change journal receivers and reset the sequence number manually when the sequence number was approaching the limit but long before it reached the limit. This would give us time to reset the sequence number at a better time; during normal working hours, and also allow several attempts should the change initially fail. This would stop jobs failing when writing out to the journal receiver. The only thing left was to find a way to know when the journal sequence number was reaching it’s limit without resorting to manually checking the relevant information. This is where a watch-for-event function came in.

A watch-for-event function allows you to set up an automatic lookout for messages that appear on a message queue; a job log and the history log (it can also look for entries in the licensed internal code log and product activity log). When the message appears on the message queue; job log or history log the function calls a user written exit program to handle processing. In our situation we set up a watch-for-event function to look out for message CPF7019 in the history log. CPF7019 is sent whenever a new journal receiver is attached to a journal and the sequence number is not reset.

A watch-for-event function can be set up using the STRWCH command and in our situation looks like the following:

STRWCH SSNID(JOURNALSEQ) WCHPGM(QGPL/CIWPGM001C) WCHMSG(CPF7019) WCHMSGQ((*HSTLOG))

SSNID is just a unique session identifier for watches on the system. WCHPGM is the user-written exit program that recieves control when the message is encountered. WCHMSG is the message identifier. WCHMSGQ is the repository where to find the message, in this case the history log.

The WRKWCH command can be used to work with watch-for-event functions configured on the system.

The user-written exit program receives information about the event via parameters and in our particular situation the program determines the first sequence number in the new journal receiver and if it exceeds 9,000,000,000 sends a message to the system operator message queue QSYSOPR. Reviewing the system operator message queue is standard practice for IBM i technical support and so this message is the cue to change the journal receiver and reset the sequence number. This gives us plenty of time to manually change the journal receiver and reset the sequence number long before it becomes an urgent problem. In our particular case we take the alerting one step further as we have Chilli’s monitors running and one of them looks for particular messages on the system operator message queue. This journal sequence message can be picked up by Chilli’s monitoring function and alerted on the monitor dashboard.

The watch-for-event function is a great way to automate the checking of, and reacting to, problems on your systems; and it can even preempt potential problems. It’s another tool available to help keep your IBM i running smoothly. Make use of it today.