open-discussion > RE: Jobs keep on running
Mar 19, 2015  02:03 PM | Pierre Bellec
RE: Jobs keep on running
Dear Chaoyi,

This problem is unfortunately not too uncommon. PSOM basically expects that when a job is submitted it will terminate cleanly. If one of the execution nodes runs out of memory, or is manually turned off, PSOM will wait forever for the job to terminate. We'll add a mechanism in the next release for PSOM to check for the good health of running jobs rather than assume things will run. For now, if a cluster is somehow unstable, the only solution is to remove PIPE.lock in the logs folder manually and restart the pipeline. It may be worth investigating why the jobs are dying and if there is a possible remedy, because it is very annoying to have to re-start a pipeline manually many times.

Re the lack of error messages, you may want to have a look in the logs folder. There may be some files named after the job, such as job1.log, job1.eqsub, job1.oqsub, etc. Those are plain text files, and may contain informative error messages.

Now, here are two possible sources of the problem and suggestions of fix.

(1) is the easiest to fix. There is a walltime on your submission system, i.e. the jobs get automatically killed after X hours. All you need to do is use opt.qsub_options and add the appropriate option to extend the wall time. This will look like
opt.qsub_options = '-l walltime=03:00:00';
but you will need to check with the specific type of scheduler you are using. 

(2) if you are using a qsub system, it may be that the .eqsub and .oqsub files are missing, and then that would be the cause of the problem (PSOM is waiting for these files to be generated). I have seen some clusters where a few of the eqsub/oqsub files are not generated, seemingly randomly, and that got eventually fixed with system upgrades but I have not narrowed down the origin of the problem. If that is the problem, please get in touch with the system administrator of the server.

I hope that helps,

Pierre

Threaded View

TitleAuthorDate
Chaoyi Qin Mar 19, 2015
RE: Jobs keep on running
Pierre Bellec Mar 19, 2015
Chaoyi Qin Mar 20, 2015
Pierre Bellec Jun 22, 2015