[NorESM-users] Potential workaround for bad performance for jobs on preproc queue of Betzy HPC
Oskar Andreas Landgren
oskaral at met.no
Thu May 23 17:25:18 CEST 2024
Hi all,
I previously sent this to the NorESM group at MET, but Ada suggested I
should also send it here. Please ignore this if it is not relevant to you.
On the Norwegian Betzy HPC, I have had a lot of issues with jobs in the
preproc queue being very slow during the past two months or so, with jobs
sometimes taking 10-20 times longer than they normally do, ending up
running out of walltime and getting killed. I have however found out that
it seems that the slow speed mainly manifested when running on the b5201
node, so I am currently running with
--qos=preproc --exclude=b5201
in all my preproc queue jobs, and it seems to work really well again. (I
have reported this back to Sigma2 too.)
Previously I had specified to only use the preproc-1 node (--qos=preproc -w
preproc-1), but that one is not always available so this exclusion is more
flexible. Specifically excluding b5201 naturally allows other preproc queue
nodes to be used, and currently b5202-b5206 are available and seem to work
well for me.
The low performance is more likely a filesystem (Lustre) performance issue,
as CPU usage is low and doesn't seem to be the bottleneck at all, but I
don't know why it works on some nodes but not all.
Some days also b5202 is a bit slow and then I exclude both, i.e.
--exclude=b5201,b5202, so which may be indicative of some heavy load on
these specific nodes, but I haven't been able to find any pattern just
looking at the number of active jobs on the respective nodes.
I guess the most important take-home message is to pay attention to which
node is used when you experience low performance, and if it happens
regularly that the low performance manifests when the job is assigned to a
certain node, then try to exclude this node as specified above.
Let me know if you also experience issues and have found any other
workarounds.
Best wishes,
Oskar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://noresm.org/pipermail/users_noresm.org/attachments/20240523/5f626645/attachment.htm>
More information about the Users
mailing list