Tuesday, April 8, 2014

Maintaining the Clusters

Quick Description: Often, we will need to update or fix our clusters, but to do this, we will need need to turn off the computer properly before modifying it and restarting it.

The Point: Take a cluster down. Put it back online.

Prerequisites: SSH Access, such as through using PuTTY.

Notes:
  1. Stopping a Job
    • To properly stop a job, copy over the STOPCAR file:
      • To stop after an ionic step, send the command "cp ~/STOPCAR ."
      • To stop after an ionic step, send the command "cp ~/STOPCAR.abort ." but this is pretty much the same as the "qdel" command (as in, this method isn't as nice)
  2. Taking down a cluster
    • Become a super user by using the command "su" (password required). If later you want to exit being a super user just use "exit"
    • SSH into the computer that you want using "ssh <name of cluster>" (make sure you go into the cluster you want because you don't want to accidentally take down vasp1, since it controls everything.
    • Enter "shutdown now" - this will close the cluster and then kick you out back into vasp1
    • If you want to check you can use "pbsnodes -a" which will show you everything and "pbsnodes -l" will show you all the nodes that are down. Alternatively you can SSH into the cluster you just took down and if it refuses you, then it's down. Some computers will allow you back on because they restart automatically, you will need to physically turn them off.
  3. Physically turning off a cluster
    • Cluster is located in the Pierce Annex bottom floor lab. The computers are labeled and organized in power strips in the back. You can use the power button or just the power cord. (Note: vasp4, vasp5, and vasp6 are slow.)
    • Sweet. Now you can change ram or put in a new hard drive or whatever you want.
    • Turn on with the power switch again. (Except for vasp3 which doesn't have one... so just use the power cord.)
  4. Restarting a cluster
Other notes....kind of disorganized still...

Nersc:

useful command for changing from dos CRLF to linux LF (ASCII):
dos2unix [filename] <--(Use only when you use a Windows machine to create INCAR, POSCAR, …)

checking text format: (to check and make sure dos2unix worked).
file [filename]

using vi to see special characters:
vi [filename]
:set list


Torque

Commands for the External Hard drive (which is currently dead):
mount --bind olddir newdir
mount -o uid=username,gid=groupname /dev/sdc /path/to/mount

uid=bartels,gid=users

mount -o remount /home/bartels/sharedData/share2

Related to any NSF mount (what your would type in options on directories to export):
ro,root_squash,sync,no_subtree_check


Check OpenSUSE installation instructions
How to know how much swap space there is:
>ssh vasp(# you wish to view)
>top (to view the swap)
>q (to exit top)

How to give a node more swap space than it currently has (if possible):
umount -a
fuser -m  /home   (then get rid of those processes using kill)
try: umount /home
      umount -a
-in yast, go to harddrives
add partition

-outside yast
mkswap /dev/sda3  (or /dev/sda4 or whatever it is) ... might not be necessary if partition was added via yast
swapon /dev/sda3


df [display hd statistics]
df -h [human readable]
(The commands above, particularly the df –h?, will show you the info that appears at the bottom of the Servlet test page. Should show you what extra drives are mounted and the local harddrives)

To check the cpu info on any server:
cat /proc/cpuinfo [determine freq etc.]

If you restart vasp1 and things won’t work do this:
start NFS server on vasp1 through YAST [if not already started -- yast2]

pbs_mom [on each compute node]
pbs_server
***** not any more: pbs_sched ... -> now called maui
maui
disown –a

If something is wrong with the scheduling system:
(probably have to be su to do this)
momctl -s [shutdown mom]
qterm [shutdown pbs]
schedctl -k [shutdown maui]

To take nodes on and offline (must be su):
pbsnodes -o nodename (takes a node offline)
pbsnodes -c nodename (put the node back online)

Allows you to run 2 8 machines per job (doesn’t always work):
qsub -l nodes=2:ppn=8

If you wanted your job to run on one specific machine:
qsub -l nodes=vasp3:ppn=24 ~/g24vasp

Runs all 2 node machines on one job:
qsub -l nodes=3:ppn=2:proc2 ~/g2vasp

Probably kills everything:
momctl -c 'all'

Shows everything that has happened to a job (note: job must be currently running and must be su)):
tracejob <jobid>

Code to suspend and resume a job (not sure if it works):
qsig -s suspend <jobid> [as su]
qsig -s resume <jobid>

See install nodes manual:
dynamically update nodes info:
  qmgr
  qmgr -c "set node vasp4 properties += dual"

How to kill a job:
qdel <jobid>

MAUI commands: (google them)
maui
mjobctl -s <jobid>
schedctl -k [shutdown maui]
runjob <jobid>
runjob -s <jobid> [this seems to be the best way to suspend]
setspri 10 <jobid>
sethold -b <jobid>   [can do more than 1 job]
releasehold -b <jobid>

Most likely the computer normally mounts to media/New Volume but we wanted to mount it to /home/… so everyone could see it:
mount --bind /media/New\ Volume /home/bartels/sharedData/share2





No comments:

Post a Comment