Go

Find A Location

Search

Center for High Performance Computing (CHPC)

 

Frequently Asked Questions



 

  • How do I log in?
    • You'll need to install a SSH client if you don't already have one.  Then SSH in (specifying your username on the supercomputer) to either login1.chpc.wustl.edu or login2.chpc.wustl.edu
  • How do I change my password?
    • You can use the command 'yppasswd' to change your password on either of the login nodes.
  • How can I stay informed about system changes?
    • There's a mailing list where I announce changes to the system:
      http://management.wustl.edu/mailman/listinfo/announce
      I highly recommend that all users subscribe to this mailing list.
    • Besides the mailing lists, there are various message-of-the-day notes that users will see when they log in.  Users are advised to pay attention to these messages.
  • What sort of jobs are appropriate for running on the login nodes?
    • Since the login nodes are shared by many users, try to refrain from taking up too many resources.  If a job is going to run for >1 hour or use more than 4 cores, it would be more appropriate to run on the compute nodes via a batch job.
  • Why isn't my job running?
    • The best way to track down the problem is to run the command 'checkjob -v JOBID' and send the output to Malcolm (mtobias@wustl.edu).
  • It's taking forever to transfer my files.  How can I transfer them faster?
    • You should first measure your actual transfer rate and then consider what is reasonable.  For a 1Gb network, you can theoretically get ~120 MB/s.  If your are sharing the network with other users, going through firewalls, etc. you are likely to get only a fraction of this.  If you're getting 30-40 MB/s, that's probably realistic. 
      • Yeah, but I've got lots of data and this will take forever!  Plus, I'm on a fast network!
        • There are always options: Typically SSH is limited by the CPU performance required for encryption and checksums for file integrity.  You can usually get faster performance by requesting more efficient algorithms: scp -o MACs=hmac-md5-96 -o Ciphers=arcfour /tmp/foo user@system:/tmp
  • I'm getting messages saying 'Disk quota exceeded'. What's going on?
    • On the home directories, we maintain a 5GB soft quota, and a 6GB hard quota.  To see how much space you're using, cd into your home directory ('cd ~' ) and run 'du --si -s'.  You can violate the soft limit for a short amount of time, but you can not violate the hard limit.  You'll want to delete enough data to get back under the 5GB quota, and consider using the scratch disk for temporary storage of large files. 
    • If your disk usage is under quota, you may want to check the number of files/directories (i.e. inodes) you're using with 'find . | wc -l'.  The scratch disk has a quota of 1,000,000 inodes.  Depending on the number of files you have and the load of the disk servers, this may take a while to complete.
  • Why are my #PBS options being ignored?
    • Make sure that your #PBS options come before any commands in your shell script. Once the queuing system encounters a command, it quits looking for #PBS options, so these options will be ignored.
  • How can I get email notifications about my job?
    • By default, the queuing system will try to send email to your account on the login node.  The first thing you should do, is set up the login nodes to forward your email to your desired email account.  This is straight-forward: Create a file in your home directory called .forward that contains your desired email address.  Then, in your batch scripts you can use the PBS flag: #PBS -m be.  The b is to get email when the job begins running, the e option is to get email when the job ends.  You can use either or both options.
  • Why am I not getting the job output files from the queuing system?
    • I'm getting error messages like:
      Unable to copy file /var/spool/torque/spool/329467.mgt.OU to me@login001:
      
      This is usually a symptom of the queuing system not being able to copy STDOUT/STDERR from the local disk on the compute nodes back to the login nodes. To do this, the queuing system requires SSH to be enabled with a passwordless key-pair. This should have been configured when your acount was created, but could have been broken in a couple of ways:
    1. Check the permissions of your home directory:
      ls -l /home | grep mtobias
      drwxr-xr-x 53 mtobias admin 8192 Jun 9 08:00 mtobias

      You need to make sure your home directory (and the underlying .ssh subdirectory) are not writable by either the group or other users. SSH has a feature where it will ignore the contents of the .ssh subdirectory if another user were able to modify your authorized_keys file. If you need to fix the permissions of your home directory (or the underlying .ssh subdirectory) you can use the chmod command:
      chmod 755 /home/mtobias/
    2. Check whether your SSH key-pair exists:
      cd ~/.ssh/
      ls
      authorized_keys id_rsa id_rsa.pub known_hosts

      You should see your private key (id_rsa), your public key (id_rsa.pub) and a copy of your public key in the authorized_keys file.

      If you need to restore your SSH key-pair, you can use the following procedure:

      ssh-keygen (accept all of the defaults)
      cd ~/.ssh
      cat id_rsa.pub > authorized_keys

      If everything is working correctly, you should be able to SSH between the two login nodes without being prompted for a password.
  • Why am I having trouble running my Java application on the compute nodes?
    • For several reasons (notably poor performance and the many different implementations) we don't officially support Java on the compute nodes.  However, many users have been able to successfully run Java on their own.  The most common method is to install their own implementation of Java in their home directories.  Then, they either modify their PATH enironmental variable, or use the absolute path to their java in their batch scripts.