#

Cluster Computing

Cannon Architecture

Description 

Researchers can take advantage of the scale of the cluster by setting up workflows to split many different tasks into large batches, which are scheduled across the cluster at the same time.  Most clusters are made from commodity hardware, so do note that the performance of an individual computation may not be any faster than a local new workstation/laptop with new CPU and flash (SSD) storage; cluster computing is about scale.  Doing computations at scale allows a researcher to test many different variables at once, thereby shorter time to outcomes, and also provides the ability to ask larger, more complex problems (I.e. larger data sets, longer simulation time, more degrees of freedom, etc.). 

Key Features and Benefits  

Definitions:  

Access/Login Node: General access to the cluster is provided via ssh (secure shell, SHA-256 or better encryption) to a set of login servers, which is the staging point for submitting jobs.  Most of the time these are on the public network with multi-factor authentication.  These hosts are not designed for compute intensive tasks as many users (100s) are logged in simultaneously.  As well, graphical or web-based login may be provided (e.g. Open OnDemand, NoMachine, or Science Gateway like Galaxy).  

Jobs: Discrete units of computational tasks are sent to a job scheduler, normally defined within a job script. 

Compute Nodes: In a cluster, these servers are dedicated to performing computations, as opposed to storing data or databases.  They normally contain a larger number of cores and memory in comparison to a workstation/laptop, and have a high-performance network (10 Gbps or more) connecting them together.  They may also have specialized hardware to off-load computations like Graphical Processing Units (GPUs) or Field Programmable Gate Arrays (FPGAs). 

Batch Processing/Queuing System: In order to maximize the utilization of the cluster hardware, jobs are submitted in batches to the queuing system (aka “the scheduler”).  This system manages the state of all the resources on the compute nodes, schedules jobs based upon requirements provided by the user, and tracks usage.  There are a set of commands that users use to submit jobs, cancel jobs, query current state of jobs, and query the historical usage of jobs. 

Scheduling Algorithm:  A scheduler is used to disperse jobs on the cluster in some fair and meaningful way when there is contention for resources. There are two main scheduling algorithms that most schedulers use.  FIFO, which is first in first out, which is first come first serve, and is good for smaller, more homogenous user communities.  Fairshare, is designed to give each grouping (whether lab group or individual user) equal access to resources over time.  Therefore, this uses the historical usage to gauge how much resources to allocate.  In Fairshare, each user has an allocated amount of shares of the whole cluster, which they use when they run jobs and are based upon the amount of resources they use for a given time.  Their shares accumulate back over time. 

Job Status:  Jobs go through various states during the course of using the queuing system.  Upon submission, they are pending until allocated resources to run, then running, then completed.  If the job did not complete successfully it would have a failed state, or it could have been terminated by user or admin and have the state canceled.  A job in the suspended state is paused, whereby the process is stopped, and the memory remains allocated, and this job could be resumed.  These states are displayed when a user queries the status of a job. 

Job Priority: Each job is assigned a priority in the queue to order them in line for next dispatch.  A job with high priority would be at the top of the list. The priority value normally is made up from the sum of values that include Fairshare, size of job (cores, memory, time limit), queue, and age of pending job. 

Queues/Partitions: These are a logical grouping of computing resources that could be grouped for use of similar hardware or to separate accesses.  A job must select a queue to submit to (or the default will be automatically selected).  Queues can be set up to allow for compute nodes to be shared among multiple jobs or dedicated to single jobs in an exclusive manner.  Also, queues can have a relative priority for jobs to be scheduled from. 

Resources:  Each queuing system requires that you specify resources for each job (or it uses the defaults).  These basic requirements are the number of processing cores, amount of memory, and time/duration of the job. 

Parallel Processing: There are three basic modes of doing computations in parallel. 1) Tightly coupled, where a collection of processes on different nodes are working in tandem on the same task passing information through the network usually via MPI (message passing interface) with RDMA (remote direct memory access) to each other's process memory segment. This relies on a low-latency network and run-time libraries to be added to the code. 2) Threaded, where a collection of processes on a single node are sharing the same memory space via OpenMP (or parallel in python or doParallel in R). 3) Loosely coupled, where the tasks are largely independent and controlled by an external script to submit tasks in a loop and sometimes analyze the output and resubmit the next iteration.  In this sense, it is pleasantly parallel, as it doesn’t require updating the code with MPI or OpenMP library calls, but just orchestrates the tasks remotely. 

Interactive Sessions:  Not all computations need be batched with job scripts, a user can submit a job that allows direct access to a compute node.  This is great for compiling code, debugging workflows, or running short calculations interactively.  As well, most clusters provide an interface that allows for a Graphical User Interface to be displayed over the network (I.e. X11 forwarding, VNC, Web socket, etc.). Interactive sessions are not meant to be long running, as the interactive nature would require input from the user and if the user isn’t present, then the resources will go unused for some time, which defeats the purpose of batch scheduling.  

 

Service Expectations and Limits: 

Cluster-computing is not intended to be a larger version of a single machine. Scaling computations to take advantage of a cluster may require changes to how the computation is configured and executed.  As cluster-computing is a shared resource, many factors affect performance including other computations on the same compute nodes competing for memory, storage, and network. In addition, due to the nature of research that is being performed, it isn’t always well understood by the end user how scaling out their computations on a cluster will differ from performing them on a single device in terms of resource requirements (cores, memory, time, data storage), as these are managed in a different way on a single device.  It is always best for researchers to test smaller data sets and batches of tasks before submitting hundreds or thousands. 

At FASRC, availability, uptime, and backup schedule is provided as best effort with staff that do not have rotating 24/7/365 shifts.   

 

Available to: 

All PIs with an active FASRC account from any Harvard School.  For more information see Account Requests

 

Service manager and Owner:  

Service Manager: Raminder Singh, Associate Director of Data Science and Research Facilitation

Service Owner: Scott Yockel, Director of FAS Research Computing 

All service requests should be sent via email to rchelp@rc.fas.harvard.edu  

 

Offerings (Tiers of Service) 

Cluster Name: Cannon 

Queuing System: SLURM 

Shared:  Resources are provided by FASRC and are distributed evenly amongst the various lab groups. 

  • Access: All PI groups. 
  • Scheduling: Fairshare, see https://www.rc.fas.harvard.edu/fairshare/ for more details. 
  • Queue/Partition name: shared, general, bigmem, test 
  • Features: 7-day maximum time limit, non-exclusive nodes, multi-node parallel 
  • Cost: $0.0184 per CPU Hour, $0.0661 per GPU Hour.  See Billing FAQ for more details.

Dedicated: Resources are provided by Centers, Departments, or PIs 

  • Access: Restricted PI groups from Center/Department/Lab. 
  • Scheduling: Fairshare, see https://www.rc.fas.harvard.edu/fairshare/ for more details. 
  • Queue/Partition name: pi_name or center name (e.g. huce) 
  • Features: unlimited maximum time limit, non-exclusive nodes, multi-node parallel 
  • Cost: $0.0029 per CPU Hour.  See Billing FAQ for more details.

Backfill: Users have the ability to utilize unused parts of the dedicated queues. 

  • Access: All PI groups. 
  • Scheduling: Fairshare, see https://www.rc.fas.harvard.edu/fairshare/ for more details. 
  • Queue/Partition name: serial_requeue, gpu_requeue 
  • Features: 3-day time limit, preemption (requeue), non-exclusive nodes, 8 cores max per job, single node 
  • Cost: $0.0029 per CPU Hour.  See Billing FAQ for more details.