Agenda
The 2011 SLURM User Group Meeting will be held on September 22 and 23 in Phoenix, Arizona and will be hosted by Bull. On September 22 there will be two parallel tracks of tutorials meeting in separate rooms. One set of tutorials will be for users and the other will be for system adminitrators. There will be a series of technical presentations on September 23. The Schedule amd Abstracts are shown below.
Schedule
September 22: User Tutorials.
Time | Theme | Speaker | Title |
---|---|---|---|
08:30 - 09:00 | Registration | ||
09:00 - 10:30 | User Tutorial #1 | Don Albert and Rod Schultz (Bull) | SLURM: Beginners Usage |
10:30 - 11:00 | Coffee break | ||
11:00 - 12:30 | User Tutorial #2 | Bill Brophy, Rod Schultz, Yiannis Georgiou (Bull) | SLURM: Advanced Usage Usage |
12:30 - 14:00 | Lunch at conference center | ||
14:00 - 15:30 | User Tutorial #3 | Martin Perry and Yiannis Georgiou (Bull) | Resource Management for multicore/multi-threaded usage |
15:30 - 16:00 | Coffee break | ||
16:00 - 17:00 | Question and Answer | Danny Auble and Morris Jette (SchedMD) | Get your questions answered by the developers |
September 22: System Adminitrator Tutorials.
Time | Theme | Speaker | Title |
---|---|---|---|
08:30 - 09:00 | Registration | ||
09:00 - 10:30 | Admin Tutorial #1 | David Egolf and Bill Brophy (Bull) | SLURM High Availability |
10:30 - 11:00 | Coffee break | ||
11:00 - 12:30 | Admin Tutorial #2 | Dan Rusak (Bull) | Power Management / sview |
12:30 - 14:00 | Lunch at conference center | ||
14:00 - 15:30 | Admin Tutorial #3 | Don Albert and Rod Schultz (Bull) | Accounting, limits and Priorities configurations |
15:30 - 16:00 | Coffee break | ||
16:00 - 17:30 | Admin Tutorial #4 | Matthieu Hautreux (CEA), Yiannis Georgiou and Martin Perry (Bull) | Scalability, Scheduling and Task placement |
September 23: Technical Session
Time | Theme | Speaker | Title |
---|---|---|---|
08:30 - 09:00 | Registration | ||
09:00 - 10:30 | Welcome | ||
Keynote | TBD | TBD | |
Session #1 | Matthieu Hautreux (CEA) | SLURM at CEA | |
Session #2 | Don Lipari (LLNL) | LLNL site report | |
10:30 - 11:00 | Coffee break | ||
11:00 - 12:30 | Session #3 | Alejandro Lucero Palau (BSC) | SLURM Simulator |
Session #4 | Danny Auble (SchedMD) | SLURM operation on IBM BlueGene/Q | |
Session #5 | Morris Jette (SchedMD) | SLURM operation on Cray XT and XE | |
12:30 - 14:00 | Lunch at conference center | ||
14:00 - 15:30 | Session #6 | Don Lipari (LLNL) | Proposed Design for Enhanced Enterprise-wide Scheduling |
Session #7 | Morris Jette (SchedMD) | Proposed Design for Job Step Management in User Space | |
Session #8 | Danny Auble and Morris Jette (SchedMD) | SLURM Version 2.3 and plans for future releases | |
15:30 - 16:00 | Coffee break | ||
16:00 - 17:30 | Open discussion, feature requests, etc. |
Abstracts
User Tutorial #1
SLURM Beginners UsageDon Albert and Rod Schultz (Bull)
- Simple use of commands (submission/monitoring/result collection)
- Reservations
- Use of accounting and reporting
- Scheduling techniques for smaller response time (setting of walltime for backfill , etc)
User Tutorial #2
SLURM Advanced UsageBill Brophy, Rod Schultz, Yiannis Georgiou (Bull)
- MPI jobs
- Checkpoint/Restart (BLCR or application level)
- Preemption / Gang Scheduling Usage
- Dynamic allocations (growing/shrinking)
- Grace Time Delay with Preemption
User Tutorial #3
Resource Management for multicore/multi-threaded usageMartin Perry and Yiannis Georgiou (Bull)
- CPU allocation
- CPU/tasks distribution
- Task binding
- Internals of the allocation procedures
Administrator Tutorial #1
SLURM High AvailabilityDavid Egolf and Bill Brophy (Bull)
- How to set up the High Availability SLURM
- Event logging with striggers
Administrator Tutorial #2
Power Management / SviewDan Rusak (Bull)
- Power Management configuration
- sview presentation
Administrator Tutorial #3
Accounting, limits and Priorities configurationsDon Albert and Rod Schultz (Bull)
- Accounting with slurmdbd configuration
- Multifactor job priorities with examples considering all different factors
- QOS configuration
- Fairsharing setting
Administrator Tutorial #4
Scalability, Scheduling and Task placementMatthieu Hautreux (CEA), Yiannis Georgiou and Martin Perry (Bull)
- High Throughput Computing
- Topology constraints config
- Generic Resources and GPUs config
- Task Placement with Cgroups
Keynote
Speaker and content to be determined.Session #1
CEA Site reportMatthieu Hautreux (CEA)
Evolutions and feedback from Tera100. SLURM on Curie, the PRACE second Tier-0 system that is planned to be installed by the end of the year in a new facility hosted at CEA. Curie will be a 1.6 Petaflop system from Bull.
Session #2
LLNL site reportDon Lipari (LLNL)
Don Lipari will provide an overview of the batch scheduling systems in use at LLNL and an overview on how they are managed.
Session #3
SLURM SimulatorAlejandro Lucero Palau (BSC)
Batch scheduling for high performance cluster installations has two main goals: 1) to keep the whole machine working at full capacity at all times, and 2) to respect priorities avoiding lower priority jobs jeopardizing higher priority ones. Usually, batch schedulers allow different policies with several variables to be tuned by policy. Other features like special job requests, reservations or job preemption increase the complexity for achiev- ing a fine-tuned algorithm. A local decision for a specific job can change the full scheduling for a high number of jobs and what can be thought as logical within a short term could make no sense for a long trace mea- sured in weeks or months. Although it is possible to extract algorithms from batch scheduling software to make simulations of large job traces, this is not the ideal approach since scheduling is not an isolated part of this type of tools and replicating same environment requires an important effort plus a high maintenance cost. We present a method for obtaining a special mode of operation for a real production-ready scheduling software, SLURM, where we can simulate execution of real job traces to evaluate impact of scheduling policies and policy tuning.
Session #4
SLURM Operation on IBM BlueGene/QDanny Auble (SchedMD)
SLURM version 2.3 supports IBM BlueGene/Q. This presentation will report the design and operation of SLURM with respect to BlueGene/Q systems.
Session #5
SLURM Operation on Cray XT and XE systemsMorris Jette (SchedMD)
SLURM version 2.3 supports Cray XT and XE systems running over Cray's ALPS (Application Level Placement Scheduler) resource manager. This presentation will discuss the design and operation of SLURM with respect to Cray systems.
Session #6
Proposed Design for Enhanced Enterprise-wide SchedulingDon Lipari (LLNL)
SLURM currently supports the ability to submit and status jobs between computers at site, however the current design has some limitations. When a job is submitted with several possible computers usable for its execution, the job is routed to the computer on which it is expected to start earliest. Changes in the workload or system failures could make moving the job to another computer result in faster initiation, but that is currently impossible. SLURM is also unable to support dependencies between jobs executing on different computers. The design of a SLURM meta-scheduler with enhanced enterprise-wide scheduling capabilities will be presented.
Session #7
Proposed Design for Job Step Management in User SpaceMorris Jette (SchedMD)
SLURM currently creates and manages job steps using SLURM's control daemon, slurmctld. Since some user jobs create thousands of job steps, the management of those job steps accounts for most of slurmctld's work. It is possible to move job step management from slurmctld into user space to improve SLURM scalability and performance. A possible implementation of this will be presented.
Session #8
Contents of SLURM Version 2.3 and plans for future releasesDanny Auble and Morris Jette (SchedMD)
An overview of the changes SLURM Version 2.3 will be presented along with current plans for future releases.