Date |
Meeting |
Material covered |
Relevant reading |
9/11 |
Lecture 1 |
Course overview; intro to types of problems |
slides |
9/13 |
Lecture 2 |
Discussing P & G video; how to read research; research process;
basic topologies, routing, and allocation |
|
9/15 |
Lecture 3 |
Discuss Dongarra video and "Multi-toroidal interconnects..."
(aridor04) |
slides |
9/18 |
Lecture 4 |
Discuss "Technology-driven, highly-scalable Dragonfly topology"
(kim08; Adam and Alex) and "Processor allocation on CPlant..."
(leung02a; me) |
Dragonfly slides;
CPlant slides |
9/20 |
Lecture 5 |
Discuss "Comparing global link arrangements.." (hastings15; Mercy) and
recent work on the difference between them (Ridham and Pedro) |
comparing slides;
recent work slides |
9/22 |
Lecture 6 |
Discuss "Utilization, predictability, workloads, ..." (mualem01;
Saverio and Tim)
and "Backfilling with guarantees granted ..." (lindsay11; Lizzy and Oscar) |
Utilization slides;
backfilling slides |
9/25 |
No class |
9/27 |
Lecture 7 |
Discuss "Universal networks for hardware-efficient supercomputing"
(leiserson85; Eddie and James) and
"Slim Fly: ..." (besta14; Thy) |
universal slides;
Slim Fly slides |
9/29 |
Lecture 8 |
Discuss "A case for random shortcut topologies..." (koibuchi12;
An and Rosie) and some related work (me and Hassan) |
Random topology slides;
related work slides |
10/2 |
Lecture 10 |
Discuss "Entering the petaflop era: ..." (barker08; Jessie and
Carlos) and "MapReduce: ..." (dean04; Khue and Izn) |
entering slides;
MapReduce slides |
10/4 |
Lecture 11 |
Discuss "Variations of conservative backfilling to improve
fairness" (rajbhandary13; Lizzy and Oscar) and "Asynchronous execution of heterogeneous tasks in
ML-driven HPC workflows" (pascuzzi23; me) |
variations slides;
asynchronous slides |
10/6 |
Lecture 12 |
Discuss "AI-Job scheduling on systems with renewable power
sources" (nileshwar22; Jessie and Thy) and "Re-making the movie-making machine" (vanns22; Tim) |
AI-Job slides;
Re-making slides |
10/9 |
Lecture 13 |
Discuss "HammingMesh: A Network Topology for Large-Scale Deep
Learning" (hoefler22; mercy) and "Encoding for Reinforcement
Learning Driven Scheduling" (li22; Alex and Adam) |
HammingMesh slides;
encoding slides |
10/11 |
Lecture 14 |
Discuss "Noise in the Clouds: Influence of Network Performance
Variability on Application Scalability" (desensi22; James and
Eddie) and "Chic-Sched: a HPC Placement-Group Scheduler on
Hierarchical Topologies with Constraints" (schares23; Saverio) |
noise slides;
chic-sched slides |
10/13 |
Lecture 15 |
Discuss "DRAS: Deep Reinforcement Learning for Cluster
Scheduling in High Performance Computing" (fan22; An and Rosie) and "The case of
performance variability on Dragonfly-based systems" (bhatele20; Carlos) |
DRAS slides;
variability slides |
10/16 |
Lecture 16 |
Discuss "Resource utilization aware job scheduling to mitigate
performance variability" (nichols22; Ridham and Pedro) and "Spark:
Cluster Computing with Working Sets" (zaharia10; Khue and Izn) |
resource slides |
10/18 |
Fall Institute Day |
10/20 |
No class |
10/23 |
No class |
10/25 |
Lecture 17 |
Using PReMAS (meet in lab) |
|
10/27 |
Lecture 18 |
Modifying PReMAS (meet in lab) |
|
10/30 |
Lecture 19 |
Discuss "Improving Valiant routing for Slim Fly networks" (han17;
me) and "Neural termination analysis" (giacobbe22; Tim L.) |
improving slides |
11/1 |
Lecture 20 |
Discuss "Sparse Hamming Graph: A Customizable Network-on-Chip
Topology" (iff23; Jessie and Mercy) and "Preemptive Parallel Job
Scheduling for Heterogeneous Systems Supporting Urgent Computing"
(agung21; Lizzy and Oscar) |
sparse slides;
preemptive slides |
11/3 |
Lecture 21 |
Discuss "A High-Performance Design, Implementation, Deployment,
and Evaluation of The Slim Fly Network" (blach23; Eddie) and
"CoTrain: Efficient Scheduling for Large-Model Training upon GPU and
CPU in Parallel" (li23; Saverio and Carlos) |
slimfly slides;
cotrain slides |
11/6 |
Lecture 22 |
Discuss "ElastiSim: A Batch-System Simulator for Malleable
Workloads" (ozden22; James and Khue) and "NCC: Neighbor-aware
Congestion Control based on Reinforcement Learning for Datacenter
Networks" (wang22; Alex and Adam) |
ElastiSim slides |
11/8 |
Lecture 23 |
Discuss "FatPaths: Routing in Supercomputers and Data Centers when
Shortest Paths Fall Short" (besta20; An and Thy) and "Analyzing and
Adjusting User Runtime Estimates to Improve Job Scheduling on the
Blue Gene/P" (tang10; Hassan and Izn) |
|
11/10 |
Lecture 24 |
Discuss "High-Radix On-chip Networks with Low-Radix Routers"
(jain14b; Ridham and Pedro) and "Topology-custom UGAL routing on
dragonfly" (rahman19; Rosie and Hassan) |
high-radix slides;
UGAL slides |
11/13 |
No class |
11/15 |
Reading Day |
11/16 |
Reading Day |