Lectures

Date Meeting Material covered Relevant reading
9/11 Lecture 1 Course overview; intro to types of problems slides
9/13 Lecture 2 Discussing P & G video; how to read research; research process; basic topologies, routing, and allocation  
9/15 Lecture 3 Discuss Dongarra video and "Multi-toroidal interconnects..." (aridor04) slides
9/18 Lecture 4 Discuss "Technology-driven, highly-scalable Dragonfly topology" (kim08; Adam and Alex) and "Processor allocation on CPlant..." (leung02a; me) Dragonfly slides; CPlant slides
9/20 Lecture 5 Discuss "Comparing global link arrangements.." (hastings15; Mercy) and recent work on the difference between them (Ridham and Pedro) comparing slides; recent work slides
9/22 Lecture 6 Discuss "Utilization, predictability, workloads, ..." (mualem01; Saverio and Tim) and "Backfilling with guarantees granted ..." (lindsay11; Lizzy and Oscar) Utilization slides; backfilling slides
9/25 No class
9/27 Lecture 7 Discuss "Universal networks for hardware-efficient supercomputing" (leiserson85; Eddie and James) and "Slim Fly: ..." (besta14; Thy) universal slides; Slim Fly slides
9/29 Lecture 8 Discuss "A case for random shortcut topologies..." (koibuchi12; An and Rosie) and some related work (me and Hassan) Random topology slides; related work slides
10/2 Lecture 10 Discuss "Entering the petaflop era: ..." (barker08; Jessie and Carlos) and "MapReduce: ..." (dean04; Khue and Izn) entering slides; MapReduce slides
10/4 Lecture 11 Discuss "Variations of conservative backfilling to improve fairness" (rajbhandary13; Lizzy and Oscar) and "Asynchronous execution of heterogeneous tasks in ML-driven HPC workflows" (pascuzzi23; me) variations slides; asynchronous slides
10/6 Lecture 12 Discuss "AI-Job scheduling on systems with renewable power sources" (nileshwar22; Jessie and Thy) and "Re-making the movie-making machine" (vanns22; Tim) AI-Job slides; Re-making slides
10/9 Lecture 13 Discuss "HammingMesh: A Network Topology for Large-Scale Deep Learning" (hoefler22; mercy) and "Encoding for Reinforcement Learning Driven Scheduling" (li22; Alex and Adam) HammingMesh slides; encoding slides
10/11 Lecture 14 Discuss "Noise in the Clouds: Influence of Network Performance Variability on Application Scalability" (desensi22; James and Eddie) and "Chic-Sched: a HPC Placement-Group Scheduler on Hierarchical Topologies with Constraints" (schares23; Saverio) noise slides; chic-sched slides
10/13 Lecture 15 Discuss "DRAS: Deep Reinforcement Learning for Cluster Scheduling in High Performance Computing" (fan22; An and Rosie) and "The case of performance variability on Dragonfly-based systems" (bhatele20; Carlos) DRAS slides; variability slides
10/16 Lecture 16 Discuss "Resource utilization aware job scheduling to mitigate performance variability" (nichols22; Ridham and Pedro) and "Spark: Cluster Computing with Working Sets" (zaharia10; Khue and Izn) resource slides
10/18 Fall Institute Day
10/20 No class
10/23 No class
10/25 Lecture 17 Using PReMAS (meet in lab)  
10/27 Lecture 18 Modifying PReMAS (meet in lab)  
10/30 Lecture 19 Discuss "Improving Valiant routing for Slim Fly networks" (han17; me) and "Neural termination analysis" (giacobbe22; Tim L.) improving slides
11/1 Lecture 20 Discuss "Sparse Hamming Graph: A Customizable Network-on-Chip Topology" (iff23; Jessie and Mercy) and "Preemptive Parallel Job Scheduling for Heterogeneous Systems Supporting Urgent Computing" (agung21; Lizzy and Oscar) sparse slides; preemptive slides
11/3 Lecture 21 Discuss "A High-Performance Design, Implementation, Deployment, and Evaluation of The Slim Fly Network" (blach23; Eddie) and "CoTrain: Efficient Scheduling for Large-Model Training upon GPU and CPU in Parallel" (li23; Saverio and Carlos) slimfly slides; cotrain slides
11/6 Lecture 22 Discuss "ElastiSim: A Batch-System Simulator for Malleable Workloads" (ozden22; James and Khue) and "NCC: Neighbor-aware Congestion Control based on Reinforcement Learning for Datacenter Networks" (wang22; Alex and Adam) ElastiSim slides
11/8 Lecture 23 Discuss "FatPaths: Routing in Supercomputers and Data Centers when Shortest Paths Fall Short" (besta20; An and Thy) and "Analyzing and Adjusting User Runtime Estimates to Improve Job Scheduling on the Blue Gene/P" (tang10; Hassan and Izn)  
11/10 Lecture 24 Discuss "High-Radix On-chip Networks with Low-Radix Routers" (jain14b; Ridham and Pedro) and "Topology-custom UGAL routing on dragonfly" (rahman19; Rosie and Hassan) high-radix slides; UGAL slides
11/13 No class
11/15 Reading Day
11/16 Reading Day