| Date | Meeting | Material covered | Relevant reading |
| 9/11 | Lecture 1 | Course overview; intro to types of problems | slides |
| 9/13 | Lecture 2 | Discussing P & G video; how to read research; research process; basic topologies, routing, and allocation | |
| 9/15 | Lecture 3 | Discuss Dongarra video and "Multi-toroidal interconnects..." (aridor04) | slides |
| 9/18 | Lecture 4 | Discuss "Technology-driven, highly-scalable Dragonfly topology" (kim08; Adam and Alex) and "Processor allocation on CPlant..." (leung02a; me) | Dragonfly slides; CPlant slides |
| 9/20 | Lecture 5 | Discuss "Comparing global link arrangements.." (hastings15; Mercy) and recent work on the difference between them (Ridham and Pedro) | comparing slides; recent work slides |
| 9/22 | Lecture 6 | Discuss "Utilization, predictability, workloads, ..." (mualem01; Saverio and Tim) and "Backfilling with guarantees granted ..." (lindsay11; Lizzy and Oscar) | Utilization slides; backfilling slides |
| 9/25 | No class | ||
| 9/27 | Lecture 7 | Discuss "Universal networks for hardware-efficient supercomputing" (leiserson85; Eddie and James) and "Slim Fly: ..." (besta14; Thy) | universal slides; Slim Fly slides |
| 9/29 | Lecture 8 | Discuss "A case for random shortcut topologies..." (koibuchi12; An and Rosie) and some related work (me and Hassan) | Random topology slides; related work slides |
| 10/2 | Lecture 10 | Discuss "Entering the petaflop era: ..." (barker08; Jessie and Carlos) and "MapReduce: ..." (dean04; Khue and Izn) | entering slides; MapReduce slides |
| 10/4 | Lecture 11 | Discuss "Variations of conservative backfilling to improve fairness" (rajbhandary13; Lizzy and Oscar) and "Asynchronous execution of heterogeneous tasks in ML-driven HPC workflows" (pascuzzi23; me) | variations slides; asynchronous slides |
| 10/6 | Lecture 12 | Discuss "AI-Job scheduling on systems with renewable power sources" (nileshwar22; Jessie and Thy) and "Re-making the movie-making machine" (vanns22; Tim) | AI-Job slides; Re-making slides |
| 10/9 | Lecture 13 | Discuss "HammingMesh: A Network Topology for Large-Scale Deep Learning" (hoefler22; mercy) and "Encoding for Reinforcement Learning Driven Scheduling" (li22; Alex and Adam) | HammingMesh slides; encoding slides |
| 10/11 | Lecture 14 | Discuss "Noise in the Clouds: Influence of Network Performance Variability on Application Scalability" (desensi22; James and Eddie) and "Chic-Sched: a HPC Placement-Group Scheduler on Hierarchical Topologies with Constraints" (schares23; Saverio) | noise slides; chic-sched slides |
| 10/13 | Lecture 15 | Discuss "DRAS: Deep Reinforcement Learning for Cluster Scheduling in High Performance Computing" (fan22; An and Rosie) and "The case of performance variability on Dragonfly-based systems" (bhatele20; Carlos) | DRAS slides; variability slides |
| 10/16 | Lecture 16 | Discuss "Resource utilization aware job scheduling to mitigate performance variability" (nichols22; Ridham and Pedro) and "Spark: Cluster Computing with Working Sets" (zaharia10; Khue and Izn) | resource slides |
| 10/18 | Fall Institute Day | ||
| 10/20 | No class | ||
| 10/23 | No class | ||
| 10/25 | Lecture 17 | Using PReMAS (meet in lab) | |
| 10/27 | Lecture 18 | Modifying PReMAS (meet in lab) | |
| 10/30 | Lecture 19 | Discuss "Improving Valiant routing for Slim Fly networks" (han17; me) and "Neural termination analysis" (giacobbe22; Tim L.) | improving slides |
| 11/1 | Lecture 20 | Discuss "Sparse Hamming Graph: A Customizable Network-on-Chip Topology" (iff23; Jessie and Mercy) and "Preemptive Parallel Job Scheduling for Heterogeneous Systems Supporting Urgent Computing" (agung21; Lizzy and Oscar) | sparse slides; preemptive slides |
| 11/3 | Lecture 21 | Discuss "A High-Performance Design, Implementation, Deployment, and Evaluation of The Slim Fly Network" (blach23; Eddie) and "CoTrain: Efficient Scheduling for Large-Model Training upon GPU and CPU in Parallel" (li23; Saverio and Carlos) | slimfly slides; cotrain slides |
| 11/6 | Lecture 22 | Discuss "ElastiSim: A Batch-System Simulator for Malleable Workloads" (ozden22; James and Khue) and "NCC: Neighbor-aware Congestion Control based on Reinforcement Learning for Datacenter Networks" (wang22; Alex and Adam) | ElastiSim slides |
| 11/8 | Lecture 23 | Discuss "FatPaths: Routing in Supercomputers and Data Centers when Shortest Paths Fall Short" (besta20; An and Thy) and "Analyzing and Adjusting User Runtime Estimates to Improve Job Scheduling on the Blue Gene/P" (tang10; Hassan and Izn) | |
| 11/10 | Lecture 24 | Discuss "High-Radix On-chip Networks with Low-Radix Routers" (jain14b; Ridham and Pedro) and "Topology-custom UGAL routing on dragonfly" (rahman19; Rosie and Hassan) | high-radix slides; UGAL slides |
| 11/13 | No class | ||
| 11/15 | Reading Day | ||
| 11/16 | Reading Day | ||