Monthly Archives: March 2013

CUDA Programming Courses

GPU/CUDA Programming for High Performance Computing
(in Mandarin, Spring 2013)
Total number of lectures: 18 (3 hours per week)
Programming Assignments: 4

This course is concerned with programming GPU’s for general purpose high performance computing (not for graphics). GPUs have evolved from supporting graphics to providing a computing engine for high performance computing. The world’s fastest compute system, the Tianhe‐1A achieves it performance (2.507 Petaflops) through the use of 7000 GPUs. Many clusters and computer systems are being designed to incorporate GPUs into their compute nodes to achieve orders of magnitude speed improvements. In this course, we will learn how to program such systems. The platform can be either a Windows or a Linux system and we will learn how to use Window systems that have GPUs and appropriate software installed in a departmental computing lab and also a departmental Linux server that has a high performance 100‐core GPU installed.Tentative topics will include:

–History of GPUs leading to their use and design for HPC
–Introduction to the GPU programming model and CUDA, host and device memories
–Basic CUDA program structure, kernel calls, threads, blocks, grid, thread addressing, predefined variables, example code: vector and matrix addition, matrix multiplication
–Using Windows and Linux environments to compile and execute simple CUDA programs.
–Timing execution time
–Host synchronization
–Routines called from device.
–Incorporating graphical output.
–Global barrier synchronization.
–Coalesced global memory access
–Shared memory and constant memory usage
–Critical sections and atomics. Example use: counter and histogram programs
–CUDA streams
–Pinned memory, zero copy memory, multiple GPUs, portable pinned memory
–Optimizing performance, using knowledge of warps, and other characteristics of GPUs, overlapping computations, effects of control, flow,
–Parallel algorithms suitable for GPUs, parallel sorting,
–Building complex applications, debugging tools,
–Hybrid programming incorporating OpenMP and/or MPI with CUDA, GPU clusters, distributed clusters, …
–Possible advanced materials: texture memory, using GPU also for graphics

Parallel programming
(in English, Autumn 2013)
Total number of lectures: 18 (3 hours per week)
Number of assignments: 2

This course is planned and developed for graduate students. As multicore CPUs and many-core GPUs become even more popular, parallel computing platforms are easily to find each day. This course intends to cover multicore CPU and CUDA architectures, and will introduce with examples OpenMP, MPI, CUDA and OpenCL. Opportunities will be provided to students to acquire hands-on programming experiences. NVIDIA CUDA and OpenCLwill be used to learn GPU programming on NVIDIA and ATI GPUs, and OpenMP and MPI to explore the computational power on multicore CPUs clusters. Tentative topics will include:

–Study Multicore CPU and GPU architectures,
–Study network topologies,
–Learn how to write parallel programs using OpenMP, MPI, OpenCL and CUDA
–Study the issues that influence the speedup and efficiency of parallel programs
–Study some parallel algorithms, as sorting, image processing, graphs, and numerical computation

1)Barry Wilkinson, Michael Allen, “Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers”, 2nd Edition, Prentice Hall
2)Michael J. Quinn, “Parallel Programming in C with MPI and OpenMP”, Mc Graw-Hill
3)Jason Sanders and Edward Kandrot, “CUDA by Example: An Introduction to General -Purpose GPU Programming”, Addison-Wesley Professional, 2010
4)Programming Massively Parallel Processors A hands‐on Approach,David B. Kirk and Wen‐mei W. Hwu,Morgan Kaufmann, 2010
5)GPU Computing Gems Emerald Edition,By Wen‐Mei W. Hwu, Editor in Chief,Morgan Kaufmann, 2011