Inhalte überspringen: zur Funktionsnavigation zur Inhaltenavigation mit Inhaltsübersicht und Verzeichnis A-Z
Topic
Meine Aktivitäten am RRZE und HPC Allgemein, CV
Status
Online seit einem Jahr und 200 Tagen
Letzter Eintrag: 2010.02.22, 20:41
Kalender
| Mo | Di | Mi | Do | Fr | Sa | So |
|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 5 | 6 | 7 |
| 8 | 9 | 10 | 11 | 12 | 13 | 14 |
| 15 | 16 | 17 | 18 | 19 | 20 | 21 |
| 22 | 23 | 24 | 25 | 26 | 27 | 28 |
| 29 | 30 | 31 | ||||
| Dezember | ||||||
Letzte Beiträge
-
PIN OMP and recent mvapich2
Also recent mvapich2 requires special handling for...
von Thomas Zeiser (2010.02.22, 20:41) -
Windows HPC2008 Cluster...
Today the Windows HPC2008 Cluster of RRZE successfully...
von Johannes Habich (2009.12.01, 22:07) -
Java; A quest with unattended...
Some guidelines for unattended Java installation in...
von Johannes Habich (2009.11.19, 10:43) -
Windows HPC 2008 Cluster...
RRZE recently extended its Windows High-Performance-Computing...
von Johannes Habich (2009.11.18, 17:03) -
Hiwi/Student Positions...
If you are a student at FAU or OHN and are interested...
von Johannes Habich (2009.10.06, 08:19)
Donnerstag, 24. September 2009
PCI express pinned Host Memory
Johannes Habich, 08:40 Uhr in CUDA
Retesting my benchmarks with the current release of Cuda 2.3 I finally incorporated new features like pinned host memory allocation. Specs say that this improves the host to device transfers and vice versa.
Due to the special allocation the arrays will stay at the same location in memory , will not be swapped and are faster available for DMA transfers. In the other case, most data is first copied to a pinned memory buffer and then to the ordinarily allocated memory space. This detour is omitted in this case here.
The performance plot shows, that pinned memory now offers a performance of up to 5.9 GB/s on the fastest currently available PCIe X16 Gen 2 Interface which has a peak transfer rate of 8 GB/s. This corresponds to 73% of Peak performance with almost no optimization applied. In contrast, optimization such as a blocked data transfer, which prooved to increase performance some time ago [PCIe revisited] have no positive effect on performance anymore.
Using only the blocked optimzations without pinned memory still is better then doing an unblocked transfer from unpinned memory, but it only transfers about 4.5 GB/s which corresponds to 56 % of peak to the device.
Reading from the device is far worst with only 2.3 GB/s.
Due to the special allocation the arrays will stay at the same location in memory , will not be swapped and are faster available for DMA transfers. In the other case, most data is first copied to a pinned memory buffer and then to the ordinarily allocated memory space. This detour is omitted in this case here.
The performance plot shows, that pinned memory now offers a performance of up to 5.9 GB/s on the fastest currently available PCIe X16 Gen 2 Interface which has a peak transfer rate of 8 GB/s. This corresponds to 73% of Peak performance with almost no optimization applied. In contrast, optimization such as a blocked data transfer, which prooved to increase performance some time ago [PCIe revisited] have no positive effect on performance anymore.
Using only the blocked optimzations without pinned memory still is better then doing an unblocked transfer from unpinned memory, but it only transfers about 4.5 GB/s which corresponds to 56 % of peak to the device.
Reading from the device is far worst with only 2.3 GB/s.
- Kommentieren (keine Kommentare)
Donnerstag, 23. Juli 2009
Cuda 2.3 released
Johannes Habich, 09:06 Uhr in CUDA
NVIDIA just released Cuda Version 2.3 with the corresponding driver.
F22 @RRZE has already been updated to support this Version.
F22 @RRZE has already been updated to support this Version.
- Kommentieren (keine Kommentare)
Mittwoch, 8. Juli 2009
Cuda Machines @ RRZE
Johannes Habich, 08:23 Uhr in CUDA
Currently the available CUDA test systems @ RRZE are:
lightning: (available with upgraded hardware)
Ubuntu 8.04 x86_64
2x Quadcore Intel Clovertown (2,33 GHz), 4 MB L2 pro 2 Cores,
GeForce 8800 Ultra (768 MB) (G80 core)
Cuda Driver Version: 180.22
Cuda Toolkit: 2.0
f22: (Last Update 29.09.09)
Ubuntu 8.04 x86_64
2x Quadcore Intel Xeon L5420 (2.5 GHz)
GeForce GTX 280 SC (1 GB) (GT200 Core)
Current: Cuda Driver Version: 190.29(Cuda2.3) --> with OpenCL Support!
Before: Cuda Driver Version: 190.16 (Cuda2.3)
Cuda Toolkit: 2.3
lightning: (available with upgraded hardware)
Ubuntu 8.04 x86_64
2x Quadcore Intel Clovertown (2,33 GHz), 4 MB L2 pro 2 Cores,
GeForce 8800 Ultra (768 MB) (G80 core)
Cuda Driver Version: 180.22
Cuda Toolkit: 2.0
f22: (Last Update 29.09.09)
Ubuntu 8.04 x86_64
2x Quadcore Intel Xeon L5420 (2.5 GHz)
GeForce GTX 280 SC (1 GB) (GT200 Core)
Current: Cuda Driver Version: 190.29(Cuda2.3) --> with OpenCL Support!
Before: Cuda Driver Version: 190.16 (Cuda2.3)
Cuda Toolkit: 2.3
- Kommentieren (keine Kommentare)
Dienstag, 7. Juli 2009
Cuda Tutorial @ RRZE
Johannes Habich, 17:10 Uhr in CUDA
Currently we have two test systems running different GPUs from NVIDIA inside the testcluster environment.
- Please apply for a HPC account at RRZE (ask your local administrator) .
- You get access to one of the machines by issuing either a job script or by requesting an interactive shell, e.g.: qsub -I -lnodes=f22:ppn=8,walltime=01:00:00
- Note, that interactive sessions are limited to one hour, but it is the recommended way to try things out in the beginning
- The module system now supplies you with various versions of compilers and CUDA Versions, e.g. module load cuda/2.2 will give you Cuda Version 2.2 64bit
- Next thing you wanna try is compiling the SDK examples.
- Therefore, download the SDK matching the CUDA version you want to use (please chek wether it is available too!) and extract it to some directory by running it.
- The cuda path you have to specify (not the install path!) is /usr/local/cudaXX were XX is the version and the architecture (e.g. -32 ).
- Then enter the directory you extracted to and type make. It should compile, if it doesn't please look to /usr/local/cudaXX/bin/linux/release/. If you find executables in there and you can acutally run them, Then somewhere in your settings is a mistake. If you are trying to compile in 32bit mode, please contact us at hpc@rrze.uni-erlangen.de because then you would need further assistance.
- Assuming compilation went well (went well = no errors; We neglect the warnings here), you should have runable SDK examples in /bin/release/linux/
- Now your basic CUDA environment is set up and ready to go for your own codes.
- Kommentieren (keine Kommentare)
Montag, 15. Dezember 2008
PCI express revisited
Johannes Habich, 11:42 Uhr in CUDA
Test results with the new generation, i.e. GT 200 based and PCIe Generation 2.0 with doubled performance, show that general naive implemented copys do not get any speedups.
Blocked copys however, climb up to 4.5 GB/s when writing data to GPU memory.
Data copy back to the host is still relatively low at 2 GB/s.

Link to first article
Blocked copys however, climb up to 4.5 GB/s when writing data to GPU memory.
Data copy back to the host is still relatively low at 2 GB/s.

Link to first article
- Kommentieren (keine Kommentare)
Sonntag, 23. November 2008
Yeehhaa: NVIDIA GT200 rocks
Johannes Habich, 23:33 Uhr in CUDA
An exemplar of the new NVIDIA Series GT200 based GTX280 Graphics card arrived at our Computing Center last Friday . The card was installed and set up right away and the first benchmark ran on Saturday 22nd of November and finished today.
Some preliminary figures show the great improvement of this new generation as I expected from the data sheets. Soon I will post some verified results here and some about the changes from the G80 generation to the current GT200 chip.
Some preliminary figures show the great improvement of this new generation as I expected from the data sheets. Soon I will post some verified results here and some about the changes from the G80 generation to the current GT200 chip.
- Kommentieren (keine Kommentare)
Mittwoch, 10. September 2008
PCI express bandwidth measurements
Johannes Habich, 14:26 Uhr in CUDA
Benchmarking the PCI express capabilities with CUDA I stumbled across the weird behaviour that a 4 MB block seems to achieve the best sustainable bandwidth. At least when writing to the host.
However, transmitting more than 4 MB but with 4 MB data packets (let's call it blocked copy) does leave a gap in performance.
Although the performance is regained at the end with almost filling the whole GPU memory, the question is what causes the performance to drop to 2GB/s in the first place.
Another interesting question is the jump in performance at 1e6 bytes. Possibly a switch in protocols
However, transmitting more than 4 MB but with 4 MB data packets (let's call it blocked copy) does leave a gap in performance.
Although the performance is regained at the end with almost filling the whole GPU memory, the question is what causes the performance to drop to 2GB/s in the first place.
Another interesting question is the jump in performance at 1e6 bytes. Possibly a switch in protocols
- Kommentieren (keine Kommentare)
Dienstag, 2. September 2008
Towards Teraflops for Games
Johannes Habich, 12:04 Uhr in CUDA
With the release of the next generation of GPUs, NVIDIA and AMD (former ATI) graphic boards deliver now performance in the order of one teraflop in single precision accuracy. NVIDIA nearly doubled both the count of processors and the memory bus width. Interesting for research is now, how the sustainable performance of programs and algorithms scales with the new platform.
Until now I was not able to test my own algorithms, the Streambenchmarks and the lattice Boltzmann method (see my Thesis for more details ), on the new NVIDIA GPUs.
Double precision also made its way into the GPU circuits, unfortunately with a huge performance loss to around a tenth of single precision performance.
In contrast to that current CPUs lose only about 50% of performance, which comes obvious from the doubled computational work.
Here a little demonstration about the key difference between CPU and GPU NVISION
Until now I was not able to test my own algorithms, the Streambenchmarks and the lattice Boltzmann method (see my Thesis for more details ), on the new NVIDIA GPUs.
Double precision also made its way into the GPU circuits, unfortunately with a huge performance loss to around a tenth of single precision performance.
In contrast to that current CPUs lose only about 50% of performance, which comes obvious from the doubled computational work.
Here a little demonstration about the key difference between CPU and GPU NVISION
- Kommentieren (keine Kommentare)
Montag, 1. September 2008
First Shot
Johannes Habich, 09:21 Uhr in CUDA
I'm currently with the HPC group @ RRZE and working on my master thesis about HPC on graphic cards regarding benchmark kernels and flow solvers.
So any remarks or hints? Drop them here!
Thanks
/edit 01.07.08
Thesis finished :-)
So any remarks or hints? Drop them here!
Thanks
/edit 01.07.08
Thesis finished :-)
- Kommentieren (keine Kommentare)

















