Johannes Habichs Blog: CUDA

Suche


Inhalte überspringen: zur Funktionsnavigation zur Inhaltenavigation mit Inhaltsübersicht und Verzeichnis A-Z

Topic

Meine Aktivitäten am RRZE und HPC Allgemein, CV

Status

Online seit einem Jahr und 200 Tagen
Letzter Eintrag: 2010.02.22, 20:41

Kalender

März 2010
Mo Di Mi Do Fr Sa So
 1   2   3   4   5   6   7 
 8   9  10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31        
Dezember    

Letzte Beiträge

Donnerstag, 24. September 2009

PCI express pinned Host Memory

Retesting my benchmarks with the current release of Cuda 2.3 I finally incorporated new features like pinned host memory allocation. Specs say that this improves the host to device transfers and vice versa.
Due to the special allocation the arrays will stay at the same location in memory , will not be swapped and are faster available for DMA transfers. In the other case, most data is first copied to a pinned memory buffer and then to the ordinarily allocated memory space. This detour is omitted in this case here.

The performance plot shows, that pinned memory now offers a performance of up to 5.9 GB/s on the fastest currently available PCIe X16 Gen 2 Interface which has a peak transfer rate of 8 GB/s. This corresponds to 73% of Peak performance with almost no optimization applied. In contrast, optimization such as a blocked data transfer, which prooved to increase performance some time ago [PCIe revisited] have no positive effect on performance anymore.

Using only the blocked optimzations without pinned memory still is better then doing an unblocked transfer from unpinned memory, but it only transfers about 4.5 GB/s which corresponds to 56 % of peak to the device.
Reading from the device is far worst with only 2.3 GB/s.

Bitte beschreiben Sie Ihr Bild hier
  • Kommentieren (keine Kommentare)

Donnerstag, 23. Juli 2009

Cuda 2.3 released

NVIDIA just released Cuda Version 2.3 with the corresponding driver.
F22 @RRZE has already been updated to support this Version.
  • Kommentieren (keine Kommentare)

Mittwoch, 8. Juli 2009

Cuda Machines @ RRZE

Currently the available CUDA test systems @ RRZE are:


lightning: (available with upgraded hardware)
Ubuntu 8.04 x86_64
2x Quadcore Intel Clovertown (2,33 GHz), 4 MB L2 pro 2 Cores,
GeForce 8800 Ultra (768 MB) (G80 core)
Cuda Driver Version: 180.22
Cuda Toolkit: 2.0

f22: (Last Update 29.09.09)
Ubuntu 8.04 x86_64
2x Quadcore Intel Xeon L5420 (2.5 GHz)
GeForce GTX 280 SC (1 GB) (GT200 Core)
Current: Cuda Driver Version: 190.29(Cuda2.3) --> with OpenCL Support!
Before: Cuda Driver Version: 190.16 (Cuda2.3)
Cuda Toolkit: 2.3
  • Kommentieren (keine Kommentare)

Dienstag, 7. Juli 2009

Cuda Tutorial @ RRZE

Currently we have two test systems running different GPUs from NVIDIA inside the testcluster environment.
  • Please apply for a HPC account at RRZE (ask your local administrator) .
  • You get access to one of the machines by issuing either a job script or by requesting an interactive shell, e.g.:
  • qsub -I -lnodes=f22:ppn=8,walltime=01:00:00
  • Note, that interactive sessions are limited to one hour, but it is the recommended way to try things out in the beginning
  • The module system now supplies you with various versions of compilers and CUDA Versions, e.g.
  • module load cuda/2.2 will give you Cuda Version 2.2 64bit
  • Next thing you wanna try is compiling the SDK examples.
    • Therefore, download the SDK matching the CUDA version you want to use (please chek wether it is available too!) and extract it to some directory by running it.
    • The cuda path you have to specify (not the install path!) is /usr/local/cudaXX were XX is the version and the architecture (e.g. -32 ).
    • Then enter the directory you extracted to and type make. It should compile, if it doesn't please look to /usr/local/cudaXX/bin/linux/release/. If you find executables in there and you can acutally run them, Then somewhere in your settings is a mistake. If you are trying to compile in 32bit mode, please contact us at hpc@rrze.uni-erlangen.de because then you would need further assistance.
  • Assuming compilation went well (went well = no errors; We neglect the warnings here), you should have runable SDK examples in /bin/release/linux/
  • Now your basic CUDA environment is set up and ready to go for your own codes.
  • Kommentieren (keine Kommentare)

Montag, 15. Dezember 2008

PCI express revisited

Test results with the new generation, i.e. GT 200 based and PCIe Generation 2.0 with doubled performance, show that general naive implemented copys do not get any speedups.
Blocked copys however, climb up to 4.5 GB/s when writing data to GPU memory.
Data copy back to the host is still relatively low at 2 GB/s.

pcix bandwidth measurements 8800 gtx vs. gtx 280



Link to first article
  • Kommentieren (keine Kommentare)

Sonntag, 23. November 2008

Yeehhaa: NVIDIA GT200 rocks

An exemplar of the new NVIDIA Series GT200 based GTX280 Graphics card arrived at our Computing Center last Friday . The card was installed and set up right away and the first benchmark ran on Saturday 22nd of November and finished today.

Some preliminary figures show the great improvement of this new generation as I expected from the data sheets. Soon I will post some verified results here and some about the changes from the G80 generation to the current GT200 chip.
  • Kommentieren (keine Kommentare)

Mittwoch, 10. September 2008

PCI express bandwidth measurements

Benchmarking the PCI express capabilities with CUDA I stumbled across the weird behaviour that a 4 MB block seems to achieve the best sustainable bandwidth. At least when writing to the host.
However, transmitting more than 4 MB but with 4 MB data packets (let's call it blocked copy) does leave a gap in performance.
Although the performance is regained at the end with almost filling the whole GPU memory, the question is what causes the performance to drop to 2GB/s in the first place.

Another interesting question is the jump in performance at 1e6 bytes. Possibly a switch in protocols
Performance of PCI Express transfers to NVIDIA G80 8800 GTX card
  • Kommentieren (keine Kommentare)

Dienstag, 2. September 2008

Towards Teraflops for Games

With the release of the next generation of GPUs, NVIDIA and AMD (former ATI) graphic boards deliver now performance in the order of one teraflop in single precision accuracy. NVIDIA nearly doubled both the count of processors and the memory bus width. Interesting for research is now, how the sustainable performance of programs and algorithms scales with the new platform.
Until now I was not able to test my own algorithms, the Streambenchmarks and the lattice Boltzmann method (see my Thesis for more details ), on the new NVIDIA GPUs.

Double precision also made its way into the GPU circuits, unfortunately with a huge performance loss to around a tenth of single precision performance.
In contrast to that current CPUs lose only about 50% of performance, which comes obvious from the doubled computational work.

Here a little demonstration about the key difference between CPU and GPU NVISION
  • Kommentieren (keine Kommentare)

Montag, 1. September 2008

First Shot

I'm currently with the HPC group @ RRZE and working on my master thesis about HPC on graphic cards regarding benchmark kernels and flow solvers.

So any remarks or hints? Drop them here!

Thanks

/edit 01.07.08

Thesis finished :-)
  • Kommentieren (keine Kommentare)

Nach oben