主要参考英文帖子。我就不翻译了哈。很容易懂的。
先说明我的运行平台:
1、IDE:Visual Studio 2012 C# .Net Framework4.5,使用默认安装路径;
2、显卡类型:NVIDIA GeForce GT 755M(笔记本用移动显卡),CUDA Toolkit版本:cuda_6.5.14_windows_general_64,使用默认安装路径。
3、使用的managedCUDA版本和下载链接:managedCUDA。作者:kunzmi, version 15。郑重声明,版权属于原作者。在此,对kunzmi表示感谢。
——————————————————————————————————————————————————————————————
C# .Net Framework4.5中配置和使用managedCUDA
一、About managedCuda
ManagedCuda provides an intuitive access to the Cuda driver API for any .net language. It is kind of an equivalent to the runtime API (= a comfortable wrapper of the driver API for C/C++) but written entirely in C# for .net. In contrast to the runtime API, managedCUDA takes a different approach to represent CUDA specifics: managedCuda is object oriented. In general you can find C# classes for each Cuda handle in the driver API. For example, instead of a handle CUContext, managedCUDA provides a CudaContext class. This design allows an intuitive and simple access to all API calls by providing correspondent methods per class. A good example for this wrapping approach is a device variable. In the original Cuda driver API those are given by standard C pointers. In managedCuda these are represented by the class Cuda[Pitched]DeviceVariable<T>. It is a generic class allowing type safe and object oriented access to the Cuda driver API. As a CudaDeviceVariable instance knows about its wrapped data type, array sizes, dimensions and eventually a memory alignment pitch, a simple call to CopyToHost(“hostArray”) is enough. The user doesn’t need to handle the entire C like function arguments, this is all done automatically. Further managedCuda provides specific exceptions in case something goes wrong, i.e. you don’t need to check API call return values, you only need to catch the CudaException just as any other exception.
But still, as a developer using managedCuda you need to know Cuda. You must know how to use contexts, set kernel launch grid configurations etc.
I will shortly describe in the following the main classes used to implement a fully functional Cuda application in C#:
The CudaContext class: This is one of the three main classes and represents a Cuda context. From Cuda 4.0 on, the Cuda API demands (at least) one context per process per device. So for each device you want to use, you need to create a CudaContext instance. In the different constructors you can define several properties, e.g. the deviceID to use. As nearly all managedCuda classes, CudaContext implements IDisposable and the wrapped Cuda context is valid until Dispose() is called. Further CudaContext defines a bunch of static methods to retrieve general information about (possible) Cuda devices. Important for multi threaded applications: In order to use any cuda object related to a context, you must activate the cudaContext by calling the SetCurrent() method from the current thread. This holds for all thread switches. (See the Cuda programming guide for more information).
CudaKernel: Cuda kernels are load from cubin or ptx files. You can load a kernel using the LoadKernel…() methods of a CudaContext using a byte array representation of the kernel file (e.g. an embedded resource) or by specifying the file name where the kernel is stored. Further you need the kernel name as defined in the source *.cu file. The LoadKernel methods return a CudaKernel object bound to the given context. CudaKernel does not implement IDisposable, as the kernels are automatically destroyed as soon as the corresponding context is destroyed.
CudaDeviceVariable and its variations: A CudaDeviceVariable object represents allocated memory on the device. The class knows about the exact memory layout (as array length, array dimension, memory pitch, etc.). As the class is a generic, it also knows about its type and type size. All this simplifies dramatically any data copying as no size parameters are needed. Only the source or destination array must be defined (either a default C# host array or another device variable). Device memory is freed as soon as the CudaDeviceVariable object is disposed.
With these three main classes one can create an entire Cuda accelerated application in C# using only very few code lines.
Other managedCuda classes:
CudaPagelockedHostMemory: In order to use asynchron copy methods (host to device or device to host) the host array must be allocated as pinned or page-locked memory. To realize this, CudaPagelockedHostMemory[2D,3D] allocates the memory using cuda’s cuMemHostAlloc. To simplify access per element, the class provides an index property to get or set single values. When implementing large datasets you must know that each single per element access trespasses the managed/unmanaged memory barrier and must be marshaled. Access is therefore not really fast. To handle large amount of data, a copy of a managed array to the unmanaged memory in one block would be faster.
CudaPagelockedHostMemory_[Type]: As the previous approach using generics and marshalling was not satisfying in terms of speed and direct pointer arithmetic with generics is not possible in C#, I tried something new, what I would call "templates with C#" using T4: A T4 template creates all possible variants like 'float', 'int4', etc. which then access memory directly via pointers. The achieved performance of this approach is close to native arrays. In case you want to use CudaPagelockedHostMemory with your own datatypes, simply copy the tt-file to your project and modify the list of types to process (but be aware of the license: managedCUDA is LGPL!).
CudaManagedMemory_[Type]: Using the same approach as for page locked memory, CudaManagedMemory gives access to the full feature set of managed memory introduced with Cuda 6.5 in .net.
CudaRegisteredHostMemory: In C++, registered host memory is normally allocated memory but with registration it gets usable for asynchron copies. But in the .net world this doesn’t work as expected: Also CudaRegisteredHostMemory is part of ManagedCUDA it shouldn’t be used. Use CudaPagelockedHostMemory instead.
CudaArray[1D,2D,3D]: Represents a CUArray. Either you specify an already existing CUArray as storage location, e.g. from graphics interop, or a new CUArray is created internally. Only if the inner CUArray was allocated by the constructor, it will be freed while disposing.
CudaTextureFoo: Represents a Cuda texture reference. The device memory to bind this texture to can either be created internally by the constructor or passed as an argument. Only if memory is allocated by the constructor it will be freed while disposing.
GraphicsInterop: Several graphics interop resource classes exist, one for every graphics API (DirectX or OpenGL). All these resources must be registered and can be mapped to cuda variables, cuda textures or cuda arrays, depending on their type. For efficient mapping, all resources can be grouped in a CudaGraphicsInteropResourceCollection, so that one single Map() call is enough to finish the task. Have a look at the sample applications to see how to use the collection.
二、Additional libraries:
- CudaFFT: Managed access to cufft*.dll
- CudaRand: Managed access to curand*.dll
- CudaSparse: Managed access to cusparse*.dll
- CudaBlas: Managed access to cublas*.dll
- CudaSolve: Managed access to cusolve*.dll
- NPP: Managed access to npp*.dll
- NVRTC: Managed access to nvrtc*.dll
All libraries have in common that they compile either to 32 or 64 bit in order to handle different wrapped dll names for 32 or 64 bit. They include a basic representation called *NativeMethods to call directly the API functions and wrap handles with C# classes.
CudaBitmapSource is a simple try to use Cuda device memory as a BitmapSource in WPF. It is more like a proof of concept than a ready to use library, especially the fact that BitmapSource is a sealed class makes a proper implementation difficult. If you have ideas for improvements or a better design, please let me know ;-)
三、How To: Setup a C# Cuda project using Visual Studio 2010 (Solution 1):
(My Visual Studio is a German edition, some “translated” menu entries might therefor differ slightly from the original English menu entries.)
You need: Microsoft Visual Studio 201x, Nvidia Cuda Toolkit 7.0, Nvidia Parallel Nsight 4.0 for debugging and of course managedCuda.(注意:本文中CUDA版本为6.5,以下,7.0统一替换为6.5)
- Create a normal C# project,此处选择C#控制台应用程序 (ConsoleApplication、ibrary、WinForms、WPF,、etc.)。
操作为:打开VS IDE——文件-——新建——项目——Visual C#——控制台应用程序,在名称中输入“vectorAdd”,点击“确定”按钮,结束。
- 在同一解决方案中,添加一个新的CudaRuntime项目。Add a new CudaRuntime 6.5.0 project to the solution.
操作为:在解决方案资源管理器中,右键点击解决方案“vectorAdd”,右键菜单:添加-——新建——项目——NVIDIA——CUDA6.5——Cuda 6.5 Runtime——在名称中输入“vectorAddKernel”,点击“确定”按钮,结束。可将新创建的项目vectorAddKernel中自动创建的名称为kernel.cu的CUDA源文件改名为:vectorAdd.cu。
- Delete the Cuda sample code. To enable proper IntelliSense functionality you need to include the following header files to your *.cu file (from toolkit-include folder):
#include <cuda.h> #include <device_launch_parameters.h> #include <texture_fetch_functions.h> #include <builtin_types.h> #include <vector_functions.h> #include “float.h”
为了便于IDE找到这些.h文件需要添加库文件和头文件路径,操作为:右键点击项目“vectorAddKernel”属性-——配置属性——VC++目录,依次进行以下设置:
包含目录:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\include
库目录:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\lib\x64
此处也可以通过设置环境变量,一劳永逸地解决这个问题,不用再每一个工程单独添加库目录和包含目录。设置环境变量的方法如下:
安装完毕后,可以看到系统中多了CUDA_PATH和CUDA_PATH_V6_0两个环境变量,接下来,还要在系统中添加以下几个环境变量:
CUDA_SDK_PATH = C:\ProgramData\NVIDIA Corporation\CUDA Samples\v6.0
CUDA_LIB_PATH = %CUDA_PATH%\lib\x64
CUDA_BIN_PATH = %CUDA_PATH%\bin
CUDA_SDK_BIN_PATH = %CUDA_SDK_PATH%\bin\x64
CUDA_SDK_LIB_PATH = %CUDA_SDK_PATH%\common\lib\x64
然后,在系统变量 PATH 的末尾添加:
;%CUDA_LIB_PATH%;%CUDA_BIN_PATH%;%CUDA_SDK_LIB_PATH%;%CUDA_SDK_BIN_PATH%;
- Also add the following defines:
#define _SIZE_T_DEFINED #ifndef __CUDACC__ #define __CUDACC__ #endif #ifndef __cplusplus #define __cplusplus #endif
- Write your kernel code in an “extern C{}” scope:
-
//Includes for IntelliSense
#define _SIZE_T_DEFINED
#ifndef __CUDACC__
#define __CUDACC__
#endif
#ifndef __cplusplus
#define __cplusplus
#endif
#include <cuda.h>
#include <device_launch_parameters.h>
#include <texture_fetch_functions.h>
#include "float.h"
#include <builtin_types.h>
#include <vector_functions.h>
// Texture reference
texture<float2, 2> texref;
extern "C"
{
//kernel code
__global__ void kernel(/* parameters */)
{
}
}
- You can also omit ‘extern “C”’ in order to use templated kernels. But then kernel names get mangled (“_Z18GMMReductionKernelILi4ELb1EEviPfiPK6uchar4iPhiiiPj” instead of “GMMReductionKernel”, to look up the right mangled name open the compiled ptx file with a text editor). To load a kernel you need the full mangled name.
- Change the following project properties of the CudaRuntime 7.0 project:
General: * Output directory: Set it to the source file directory of the C# project ,即vectorAdd\vectorAdd目录下。前一个vectorAdd是解决方案名称,后一个vectorAdd是默认创建的 C#控制台应用程序名称。 * Application type: 实用工具. This avoids a call to the VisualC++ compiler, no C++ output will be created.
CUDA C/C++:
*Compiler Output: $(OutDir)%(FileName)_x64.ptx 或者.cubin 。注意:此处的_x64必须明确指出,否则编译不通过。如果想编译输出32位平台,请将编译器输出设置为:$(OutDir)%(FileName)_x86.ptx 或者.cubin 。 *NVCC Compilation Type: “Generate .ptx file (-ptx)” 或者 “Generate .cubin file (-cubin)” respectively 。需要与前一步骤保持一致。
*Target Merchine Platform:64-bit (--machine 64)。
You need to set these properties for all possible targets and configurations (x86/x64, Debug/Release). To handle mixed mode platform kernels, give a different kernel name for x86 and x64, for example $(OutDir)%(FileName)_x86.ptx and $(OutDir)%(FileName)_x64.ptx.
-
- Delete the post build event: We don’t need the CUDA runtime libraries copied.
Build the Cuda project once for each platform。编译CUDA项目需要的设置:操作为:右键点击项目“vectorAddKernel”——生成自定义——勾选CUDA(.target,.props),点击“确定”按钮,结束。
In the C# project, add the newly build kernel files in the C# project source directory to the project.
Set the file properties either to embedded resource (access files by stream (byte[]) when loading kernel images) or set “copy to output directory” to “always” and load the kernel image from file.
注意:此处,除了需要将前一步中生成的vectorAdd_x64.ptx文件添加到项目vectorAdd(方法:右键点击项目“vectorAdd”——添加——现有项-选中vectorAdd_x64.ptx,并添加)之外,还需要将vectorAdd_x64.ptx文件属性设置为“嵌入的资源”,以便可以通过文件流,获取该资源中的核函数(方法:右键点击文件“vectorAdd_x64.ptx”——属性——生成操作-嵌入的资源,或者设置复制到输出目录——始终复制)。
Add a reference to the managedCuda assembly。添加对managedCuda 程序集的引用。
四、How To: Setup a C# Cuda project using Visual Studio 2010 (Solution 2 from Brian Jimdar)
Using pre-build events:
In the project properties-page of your C# project, add the following pre-build event:
call "%VS100COMNTOOLS%vsvars32.bat" for /f %%a IN ('dir /b "$(ProjectDir)Kernels\*.cu"') do nvcc -ptx -arch sm_11 -m 64 -o "$(ProjectDir)PTX\%%~na_64.ptx" "$(ProjectDir)Kernels\%%~na.cu" for /f %%a IN ('dir /b "$(ProjectDir)Kernels\*.cu"') do nvcc -ptx -arch sm_11 -m 32 -o "$(ProjectDir)PTX\%%~na.ptx" "$(ProjectDir)Kernels\%%~na.cu"
This builds a x86 and x64 version of each file in the .\Kernels directory, outputs it to the .\PTX directory.
五、常见问题解决办法
1、Assembly.GetManifestResourceStream总返回 null。
运行或调试代码,发现Assembly.GetManifestResourceStream总是返回null。
明明文件资源都在,后来发现因为我仅仅是项目包括了文件,而Assembly.GetManifestResourceStream是对应用的资源进行检索,所以这个文件需要右键点击,在属性中选择“生成操作——嵌入的资源”即可。
另:发现 Assembly.GetManifestResourceStream(type,name)时,前面的type所在的namespace必须和name所指的资源的namespace(实际上namespace由资源所在的路径决定)相同。
2、异常:System.BadImageFormatException,未能加载正确的程序集XXX。
一般是由于目标程序的目标平台与其某一依赖项的目标编译平台不一致导致,把所有的项目都修改到同一目标平台下(X86、X64或AnyCPU)进行编译,一般即可解决问题。尤其是DLL的X86或X64平台,以及Debug或Release版本之间互相不匹配,非常容易引起该问题。
3、
C#如何判断操作系统位数是32位还是64位。
方法很多,可以使用下面的代码判断:
if (System.IntPtr.Size == 4)
MessageBox.Show("32位操作系统");
else if (System.IntPtr.Size == 8)
MessageBox.Show("64位操作系统");
当然了,如果你的操作系统已经是windows7 64位的,如果还出现 IntPtr.Size==4的情况,是因为你的C#项目属性设置为首选32位的原因。如果想取消,操作为:右键点击项目“vectorAdd”——属性——生成——取消选中“首选32位”即可。
|