Kernel launches with Circle

Peruse the source yourself: kernel.cu

Mark Harris put together Six Ways to SAXPY, comparing the linear algebra SAXPY call implemented in six different GPU frontends. This document advances a seventh: the SAXPY launch with Circle.

How does Circle differ from NVIDIA's nvcc compiler?

  1. Circle is single pass. The __CUDA_ARCH__ macro is defined once, and reflects the most recent device architecture targeted. It cannot be used to gate between host and runtime code, and cannot be used to target specific architectures. Its only use in Circle is to enable function definitions in the CUDA Toolkit headers that require it. Instead of relying on __CUDA_ARCH__, Circle programmers have access to an nvvm_arch_t named enumeration, with an enumeration sm_XX for each GPU architecture targeted at the command line. The implicitly-defined variable __cuda_arch reflects the device module currently being targeted. Use if codegen to execute a branch during code generation time (that is, still during compile-time, but after parsing and semantic analysis has run). There are two major benefits to the single-pass design: compile times and code simplicity.
  2. Circle doesn't require tags. __host__ and __device__ tags have similar meanings in Circle as they do in nvcc, but you aren't obligated to tag functions just to use them from kernels. The saxpy helper function below is untagged, allowing it to be called from host or device code. This is especially useful for using big existing libraries, as no porting should be required. Thrust provides tagged versions of complicated STL classes like tuple and array. Circle lets you use those types directly.
  3. Circle lets you metaprogram in C++. Delivering tuning parameters to kernels has been a tricky with CUDA. It requires an unholy union of templates and macros. With Circle, you can use the data structure of your choosing to hold kernel parameters, and query it the exact same way from the host and device code. For the host code, we make a normal query. For the device code, we query from a {@meta} context, which invokes the interpreter to make a compile-time copy of the underlying data structure, and pulls the values in as constexpr objects. We do this in the device for each targeted device architecture, then execute on the tunings corresponding to the architecture being built by the code generator. No macros or template metaprogramming is required.

Circle's metaprogramming facilities give us the flexibility to parameterize operations based on GPU architecture, data type, problem size and input distribution, and so on. We can use the same constructs to access the kernel parameters from device and host code, even though the data is constexpr on the device side!

#include <cuda.h>
#include <map>

struct details_t {
  int nt;     // number of threads per CTA.
  int vt;     // number of values per thread.

  // Go nuts and put all sorts of other options here. Any data type will
  // work.
};

// For each SM version we care about, provide specific tunings for our 
// kernel. The options need to be accessible to both the host at runtime
// (so that we can size the launches and understand how to call the kernel)
// and the device at compile time (so we get efficient code generation).

// We'll use the same std::map for these uses. The @meta context and integrated
// interpreter will allow us to manipulate a compile-time copy of the map.
// This is much easier than C++ template metaprogramming, which doesn't allow
// for random-access into standard containers.

// You can load from CSV or JSON in your build directory. Or you can load from
// a database or scrape tweets or anything you like. The Circle interpreter
// makes importing data very easy.
const std::map<int, details_t> kernel_config {
  { 35, { 256, 3  } },
  { 52, { 128, 7  } },
  { 61, { 128, 11 } },
  { 70, { 256, 8  } }
};

// For convenience, make a templated version of our function. We don't have
// to give it a __device__ tag, because Circle allows you to call untagged
// functions from kernels and other device code. This is especially convenient
// in that we can use std::array, std::tuple and the like directly from the
// STL, and not rely on their equivalents in thrust.
template<int nt, int vt, typename type_t>
void saxpy(type_t a, const type_t* x, type_t* y, size_t index, size_t count) {
  // First load the values in x and y. Perform a bunch of loads before
  // doing any arithmetic. This lets us saturate the memory bandwidth of the
  // device, since we aren't getting blocked by arithmetic data dependencies.
  // @meta for makes this a true unrolled loop--the contained statements are
  // actually injected through the frontend once for each iteration.
  type_t x2[vt], y2[vt];
  @meta for(int i = 0; i < vt; ++i) {
    if(index + i * nt < count) {
      x2[i] = x[index + i * nt];
      y2[i] = y[index + i * nt];
    }
  }

  // Perform arithmetic in-place.
  @meta for(int i = 0; i < vt; ++i)
    y2[i] += a * x2[i];

  // Write all y values back out.
  @meta for(int i = 0; i < vt; ++i) {
    if(index + i * nt < count)
      y[index + i * nt] = y2[i];
  }
}

template<typename type_t>
__global__ void kernel(type_t a, const type_t* x, type_t* y, size_t count) {

  // Loop over each enumerator for the -sm_XX architectures specified at 
  // the command line. These are not necessarily the ones specified in the
  // kernel_config map!
  @meta for enum(auto sm : nvvm_arch_t) {

    // __nvvm_arch is set by Circle's code generator to reflect the currently-
    // targeted NVVM/NVPTX module. If sm is the currently-targeted 
    // architecture in the backend, this branch will get emitted as LLVM IR
    // and compiled down to PTX. The other iterations in the loop will be 
    // skipped over, leaving us with one call to saxpy per target architecture.
    if codegen(__nvvm_arch == sm) {
      
      // Read the details from the std::map at compile time. This is not the
      // same instance of the kernel_config map that is read from main.
      // This kernel_config object is lazily created by the interpreter when
      // we try to access it in an expression in a @meta statement. It's 
      // initialized with the same std::initializer_list ctor, so it has the
      // same value as the ordinary runtime object accessed by main.

      // Use lower_bound search to find the best fit.
      @meta auto it = kernel_config.lower_bound((int)sm);
      static_assert(it != kernel_config.end(), 
        "requested SM version has no kernel details!");

      // Call the saxpy helper function with the parameters from details.
      // Since we loaded details in a @meta statement, its members are
      // effectively constexpr, and we can use them to specialized templates, 
      // size arrays, and so on.

      // We don't have to specialize the helper function over the kernel
      // tuning parameter's. If we preferred, we could specialize over 
      // the nvvm_arch_t sm constant and access the kernel_config data
      // structure directly from saxpy.
      @meta details_t details = it->second;

      size_t index = blockIdx.x * details.nt * details.vt + threadIdx.x;
      saxpy<details.nt, details.vt>(a, x, y, index, count);
    }
  }
}

int main(int argc, char** argv) {

  size_t count = 1000000;

  // Assemble the SM version at runtime based on the GPU plugged into your
  // machine.
  int major, minor;
  cudaDeviceGetAttribute(&major, cudaDevAttrComputeCapabilityMajor, 0);
  cudaDeviceGetAttribute(&minor, cudaDevAttrComputeCapabilityMinor, 0);
  int sm = 10 * major + minor;

  // Retrieve the kernel config.
  auto it = kernel_config.lower_bound(sm);
  if(it == kernel_config.end()) {
    fprintf(stderr, "requested SM version has no kernel details!");
    return 1;
  }
  details_t details = it->second;

  // Compute the number of blocks. This is a runtime operation.
  int nv = details.nt * details.vt;
  int num_blocks = (count + nv - 1) / nv;

  float* x, *y;
  cudaMalloc((void**)&x, sizeof(float) * count);
  cudaMalloc((void**)&y, sizeof(float) * count);

  // Launch the kernel. Note we aren't passing the kernel any architecture-
  // specific template arguments. It gets the architecture from the __cuda_arch
  // @codegen variable.
  kernel<<<num_blocks, details.nt>>>(3.14f, x, y, count);

  cudaFree(x);
  cudaFree(y);
        
  return 0;
}

To compile the code, use -cuda-path to point to the Toolkit installation, specify each architecture you're targeting, and link to libcudart.so. Circle makes a single frontend pass, binds all the ptx and cubin data into a single fatbin, and links your code. The chevron launch works just like nvcc's.

$ circle -cuda-path /usr/local/cuda-10.0 -sm_35 -sm_52 -sm_61 -sm_70 kernel.cu -lcudart
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6kernelIfEvT_PKS0_PS0_m' for 'sm_35'
ptxas info    : Function properties for _Z6kernelIfEvT_PKS0_PS0_m
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 13 registers, 352 bytes cmem[0]
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6kernelIfEvT_PKS0_PS0_m' for 'sm_52'
ptxas info    : Function properties for _Z6kernelIfEvT_PKS0_PS0_m
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 23 registers, 352 bytes cmem[0]
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6kernelIfEvT_PKS0_PS0_m' for 'sm_61'
ptxas info    : Function properties for _Z6kernelIfEvT_PKS0_PS0_m
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 37 registers, 352 bytes cmem[0]
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function '_Z6kernelIfEvT_PKS0_PS0_m' for 'sm_70'
ptxas info    : Function properties for _Z6kernelIfEvT_PKS0_PS0_m
    0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 30 registers, 384 bytes cmem[0]

We can use cuobjdump to print out the SASS linked into the executable. We see that our register blocking has produced code where most of the loads are issued, then the compute is performed, then the stores are made. This is the optimal design for GPU kernels, where we want to exploit fine-grained multithreading to hide IO latency.

$ cuobjdump -sass kernel
...
  code for sm_70
    Function : _Z6kernelIfEvT_PKS0_PS0_m
  .headerflags    @"EF_CUDA_SM70 EF_CUDA_PTX_SM(EF_CUDA_SM70)"
        /*0000*/              @!PT SHFL.IDX PT, RZ, RZ, RZ, RZ ;                             /* 0x000000fffffff389 */
                                                                                             /* 0x000fe200000e00ff */
        /*0010*/                   IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ;                   /* 0x00000a00ff017624 */
                                                                                             /* 0x000fd000078e00ff */
        /*0020*/                   S2R R22, SR_CTAID.X ;                                     /* 0x0000000000167919 */
                                                                                             /* 0x000e220000002500 */
        /*0030*/                   S2R R3, SR_TID.X ;                                        /* 0x0000000000037919 */
                                                                                             /* 0x000e240000002100 */
        /*0040*/                   IMAD R22, R22, 0x800, R3 ;                                /* 0x0000080016167824 */
                                                                                             /* 0x001fe400078e0203 */
        /*0050*/                   IMAD.MOV.U32 R3, RZ, RZ, 0x4 ;                            /* 0x00000004ff037424 */
                                                                                             /* 0x000fc600078e00ff */
        /*0060*/                   ISETP.GE.U32.AND P6, PT, R22.reuse, c[0x0][0x178], PT ;   /* 0x00005e0016007a0c */
                                                                                             /* 0x040fe40003fc6070 */
        /*0070*/                   SHF.R.S32.HI R21, RZ, 0x1f, R22 ;                         /* 0x0000001fff157819 */
                                                                                             /* 0x000fe20000011416 */
        /*0080*/                   IMAD.WIDE R4, R22, R3, c[0x0][0x168] ;                    /* 0x00005a0016047625 */
                                                                                             /* 0x000fc600078e0203 */
        /*0090*/                   ISETP.GE.U32.AND.EX P6, PT, R21, c[0x0][0x17c], PT, P6 ;  /* 0x00005f0015007a0c */
                                                                                             /* 0x000fe20003fc6160 */
        /*00a0*/                   IMAD.WIDE R2, R22, R3, c[0x0][0x170] ;                    /* 0x00005c0016027625 */
                                                                                             /* 0x000fd600078e0203 */
        /*00b0*/              @!P6 LDG.E.SYS R19, [R4] ;                                     /* 0x000000000413e381 */
                                                                                             /* 0x000ea800001ee900 */
        /*00c0*/              @!P6 LDG.E.SYS R20, [R2] ;                                     /* 0x000000000214e381 */
                                                                                             /* 0x000ea200001ee900 */
        /*00d0*/                   IADD3 R24, P3, R22.reuse, 0x100, RZ ;                     /* 0x0000010016187810 */
                                                                                             /* 0x040fe40007f7e0ff */
        /*00e0*/                   IADD3 R23, P2, R22, 0x200, RZ ;                           /* 0x0000020016177810 */
                                                                                             /* 0x000fe40007f5e0ff */
        /*00f0*/                   ISETP.GE.U32.AND P5, PT, R24, c[0x0][0x178], PT ;         /* 0x00005e0018007a0c */
                                                                                             /* 0x000fe20003fa6070 */
        /*0100*/                   IMAD.X R26, RZ, RZ, R21, P3 ;                             /* 0x000000ffff1a7224 */
                                                                                             /* 0x000fe200018e0615 */
        /*0110*/                   IADD3 R24, P0, R22, 0x300, RZ ;                           /* 0x0000030016187810 */
                                                                                             /* 0x000fc40007f1e0ff */
        /*0120*/                   ISETP.GE.U32.AND P4, PT, R23, c[0x0][0x178], PT ;         /* 0x00005e0017007a0c */
                                                                                             /* 0x000fe40003f86070 */
        /*0130*/                   ISETP.GE.U32.AND P3, PT, R24, c[0x0][0x178], PT ;         /* 0x00005e0018007a0c */
                                                                                             /* 0x000fe20003f66070 */
        /*0140*/                   IMAD.X R24, RZ, RZ, R21.reuse, P2 ;                       /* 0x000000ffff187224 */
                                                                                             /* 0x100fe200010e0615 */
        /*0150*/                   IADD3 R23, P1, R22, 0x400, RZ ;                           /* 0x0000040016177810 */
                                                                                             /* 0x000fe20007f3e0ff */
        /*0160*/                   IMAD.X R27, RZ, RZ, R21.reuse, P0 ;                       /* 0x000000ffff1b7224 */
                                                                                             /* 0x100fe200000e0615 */
        /*0170*/                   ISETP.GE.U32.AND.EX P5, PT, R26, c[0x0][0x17c], PT, P5 ;  /* 0x00005f001a007a0c */
                                                                                             /* 0x000fe40003fa6150 */
        /*0180*/                   IADD3 R26, P0, R22, 0x500, RZ ;                           /* 0x00000500161a7810 */
                                                                                             /* 0x000fe40007f1e0ff */
        /*0190*/                   ISETP.GE.U32.AND P2, PT, R23, c[0x0][0x178], PT ;         /* 0x00005e0017007a0c */
                                                                                             /* 0x000fe20003f46070 */
        /*01a0*/                   IMAD.X R23, RZ, RZ, R21.reuse, P1 ;                       /* 0x000000ffff177224 */
                                                                                             /* 0x100fe200008e0615 */
        /*01b0*/                   ISETP.GE.U32.AND.EX P4, PT, R24, c[0x0][0x17c], PT, P4 ;  /* 0x00005f0018007a0c */
                                                                                             /* 0x000fe20003f86140 */
        /*01c0*/                   IMAD.X R24, RZ, RZ, R21, P0 ;                             /* 0x000000ffff187224 */
                                                                                             /* 0x000fe200000e0615 */
        /*01d0*/                   ISETP.GE.U32.AND P1, PT, R26, c[0x0][0x178], PT ;         /* 0x00005e001a007a0c */
                                                                                             /* 0x000fc40003f26070 */
        /*01e0*/                   IADD3 R26, P0, R22, 0x600, RZ ;                           /* 0x00000600161a7810 */
                                                                                             /* 0x000fe40007f1e0ff */
        /*01f0*/                   ISETP.GE.U32.AND.EX P3, PT, R27, c[0x0][0x17c], PT, P3 ;  /* 0x00005f001b007a0c */
                                                                                             /* 0x000fc60003f66130 */
        /*0200*/                   IMAD.X R27, RZ, RZ, R21.reuse, P0 ;                       /* 0x000000ffff1b7224 */
                                                                                             /* 0x100fe200000e0615 */
        /*0210*/                   IADD3 R22, P0, R22, 0x700, RZ ;                           /* 0x0000070016167810 */
                                                                                             /* 0x000fe40007f1e0ff */
        /*0220*/                   P2R R25, PR, RZ, 0x40 ;                                   /* 0x00000040ff197803 */
                                                                                             /* 0x000fe40000000000 */
        /*0230*/                   ISETP.GE.U32.AND P6, PT, R22, c[0x0][0x178], PT ;         /* 0x00005e0016007a0c */
                                                                                             /* 0x000fe20003fc6070 */
        /*0240*/                   IMAD.X R21, RZ, RZ, R21, P0 ;                             /* 0x000000ffff157224 */
                                                                                             /* 0x000fca00000e0615 */
        /*0250*/                   ISETP.GE.U32.AND.EX P6, PT, R21, c[0x0][0x17c], PT, P6 ;  /* 0x00005f0015007a0c */
                                                                                             /* 0x000fc80003fc6160 */
        /*0260*/                   P2R R21, PR, RZ, 0x40 ;                                   /* 0x00000040ff157803 */
                                                                                             /* 0x000fe40000000000 */
        /*0270*/                   ISETP.NE.AND P6, PT, R25, RZ, PT ;                        /* 0x000000ff1900720c */
                                                                                             /* 0x000fe40003fc5270 */
        /*0280*/                   ISETP.GE.U32.AND P0, PT, R26, c[0x0][0x178], PT ;         /* 0x00005e001a007a0c */
                                                                                             /* 0x000fe40003f06070 */
        /*0290*/                   ISETP.GE.U32.AND.EX P2, PT, R23, c[0x0][0x17c], PT, P2 ;  /* 0x00005f0017007a0c */
                                                                                             /* 0x000fe40003f46120 */
        /*02a0*/                   ISETP.GE.U32.AND.EX P1, PT, R24, c[0x0][0x17c], PT, P1 ;  /* 0x00005f0018007a0c */
                                                                                             /* 0x000fe40003f26110 */
        /*02b0*/                   ISETP.GE.U32.AND.EX P0, PT, R27, c[0x0][0x17c], PT, P0 ;  /* 0x00005f001b007a0c */
                                                                                             /* 0x000fe20003f06100 */
        /*02c0*/              @!P5 LDG.E.SYS R17, [R4+0x400] ;                               /* 0x000400000411d381 */
                                                                                             /* 0x000ee800001ee900 */
        /*02d0*/              @!P4 LDG.E.SYS R16, [R4+0x800] ;                               /* 0x000800000410c381 */
                                                                                             /* 0x000f2800001ee900 */
        /*02e0*/              @!P3 LDG.E.SYS R14, [R4+0xc00] ;                               /* 0x000c0000040eb381 */
                                                                                             /* 0x000f6800001ee900 */
        /*02f0*/              @!P2 LDG.E.SYS R12, [R4+0x1000] ;                              /* 0x00100000040ca381 */
                                                                                             /* 0x000ee800001ee900 */
        /*0300*/              @!P1 LDG.E.SYS R10, [R4+0x1400] ;                              /* 0x00140000040a9381 */
                                                                                             /* 0x000ee800001ee900 */
        /*0310*/              @!P0 LDG.E.SYS R8, [R4+0x1800] ;                               /* 0x0018000004088381 */
                                                                                             /* 0x000ee800001ee900 */
        /*0320*/              @!P5 LDG.E.SYS R18, [R2+0x400] ;                               /* 0x000400000212d381 */
                                                                                             /* 0x000ee800001ee900 */
        /*0330*/              @!P4 LDG.E.SYS R15, [R2+0x800] ;                               /* 0x00080000020fc381 */
                                                                                             /* 0x000f2800001ee900 */
        /*0340*/              @!P3 LDG.E.SYS R13, [R2+0xc00] ;                               /* 0x000c0000020db381 */
                                                                                             /* 0x000f6800001ee900 */
        /*0350*/              @!P2 LDG.E.SYS R11, [R2+0x1000] ;                              /* 0x00100000020ba381 */
                                                                                             /* 0x000f2800001ee900 */
        /*0360*/              @!P1 LDG.E.SYS R9, [R2+0x1400] ;                               /* 0x0014000002099381 */
                                                                                             /* 0x000f2800001ee900 */
        /*0370*/              @!P0 LDG.E.SYS R7, [R2+0x1800] ;                               /* 0x0018000002078381 */
                                                                                             /* 0x000f2200001ee900 */
        /*0380*/              @!P6 FFMA R19, R19, c[0x0][0x160], R20 ;                       /* 0x000058001313ea23 */
                                                                                             /* 0x004fe20000000014 */
        /*0390*/                   P2R R20, PR, RZ, 0x40 ;                                   /* 0x00000040ff147803 */
                                                                                             /* 0x000fc40000000000 */
        /*03a0*/                   ISETP.NE.AND P6, PT, R21, RZ, PT ;                        /* 0x000000ff1500720c */
                                                                                             /* 0x000fd80003fc5270 */
        /*03b0*/              @!P6 LDG.E.SYS R0, [R4+0x1c00] ;                               /* 0x001c00000400e381 */
                                                                                             /* 0x00002200001ee900 */
        /*03c0*/                   ISETP.NE.AND P6, PT, R20, RZ, PT ;                        /* 0x000000ff1400720c */
                                                                                             /* 0x000fd80003fc5270 */
        /*03d0*/              @!P6 STG.E.SYS [R2], R19 ;                                     /* 0x000000130200e386 */
                                                                                             /* 0x0001e2000010e900 */
        /*03e0*/                   ISETP.NE.AND P6, PT, R21, RZ, PT ;                        /* 0x000000ff1500720c */
                                                                                             /* 0x000fd80003fc5270 */
        /*03f0*/              @!P6 LDG.E.SYS R6, [R2+0x1c00] ;                               /* 0x001c00000206e381 */
                                                                                             /* 0x00002200001ee900 */
        /*0400*/              @!P5 FFMA R17, R17, c[0x0][0x160], R18 ;                       /* 0x000058001111da23 */
                                                                                             /* 0x008fd00000000012 */
        /*0410*/              @!P5 STG.E.SYS [R2+0x400], R17 ;                               /* 0x000400110200d386 */
                                                                                             /* 0x0001e2000010e900 */
        /*0420*/              @!P4 FFMA R15, R16, c[0x0][0x160], R15 ;                       /* 0x00005800100fca23 */
                                                                                             /* 0x010fe4000000000f */
        /*0430*/              @!P3 FFMA R13, R14, c[0x0][0x160], R13 ;                       /* 0x000058000e0dba23 */
                                                                                             /* 0x020fe4000000000d */
        /*0440*/              @!P2 FFMA R11, R12, c[0x0][0x160], R11 ;                       /* 0x000058000c0baa23 */
                                                                                             /* 0x000fe4000000000b */
        /*0450*/              @!P1 FFMA R9, R10, c[0x0][0x160], R9 ;                         /* 0x000058000a099a23 */
                                                                                             /* 0x000fe40000000009 */
        /*0460*/              @!P0 FFMA R7, R8, c[0x0][0x160], R7 ;                          /* 0x0000580008078a23 */
                                                                                             /* 0x000fe20000000007 */
        /*0470*/              @!P4 STG.E.SYS [R2+0x800], R15 ;                               /* 0x0008000f0200c386 */
                                                                                             /* 0x0001e8000010e900 */
        /*0480*/              @!P3 STG.E.SYS [R2+0xc00], R13 ;                               /* 0x000c000d0200b386 */
                                                                                             /* 0x0001e8000010e900 */
        /*0490*/              @!P2 STG.E.SYS [R2+0x1000], R11 ;                              /* 0x0010000b0200a386 */
                                                                                             /* 0x0001e8000010e900 */
        /*04a0*/              @!P1 STG.E.SYS [R2+0x1400], R9 ;                               /* 0x0014000902009386 */
                                                                                             /* 0x0001e8000010e900 */
        /*04b0*/              @!P0 STG.E.SYS [R2+0x1800], R7 ;                               /* 0x0018000702008386 */
                                                                                             /* 0x0001e2000010e900 */
        /*04c0*/               @P6 EXIT ;                                                    /* 0x000000000000694d */
                                                                                             /* 0x000fea0003800000 */
        /*04d0*/                   FFMA R0, R0, c[0x0][0x160], R6 ;                          /* 0x0000580000007a23 */
                                                                                             /* 0x001fd00000000006 */
        /*04e0*/                   STG.E.SYS [R2+0x1c00], R0 ;                               /* 0x001c000002007386 */
                                                                                             /* 0x000fe2000010e900 */
        /*04f0*/                   EXIT ;                                                    /* 0x000000000000794d */
                                                                                             /* 0x000fea0003800000 */
        /*0500*/                   BRA 0x500;                                                /* 0xfffffff000007947 */
                                                                                             /* 0x000fc0000383ffff */
        /*0510*/                   NOP;                                                      /* 0x0000000000007918 */
                                                                                             /* 0x000fc00000000000 */
        /*0520*/                   NOP;                                                      /* 0x0000000000007918 */
                                                                                             /* 0x000fc00000000000 */
        /*0530*/                   NOP;                                                      /* 0x0000000000007918 */
                                                                                             /* 0x000fc00000000000 */
        /*0540*/                   NOP;                                                      /* 0x0000000000007918 */
                                                                                             /* 0x000fc00000000000 */
        /*0550*/                   NOP;                                                      /* 0x0000000000007918 */
                                                                                             /* 0x000fc00000000000 */
        /*0560*/                   NOP;                                                      /* 0x0000000000007918 */
                                                                                             /* 0x000fc00000000000 */
        /*0570*/                   NOP;                                                      /* 0x0000000000007918 */
                                                                                             /* 0x000fc00000000000 */
    ....................................

Update 13 Aug 2020

NVC++ is making the transitions to single-pass PTX compilation. The stdpar::current_platform() builtin function will yield a value corresponding to the architecture of the function definition being generated. In essence the compiler becomes multi-pass once a stdpar::current_platform() invocation is encountered. The frontend parses the kernel once for each specified target architecture, creating a folder of function definitions. This multi-pass behavior implies some constraints to prevent ODR calamities. Consider a function with deduced return type:

template<typename type_t>
auto func(type_t x) {
  if constexpr(stdpar::current_platform() <= stdpar::sm_35) {
    return 100;

  } else {
    return 3.14;
  }
}

void bar() {
  // Instantiate the func<int> defintion immediately to deduce the return
  // type.
  auto x = func(5);
}

Forked by the current_platform call, there are multiple definitions of func<int>, with different deduced return types. The frontend doesn't have context to distinguish between the different definitions, so it chooses one arbitrarily, instantiates it, gets the deduced type, then establishes that return type as a constraint on the other definitions.

Allowing multiple definitions on device architecture imposes downstream constraints that are hard to reason about and hard to code for. The worst part of the Circle source code is the attachment of multiple function definitions in each function declaration, which I only to support host and device-tagged overloads. I ended up disabling PTX support in the compiler because CUDA_ARCH abuse in Thrust was too much to deal with, but the code paths remain, and those multiple function definition slots are a continual burden. Implicit forks bring this pain, plus a lot more. It's a trap.

I never considered implicit forks because Circle features compile-time side effects as a core language feature. Multiple passes inside a function would have multiple (and unwanted) side effects. Consider:

void func() {
  @meta printf("Before current_platform query.\n");
  if constexpr(stdpar::current_platform() < stdpar::sm_35) {
    foo();
  } else {
    bar();
  }
  @meta printf("After current_platform query.\n");
}

There's no compile-time control flow. Everyone expects one "before" and one "after" message to the terminal during translation. However, by forking the parser when a current_platform builtin is encountered, we'd get one "before" message and an "after" message for each architecture targeted. For terminal outputs, this is just a curiosity, but forking of functions that, say, perform file i/o, would result in ill-formed programs. Currently C++ doesn't allow compile-time side effects (although they are achievable within the Standard), but it's inevitable that it will. Non-transient dynamic memory allocations are allowed in constexpr contexts. But they're fairly useless, and transient allocations will be added due to popular demand in the future. Same with file i/o. These sorts of advanced metaprogramming features are exactly what you'd want to generate GPU kernels, so it's weird to make your kernel-targeting logic incompatible with them.

Circle uses a single pass everywhere, so adding a PTX extension to the compiler only required injecting new implicit declarations (for textures and surfaces, threadIdx and gridIdx variables), chevron launch syntax, and the separation of module from context code in the LLVM backend with the addition of a new module for each target. I spent about three weeks on this and was able to compile all the Toolkit simulations, including nbody and fluidsGL. It worked quickly because the only change to the core language was the codegen bit.

codegen is a semantic category that signifies an expression is constant during code generation. All constexpr expressions are implicitly codegen, but the reverse is not true. When the compiler starts up, an nvvm_arch_t enum is defined with one enumerator per architecture specified on the command line.

$ circle -cuda-path /usr/local/cuda-10.0 -sm_35 -sm_52 -sm_61 -sm_70 kernel.cu -lcudart

Generates the enum:

enum class nvvm_arch_t {
  sm_35 = 35,
  sm_52 = 52,
  sm_61 = 61,
  sm_70 = 70,
};

There's an implicit codegen declaration:

@codegen extern const nvvm_arch_t __nvvm_arch;

The object is undefined while the frontend executes. When the AST is traversed during code generation to lower to LLVM IR, __nvvm_arch is substituted with the enum corresponding to the current architecture. The predicate expression of if codegen statements are evaluated at this point, and non-taken branches are pruned.

If you only use if-codegen statements, there's no difference is convenience between Circle and NVC++, but Circle still only makes a single frontend pass over all source. The difference occurs in how to visit all architectures: Circle encourages the programmer to use reflection to visit all enums in nvvm_arch_t, then enter an if codegen guard to prevent quadratic PTX explosion, and emit target-specific code in that branch. NVC++ performs an implicit frontend fork to create a separate version of the function for each architecture.

NVC++ could, and should, provide reflective dispatch as the means of implementing a switch over all architectures. This does not require general purpose reflection. It can be implemented in compiler intrinsics which would have the advantage of brevity over Circle's method, without the implicit fork and multiple definitions of NVC++.

Consider the Circle reflective dispatch:

template<typename func_t, typename... args_t>
auto sm_dispatch(func_t& f, args_t&&... args) {
  @meta for enum(nvvm_arch_t arch : nvvm_arch_t) {
    // __nvvm_arch is a codegen variable. arch is a constexpr expression. 
    // This block will be pruned during codegen on arch mismatch.
    if codegen(arch == __nvvm_arch) {
      return f.template operator<arch>(std::forward<args_t>(args)...);
    }
  }
}

Since C++ doesn't support passing overload sets to functions, func_t has to be a function object with a templated call operator. We can also define an sm_dispatch that passes tags and is compatible with generic lambdas.

How to tag

I believe a collection of inheritance-organized tags is the best way to subscribe a function to supporting a set of architectures with common functionality. C++ requires-clauses do not work, because they don't provide a mechanism for resolving ambiguity during overload resolution; either the concept passes, or it doesn't. But multiple candidates can be resolved by ranking their parameters according to inheritance: if a candidate has a class type parameter D that inherits B, it's a better candidate than a function with a parameter of class type B.

#include <cstdio>

// Define arch tags using inheritance.
template<int ver> struct sm_t;
template<> struct sm_t<20> { };
template<> struct sm_t<30> : sm_t<20> { };
template<> struct sm_t<35> : sm_t<30> { };
template<> struct sm_t<52> : sm_t<35> { };
template<> struct sm_t<61> : sm_t<52> { };

typedef sm_t<20> sm_20_t;
typedef sm_t<30> sm_30_t;
typedef sm_t<35> sm_35_t;
typedef sm_t<52> sm_52_t;
typedef sm_t<61> sm_61_t;

constexpr sm_20_t sm_20;
constexpr sm_30_t sm_30;
constexpr sm_35_t sm_35;
constexpr sm_52_t sm_52;
constexpr sm_61_t sm_61; 
// Etc.

template<nvvm_arch_t arch>
using sm_arch_t = sm_t<(int)arch>;

template<nvvm_arch_t arch>
constexpr sm_arch_t<arch> sm_arch;

void func(double x, sm_30_t) { puts("sm_30"); }
void func(double x, sm_52_t) { puts("sm_52"); }

int main() {
  // Ill-formed. No viable candidates.
  // func(1.0, sm_20);

  // matches sm_30 exactly.
  func(1.0, sm_30);

  // Only matches sm_30.
  func(1.0, sm_35);

  // matches sm_52 exactly.
  func(1.0, sm_52);

  // Matches both sm_30 and sm_52.
  // sm_52 is the better fit, because it's more derived.
  func(1.0, sm_61);
}

void func_dispatch(double x) {
  @meta for enum(auto arch : nvvm_arch_t) {
    // Visit each arch specified at the command line.
    
    if codegen(__nvvm_arch == arch) {
      // if-codegen to prevent generation of arch-mismatched code.

      // Call a tagged function. Uses the same OR principles as above to 
      // find the best match.
      return func(sm_arch<arch>, std::forward<args_t>(args)...);
    }
  }
}

How to dispatch

I used reflective dispatch to port moderngpu from nvcc to Circle's single-pass frontend in a couple of hours. All the _CUDA_ARCH_ mess went away, and I had a lot more flexibilty in associating tuning parameters with architectures.

We can also define a reflective dispatch as a macro that supports overload sets. (i.e. give it a function name, rather than a function object.)

#define SM_DISPATCH(name, ...)               \
  @meta for enum(nvvm_arch_t arch : nvvm_arch_t) { \
    if codegen(arch == __nvvm_arch) {        \
      name<arch>(__VA_ARGS__);               \
    }                                        \
  }                                          \
}

The path for NVC++: provide sm-dispatch as builtin operators. builtins address two things lacking in Standard C++: reflection and overload set parameters. It's fast to compile, discourages ODR creativity, and avoids the own-goal of implicit forked parsing.

inline int warp_scan(sm_20_t, float* data, int lane_mask) {
  // A generic warp scan.
}
inline int warp_scan(sm_35_t, float* data, int lane_mask) {
  // 35+
}
inline int warp_scan(sm_61_t, float* data, int lane_mask) {
  // 61+
}

template<int NT>
__global__ void kernel() {
  __shared__ float shared[NT];
  int lane_mask = blah;
  ...
  int x = __sm_dispatch_tag(warp_scan, shared, lane_mask);
}

Provide a family of these builtins:

Also provide a step function that expands the operand statement for each potential architecture.

__sm_arch_all(printf("%d\n", (int)__nvvm_arch));
__sm_arch_all(return my_weird_template<args1, args2, __nvvm_arch>(x, y, z));

This gets compiled as:

if codegen(__nvvm_arch == nvvm_arch_t::sm_35)
  printf("%d\n", (int)nvvm_arch_t::sm_35);
else if codegen(__nvvm_arch == nvvm_arch_t::sm_52)
  printf("%d\n", (int)nvvm_arch_t::sm_52);
else if codegen(__nvvm_arch == nvvm_arch_t::sm_61)
  printf("%d\n", (int)nvvm_arch_t::sm_61);
else if codegen(__nvvm_arch == nvvm_arch_t::sm_70)
  printf("%d\n", (int)nvvm_arch_t::sm_70);

and

if codegen(__nvvm_arch == nvvm_arch_t::sm_35)
  return my_weird_template<args1, args2, nvvm_arch_t::sm_35>(x, y, z);
else if codegen(__nvvm_arch == nvvm_arch_t::sm_52)
  return my_weird_template<args1, args2, nvvm_arch_t::sm_52>(x, y, z);
else if codegen(__nvvm_arch == nvvm_arch_t::sm_61)
  return my_weird_template<args1, args2, nvvm_arch_t::sm_61>(x, y, z);
else if codegen(__nvvm_arch == nvvm_arch_t::sm_70)
  return my_weird_template<args1, args2, nvvm_arch_t::sm_70>(x, y, z);

Inside this intrinsic, references to the __nvvm_arch codegen variable are resolved at frontend time, by being set to each of the nvvm_arch_t enums in sequence. This is your compile-time for loop, without having that as a language feature.

As a CUDA algorithms developer (check out the thrust set functions--those bizarre balanced path things are mine) and as someone who wrote an entire C++ compiler from scratch which already supports both PTX and the entire SPIR-V stack, a religiously single-pass compiler with intrinsics to address the lack of reflection and overload set parameters in C++ is the dream solution.