- fix a bug in when compile code with arch < sm_75
- add tv::TensorView capture support in nvrtc inliner
- add better error support for cumm nvrtc
- fix a bug in CummNVRTCModule, we need to keep flag order
- fix a small bug in tv::Tensor::empty.
- fix a small bug in nvrtc tuple.
- fix a small bug in nvrtc
- fix a compile problem in msvc
- fix unsupported arch in cuda 12.0
- fix compile problem
- fix some compile problem in cpu only
- change version to rebuild due to pypi server problem
- Add cuda 12.0
- Add int8 inference for sparse conv
- Fix some problem in cuda 12.0
- Fix bug in ConvProblem introduced in 0.3.6
- Add int64 support for TensorGeneric
- Add flags for H100 and RTX 4090
- fix nvrtc launch problem when smem size is large
- fix nvrtc constant variable parse problem
- Change gemm/conv main function to splited version
- Fix problem in CompileInfo
- Change nlohmann json to 3.11.2
- Fix build problem in cuda 10.2
- Fix some bug related to nvrtc
- Fix cpu build problem
- Add Ampere support. faster fp16, faster tf32 and greatly faster int8 kernels in Ampere GPUs.
- Add nvrtc support for conv kernel.
- drop python 3.6 support.
- BREAKING CHANGE: change dtype enum value for some important reason.
- Fix missing sm37 in supported arch
- add sm37 for cu102.
- add compile info (cuda arch) for better error information.
- Fix a small bug that incorrectly limit arch of simt to sm52.
- add cpu support for CUDAKernelTimer.
- add non-contiguous support for tv::Tensor.
- add tsl hash map, refine cuda hash impl.
- raise error instead of exit program when cuda error occurs.
- gemm kernel now use stride, this enable us perform gemm with non-contiguous tensor
- Fix bugs for gemm kernel when use non-contiguous operand.
- Fix bugs for implicit gemm
- add support for python 3.6, but cudasim don't support python 3.6.
- add profile tool for all gemm and conv kernels.
- Fix some bug of implicit gemm
- add implicit gemm algorithm for all kind of convolution with kernel volume <= 32. this algorithm is very fast with float16.
- add cuda 11.3 build
- remove python 3.6 support