rocBLAS Extension

rocBLAS Extension#

Level-1 Extension functions support the ILP64 API. For more information on these _64 functions, refer to section ILP64 Interface.

rocblas_axpy_ex + batched, strided_batched#

rocblas_status rocblas_axpy_ex(rocblas_handle handle, rocblas_int n, const void *alpha, rocblas_datatype alpha_type, const void *x, rocblas_datatype x_type, rocblas_int incx, void *y, rocblas_datatype y_type, rocblas_int incy, rocblas_datatype execution_type)

BLAS EX API

axpy_ex computes constant alpha multiplied by vector x, plus vector y.

y := alpha * x + y

Currently supported datatypes are as follows:

alpha_type	x_type	y_type	execution_type
bf16_r	bf16_r	bf16_r	f32_r
f32_r	bf16_r	bf16_r	f32_r
f16_r	f16_r	f16_r	f16_r
f16_r	f16_r	f16_r	f32_r
f32_r	f16_r	f16_r	f32_r
f32_r	f32_r	f32_r	f32_r
f64_r	f64_r	f64_r	f64_r
f32_c	f32_c	f32_c	f32_c
f64_c	f64_c	f64_c	f64_c

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
n – [in] [rocblas_int] the number of elements in x and y.
alpha – [in] device pointer or host pointer to specify the scalar alpha.
alpha_type – [in] [rocblas_datatype] specifies the datatype of alpha.
x – [in] device pointer storing vector x.
x_type – [in] [rocblas_datatype] specifies the datatype of vector x.
incx – [in] [rocblas_int] specifies the increment for the elements of x.
y – [inout] device pointer storing vector y.
y_type – [in] [rocblas_datatype] specifies the datatype of vector y.
incy – [in] [rocblas_int] specifies the increment for the elements of y.
execution_type – [in] [rocblas_datatype] specifies the datatype of computation.

rocblas_status rocblas_axpy_batched_ex(rocblas_handle handle, rocblas_int n, const void *alpha, rocblas_datatype alpha_type, const void *x, rocblas_datatype x_type, rocblas_int incx, void *y, rocblas_datatype y_type, rocblas_int incy, rocblas_int batch_count, rocblas_datatype execution_type)

BLAS EX API

axpy_batched_ex computes constant alpha multiplied by vector x, plus vector y over a set of batched vectors.

y := alpha * x + y

Currently supported datatypes are as follows:

alpha_type	x_type	y_type	execution_type
bf16_r	bf16_r	bf16_r	f32_r
f32_r	bf16_r	bf16_r	f32_r
f16_r	f16_r	f16_r	f16_r
f16_r	f16_r	f16_r	f32_r
f32_r	f16_r	f16_r	f32_r
f32_r	f32_r	f32_r	f32_r
f64_r	f64_r	f64_r	f64_r
f32_c	f32_c	f32_c	f32_c
f64_c	f64_c	f64_c	f64_c

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
n – [in] [rocblas_int] the number of elements in each x_i and y_i.
alpha – [in] device pointer or host pointer to specify the scalar alpha.
alpha_type – [in] [rocblas_datatype] specifies the datatype of alpha.
x – [in] device array of device pointers storing each vector x_i.
x_type – [in] [rocblas_datatype] specifies the datatype of each vector x_i.
incx – [in] [rocblas_int] specifies the increment for the elements of each x_i.
y – [inout] device array of device pointers storing each vector y_i.
y_type – [in] [rocblas_datatype] specifies the datatype of each vector y_i.
incy – [in] [rocblas_int] specifies the increment for the elements of each y_i.
batch_count – [in] [rocblas_int] number of instances in the batch.
execution_type – [in] [rocblas_datatype] specifies the datatype of computation.

rocblas_status rocblas_axpy_strided_batched_ex(rocblas_handle handle, rocblas_int n, const void *alpha, rocblas_datatype alpha_type, const void *x, rocblas_datatype x_type, rocblas_int incx, rocblas_stride stridex, void *y, rocblas_datatype y_type, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count, rocblas_datatype execution_type)

BLAS EX API

axpy_strided_batched_ex computes constant alpha multiplied by vector x, plus vector y over a set of strided batched vectors.

y := alpha * x + y

Currently supported datatypes are as follows:

alpha_type	x_type	y_type	execution_type
bf16_r	bf16_r	bf16_r	f32_r
f32_r	bf16_r	bf16_r	f32_r
f16_r	f16_r	f16_r	f16_r
f16_r	f16_r	f16_r	f32_r
f32_r	f16_r	f16_r	f32_r
f32_r	f32_r	f32_r	f32_r
f64_r	f64_r	f64_r	f64_r
f32_c	f32_c	f32_c	f32_c
f64_c	f64_c	f64_c	f64_c

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
n – [in] [rocblas_int] the number of elements in each x_i and y_i.
alpha – [in] device pointer or host pointer to specify the scalar alpha.
alpha_type – [in] [rocblas_datatype] specifies the datatype of alpha.
x – [in] device pointer to the first vector x_1.
x_type – [in] [rocblas_datatype] specifies the datatype of each vector x_i.
incx – [in] [rocblas_int] specifies the increment for the elements of each x_i.
stridex – [in] [rocblas_stride] stride from the start of one vector (x_i) to the next one (x_i+1). There are no restrictions placed on stridex. However, ensure that stridex is of appropriate size. For a typical case this means stridex >= n * incx.
y – [inout] device pointer to the first vector y_1.
y_type – [in] [rocblas_datatype] specifies the datatype of each vector y_i.
incy – [in] [rocblas_int] specifies the increment for the elements of each y_i.
stridey – [in] [rocblas_stride] stride from the start of one vector (y_i) to the next one (y_i+1). There are no restrictions placed on stridey. However, ensure that stridey is of appropriate size. For a typical case this means stridey >= n * incy.
batch_count – [in] [rocblas_int] number of instances in the batch.
execution_type – [in] [rocblas_datatype] specifies the datatype of computation.

axpy_ex, axpy_batched_ex, and axpy_strided_batched_ex functions support the _64 interface. Refer to section ILP64 Interface.

rocblas_dot_ex + batched, strided_batched#

rocblas_status rocblas_dot_ex(rocblas_handle handle, rocblas_int n, const void *x, rocblas_datatype x_type, rocblas_int incx, const void *y, rocblas_datatype y_type, rocblas_int incy, void *result, rocblas_datatype result_type, rocblas_datatype execution_type)

BLAS EX API

dot_ex performs the dot product of vectors x and y.

result = x * y;

dotc_ex performs the dot product of the conjugate of complex vector x and complex vector y

result = conjugate (x) * y;

Currently supported datatypes are as follows:

x_type	y_type	result_type	execution_type
f16_r	f16_r	f16_r	f16_r
f16_r	f16_r	f16_r	f32_r
bf16_r	bf16_r	bf16_r	f32_r
f32_r	f32_r	f32_r	f32_r
f64_r	f64_r	f64_r	f64_r
f32_c	f32_c	f32_c	f32_c
f64_c	f64_c	f64_c	f64_c

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
n – [in] [rocblas_int] the number of elements in x and y.
x – [in] device pointer storing vector x.
x_type – [in] [rocblas_datatype] specifies the datatype of vector x.
incx – [in] [rocblas_int] specifies the increment for the elements of y.
y – [in] device pointer storing vector y.
y_type – [in] [rocblas_datatype] specifies the datatype of vector y.
incy – [in] [rocblas_int] specifies the increment for the elements of y.
result – [inout] device pointer or host pointer to store the dot product. return is 0.0 if n <= 0.
result_type – [in] [rocblas_datatype] specifies the datatype of the result.
execution_type – [in] [rocblas_datatype] specifies the datatype of computation.

rocblas_status rocblas_dot_batched_ex(rocblas_handle handle, rocblas_int n, const void *x, rocblas_datatype x_type, rocblas_int incx, const void *y, rocblas_datatype y_type, rocblas_int incy, rocblas_int batch_count, void *result, rocblas_datatype result_type, rocblas_datatype execution_type)

BLAS EX API

dot_batched_ex performs a batch of dot products of vectors x and y.

result_i = x_i * y_i;

dotc_batched_ex performs a batch of dot products of the conjugate of complex vector x and complex vector y

result_i = conjugate (x_i) * y_i;

where (x_i, y_i) is the i-th instance of the batch. x_i and y_i are vectors, for i = 1, …, batch_count

Currently supported datatypes are as follows:

x_type	y_type	result_type	execution_type
f16_r	f16_r	f16_r	f16_r
f16_r	f16_r	f16_r	f32_r
bf16_r	bf16_r	bf16_r	f32_r
f32_r	f32_r	f32_r	f32_r
f64_r	f64_r	f64_r	f64_r
f32_c	f32_c	f32_c	f32_c
f64_c	f64_c	f64_c	f64_c

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
n – [in] [rocblas_int] the number of elements in each x_i and y_i.
x – [in] device array of device pointers storing each vector x_i.
x_type – [in] [rocblas_datatype] specifies the datatype of each vector x_i.
incx – [in] [rocblas_int] specifies the increment for the elements of each x_i.
y – [in] device array of device pointers storing each vector y_i.
y_type – [in] [rocblas_datatype] specifies the datatype of each vector y_i.
incy – [in] [rocblas_int] specifies the increment for the elements of each y_i.
batch_count – [in] [rocblas_int] number of instances in the batch.
result – [inout] device array or host array of batch_count size to store the dot products of each batch. return 0.0 for each element if n <= 0.
result_type – [in] [rocblas_datatype] specifies the datatype of the result.
execution_type – [in] [rocblas_datatype] specifies the datatype of computation.

rocblas_status rocblas_dot_strided_batched_ex(rocblas_handle handle, rocblas_int n, const void *x, rocblas_datatype x_type, rocblas_int incx, rocblas_stride stride_x, const void *y, rocblas_datatype y_type, rocblas_int incy, rocblas_stride stride_y, rocblas_int batch_count, void *result, rocblas_datatype result_type, rocblas_datatype execution_type)

BLAS EX API

dot_strided_batched_ex performs a batch of dot products of vectors x and y.

result_i = x_i * y_i;

dotc_strided_batched_ex performs a batch of dot products of the conjugate of complex vector x and complex vector y

result_i = conjugate (x_i) * y_i;

where (x_i, y_i) is the i-th instance of the batch. x_i and y_i are vectors, for i = 1, …, batch_count

Currently supported datatypes are as follows:

x_type	y_type	result_type	execution_type
f16_r	f16_r	f16_r	f16_r
f16_r	f16_r	f16_r	f32_r
bf16_r	bf16_r	bf16_r	f32_r
f32_r	f32_r	f32_r	f32_r
f64_r	f64_r	f64_r	f64_r
f32_c	f32_c	f32_c	f32_c
f64_c	f64_c	f64_c	f64_c

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
n – [in] [rocblas_int] the number of elements in each x_i and y_i.
x – [in] device pointer to the first vector (x_1) in the batch.
x_type – [in] [rocblas_datatype] specifies the datatype of each vector x_i.
incx – [in] [rocblas_int] specifies the increment for the elements of each x_i.
stride_x – [in] [rocblas_stride] stride from the start of one vector (x_i) and the next one (x_i+1)
y – [in] device pointer to the first vector (y_1) in the batch.
y_type – [in] [rocblas_datatype] specifies the datatype of each vector y_i.
incy – [in] [rocblas_int] specifies the increment for the elements of each y_i.
stride_y – [in] [rocblas_stride] stride from the start of one vector (y_i) and the next one (y_i+1)
batch_count – [in] [rocblas_int] number of instances in the batch.
result – [inout] device array or host array of batch_count size to store the dot products of each batch. return 0.0 for each element if n <= 0.
result_type – [in] [rocblas_datatype] specifies the datatype of the result.
execution_type – [in] [rocblas_datatype] specifies the datatype of computation.

dot_ex, dot_batched_ex, and dot_strided_batched_ex functions support the _64 interface. Refer to section ILP64 Interface.

rocblas_dotc_ex + batched, strided_batched#

rocblas_status rocblas_dotc_ex(rocblas_handle handle, rocblas_int n, const void *x, rocblas_datatype x_type, rocblas_int incx, const void *y, rocblas_datatype y_type, rocblas_int incy, void *result, rocblas_datatype result_type, rocblas_datatype execution_type)

BLAS EX API

dot_ex performs the dot product of vectors x and y.

result = x * y;

dotc_ex performs the dot product of the conjugate of complex vector x and complex vector y

result = conjugate (x) * y;

Currently supported datatypes are as follows:

x_type	y_type	result_type	execution_type
f16_r	f16_r	f16_r	f16_r
f16_r	f16_r	f16_r	f32_r
bf16_r	bf16_r	bf16_r	f32_r
f32_r	f32_r	f32_r	f32_r
f64_r	f64_r	f64_r	f64_r
f32_c	f32_c	f32_c	f32_c
f64_c	f64_c	f64_c	f64_c

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
n – [in] [rocblas_int] the number of elements in x and y.
x – [in] device pointer storing vector x.
x_type – [in] [rocblas_datatype] specifies the datatype of vector x.
incx – [in] [rocblas_int] specifies the increment for the elements of y.
y – [in] device pointer storing vector y.
y_type – [in] [rocblas_datatype] specifies the datatype of vector y.
incy – [in] [rocblas_int] specifies the increment for the elements of y.
result – [inout] device pointer or host pointer to store the dot product. return is 0.0 if n <= 0.
result_type – [in] [rocblas_datatype] specifies the datatype of the result.
execution_type – [in] [rocblas_datatype] specifies the datatype of computation.

rocblas_status rocblas_dotc_batched_ex(rocblas_handle handle, rocblas_int n, const void *x, rocblas_datatype x_type, rocblas_int incx, const void *y, rocblas_datatype y_type, rocblas_int incy, rocblas_int batch_count, void *result, rocblas_datatype result_type, rocblas_datatype execution_type)

BLAS EX API

dot_batched_ex performs a batch of dot products of vectors x and y.

result_i = x_i * y_i;

dotc_batched_ex performs a batch of dot products of the conjugate of complex vector x and complex vector y

result_i = conjugate (x_i) * y_i;

where (x_i, y_i) is the i-th instance of the batch. x_i and y_i are vectors, for i = 1, …, batch_count

Currently supported datatypes are as follows:

x_type	y_type	result_type	execution_type
f16_r	f16_r	f16_r	f16_r
f16_r	f16_r	f16_r	f32_r
bf16_r	bf16_r	bf16_r	f32_r
f32_r	f32_r	f32_r	f32_r
f64_r	f64_r	f64_r	f64_r
f32_c	f32_c	f32_c	f32_c
f64_c	f64_c	f64_c	f64_c

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
n – [in] [rocblas_int] the number of elements in each x_i and y_i.
x – [in] device array of device pointers storing each vector x_i.
x_type – [in] [rocblas_datatype] specifies the datatype of each vector x_i.
incx – [in] [rocblas_int] specifies the increment for the elements of each x_i.
y – [in] device array of device pointers storing each vector y_i.
y_type – [in] [rocblas_datatype] specifies the datatype of each vector y_i.
incy – [in] [rocblas_int] specifies the increment for the elements of each y_i.
batch_count – [in] [rocblas_int] number of instances in the batch.
result – [inout] device array or host array of batch_count size to store the dot products of each batch. return 0.0 for each element if n <= 0.
result_type – [in] [rocblas_datatype] specifies the datatype of the result.
execution_type – [in] [rocblas_datatype] specifies the datatype of computation.

rocblas_status rocblas_dotc_strided_batched_ex(rocblas_handle handle, rocblas_int n, const void *x, rocblas_datatype x_type, rocblas_int incx, rocblas_stride stride_x, const void *y, rocblas_datatype y_type, rocblas_int incy, rocblas_stride stride_y, rocblas_int batch_count, void *result, rocblas_datatype result_type, rocblas_datatype execution_type)

BLAS EX API

dot_strided_batched_ex performs a batch of dot products of vectors x and y.

result_i = x_i * y_i;

dotc_strided_batched_ex performs a batch of dot products of the conjugate of complex vector x and complex vector y

result_i = conjugate (x_i) * y_i;

where (x_i, y_i) is the i-th instance of the batch. x_i and y_i are vectors, for i = 1, …, batch_count

Currently supported datatypes are as follows:

x_type	y_type	result_type	execution_type
f16_r	f16_r	f16_r	f16_r
f16_r	f16_r	f16_r	f32_r
bf16_r	bf16_r	bf16_r	f32_r
f32_r	f32_r	f32_r	f32_r
f64_r	f64_r	f64_r	f64_r
f32_c	f32_c	f32_c	f32_c
f64_c	f64_c	f64_c	f64_c

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
n – [in] [rocblas_int] the number of elements in each x_i and y_i.
x – [in] device pointer to the first vector (x_1) in the batch.
x_type – [in] [rocblas_datatype] specifies the datatype of each vector x_i.
incx – [in] [rocblas_int] specifies the increment for the elements of each x_i.
stride_x – [in] [rocblas_stride] stride from the start of one vector (x_i) and the next one (x_i+1)
y – [in] device pointer to the first vector (y_1) in the batch.
y_type – [in] [rocblas_datatype] specifies the datatype of each vector y_i.
incy – [in] [rocblas_int] specifies the increment for the elements of each y_i.
stride_y – [in] [rocblas_stride] stride from the start of one vector (y_i) and the next one (y_i+1)
batch_count – [in] [rocblas_int] number of instances in the batch.
result – [inout] device array or host array of batch_count size to store the dot products of each batch. return 0.0 for each element if n <= 0.
result_type – [in] [rocblas_datatype] specifies the datatype of the result.
execution_type – [in] [rocblas_datatype] specifies the datatype of computation.

dotc_ex, dotc_batched_ex, and dotc_strided_batched_ex functions support the _64 interface. Refer to section ILP64 Interface.

rocblas_nrm2_ex + batched, strided_batched#

rocblas_status rocblas_nrm2_ex(rocblas_handle handle, rocblas_int n, const void *x, rocblas_datatype x_type, rocblas_int incx, void *results, rocblas_datatype result_type, rocblas_datatype execution_type)

BLAS_EX API.

nrm2_ex computes the euclidean norm of a real or complex vector.

      result := sqrt( x'*x ) for real vectors
      result := sqrt( x**H*x ) for complex vectors

Currently supported datatypes are as follows:

x_type	result	execution_type
bf16_r	bf16_r	f32_r
f16_r	f16_r	f32_r
f32_r	f32_r	f32_r
f64_r	f64_r	f64_r
f32_c	f32_r	f32_r
f64_c	f64_r	f64_r

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
n – [in] [rocblas_int] the number of elements in x.
x – [in] device pointer storing vector x.
x_type – [in] [rocblas_datatype] specifies the datatype of the vector x.
incx – [in] [rocblas_int] specifies the increment for the elements of y.
results – [inout] device pointer or host pointer to store the nrm2 product. return is 0.0 if n, incx<=0.
result_type – [in] [rocblas_datatype] specifies the datatype of the result.
execution_type – [in] [rocblas_datatype] specifies the datatype of computation.

rocblas_status rocblas_nrm2_batched_ex(rocblas_handle handle, rocblas_int n, const void *x, rocblas_datatype x_type, rocblas_int incx, rocblas_int batch_count, void *results, rocblas_datatype result_type, rocblas_datatype execution_type)

BLAS_EX API.

nrm2_batched_ex computes the euclidean norm over a batch of real or complex vectors.

      result := sqrt( x_i'*x_i ) for real vectors x, for i = 1, ..., batch_count
      result := sqrt( x_i**H*x_i ) for complex vectors x, for i = 1, ..., batch_count

Currently supported datatypes are as follows:

x_type	result	execution_type
bf16_r	bf16_r	f32_r
f16_r	f16_r	f32_r
f32_r	f32_r	f32_r
f64_r	f64_r	f64_r
f32_c	f32_r	f32_r
f64_c	f64_r	f64_r

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
n – [in] [rocblas_int] number of elements in each x_i.
x – [in] device array of device pointers storing each vector x_i.
x_type – [in] [rocblas_datatype] specifies the datatype of each vector x_i.
incx – [in] [rocblas_int] specifies the increment for the elements of each x_i. incx must be > 0.
batch_count – [in] [rocblas_int] number of instances in the batch.
results – [out] device pointer or host pointer to array of batch_count size for nrm2 results. return is 0.0 for each element if n <= 0, incx<=0.
result_type – [in] [rocblas_datatype] specifies the datatype of the result.
execution_type – [in] [rocblas_datatype] specifies the datatype of computation.

rocblas_status rocblas_nrm2_strided_batched_ex(rocblas_handle handle, rocblas_int n, const void *x, rocblas_datatype x_type, rocblas_int incx, rocblas_stride stride_x, rocblas_int batch_count, void *results, rocblas_datatype result_type, rocblas_datatype execution_type)

BLAS_EX API.

nrm2_strided_batched_ex computes the euclidean norm over a batch of real or complex vectors.

result := sqrt( x_i'*x_i ) for real vectors x, for i = 1, ..., batch_count
result := sqrt( x_i**H*x_i ) for complex vectors, for i = 1, ..., batch_count

Currently supported datatypes are as follows:

x_type	result	execution_type
bf16_r	bf16_r	f32_r
f16_r	f16_r	f32_r
f32_r	f32_r	f32_r
f64_r	f64_r	f64_r
f32_c	f32_r	f32_r
f64_c	f64_r	f64_r

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
n – [in] [rocblas_int] number of elements in each x_i.
x – [in] device pointer to the first vector x_1.
x_type – [in] [rocblas_datatype] specifies the datatype of each vector x_i.
incx – [in] [rocblas_int] specifies the increment for the elements of each x_i. incx must be > 0.
stride_x – [in] [rocblas_stride] stride from the start of one vector (x_i) and the next one (x_i+1). There are no restrictions placed on stride_x. However, ensure that stride_x is of appropriate size. For a typical case this means stride_x >= n * incx.
batch_count – [in] [rocblas_int] number of instances in the batch.
results – [out] device pointer or host pointer to array for storing contiguous batch_count results. return is 0.0 for each element if n <= 0, incx<=0.
result_type – [in] [rocblas_datatype] specifies the datatype of the result.
execution_type – [in] [rocblas_datatype] specifies the datatype of computation.

nrm2_ex, nrm2_batched_ex, and nrm2_strided_batched_ex functions support the _64 interface. Refer to section ILP64 Interface.

rocblas_rot_ex + batched, strided_batched#

rocblas_status rocblas_rot_ex(rocblas_handle handle, rocblas_int n, void *x, rocblas_datatype x_type, rocblas_int incx, void *y, rocblas_datatype y_type, rocblas_int incy, const void *c, const void *s, rocblas_datatype cs_type, rocblas_datatype execution_type)

BLAS EX API

rot_ex applies the Givens rotation matrix defined by c=cos(alpha) and s=sin(alpha) to vectors x and y. Scalars c and s may be stored in either host or device memory. Location is specified by calling rocblas_set_pointer_mode.

In the case where cs_type is real:

x := c * x + s * y
y := c * y - s * x

In the case where cs_type is complex, the imaginary part of c is ignored:

x := real(c) * x + s * y
y := real(c) * y - conj(s) * x

Currently supported datatypes are as follows:

x_type	y_type	cs_type	execution_type
bf16_r	bf16_r	bf16_r	f32_r
f16_r	f16_r	f16_r	f32_r
f32_r	f32_r	f32_r	f32_r
f64_r	f64_r	f64_r	f64_r
f32_c	f32_c	f32_c	f32_c
f32_c	f32_c	f32_r	f32_c
f64_c	f64_c	f64_c	f64_c
f64_c	f64_c	f64_r	f64_c

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
n – [in] [rocblas_int] number of elements in the x and y vectors.
x – [inout] device pointer storing vector x.
x_type – [in] [rocblas_datatype] specifies the datatype of vector x.
incx – [in] [rocblas_int] specifies the increment between elements of x.
y – [inout] device pointer storing vector y.
y_type – [in] [rocblas_datatype] specifies the datatype of vector y.
incy – [in] [rocblas_int] specifies the increment between elements of y.
c – [in] device pointer or host pointer storing scalar cosine component of the rotation matrix.
s – [in] device pointer or host pointer storing scalar sine component of the rotation matrix.
cs_type – [in] [rocblas_datatype] specifies the datatype of c and s.
execution_type – [in] [rocblas_datatype] specifies the datatype of computation.

rocblas_status rocblas_rot_batched_ex(rocblas_handle handle, rocblas_int n, void *x, rocblas_datatype x_type, rocblas_int incx, void *y, rocblas_datatype y_type, rocblas_int incy, const void *c, const void *s, rocblas_datatype cs_type, rocblas_int batch_count, rocblas_datatype execution_type)

BLAS EX API

rot_batched_ex applies the Givens rotation matrix defined by c=cos(alpha) and s=sin(alpha) to batched vectors x_i and y_i, for i = 1, …, batch_count. Scalars c and s may be stored in either host or device memory. Location is specified by calling rocblas_set_pointer_mode.

In the case where cs_type is real:

x := c * x + s * y
y := c * y - s * x

In the case where cs_type is complex, the imaginary part of c is ignored:

x := real(c) * x + s * y
y := real(c) * y - conj(s) * x

Currently supported datatypes are as follows:

x_type	y_type	cs_type	execution_type
bf16_r	bf16_r	bf16_r	f32_r
f16_r	f16_r	f16_r	f32_r
f32_r	f32_r	f32_r	f32_r
f64_r	f64_r	f64_r	f64_r
f32_c	f32_c	f32_c	f32_c
f32_c	f32_c	f32_r	f32_c
f64_c	f64_c	f64_c	f64_c
f64_c	f64_c	f64_r	f64_c

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
n – [in] [rocblas_int] number of elements in each x_i and y_i vectors.
x – [inout] device array of deivce pointers storing each vector x_i.
x_type – [in] [rocblas_datatype] specifies the datatype of each vector x_i.
incx – [in] [rocblas_int] specifies the increment between elements of each x_i.
y – [inout] device array of device pointers storing each vector y_i.
y_type – [in] [rocblas_datatype] specifies the datatype of each vector y_i.
incy – [in] [rocblas_int] specifies the increment between elements of each y_i.
c – [in] device pointer or host pointer to scalar cosine component of the rotation matrix.
s – [in] device pointer or host pointer to scalar sine component of the rotation matrix.
cs_type – [in] [rocblas_datatype] specifies the datatype of c and s.
batch_count – [in] [rocblas_int] the number of x and y arrays, the number of batches.
execution_type – [in] [rocblas_datatype] specifies the datatype of computation.

rocblas_status rocblas_rot_strided_batched_ex(rocblas_handle handle, rocblas_int n, void *x, rocblas_datatype x_type, rocblas_int incx, rocblas_stride stride_x, void *y, rocblas_datatype y_type, rocblas_int incy, rocblas_stride stride_y, const void *c, const void *s, rocblas_datatype cs_type, rocblas_int batch_count, rocblas_datatype execution_type)

BLAS Level 1 API

rot_strided_batched_ex applies the Givens rotation matrix defined by c=cos(alpha) and s=sin(alpha) to strided batched vectors x_i and y_i, for i = 1, …, batch_count. Scalars c and s may be stored in either host or device memory. Location is specified by calling rocblas_set_pointer_mode.

In the case where cs_type is real:

x := c * x + s * y
y := c * y - s * x

In the case where cs_type is complex, the imaginary part of c is ignored:

x := real(c) * x + s * y
y := real(c) * y - conj(s) * x

Currently supported datatypes are as follows:

x_type	y_type	cs_type	execution_type
bf16_r	bf16_r	bf16_r	f32_r
f16_r	f16_r	f16_r	f32_r
f32_r	f32_r	f32_r	f32_r
f64_r	f64_r	f64_r	f64_r
f32_c	f32_c	f32_c	f32_c
f32_c	f32_c	f32_r	f32_c
f64_c	f64_c	f64_c	f64_c
f64_c	f64_c	f64_r	f64_c

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
n – [in] [rocblas_int] number of elements in each x_i and y_i vectors.
x – [inout] device pointer to the first vector x_1.
x_type – [in] [rocblas_datatype] specifies the datatype of each vector x_i.
incx – [in] [rocblas_int] specifies the increment between elements of each x_i.
stride_x – [in] [rocblas_stride] specifies the increment from the beginning of x_i to the beginning of x_(i+1)
y – [inout] device pointer to the first vector y_1.
y_type – [in] [rocblas_datatype] specifies the datatype of each vector y_i.
incy – [in] [rocblas_int] specifies the increment between elements of each y_i.
stride_y – [in] [rocblas_stride] specifies the increment from the beginning of y_i to the beginning of y_(i+1)
c – [in] device pointer or host pointer to scalar cosine component of the rotation matrix.
s – [in] device pointer or host pointer to scalar sine component of the rotation matrix.
cs_type – [in] [rocblas_datatype] specifies the datatype of c and s.
batch_count – [in] [rocblas_int] the number of x and y arrays, the number of batches.
execution_type – [in] [rocblas_datatype] specifies the datatype of computation.

rot_ex, rot_batched_ex, and rot_strided_batched_ex functions support the _64 interface. Refer to section ILP64 Interface.

rocblas_scal_ex + batched, strided_batched#

rocblas_status rocblas_scal_ex(rocblas_handle handle, rocblas_int n, const void *alpha, rocblas_datatype alpha_type, void *x, rocblas_datatype x_type, rocblas_int incx, rocblas_datatype execution_type)

BLAS EX API

scal_ex scales each element of vector x with scalar alpha.

x := alpha * x

Currently supported datatypes are as follows:

alpha_type	x_type	execution_type
f32_r	bf16_r	f32_r
bf16_r	bf16_r	f32_r
f16_r	f16_r	f16_r
f16_r	f16_r	f32_r
f32_r	f16_r	f32_r
f32_r	f32_r	f32_r
f64_r	f64_r	f64_r
f32_c	f32_c	f32_c
f64_c	f64_c	f64_c
f32_r	f32_c	f32_c
f64_r	f64_c	f64_c

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
n – [in] [rocblas_int] the number of elements in x.
alpha – [in] device pointer or host pointer for the scalar alpha.
alpha_type – [in] [rocblas_datatype] specifies the datatype of alpha.
x – [inout] device pointer storing vector x.
x_type – [in] [rocblas_datatype] specifies the datatype of vector x.
incx – [in] [rocblas_int] specifies the increment for the elements of x.
execution_type – [in] [rocblas_datatype] specifies the datatype of computation.

rocblas_status rocblas_scal_batched_ex(rocblas_handle handle, rocblas_int n, const void *alpha, rocblas_datatype alpha_type, void *x, rocblas_datatype x_type, rocblas_int incx, rocblas_int batch_count, rocblas_datatype execution_type)

BLAS EX API

scal_batched_ex scales each element of each vector x_i with scalar alpha.

x_i := alpha * x_i

Currently supported datatypes are as follows:

alpha_type	x_type	execution_type
f32_r	bf16_r	f32_r
bf16_r	bf16_r	f32_r
f16_r	f16_r	f16_r
f16_r	f16_r	f32_r
f32_r	f16_r	f32_r
f32_r	f32_r	f32_r
f64_r	f64_r	f64_r
f32_c	f32_c	f32_c
f64_c	f64_c	f64_c
f32_r	f32_c	f32_c
f64_r	f64_c	f64_c

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
n – [in] [rocblas_int] the number of elements in x.
alpha – [in] device pointer or host pointer for the scalar alpha.
alpha_type – [in] [rocblas_datatype] specifies the datatype of alpha.
x – [inout] device array of device pointers storing each vector x_i.
x_type – [in] [rocblas_datatype] specifies the datatype of each vector x_i.
incx – [in] [rocblas_int] specifies the increment for the elements of each x_i.
batch_count – [in] [rocblas_int] number of instances in the batch.
execution_type – [in] [rocblas_datatype] specifies the datatype of computation.

rocblas_status rocblas_scal_strided_batched_ex(rocblas_handle handle, rocblas_int n, const void *alpha, rocblas_datatype alpha_type, void *x, rocblas_datatype x_type, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, rocblas_datatype execution_type)

BLAS EX API

scal_strided_batched_ex scales each element of vector x with scalar alpha over a set of strided batched vectors.

x := alpha * x

Currently supported datatypes are as follows:

alpha_type	x_type	execution_type
f32_r	bf16_r	f32_r
bf16_r	bf16_r	f32_r
f16_r	f16_r	f16_r
f16_r	f16_r	f32_r
f32_r	f16_r	f32_r
f32_r	f32_r	f32_r
f64_r	f64_r	f64_r
f32_c	f32_c	f32_c
f64_c	f64_c	f64_c
f32_r	f32_c	f32_c
f64_r	f64_c	f64_c

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
n – [in] [rocblas_int] the number of elements in x.
alpha – [in] device pointer or host pointer for the scalar alpha.
alpha_type – [in] [rocblas_datatype] specifies the datatype of alpha.
x – [inout] device pointer to the first vector x_1.
x_type – [in] [rocblas_datatype] specifies the datatype of each vector x_i.
incx – [in] [rocblas_int] specifies the increment for the elements of each x_i.
stridex – [in] [rocblas_stride] stride from the start of one vector (x_i) to the next one (x_i+1). There are no restrictions placed on stridex. However, ensure that stridex is of appropriate size. For a typical case this means stridex >= n * incx.
batch_count – [in] [rocblas_int] number of instances in the batch.
execution_type – [in] [rocblas_datatype] specifies the datatype of computation.

scal_ex, scal_batched_ex, and scal_strided_batched_ex functions support the _64 interface. Refer to section ILP64 Interface.

rocblas_gemm_ex + batched, strided_batched#

rocblas_status rocblas_gemm_ex(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, const void *b, rocblas_datatype b_type, rocblas_int ldb, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_datatype compute_type, rocblas_gemm_algo algo, int32_t solution_index, uint32_t flags)

BLAS EX API

gemm_ex performs one of the matrix-matrix operations:

D = alpha*op( A )*op( B ) + beta*C,

where op( X ) is one of

op( X ) = X      or
op( X ) = X**T   or
op( X ) = X**H,

alpha and beta are scalars, and A, B, C, and D are matrices, with op( A ) an m by k matrix, op( B ) a k by n matrix and C and D are m by n matrices. C and D may point to the same matrix if their parameters are identical.

Supported types are as follows:

rocblas_datatype_f64_r = a_type = b_type = c_type = d_type = compute_type
rocblas_datatype_f32_r = a_type = b_type = c_type = d_type = compute_type
rocblas_datatype_f16_r = a_type = b_type = c_type = d_type = compute_type
rocblas_datatype_f16_r = a_type = b_type = c_type = d_type; rocblas_datatype_f32_r = compute_type
rocblas_datatype_f16_r = a_type = b_type; rocblas_datatype_f32_r = c_type = d_type = compute_type
rocblas_datatype_bf16_r = a_type = b_type = c_type = d_type; rocblas_datatype_f32_r = compute_type
rocblas_datatype_bf16_r = a_type = b_type; rocblas_datatype_f32_r = c_type = d_type = compute_type
rocblas_datatype_i8_r = a_type = b_type; rocblas_datatype_i32_r = c_type = d_type = compute_type
rocblas_datatype_f32_c = a_type = b_type = c_type = d_type = compute_type
rocblas_datatype_f64_c = a_type = b_type = c_type = d_type = compute_type

Although not widespread, some gemm kernels used by gemm_ex may use atomic operations. See Atomic Operations in the API Reference Guide for more information.

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
transA – [in] [rocblas_operation] specifies the form of op( A ).
transB – [in] [rocblas_operation] specifies the form of op( B ).
m – [in] [rocblas_int] matrix dimension m.
n – [in] [rocblas_int] matrix dimension n.
k – [in] [rocblas_int] matrix dimension k.
alpha – [in] [const void *] device pointer or host pointer specifying the scalar alpha. Same datatype as compute_type.
a – [in] [void *] device pointer storing matrix A.
a_type – [in] [rocblas_datatype] specifies the datatype of matrix A.
lda – [in] [rocblas_int] specifies the leading dimension of A.
b – [in] [void *] device pointer storing matrix B.
b_type – [in] [rocblas_datatype] specifies the datatype of matrix B.
ldb – [in] [rocblas_int] specifies the leading dimension of B.
beta – [in] [const void *] device pointer or host pointer specifying the scalar beta. Same datatype as compute_type.
c – [in] [void *] device pointer storing matrix C.
c_type – [in] [rocblas_datatype] specifies the datatype of matrix C.
ldc – [in] [rocblas_int] specifies the leading dimension of C.
d – [out] [void *] device pointer storing matrix D. If d and c pointers are to the same matrix then d_type must equal c_type and ldd must equal ldc or the respective invalid status will be returned.
d_type – [in] [rocblas_datatype] specifies the datatype of matrix D.
ldd – [in] [rocblas_int] specifies the leading dimension of D.
compute_type – [in] [rocblas_datatype] specifies the datatype of computation.
algo – [in] [rocblas_gemm_algo] enumerant specifying the algorithm type.
solution_index – [in] [int32_t] if algo is rocblas_gemm_algo_solution_index, this controls which solution is used. When algo is not rocblas_gemm_algo_solution_index, or if solution_index <= 0, the default solution is used. This parameter was unused in previous releases and instead always used the default solution
flags – [in] [uint32_t] optional gemm flags.

rocblas_status rocblas_gemm_batched_ex(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, const void *b, rocblas_datatype b_type, rocblas_int ldb, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_int batch_count, rocblas_datatype compute_type, rocblas_gemm_algo algo, int32_t solution_index, uint32_t flags)

BLAS EX API

gemm_batched_ex performs one of the batched matrix-matrix operations: D_i = alpha*op(A_i)*op(B_i) + beta*C_i, for i = 1, …, batch_count. where op( X ) is one of op( X ) = X or op( X ) = X**T or op( X ) = X**H, alpha and beta are scalars, and A, B, C, and D are batched pointers to matrices, with op( A ) an m by k by batch_count batched matrix, op( B ) a k by n by batch_count batched matrix and C and D are m by n by batch_count batched matrices. The batched matrices are an array of pointers to matrices. The number of pointers to matrices is batch_count. C and D may point to the same matrices if their parameters are identical.

Supported types are as follows:

rocblas_datatype_f64_r = a_type = b_type = c_type = d_type = compute_type
rocblas_datatype_f32_r = a_type = b_type = c_type = d_type = compute_type
rocblas_datatype_f16_r = a_type = b_type = c_type = d_type = compute_type
rocblas_datatype_f16_r = a_type = b_type = c_type = d_type; rocblas_datatype_f32_r = compute_type
rocblas_datatype_bf16_r = a_type = b_type = c_type = d_type; rocblas_datatype_f32_r = compute_type
rocblas_datatype_i8_r = a_type = b_type; rocblas_datatype_i32_r = c_type = d_type = compute_type
rocblas_datatype_f32_c = a_type = b_type = c_type = d_type = compute_type
rocblas_datatype_f64_c = a_type = b_type = c_type = d_type = compute_type

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
transA – [in] [rocblas_operation] specifies the form of op( A ).
transB – [in] [rocblas_operation] specifies the form of op( B ).
m – [in] [rocblas_int] matrix dimension m.
n – [in] [rocblas_int] matrix dimension n.
k – [in] [rocblas_int] matrix dimension k.
alpha – [in] [const void *] device pointer or host pointer specifying the scalar alpha. Same datatype as compute_type.
a – [in] [void *] device pointer storing array of pointers to each matrix A_i.
a_type – [in] [rocblas_datatype] specifies the datatype of each matrix A_i.
lda – [in] [rocblas_int] specifies the leading dimension of each A_i.
b – [in] [void *] device pointer storing array of pointers to each matrix B_i.
b_type – [in] [rocblas_datatype] specifies the datatype of each matrix B_i.
ldb – [in] [rocblas_int] specifies the leading dimension of each B_i.
beta – [in] [const void *] device pointer or host pointer specifying the scalar beta. Same datatype as compute_type.
c – [in] [void *] device array of device pointers to each matrix C_i.
c_type – [in] [rocblas_datatype] specifies the datatype of each matrix C_i.
ldc – [in] [rocblas_int] specifies the leading dimension of each C_i.
d – [out] [void *] device array of device pointers to each matrix D_i. If d and c are the same array of matrix pointers then d_type must equal c_type and ldd must equal ldc or the respective invalid status will be returned.
d_type – [in] [rocblas_datatype] specifies the datatype of each matrix D_i.
ldd – [in] [rocblas_int] specifies the leading dimension of each D_i.
batch_count – [in] [rocblas_int] number of gemm operations in the batch.
compute_type – [in] [rocblas_datatype] specifies the datatype of computation.
algo – [in] [rocblas_gemm_algo] enumerant specifying the algorithm type.
solution_index – [in] [int32_t] if algo is rocblas_gemm_algo_solution_index, this controls which solution is used. When algo is not rocblas_gemm_algo_solution_index, or if solution_index <= 0, the default solution is used. This parameter was unused in previous releases and instead always used the default solution
flags – [in] [uint32_t] optional gemm flags.

rocblas_status rocblas_gemm_strided_batched_ex(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, rocblas_int k, const void *alpha, const void *a, rocblas_datatype a_type, rocblas_int lda, rocblas_stride stride_a, const void *b, rocblas_datatype b_type, rocblas_int ldb, rocblas_stride stride_b, const void *beta, const void *c, rocblas_datatype c_type, rocblas_int ldc, rocblas_stride stride_c, void *d, rocblas_datatype d_type, rocblas_int ldd, rocblas_stride stride_d, rocblas_int batch_count, rocblas_datatype compute_type, rocblas_gemm_algo algo, int32_t solution_index, uint32_t flags)

BLAS EX API

gemm_strided_batched_ex performs one of the strided_batched matrix-matrix operations:

D_i = alpha*op(A_i)*op(B_i) + beta*C_i, for i = 1, ..., batch_count

where op( X ) is one of

op( X ) = X      or
op( X ) = X**T   or
op( X ) = X**H,

alpha and beta are scalars, and A, B, C, and D are strided_batched matrices, with op( A ) an m by k by batch_count strided_batched matrix, op( B ) a k by n by batch_count strided_batched matrix and C and D are m by n by batch_count strided_batched matrices. C and D may point to the same matrices if their parameters are identical.

The strided_batched matrices are multiple matrices separated by a constant stride. The number of matrices is batch_count.

Supported types are as follows:

rocblas_datatype_f64_r = a_type = b_type = c_type = d_type = compute_type
rocblas_datatype_f32_r = a_type = b_type = c_type = d_type = compute_type
rocblas_datatype_f16_r = a_type = b_type = c_type = d_type = compute_type
rocblas_datatype_f16_r = a_type = b_type = c_type = d_type; rocblas_datatype_f32_r = compute_type
rocblas_datatype_bf16_r = a_type = b_type = c_type = d_type; rocblas_datatype_f32_r = compute_type
rocblas_datatype_i8_r = a_type = b_type; rocblas_datatype_i32_r = c_type = d_type = compute_type
rocblas_datatype_f32_c = a_type = b_type = c_type = d_type = compute_type
rocblas_datatype_f64_c = a_type = b_type = c_type = d_type = compute_type

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
transA – [in] [rocblas_operation] specifies the form of op( A ).
transB – [in] [rocblas_operation] specifies the form of op( B ).
m – [in] [rocblas_int] matrix dimension m.
n – [in] [rocblas_int] matrix dimension n.
k – [in] [rocblas_int] matrix dimension k.
alpha – [in] [const void *] device pointer or host pointer specifying the scalar alpha. Same datatype as compute_type.
a – [in] [void *] device pointer pointing to first matrix A_1.
a_type – [in] [rocblas_datatype] specifies the datatype of each matrix A_i.
lda – [in] [rocblas_int] specifies the leading dimension of each A_i.
stride_a – [in] [rocblas_stride] specifies stride from start of one A_i matrix to the next A_(i + 1).
b – [in] [void *] device pointer pointing to first matrix B_1.
b_type – [in] [rocblas_datatype] specifies the datatype of each matrix B_i.
ldb – [in] [rocblas_int] specifies the leading dimension of each B_i.
stride_b – [in] [rocblas_stride] specifies stride from start of one B_i matrix to the next B_(i + 1).
beta – [in] [const void *] device pointer or host pointer specifying the scalar beta. Same datatype as compute_type.
c – [in] [void *] device pointer pointing to first matrix C_1.
c_type – [in] [rocblas_datatype] specifies the datatype of each matrix C_i.
ldc – [in] [rocblas_int] specifies the leading dimension of each C_i.
stride_c – [in] [rocblas_stride] specifies stride from start of one C_i matrix to the next C_(i + 1).
d – [out] [void *] device pointer storing each matrix D_i. If d and c pointers are to the same matrix then d_type must equal c_type and ldd must equal ldc and stride_d must equal stride_c or the respective invalid status will be returned.
d_type – [in] [rocblas_datatype] specifies the datatype of each matrix D_i.
ldd – [in] [rocblas_int] specifies the leading dimension of each D_i.
stride_d – [in] [rocblas_stride] specifies stride from start of one D_i matrix to the next D_(i + 1).
batch_count – [in] [rocblas_int] number of gemm operations in the batch.
compute_type – [in] [rocblas_datatype] specifies the datatype of computation.
algo – [in] [rocblas_gemm_algo] enumerant specifying the algorithm type.
solution_index – [in] [int32_t] if algo is rocblas_gemm_algo_solution_index, this controls which solution is used. When algo is not rocblas_gemm_algo_solution_index, or if solution_index <= 0, the default solution is used. This parameter was unused in previous releases and instead always used the default solution
flags – [in] [uint32_t] optional gemm flags.

rocblas_trsm_ex + batched, strided_batched#

rocblas_status rocblas_trsm_ex(rocblas_handle handle, rocblas_side side, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, rocblas_int n, const void *alpha, const void *A, rocblas_int lda, void *B, rocblas_int ldb, const void *invA, rocblas_int invA_size, rocblas_datatype compute_type)

BLAS EX API

trsm_ex solves:

op(A)*X = alpha*B or X*op(A) = alpha*B,

where alpha is a scalar, X and B are m by n matrices, A is triangular matrix and op(A) is one of

op( A ) = A   or   op( A ) = A^T   or   op( A ) = A^H.

The matrix X is overwritten on B.

This function gives the user the ability to reuse the invA matrix between runs. If invA == NULL, rocblas_trsm_ex will automatically calculate invA on every run.

Setting up invA: The accepted invA matrix consists of the packed 128x128 inverses of the diagonal blocks of matrix A, followed by any smaller diagonal block that remains. To set up invA it is recommended that rocblas_trtri_batched be used with matrix A as the input.

Device memory of size 128 x k should be allocated for invA ahead of time, where k is m when rocblas_side_left and is n when rocblas_side_right. The actual number of elements in invA should be passed as invA_size.

To begin, rocblas_trtri_batched must be called on the full 128x128-sized diagonal blocks of matrix A. Below are the restricted parameters:

n = 128
ldinvA = 128
stride_invA = 128x128
batch_count = k / 128,

Then any remaining block may be added:

n = k % 128
invA = invA + stride_invA * previous_batch_count
ldinvA = 128
batch_count = 1

Although not widespread, some gemm kernels used by trsm_ex may use atomic operations. See Atomic Operations in the API Reference Guide for more information.

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
side – [in] [rocblas_side]
- rocblas_side_left: op(A)*X = alpha*B
- rocblas_side_right: X*op(A) = alpha*B
uplo – [in] [rocblas_fill]
- rocblas_fill_upper: A is an upper triangular matrix.
- rocblas_fill_lower: A is a lower triangular matrix.
transA – [in] [rocblas_operation]
- transB: op(A) = A.
- rocblas_operation_transpose: op(A) = A^T
- rocblas_operation_conjugate_transpose: op(A) = A^H
diag – [in] [rocblas_diagonal]
- rocblas_diagonal_unit: A is assumed to be unit triangular.
- rocblas_diagonal_non_unit: A is not assumed to be unit triangular.
m – [in] [rocblas_int] m specifies the number of rows of B. m >= 0.
n – [in] [rocblas_int] n specifies the number of columns of B. n >= 0.
alpha – [in] [void *] device pointer or host pointer specifying the scalar alpha. When alpha is &zero then A is not referenced, and B need not be set before entry.
A – [in] [void *] device pointer storing matrix A. of dimension ( lda, k ), where k is m when rocblas_side_left and is n when rocblas_side_right only the upper/lower triangular part is accessed.

lda – [in] [rocblas_int] lda specifies the first dimension of A.

if side = rocblas_side_left,  lda >= max( 1, m ),
if side = rocblas_side_right, lda >= max( 1, n ).

B – [inout] [void *] device pointer storing matrix B. B is of dimension ( ldb, n ). Before entry, the leading m by n part of the array B must contain the right-hand side matrix B, and on exit is overwritten by the solution matrix X.
ldb – [in] [rocblas_int] ldb specifies the first dimension of B. ldb >= max( 1, m ).
invA – [in] [void *] device pointer storing the inverse diagonal blocks of A. invA is of dimension ( ld_invA, k ), where k is m when rocblas_side_left and is n when rocblas_side_right. ld_invA must be equal to 128.
invA_size – [in] [rocblas_int] invA_size specifies the number of elements of device memory in invA.
compute_type – [in] [rocblas_datatype] specifies the datatype of computation.

rocblas_status rocblas_trsm_batched_ex(rocblas_handle handle, rocblas_side side, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, rocblas_int n, const void *alpha, const void *A, rocblas_int lda, void *B, rocblas_int ldb, rocblas_int batch_count, const void *invA, rocblas_int invA_size, rocblas_datatype compute_type)

BLAS EX API

trsm_batched_ex solves:

op(A_i)*X_i = alpha*B_i or X_i*op(A_i) = alpha*B_i,

for i = 1, …, batch_count; and where alpha is a scalar, X and B are arrays of m by n matrices, A is an array of triangular matrix and each op(A_i) is one of

op( A_i ) = A_i   or   op( A_i ) = A_i^T   or   op( A_i ) = A_i^H.

Each matrix X_i is overwritten on B_i.

This function gives the user the ability to reuse the invA matrix between runs. If invA == NULL, rocblas_trsm_batched_ex will automatically calculate each invA_i on every run.

Setting up invA: Each accepted invA_i matrix consists of the packed 128x128 inverses of the diagonal blocks of matrix A_i, followed by any smaller diagonal block that remains. To set up each invA_i it is recommended that rocblas_trtri_batched be used with matrix A_i as the input. invA is an array of pointers of batch_count length holding each invA_i.

Device memory of size 128 x k should be allocated for each invA_i ahead of time, where k is m when rocblas_side_left and is n when rocblas_side_right. The actual number of elements in each invA_i should be passed as invA_size.

To begin, rocblas_trtri_batched must be called on the full 128x128-sized diagonal blocks of each matrix A_i. Below are the restricted parameters:

n = 128
ldinvA = 128
stride_invA = 128x128
batch_count = k / 128,

Then any remaining block may be added:

n = k % 128
invA = invA + stride_invA * previous_batch_count
ldinvA = 128
batch_count = 1

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
side – [in] [rocblas_side]
- rocblas_side_left: op(A)*X = alpha*B
- rocblas_side_right: X*op(A) = alpha*B
uplo – [in] [rocblas_fill]
- rocblas_fill_upper: each A_i is an upper triangular matrix.
- rocblas_fill_lower: each A_i is a lower triangular matrix.
transA – [in] [rocblas_operation]
- transB: op(A) = A.
- rocblas_operation_transpose: op(A) = A^T
- rocblas_operation_conjugate_transpose: op(A) = A^H
diag – [in] [rocblas_diagonal]
- rocblas_diagonal_unit: each A_i is assumed to be unit triangular.
- rocblas_diagonal_non_unit: each A_i is not assumed to be unit triangular.
m – [in] [rocblas_int] m specifies the number of rows of each B_i. m >= 0.
n – [in] [rocblas_int] n specifies the number of columns of each B_i. n >= 0.
alpha – [in] [void *] device pointer or host pointer alpha specifying the scalar alpha. When alpha is &zero then A is not referenced, and B need not be set before entry.
A – [in] [void *] device array of device pointers storing each matrix A_i. each A_i is of dimension ( lda, k ), where k is m when rocblas_side_left and is n when rocblas_side_right only the upper/lower triangular part is accessed.

lda – [in] [rocblas_int] lda specifies the first dimension of each A_i.

if side = rocblas_side_left,  lda >= max( 1, m ),
if side = rocblas_side_right, lda >= max( 1, n ).

B – [inout] [void *] device array of device pointers storing each matrix B_i. each B_i is of dimension ( ldb, n ). Before entry, the leading m by n part of the array B_i must contain the right-hand side matrix B_i, and on exit is overwritten by the solution matrix X_i
ldb – [in] [rocblas_int] ldb specifies the first dimension of each B_i. ldb >= max( 1, m ).
batch_count – [in] [rocblas_int] specifies how many batches.
invA – [in] [void *] device array of device pointers storing the inverse diagonal blocks of each A_i. each invA_i is of dimension ( ld_invA, k ), where k is m when rocblas_side_left and is n when rocblas_side_right. ld_invA must be equal to 128.
invA_size – [in] [rocblas_int] invA_size specifies the number of elements of device memory in each invA_i.
compute_type – [in] [rocblas_datatype] specifies the datatype of computation.

rocblas_status rocblas_trsm_strided_batched_ex(rocblas_handle handle, rocblas_side side, rocblas_fill uplo, rocblas_operation transA, rocblas_diagonal diag, rocblas_int m, rocblas_int n, const void *alpha, const void *A, rocblas_int lda, rocblas_stride stride_A, void *B, rocblas_int ldb, rocblas_stride stride_B, rocblas_int batch_count, const void *invA, rocblas_int invA_size, rocblas_stride stride_invA, rocblas_datatype compute_type)

BLAS EX API

trsm_strided_batched_ex solves:

op(A_i)*X_i = alpha*B_i or X_i*op(A_i) = alpha*B_i,

for i = 1, …, batch_count; and where alpha is a scalar, X and B are strided batched m by n matrices, A is a strided batched triangular matrix and op(A_i) is one of

op( A_i ) = A_i   or   op( A_i ) = A_i^T   or   op( A_i ) = A_i^H.

Each matrix X_i is overwritten on B_i.

This function gives the user the ability to reuse each invA_i matrix between runs. If invA == NULL, rocblas_trsm_batched_ex will automatically calculate each invA_i on every run.

Setting up invA: Each accepted invA_i matrix consists of the packed 128x128 inverses of the diagonal blocks of matrix A_i, followed by any smaller diagonal block that remains. To set up invA_i it is recommended that rocblas_trtri_batched be used with matrix A_i as the input. invA is a contiguous piece of memory holding each invA_i.

To begin, rocblas_trtri_batched must be called on the full 128x128-sized diagonal blocks of each matrix A_i. Below are the restricted parameters:

n = 128
ldinvA = 128
stride_invA = 128x128
batch_count = k / 128

Then any remaining block may be added:

n = k % 128
invA = invA + stride_invA * previous_batch_count
ldinvA = 128
batch_count = 1

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
side – [in] [rocblas_side]
- rocblas_side_left: op(A)*X = alpha*B
- rocblas_side_right: X*op(A) = alpha*B
uplo – [in] [rocblas_fill]
- rocblas_fill_upper: each A_i is an upper triangular matrix.
- rocblas_fill_lower: each A_i is a lower triangular matrix.
transA – [in] [rocblas_operation]
- transB: op(A) = A.
- rocblas_operation_transpose: op(A) = A^T
- rocblas_operation_conjugate_transpose: op(A) = A^H
diag – [in] [rocblas_diagonal]
- rocblas_diagonal_unit: each A_i is assumed to be unit triangular.
- rocblas_diagonal_non_unit: each A_i is not assumed to be unit triangular.
m – [in] [rocblas_int] m specifies the number of rows of each B_i. m >= 0.
n – [in] [rocblas_int] n specifies the number of columns of each B_i. n >= 0.
alpha – [in] [void *] device pointer or host pointer specifying the scalar alpha. When alpha is &zero then A is not referenced, and B need not be set before entry.
A – [in] [void *] device pointer storing matrix A. of dimension ( lda, k ), where k is m when rocblas_side_left and is n when rocblas_side_right only the upper/lower triangular part is accessed.

lda – [in] [rocblas_int] lda specifies the first dimension of A.

if side = rocblas_side_left,  lda >= max( 1, m ),
if side = rocblas_side_right, lda >= max( 1, n ).

stride_A – [in] [rocblas_stride] The stride between each A matrix.
B – [inout] [void *] device pointer pointing to first matrix B_i. each B_i is of dimension ( ldb, n ). Before entry, the leading m by n part of each array B_i must contain the right-hand side of matrix B_i, and on exit is overwritten by the solution matrix X_i.
ldb – [in] [rocblas_int] ldb specifies the first dimension of each B_i. ldb >= max( 1, m ).
stride_B – [in] [rocblas_stride] The stride between each B_i matrix.
batch_count – [in] [rocblas_int] specifies how many batches.
invA – [in] [void *] device pointer storing the inverse diagonal blocks of each A_i. invA points to the first invA_1. each invA_i is of dimension ( ld_invA, k ), where k is m when rocblas_side_left and is n when rocblas_side_right. ld_invA must be equal to 128.
invA_size – [in] [rocblas_int] invA_size specifies the number of elements of device memory in each invA_i.
stride_invA – [in] [rocblas_stride] The stride between each invA matrix.
compute_type – [in] [rocblas_datatype] specifies the datatype of computation.

rocblas_Xgeam + batched, strided_batched#

rocblas_status rocblas_sgeam(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const float *alpha, const float *A, rocblas_int lda, const float *beta, const float *B, rocblas_int ldb, float *C, rocblas_int ldc)

rocblas_status rocblas_dgeam(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const double *alpha, const double *A, rocblas_int lda, const double *beta, const double *B, rocblas_int ldb, double *C, rocblas_int ldc)

rocblas_status rocblas_cgeam(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const rocblas_float_complex *alpha, const rocblas_float_complex *A, rocblas_int lda, const rocblas_float_complex *beta, const rocblas_float_complex *B, rocblas_int ldb, rocblas_float_complex *C, rocblas_int ldc)

rocblas_status rocblas_zgeam(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const rocblas_double_complex *alpha, const rocblas_double_complex *A, rocblas_int lda, const rocblas_double_complex *beta, const rocblas_double_complex *B, rocblas_int ldb, rocblas_double_complex *C, rocblas_int ldc)

BLAS Level 3 API

geam performs one of the matrix-matrix operations:

C = alpha*op( A ) + beta*op( B ),

where op( X ) is one of

op( X ) = X      or
op( X ) = X**T   or
op( X ) = X**H,

alpha and beta are scalars, and A, B and C are matrices, with
op( A ) an m by n matrix, op( B ) an m by n matrix, and C an m by n matrix.

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
transA – [in] [rocblas_operation] specifies the form of op( A ).
transB – [in] [rocblas_operation] specifies the form of op( B ).
m – [in] [rocblas_int] matrix dimension m.
n – [in] [rocblas_int] matrix dimension n.
alpha – [in] device pointer or host pointer specifying the scalar alpha.
A – [in] device pointer storing matrix A.
lda – [in] [rocblas_int] specifies the leading dimension of A.
beta – [in] device pointer or host pointer specifying the scalar beta.
B – [in] device pointer storing matrix B.
ldb – [in] [rocblas_int] specifies the leading dimension of B.
C – [inout] device pointer storing matrix C.
ldc – [in] [rocblas_int] specifies the leading dimension of C.

rocblas_status rocblas_sgeam_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const float *alpha, const float *const A[], rocblas_int lda, const float *beta, const float *const B[], rocblas_int ldb, float *const C[], rocblas_int ldc, rocblas_int batch_count)

rocblas_status rocblas_dgeam_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const double *alpha, const double *const A[], rocblas_int lda, const double *beta, const double *const B[], rocblas_int ldb, double *const C[], rocblas_int ldc, rocblas_int batch_count)

rocblas_status rocblas_cgeam_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const rocblas_float_complex *alpha, const rocblas_float_complex *const A[], rocblas_int lda, const rocblas_float_complex *beta, const rocblas_float_complex *const B[], rocblas_int ldb, rocblas_float_complex *const C[], rocblas_int ldc, rocblas_int batch_count)

rocblas_status rocblas_zgeam_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const rocblas_double_complex *alpha, const rocblas_double_complex *const A[], rocblas_int lda, const rocblas_double_complex *beta, const rocblas_double_complex *const B[], rocblas_int ldb, rocblas_double_complex *const C[], rocblas_int ldc, rocblas_int batch_count)

BLAS Level 3 API

geam_batched performs one of the batched matrix-matrix operations:

C_i = alpha*op( A_i ) + beta*op( B_i )  for i = 0, 1, ... batch_count - 1,

where alpha and beta are scalars, and op(A_i), op(B_i) and C_i are m by n matrices
and op( X ) is one of

op( X ) = X      or
op( X ) = X**T

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
transA – [in] [rocblas_operation] specifies the form of op( A ).
transB – [in] [rocblas_operation] specifies the form of op( B ).
m – [in] [rocblas_int] matrix dimension m.
n – [in] [rocblas_int] matrix dimension n.
alpha – [in] device pointer or host pointer specifying the scalar alpha.
A – [in] device array of device pointers storing each matrix A_i on the GPU. Each A_i is of dimension ( lda, k ), where k is m when transA == rocblas_operation_none and is n when transA == rocblas_operation_transpose.
lda – [in] [rocblas_int] specifies the leading dimension of A.
beta – [in] device pointer or host pointer specifying the scalar beta.
B – [in] device array of device pointers storing each matrix B_i on the GPU. Each B_i is of dimension ( ldb, k ), where k is m when transB == rocblas_operation_none and is n when transB == rocblas_operation_transpose.
ldb – [in] [rocblas_int] specifies the leading dimension of B.
C – [inout] device array of device pointers storing each matrix C_i on the GPU. Each C_i is of dimension ( ldc, n ).
ldc – [in] [rocblas_int] specifies the leading dimension of C.
batch_count – [in] [rocblas_int] number of instances i in the batch.

rocblas_status rocblas_sgeam_strided_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const float *alpha, const float *A, rocblas_int lda, rocblas_stride stride_A, const float *beta, const float *B, rocblas_int ldb, rocblas_stride stride_B, float *C, rocblas_int ldc, rocblas_stride stride_C, rocblas_int batch_count)

rocblas_status rocblas_dgeam_strided_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const double *alpha, const double *A, rocblas_int lda, rocblas_stride stride_A, const double *beta, const double *B, rocblas_int ldb, rocblas_stride stride_B, double *C, rocblas_int ldc, rocblas_stride stride_C, rocblas_int batch_count)

rocblas_status rocblas_cgeam_strided_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const rocblas_float_complex *alpha, const rocblas_float_complex *A, rocblas_int lda, rocblas_stride stride_A, const rocblas_float_complex *beta, const rocblas_float_complex *B, rocblas_int ldb, rocblas_stride stride_B, rocblas_float_complex *C, rocblas_int ldc, rocblas_stride stride_C, rocblas_int batch_count)

rocblas_status rocblas_zgeam_strided_batched(rocblas_handle handle, rocblas_operation transA, rocblas_operation transB, rocblas_int m, rocblas_int n, const rocblas_double_complex *alpha, const rocblas_double_complex *A, rocblas_int lda, rocblas_stride stride_A, const rocblas_double_complex *beta, const rocblas_double_complex *B, rocblas_int ldb, rocblas_stride stride_B, rocblas_double_complex *C, rocblas_int ldc, rocblas_stride stride_C, rocblas_int batch_count)

BLAS Level 3 API

geam_strided_batched performs one of the batched matrix-matrix operations:

C_i = alpha*op( A_i ) + beta*op( B_i )  for i = 0, 1, ... batch_count - 1,

where alpha and beta are scalars, and op(A_i), op(B_i) and C_i are m by n matrices
and op( X ) is one of

op( X ) = X      or
op( X ) = X**T

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
transA – [in] [rocblas_operation] specifies the form of op( A ).
transB – [in] [rocblas_operation] specifies the form of op( B ).
m – [in] [rocblas_int] matrix dimension m.
n – [in] [rocblas_int] matrix dimension n.
alpha – [in] device pointer or host pointer specifying the scalar alpha.
A – [in] device pointer to the first matrix A_0 on the GPU. Each A_i is of dimension ( lda, k ), where k is m when transA == rocblas_operation_none and is n when transA == rocblas_operation_transpose.
lda – [in] [rocblas_int] specifies the leading dimension of A.
stride_A – [in] [rocblas_stride] stride from the start of one matrix (A_i) and the next one (A_i+1).
beta – [in] device pointer or host pointer specifying the scalar beta.
B – [in] pointer to the first matrix B_0 on the GPU. Each B_i is of dimension ( ldb, k ), where k is m when transB == rocblas_operation_none and is n when transB == rocblas_operation_transpose.
ldb – [in] [rocblas_int] specifies the leading dimension of B.
stride_B – [in] [rocblas_stride] stride from the start of one matrix (B_i) and the next one (B_i+1)
C – [inout] pointer to the first matrix C_0 on the GPU. Each C_i is of dimension ( ldc, n ).
ldc – [in] [rocblas_int] specifies the leading dimension of C.
stride_C – [in] [rocblas_stride] stride from the start of one matrix (C_i) and the next one (C_i+1).
batch_count – [in] [rocblas_int] number of instances i in the batch.

rocblas_Xdgmm + batched, strided_batched#

rocblas_status rocblas_sdgmm(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const float *A, rocblas_int lda, const float *x, rocblas_int incx, float *C, rocblas_int ldc)

rocblas_status rocblas_ddgmm(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const double *A, rocblas_int lda, const double *x, rocblas_int incx, double *C, rocblas_int ldc)

rocblas_status rocblas_cdgmm(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const rocblas_float_complex *A, rocblas_int lda, const rocblas_float_complex *x, rocblas_int incx, rocblas_float_complex *C, rocblas_int ldc)

rocblas_status rocblas_zdgmm(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const rocblas_double_complex *A, rocblas_int lda, const rocblas_double_complex *x, rocblas_int incx, rocblas_double_complex *C, rocblas_int ldc)

BLAS Level 3 API

dgmm performs one of the matrix-matrix operations:

C = A * diag(x) if side == rocblas_side_right
C = diag(x) * A if side == rocblas_side_left

where C and A are m by n dimensional matrices. diag( x ) is a diagonal matrix
and x is vector of dimension n if side == rocblas_side_right and dimension m
if side == rocblas_side_left.

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
side – [in] [rocblas_side] specifies the side of diag(x).
m – [in] [rocblas_int] matrix dimension m.
n – [in] [rocblas_int] matrix dimension n.
A – [in] device pointer storing matrix A.
lda – [in] [rocblas_int] specifies the leading dimension of A.
x – [in] device pointer storing vector x.
incx – [in] [rocblas_int] specifies the increment between values of x
C – [inout] device pointer storing matrix C.
ldc – [in] [rocblas_int] specifies the leading dimension of C.

rocblas_status rocblas_sdgmm_batched(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const float *const A[], rocblas_int lda, const float *const x[], rocblas_int incx, float *const C[], rocblas_int ldc, rocblas_int batch_count)

rocblas_status rocblas_ddgmm_batched(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const double *const A[], rocblas_int lda, const double *const x[], rocblas_int incx, double *const C[], rocblas_int ldc, rocblas_int batch_count)

rocblas_status rocblas_cdgmm_batched(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const rocblas_float_complex *const A[], rocblas_int lda, const rocblas_float_complex *const x[], rocblas_int incx, rocblas_float_complex *const C[], rocblas_int ldc, rocblas_int batch_count)

rocblas_status rocblas_zdgmm_batched(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const rocblas_double_complex *const A[], rocblas_int lda, const rocblas_double_complex *const x[], rocblas_int incx, rocblas_double_complex *const C[], rocblas_int ldc, rocblas_int batch_count)

BLAS Level 3 API

dgmm_batched performs one of the batched matrix-matrix operations:

C_i = A_i * diag(x_i) for i = 0, 1, ... batch_count-1 if side == rocblas_side_right
C_i = diag(x_i) * A_i for i = 0, 1, ... batch_count-1 if side == rocblas_side_left,

where C_i and A_i are m by n dimensional matrices. diag(x_i) is a diagonal matrix
and x_i is vector of dimension n if side == rocblas_side_right and dimension m
if side == rocblas_side_left.

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
side – [in] [rocblas_side] specifies the side of diag(x).
m – [in] [rocblas_int] matrix dimension m.
n – [in] [rocblas_int] matrix dimension n.
A – [in] device array of device pointers storing each matrix A_i on the GPU. Each A_i is of dimension ( lda, n ).
lda – [in] [rocblas_int] specifies the leading dimension of A_i.
x – [in] device array of device pointers storing each vector x_i on the GPU. Each x_i is of dimension n if side == rocblas_side_right and dimension m if side == rocblas_side_left.
incx – [in] [rocblas_int] specifies the increment between values of x_i.
C – [inout] device array of device pointers storing each matrix C_i on the GPU. Each C_i is of dimension ( ldc, n ).
ldc – [in] [rocblas_int] specifies the leading dimension of C_i.
batch_count – [in] [rocblas_int] number of instances in the batch.

rocblas_status rocblas_sdgmm_strided_batched(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const float *A, rocblas_int lda, rocblas_stride stride_A, const float *x, rocblas_int incx, rocblas_stride stride_x, float *C, rocblas_int ldc, rocblas_stride stride_C, rocblas_int batch_count)

rocblas_status rocblas_ddgmm_strided_batched(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const double *A, rocblas_int lda, rocblas_stride stride_A, const double *x, rocblas_int incx, rocblas_stride stride_x, double *C, rocblas_int ldc, rocblas_stride stride_C, rocblas_int batch_count)

rocblas_status rocblas_cdgmm_strided_batched(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const rocblas_float_complex *A, rocblas_int lda, rocblas_stride stride_A, const rocblas_float_complex *x, rocblas_int incx, rocblas_stride stride_x, rocblas_float_complex *C, rocblas_int ldc, rocblas_stride stride_C, rocblas_int batch_count)

rocblas_status rocblas_zdgmm_strided_batched(rocblas_handle handle, rocblas_side side, rocblas_int m, rocblas_int n, const rocblas_double_complex *A, rocblas_int lda, rocblas_stride stride_A, const rocblas_double_complex *x, rocblas_int incx, rocblas_stride stride_x, rocblas_double_complex *C, rocblas_int ldc, rocblas_stride stride_C, rocblas_int batch_count)

BLAS Level 3 API

dgmm_strided_batched performs one of the batched matrix-matrix operations:

C_i = A_i * diag(x_i)   if side == rocblas_side_right   for i = 0, 1, ... batch_count-1
C_i = diag(x_i) * A_i   if side == rocblas_side_left    for i = 0, 1, ... batch_count-1,

where C_i and A_i are m by n dimensional matrices. diag(x_i) is a diagonal matrix
and x_i is vector of dimension n if side == rocblas_side_right and dimension m
if side == rocblas_side_left.

Parameters:

handle – [in] [rocblas_handle] handle to the rocblas library context queue.
side – [in] [rocblas_side] specifies the side of diag(x).
m – [in] [rocblas_int] matrix dimension m.
n – [in] [rocblas_int] matrix dimension n.
A – [in] device pointer to the first matrix A_0 on the GPU. Each A_i is of dimension ( lda, n ).
lda – [in] [rocblas_int] specifies the leading dimension of A.
stride_A – [in] [rocblas_stride] stride from the start of one matrix (A_i) and the next one (A_i+1).
x – [in] pointer to the first vector x_0 on the GPU. Each x_i is of dimension n if side == rocblas_side_right and dimension m if side == rocblas_side_left.
incx – [in] [rocblas_int] specifies the increment between values of x.
stride_x – [in] [rocblas_stride] stride from the start of one vector(x_i) and the next one (x_i+1).
C – [inout] device pointer to the first matrix C_0 on the GPU. Each C_i is of dimension ( ldc, n ).
ldc – [in] [rocblas_int] specifies the leading dimension of C.
stride_C – [in] [rocblas_stride] stride from the start of one matrix (C_i) and the next one (C_i+1).
batch_count – [in] [rocblas_int] number of instances i in the batch.

rocBLAS Extension

Contents

rocBLAS Extension#

rocblas_axpy_ex + batched, strided_batched#

rocblas_dot_ex + batched, strided_batched#

rocblas_dotc_ex + batched, strided_batched#

rocblas_nrm2_ex + batched, strided_batched#

rocblas_rot_ex + batched, strided_batched#

rocblas_scal_ex + batched, strided_batched#

rocblas_gemm_ex + batched, strided_batched#

rocblas_trsm_ex + batched, strided_batched#

rocblas_Xgeam + batched, strided_batched#

rocblas_Xdgmm + batched, strided_batched#