2. API Reference Guide

2.1. Introduction

rocBLAS is the AMD library for Basic Linear Algebra Subprograms (BLAS) on the ROCm platform . It is implemented in the HIP programming language and optimized for AMD GPUs.

The aim of rocBLAS is to provide:

  • Functionality similar to Legacy BLAS, adapted to run on GPUs

  • High-performance robust implementation

rocBLAS is written in C++17 and HIP. It uses the AMD ROCm runtime to run on GPU devices.

The rocBLAS API is a thin C99 API using the Hourglass Pattern. It contains:

  • [Level1], [Level2], and [Level3] BLAS functions, with batched and strided_batched versions

  • Extensions to Legacy BLAS, including functions for mixed precision

  • Auxiliary functions

  • Device Memory functions

Note

  • The official rocBLAS API is the C99 API defined in rocblas.h. Therefore the use of any other public symbols is discouraged. All other C/C++ interfaces may not follow a deprecation model and so can change without warning from one release to the next.

  • rocBLAS array storage format is column major and one based. This is to maintain compatibility with the Legacy BLAS code, which is written in Fortran.

  • rocBLAS calls the AMD library Tensile for Level 3 BLAS matrix multiplication.

2.2. rocBLAS API and Legacy BLAS Functions

rocBLAS is initialized by calling rocblas_create_handle, and it is terminated by calling rocblas_destroy_handle. The rocblas_handle is persistent, and it contains:

  • HIP stream

  • Temporary device work space

  • Mode for enabling or disabling logging (default is logging disabled)

rocBLAS functions run on the host, and they call HIP to launch rocBLAS kernels that run on the device in a HIP stream. The kernels are asynchronous unless:

  • The function returns a scalar result from device to host

  • Temporary device memory is allocated

In both cases above, the launch can be made asynchronous by:

  • Use rocblas_pointer_mode_device to keep the scalar result on the device. Note that it is only the following Level1 BLAS functions that return a scalar result: Xdot, Xdotu, Xnrm2, Xasum, iXamax, iXamin.

  • Use the provided device memory functions to allocate device memory that persists in the handle. Note that most rocBLAS functions do not allocate temporary device memory.

Before calling a rocBLAS function, arrays must be copied to the device. Integer scalars like m, n, k are stored on the host. Floating point scalars like alpha and beta can be on host or device.

Error handling is by returning a rocblas_status. Functions conform to the Legacy BLAS argument checking.

2.2.1. Rules for Obtaining rocBLAS API from Legacy BLAS

  1. The Legacy BLAS routine name is changed to lowercase and prefixed by rocblas_. For example: Legacy BLAS routine SSCAL, scales a vector by a constant, is converted to rocblas_sscal.

  2. A first argument rocblas_handle handle is added to all rocBLAS functions.

  3. Input arguments are declared with the const modifier.

  4. Character arguments are replaced with enumerated types defined in rocblas_types.h. They are passed by value on the host.

  5. Array arguments are passed by reference on the device.

  6. Scalar arguments are passed by value on the host with the following exceptions. See the section Pointer Mode for more information on these exceptions:

  • Scalar values alpha and beta are passed by reference on either the host or the device.

  • Where Legacy BLAS functions have return values, the return value is instead added as the last function argument. It is returned by reference on either the host or the device. This applies to the following functions: xDOT, xDOTU, xNRM2, xASUM, IxAMAX, IxAMIN.

  1. The return value of all functions is rocblas_status, defined in rocblas_types.h. It is used to check for errors.

2.2.2. Example Code

Below is a simple example code for calling function rocblas_sscal:

#include <iostream>
#include <vector>
#include "hip/hip_runtime_api.h"
#include "rocblas.h"

using namespace std;

int main()
{
    rocblas_int n = 10240;
    float alpha = 10.0;

    vector<float> hx(n);
    vector<float> hz(n);
    float* dx;

    rocblas_handle handle;
    rocblas_create_handle(&handle);

    // allocate memory on device
    hipMalloc(&dx, n * sizeof(float));

    // Initial Data on CPU,
    srand(1);
    for( int i = 0; i < n; ++i )
    {
        hx[i] = rand() % 10 + 1;  //generate a integer number between [1, 10]
    }

    // copy array from host memory to device memory
    hipMemcpy(dx, hx.data(), sizeof(float) * n, hipMemcpyHostToDevice);

    // call rocBLAS function
    rocblas_status status = rocblas_sscal(handle, n, &alpha, dx, 1);

    // check status for errors
    if(status == rocblas_status_success)
    {
        cout << "status == rocblas_status_success" << endl;
    }
    else
    {
        cout << "rocblas failure: status = " << status << endl;
    }

    // copy output from device memory to host memory
    hipMemcpy(hx.data(), dx, sizeof(float) * n, hipMemcpyDeviceToHost);

    hipFree(dx);
    rocblas_destroy_handle(handle);
    return 0;
}

2.2.3. LP64 Interface

The rocBLAS library is LP64, so rocblas_int arguments are 32 bit and rocblas_long arguments are 64 bit.

2.2.4. Column-major Storage and 1 Based Indexing

rocBLAS uses column-major storage for 2D arrays, and 1-based indexing for the functions xMAX and xMIN. This is the same as Legacy BLAS and cuBLAS.

If you need row-major and 0-based indexing (used in C language arrays), download the file cblas.tgz from the Netlib Repository. Look at the CBLAS functions that provide a thin interface to Legacy BLAS. They convert from row-major, 0 based, to column-major, 1 based. This is done by swapping the order of function arguments. It is not necessary to transpose matrices.

2.2.5. Pointer Mode

The auxiliary functions rocblas_set_pointer and rocblas_get_pointer are used to set and get the value of the state variable rocblas_pointer_mode. This variable is stored in rocblas_handle. If rocblas_pointer_mode == rocblas_pointer_mode_host, then scalar parameters must be allocated on the host. If rocblas_pointer_mode == rocblas_pointer_mode_device, then scalar parameters must be allocated on the device.

There are two types of scalar parameter:

  • Scaling parameters like alpha and beta used in functions like axpy, gemv, gemm 2

  • Scalar results from functions amax, amin, asum, dot, nrm2

For scalar parameters like alpha and beta when rocblas_pointer_mode == rocblas_pointer_mode_host, they can be allocated on the host heap or stack. The kernel launch is asynchronous, and if they are on the heap, they can be freed after the return from the kernel launch. When rocblas_pointer_mode == rocblas_pointer_mode_device they must not be changed till the kernel completes.

For scalar results, when rocblas_pointer_mode == rocblas_pointer_mode_host, then the function blocks the CPU till the GPU has copied the result back to the host. When rocblas_pointer_mode == rocblas_pointer_mode_device the function will return after the asynchronous launch. Similarly to vector and matrix results, the scalar result is only available when the kernel has completed execution.

2.2.6. Asynchronous API

rocBLAS functions will be asynchronous unless:

  • The function needs to allocate device memory

  • The function returns a scalar result from GPU to CPU

The order of operations in the asynchronous functions is as in the figure below. The argument checking, calculation of process grid, and kernel launch take very little time. The asynchronous kernel running on the GPU does not block the CPU. After the kernel launch, the CPU keeps processing the next instructions.

code blocks in asynch function call

Order of operations in asynchronous functions

The above order of operations will change if there is logging or the function is synchronous. Logging requires system calls, and the program must wait for them to complete before executing the next instruction. See the Logging section for more information.

Note

The default is no logging.

If the cpu needs to allocate device memory, it must wait till this is complete before executing the next instruction. See the Device Memory Allocation section for more information.

Note

Memory can be preallocated. This will make the function asynchronous, as it removes the need for the function to allocate memory.

The following functions copy a scalar result from GPU to CPU if rocblas_pointer_mode == rocblas_pointer_mode_host: asum, dot, max, min, nrm2.

This makes the function synchronous, as the program must wait for the copy before executing the next instruction. See the section on Pointer Mode for more information.

Note

Set rocblas_pointer_mode to rocblas_pointer_mode_device makes the function asynchronous by keeping the result on the GPU.

The order of operations with logging, device memory allocation, and return of a scalar result is as in the figure below:

code blocks in synchronous function call

Code blocks in synchronous function call

2.2.7. Complex Number Data Types

Data types for rocBLAS complex numbers in the API are a special case. For C compiler users, gcc, and other non-hipcc compiler users, these types are exposed as a struct with x and y components and identical memory layout to std::complex for float and double precision. Internally a templated C++ class is defined, but it should be considered deprecated for external use. For simplified usage with Hipified code there is an option to interpret the API as using hipFloatComplex and hipDoubleComplex types (i.e. typedef hipFloatComplex rocblas_float_complex). This is provided for users to avoid casting when using the hip complex types in their code. As the memory layout is consistent across all three types, it is safe to cast arguments to API calls between the 3 types: hipFloatComplex, std::complex<float>, and rocblas_float_complex, as well as for the double precision variants. To expose the API as using the hip defined complex types, user can use either a compiler define or inlined #define ROCM_MATHLIBS_API_USE_HIP_COMPLEX before including the header file <rocblas.h>. Thus the API is compatible with both forms, but recompilation is required to avoid casting if switching to pass in the hip complex types. Most device memory pointers are passed with void* types to hip utility functions (e.g. hipMemcpy), so uploading memory from std::complex arrays or hipFloatComplex arrays requires no changes regardless of complex data type API choice.

2.2.8. MI100 (gfx908) Considerations

On nodes with the MI100 (gfx908), MFMA (Matrix-Fused-Multiply-Add) instructions are available to substantially speed up matrix operations. This hardware feature is used in all gemm and gemm-based functions in rocBLAS with 32-bit or shorter base datatypes with an associated 32-bit compute_type (f32_r, i32_r, or f32_c as appropriate).

Specifically, rocBLAS takes advantage of MI100’s MFMA instructions for three real base types f16_r, bf16_r, and f32_r with compute_type f32_r, one integral base type i8_r with compute_type i32_r, and one complex base type f32_c with compute_type f32_c. In summary, all GEMM APIs and APIs for GEMM-based functions using these five base types and their associated compute_type (explicit or implicit) take advantage of MI100’s MFMA instructions.

Note

The use of MI100’s MFMA instructions is automatic. There is no user control for on/off.

Not all problem sizes may select MFMA-based kernels; additional tuning may be needed to get good performance.

2.2.9. MI200 (gfx90a) Considerations

On nodes with the MI200 (gfx90a), MFMA_F64 instructions are available to substantially speed up double precision matrix operations. This hardware feature is used in all GEMM and GEMM-based functions in rocBLAS with 64-bit floating-point datatype, namely DGEMM, ZGEMM, DTRSM, ZTRSM, DTRMM, ZTRMM, DSYRKX, and ZSYRKX.

The MI200 MFMA_F16, MFMA_BF16 and MFMA_BF16_1K instructions flush subnormal input/output data (“denorms”) to zero. It is observed that certain use cases utilizing the HPA (High Precision Accumulate) HGEMM kernels where a_type=b_type=c_type=d_type=f16_r and compute_type=f32_r do not tolerate the MI200’s flush-denorms-to-zero behavior well due to F16’s limited exponent range. An alternate implementation of the HPA HGEMM kernel utilizing the MFMA_BF16_1K instruction is provided which, takes advantage of BF16’s much larger exponent range, albeit with reduced accuracy. To select the alternate implementation of HPA HGEMM with the gemm_ex/gemm_strided_batched_ex functions, for the flags argument, use the enum value of rocblas_gemm_flags_fp16_alt_impl.

Note

The use of MI200’s MFMA instructions (including MFMA_F64) is automatic. There is no user control for on/off.

Not all problem sizes may select MFMA-based kernels; additional tuning may be needed to get good performance.

2.3. Deprecations by version

2.3.1. Announced in rocBLAS 2.45

2.3.1.1. Replace is_complex by rocblas_is_complex

From rocBLAS 3.0 the trait is_complex will be replaces by rocblas_is_complex

2.3.1.2. Replace truncate with rocblas_truncate

From rocBLAS 3.0 enum truncate_t and the value truncate will be removed and replaced by rocblas_truncate_t and rocblas_truncate, respectively. The new enum rocblas_truncate_t and the value rocblas_truncate could be used from this ROCm release for an easy transition.

2.3.2. Announced in rocBLAS 2.46

2.3.2.1. Remove ability for hipBLAS to set rocblas_int8_type_for_hipblas

From rocBLAS 3.0 remove enum rocblas_int8_type_for_hipblas and the functions rocblas_get_int8_type_for_hipblas and rocblas_set_int8_type_for_hipblas. These are used by hipBLAS to select either int8_t or packed_int8x4 datatype. In hipBLAS the option to use packed_int8x4 will be removed, only int8_t will be available.

References:

Level1
    1. Lawson, R. J. Hanson, D. Kincaid, and F. T. Krogh, Basic Linear Algebra Subprograms for FORTRAN usage, ACM Trans. Math. Soft., 5 (1979), pp. 308–323.

Level2
    1. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson, An extended set of FORTRAN Basic Linear Algebra Subprograms, ACM Trans. Math. Soft., 14 (1988), pp. 1–17

Level3
    1. Dongarra, J. Du Croz, S. Hammarling, and R. J. Hanson, Algorithm 656: An extended set of FORTRAN Basic Linear Algebra Subprograms, ACM Trans. Math. Soft., 14 (1988), pp. 18–32

3. Using rocBLAS API

This section describes how to use the rocBLAS library API.

3.1. rocBLAS Datatypes

3.1.1. rocblas_handle

typedef struct _rocblas_handle *rocblas_handle

rocblas_handle is a structure holding the rocblas library context. It must be initialized using rocblas_create_handle(), and the returned handle must be passed to all subsequent library function calls. It should be destroyed at the end using rocblas_destroy_handle().

3.1.2. rocblas_int

typedef int32_t rocblas_int

To specify whether int32 is used for LP64 or int64 is used for ILP64.

3.1.3. rocblas_stride

typedef int64_t rocblas_stride

Stride between matrices or vectors in strided_batched functions.

3.1.4. rocblas_half

struct rocblas_half

Structure definition for rocblas_half.

3.1.5. rocblas_bfloat16

struct rocblas_bfloat16

Struct to represent a 16 bit Brain floating-point number.

3.1.6. rocblas_float_complex

struct rocblas_float_complex

Struct to represent a complex number with single precision real and imaginary parts.

3.1.7. rocblas_double_complex

struct rocblas_double_complex

Struct to represent a complex number with double precision real and imaginary parts.

3.2. rocBLAS Enumeration

Enumeration constants have numbering that is consistent with CBLAS, ACML, most standard C BLAS libraries

3.2.1. rocblas_operation

enum rocblas_operation

Used to specify whether the matrix is to be transposed or not.

Parameter constants. numbering is consistent with CBLAS, ACML and most standard C BLAS libraries

Values:

enumerator rocblas_operation_none

Operate with the matrix.

enumerator rocblas_operation_transpose

Operate with the transpose of the matrix.

enumerator rocblas_operation_conjugate_transpose

Operate with the conjugate transpose of the matrix.

3.2.2. rocblas_fill

enum rocblas_fill

Used by the Hermitian, symmetric and triangular matrix routines to specify whether the upper, or lower triangle is being referenced.

Values:

enumerator rocblas_fill_upper

Upper triangle.

enumerator rocblas_fill_lower

Lower triangle.

enumerator rocblas_fill_full

3.2.3. rocblas_diagonal

enum rocblas_diagonal

It is used by the triangular matrix routines to specify whether the matrix is unit triangular.

Values:

enumerator rocblas_diagonal_non_unit

Non-unit triangular.

enumerator rocblas_diagonal_unit

Unit triangular.

3.2.4. rocblas_side

enum rocblas_side

Indicates the side matrix A is located relative to matrix B during multiplication.

Values:

enumerator rocblas_side_left

Multiply general matrix by symmetric, Hermitian, or triangular matrix on the left.

enumerator rocblas_side_right

Multiply general matrix by symmetric, Hermitian, or triangular matrix on the right.

enumerator rocblas_side_both

3.2.5. rocblas_status

enum rocblas_status

rocblas status codes definition

Values:

enumerator rocblas_status_success

Success

enumerator rocblas_status_invalid_handle

Handle not initialized, invalid or null

enumerator rocblas_status_not_implemented

Function is not implemented

enumerator rocblas_status_invalid_pointer

Invalid pointer argument

enumerator rocblas_status_invalid_size

Invalid size argument

enumerator rocblas_status_memory_error

Failed internal memory allocation, copy or dealloc

enumerator rocblas_status_internal_error

Other internal library failure

enumerator rocblas_status_perf_degraded

Performance degraded due to low device memory

enumerator rocblas_status_size_query_mismatch

Unmatched start/stop size query

enumerator rocblas_status_size_increased

Queried device memory size increased

enumerator rocblas_status_size_unchanged

Queried device memory size unchanged

enumerator rocblas_status_invalid_value

Passed argument not valid

enumerator rocblas_status_continue

Nothing preventing function to proceed

enumerator rocblas_status_check_numerics_fail

Will be set if the vector/matrix has a NaN/Infinity/denormal value

3.2.6. rocblas_datatype

enum rocblas_datatype

Indicates the precision width of data stored in a blas type.

Parameter constants. Numbering continues into next free decimal range but not shared with other BLAS libraries

Values:

enumerator rocblas_datatype_f16_r

16-bit floating point, real

enumerator rocblas_datatype_f32_r

32-bit floating point, real

enumerator rocblas_datatype_f64_r

64-bit floating point, real

enumerator rocblas_datatype_f16_c

16-bit floating point, complex

enumerator rocblas_datatype_f32_c

32-bit floating point, complex

enumerator rocblas_datatype_f64_c

64-bit floating point, complex

enumerator rocblas_datatype_i8_r

8-bit signed integer, real

enumerator rocblas_datatype_u8_r

8-bit unsigned integer, real

enumerator rocblas_datatype_i32_r

32-bit signed integer, real

enumerator rocblas_datatype_u32_r

32-bit unsigned integer, real

enumerator rocblas_datatype_i8_c

8-bit signed integer, complex

enumerator rocblas_datatype_u8_c

8-bit unsigned integer, complex

enumerator rocblas_datatype_i32_c

32-bit signed integer, complex

enumerator rocblas_datatype_u32_c

32-bit unsigned integer, complex

enumerator rocblas_datatype_bf16_r

16-bit bfloat, real

enumerator rocblas_datatype_bf16_c

16-bit bfloat, complex

enumerator rocblas_datatype_invalid

Invalid datatype value, do not use

3.2.7. rocblas_pointer_mode

enum rocblas_pointer_mode

Indicates if scalar pointers are on host or device. This is used for scalars alpha and beta and for scalar function return values.

Values:

enumerator rocblas_pointer_mode_host

Scalar values affected by this variable are located on the host.

enumerator rocblas_pointer_mode_device

Scalar values affected by this variable are located on the device.

3.2.8. rocblas_atomics_mode

enum rocblas_atomics_mode

Indicates if atomics operations are allowed. Not allowing atomic operations may generally improve determinism and repeatability of results at a cost of performance.

Values:

enumerator rocblas_atomics_not_allowed

Algorithms will refrain from atomics where applicable.

enumerator rocblas_atomics_allowed

Algorithms will take advantage of atomics where applicable.

3.2.9. rocblas_layer_mode

enum rocblas_layer_mode

Indicates if layer is active with bitmask.

Values:

enumerator rocblas_layer_mode_none

No logging will take place.

enumerator rocblas_layer_mode_log_trace

A line containing the function name and value of arguments passed will be printed with each rocBLAS function call.

enumerator rocblas_layer_mode_log_bench

Outputs a line each time a rocBLAS function is called, this line can be used with rocblas-bench to make the same call again.

enumerator rocblas_layer_mode_log_profile

Outputs a YAML description of each rocBLAS function called, along with its arguments and number of times it was called.

3.2.10. rocblas_gemm_algo

enum rocblas_gemm_algo

Indicates if layer is active with bitmask.

Values:

enumerator rocblas_gemm_algo_standard

3.2.11. rocblas_gemm_flags

enum rocblas_gemm_flags

Control flags passed into gemm algorithms invoked by Tensile Host.

Values:

enumerator rocblas_gemm_flags_none

Default empty flags.

enumerator rocblas_gemm_flags_pack_int8x4

Before ROCm 4.2, this flags is not implemented and rocblas uses packed-Int8x4 by default. After ROCm 4.2, set flag is neccesary if we want packed-Int8x4. Default (0x0) uses unpacked.

enumerator rocblas_gemm_flags_use_cu_efficiency

Select the gemm problem with the highest efficiency per compute unit used. Useful for running multiple smaller problems simultaneously. This takes precedence over the performance metric set in rocblas_handle and currently only works for gemm_*_ex problems.

enumerator rocblas_gemm_flags_fp16_alt_impl

Select an alternate implementation for the MI200 FP16 HPA (High Precision Accumulate) GEMM kernel utilizing the BF16 matrix instructions with reduced accuracy in cases where computation cannot tolerate the FP16 matrix instructions flushing subnormal FP16 input/output data to zero. See the “MI200 (gfx90a) Considerations” section for more details.

3.3. rocBLAS Helper functions

3.3.1. Auxiliary Functions

rocblas_status rocblas_create_handle(rocblas_handle *handle)

Create handle.

rocblas_status rocblas_destroy_handle(rocblas_handle handle)

Destroy handle.

rocblas_status rocblas_set_stream(rocblas_handle handle, hipStream_t stream)

Set stream for handle.

rocblas_status rocblas_get_stream(rocblas_handle handle, hipStream_t *stream)

Get stream [0] from handle.

rocblas_status rocblas_set_pointer_mode(rocblas_handle handle, rocblas_pointer_mode pointer_mode)

Set rocblas_pointer_mode.

rocblas_status rocblas_get_pointer_mode(rocblas_handle handle, rocblas_pointer_mode *pointer_mode)

Get rocblas_pointer_mode.

rocblas_status rocblas_set_atomics_mode(rocblas_handle handle, rocblas_atomics_mode atomics_mode)

Set rocblas_atomics_mode.

rocblas_status rocblas_get_atomics_mode(rocblas_handle handle, rocblas_atomics_mode *atomics_mode)

Get rocblas_atomics_mode.

rocblas_status rocblas_query_int8_layout_flag(rocblas_handle handle, rocblas_gemm_flags *flag)

Query the preferable supported int8 input layout for gemm.

Indicates the supported int8 input layout for gemm according to the device. If the device supports packed-int8x4 (1) only, output flag is rocblas_gemm_flags_pack_int8x4 and users must bitwise-or your flag with rocblas_gemm_flags_pack_int8x4. If output flag is rocblas_gemm_flags_none (0), then unpacked int8 is preferable and suggested.

Parameters
  • handle[in] [rocblas_handle] the handle of device

  • flag[out] pointer to rocblas_gemm_flags

rocblas_pointer_mode rocblas_pointer_to_mode(void *ptr)

Indicates whether the pointer is on the host or device.

rocblas_status rocblas_set_vector(rocblas_int n, rocblas_int elem_size, const void *x, rocblas_int incx, void *y, rocblas_int incy)

Copy vector from host to device.

Parameters
  • n[in] [rocblas_int] number of elements in the vector

  • elem_size[in] [rocblas_int] number of bytes per element in the matrix

  • x[in] pointer to vector on the host

  • incx[in] [rocblas_int] specifies the increment for the elements of the vector

  • y[out] pointer to vector on the device

  • incy[in] [rocblas_int] specifies the increment for the elements of the vector

rocblas_status rocblas_get_vector(rocblas_int n, rocblas_int elem_size, const void *x, rocblas_int incx, void *y, rocblas_int incy)

Copy vector from device to host.

Parameters
  • n[in] [rocblas_int] number of elements in the vector

  • elem_size[in] [rocblas_int] number of bytes per element in the matrix

  • x[in] pointer to vector on the device

  • incx[in] [rocblas_int] specifies the increment for the elements of the vector

  • y[out] pointer to vector on the host

  • incy[in] [rocblas_int] specifies the increment for the elements of the vector

rocblas_status rocblas_set_matrix(rocblas_int rows, rocblas_int cols, rocblas_int elem_size, const void *a, rocblas_int lda, void *b, rocblas_int ldb)

Copy matrix from host to device.

Parameters
  • rows[in] [rocblas_int] number of rows in matrices

  • cols[in] [rocblas_int] number of columns in matrices

  • elem_size[in] [rocblas_int] number of bytes per element in the matrix

  • a[in] pointer to matrix on the host

  • lda[in] [rocblas_int] specifies the leading dimension of A, lda >= rows

  • b[out] pointer to matrix on the GPU

  • ldb[in] [rocblas_int] specifies the leading dimension of B, ldb >= rows

rocblas_status rocblas_get_matrix(rocblas_int rows, rocblas_int cols, rocblas_int elem_size, const void *a, rocblas_int lda, void *b, rocblas_int ldb)

Copy matrix from device to host.

Parameters
  • rows[in] [rocblas_int] number of rows in matrices

  • cols[in] [rocblas_int] number of columns in matrices

  • elem_size[in] [rocblas_int] number of bytes per element in the matrix

  • a[in] pointer to matrix on the GPU

  • lda[in] [rocblas_int] specifies the leading dimension of A, lda >= rows

  • b[out] pointer to matrix on the host

  • ldb[in] [rocblas_int] specifies the leading dimension of B, ldb >= rows

rocblas_status rocblas_set_vector_async(rocblas_int n, rocblas_int elem_size, const void *x, rocblas_int incx, void *y, rocblas_int incy, hipStream_t stream)

Asynchronously copy vector from host to device.

rocblas_set_vector_async copies a vector from pinned host memory to device memory asynchronously. Memory on the host must be allocated with hipHostMalloc or the transfer will be synchronous.

Parameters
  • n[in] [rocblas_int] number of elements in the vector

  • elem_size[in] [rocblas_int] number of bytes per element in the matrix

  • x[in] pointer to vector on the host

  • incx[in] [rocblas_int] specifies the increment for the elements of the vector

  • y[out] pointer to vector on the device

  • incy[in] [rocblas_int] specifies the increment for the elements of the vector

  • stream[in] specifies the stream into which this transfer request is queued

rocblas_status rocblas_set_matrix_async(rocblas_int rows, rocblas_int cols, rocblas_int elem_size, const void *a, rocblas_int lda, void *b, rocblas_int ldb, hipStream_t stream)

Asynchronously copy matrix from host to device.

rocblas_set_matrix_async copies a matrix from pinned host memory to device memory asynchronously. Memory on the host must be allocated with hipHostMalloc or the transfer will be synchronous.

Parameters
  • rows[in] [rocblas_int] number of rows in matrices

  • cols[in] [rocblas_int] number of columns in matrices

  • elem_size[in] [rocblas_int] number of bytes per element in the matrix

  • a[in] pointer to matrix on the host

  • lda[in] [rocblas_int] specifies the leading dimension of A, lda >= rows

  • b[out] pointer to matrix on the GPU

  • ldb[in] [rocblas_int] specifies the leading dimension of B, ldb >= rows

  • stream[in] specifies the stream into which this transfer request is queued

rocblas_status rocblas_get_matrix_async(rocblas_int rows, rocblas_int cols, rocblas_int elem_size, const void *a, rocblas_int lda, void *b, rocblas_int ldb, hipStream_t stream)

asynchronously copy matrix from device to host

rocblas_get_matrix_async copies a matrix from device memory to pinned host memory asynchronously. Memory on the host must be allocated with hipHostMalloc or the transfer will be synchronous.

Parameters
  • rows[in] [rocblas_int] number of rows in matrices

  • cols[in] [rocblas_int] number of columns in matrices

  • elem_size[in] [rocblas_int] number of bytes per element in the matrix

  • a[in] pointer to matrix on the GPU

  • lda[in] [rocblas_int] specifies the leading dimension of A, lda >= rows

  • b[out] pointer to matrix on the host

  • ldb[in] [rocblas_int] specifies the leading dimension of B, ldb >= rows

  • stream[in] specifies the stream into which this transfer request is queued

void rocblas_initialize(void)

Initialize rocBLAS on the current HIP device, to avoid costly startup time at the first call on that device.

Calling rocblas_initialize() allows upfront initialization including device specific kernel setup. Otherwise this function is automatically called on the first function call that requires these initializations (mainly GEMM).

const char *rocblas_status_to_string(rocblas_status status)

BLAS Auxiliary API

rocblas_status_to_string

Returns string representing rocblas_status value

Parameters

status[in] [rocblas_status] rocBLAS status to convert to string

3.3.2. Device Memory Allocation Functions

rocblas_status rocblas_start_device_memory_size_query(rocblas_handle handle)

Indicates that subsequent rocBLAS kernel calls should collect the optimal device memory size in bytes for their given kernel arguments and keep track of the maximum. Each kernel call can reuse temporary device memory on the same stream so the maximum is collected. Returns rocblas_status_size_query_mismatch if another size query is already in progress; returns rocblas_status_success otherwise

Parameters

handle[in] rocblas handle

rocblas_status rocblas_stop_device_memory_size_query(rocblas_handle handle, size_t *size)

Stops collecting optimal device memory size information. Returns rocblas_status_size_query_mismatch if a collection is not underway; rocblas_status_invalid_handle if handle is nullptr; rocblas_status_invalid_pointer if size is nullptr; rocblas_status_success otherwise

Parameters
  • handle[in] rocblas handle

  • size[out] maximum of the optimal sizes collected

rocblas_status rocblas_get_device_memory_size(rocblas_handle handle, size_t *size)

Gets the current device memory size for the handle. Returns rocblas_status_invalid_handle if handle is nullptr; rocblas_status_invalid_pointer if size is nullptr; rocblas_status_success otherwise

Parameters
  • handle[in] rocblas handle

  • size[out] current device memory size for the handle

rocblas_status rocblas_set_device_memory_size(rocblas_handle handle, size_t size)

Changes the size of allocated device memory at runtime.

Any previously allocated device memory managed by the handle is freed.

If size > 0 sets the device memory size to the specified size (in bytes). If size == 0, frees the memory allocated so far, and lets rocBLAS manage device memory in the future, expanding it when necessary. Returns rocblas_status_invalid_handle if handle is nullptr; rocblas_status_invalid_pointer if size is nullptr; rocblas_status_success otherwise

Parameters
  • handle[in] rocblas handle

  • size[in] size of allocated device memory

rocblas_status rocblas_set_workspace(rocblas_handle handle, void *addr, size_t size)

Sets the device workspace for the handle to use.

Any previously allocated device memory managed by the handle is freed.

Returns rocblas_status_invalid_handle if handle is nullptr; rocblas_status_success otherwise

Parameters
  • handle[in] rocblas handle

  • addr[in] address of workspace memory

  • size[in] size of workspace memory

bool rocblas_is_managing_device_memory(rocblas_handle handle)

Returns true when device memory in handle is managed by rocBLAS

Parameters

handle[in] rocblas handle

bool rocblas_is_user_managing_device_memory(rocblas_handle handle)

Returns true when device memory in handle is managed by the user

Parameters

handle[in] rocblas handle

For more detailed informationt, refer to sections Device Memory Allocation in rocBLAS and Device Memory Allocation.

3.3.3. Build Information Functions

rocblas_status rocblas_get_version_string_size(size_t *len)

Queries the minimum buffer size for a successful call to rocblas_get_version_string.

Parameters

len[out] pointer to size_t for storing the length

rocblas_status rocblas_get_version_string(char *buf, size_t len)

Loads char* buf with the rocblas library version. size_t len is the maximum length of char* buf.

Parameters
  • buf[inout] pointer to buffer for version string

  • len[in] length of buf

3.4. rocBLAS Level-1 functions

3.4.1. rocblas_iXamax + batched, strided_batched

rocblas_status rocblas_isamax(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, rocblas_int *result)
rocblas_status rocblas_idamax(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, rocblas_int *result)
rocblas_status rocblas_icamax(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, rocblas_int *result)
rocblas_status rocblas_izamax(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, rocblas_int *result)

BLAS Level 1 API

amax finds the first index of the element of maximum magnitude of a vector x.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] the number of elements in x.

  • x[in] device pointer storing vector x.

  • incx[in] [rocblas_int] specifies the increment for the elements of y.

  • result[inout] device pointer or host pointer to store the amax index. return is 0.0 if n, incx<=0.

rocblas_status rocblas_isamax_batched(rocblas_handle handle, rocblas_int n, const float *const x[], rocblas_int incx, rocblas_int batch_count, rocblas_int *result)
rocblas_status rocblas_idamax_batched(rocblas_handle handle, rocblas_int n, const double *const x[], rocblas_int incx, rocblas_int batch_count, rocblas_int *result)
rocblas_status rocblas_icamax_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *const x[], rocblas_int incx, rocblas_int batch_count, rocblas_int *result)
rocblas_status rocblas_izamax_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *const x[], rocblas_int incx, rocblas_int batch_count, rocblas_int *result)

BLAS Level 1 API

amax_batched finds the first index of the element of maximum magnitude of each vector x_i in a batch, for i = 1, …, batch_count.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] number of elements in each vector x_i.

  • x[in] device array of device pointers storing each vector x_i.

  • incx[in] [rocblas_int] specifies the increment for the elements of each x_i. incx must be > 0.

  • batch_count[in] [rocblas_int] number of instances in the batch. Must be > 0.

  • result[out] device or host array of pointers of batch_count size for results. return is 0 if n, incx<=0.

rocblas_status rocblas_isamax_strided_batched(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, rocblas_int *result)
rocblas_status rocblas_idamax_strided_batched(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, rocblas_int *result)
rocblas_status rocblas_icamax_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, rocblas_int *result)
rocblas_status rocblas_izamax_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, rocblas_int *result)

BLAS Level 1 API

amax_strided_batched finds the first index of the element of maximum magnitude of each vector x_i in a batch, for i = 1, …, batch_count.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] number of elements in each vector x_i.

  • x[in] device pointer to the first vector x_1.

  • incx[in] [rocblas_int] specifies the increment for the elements of each x_i. incx must be > 0.

  • stridex[in] [rocblas_stride] specifies the pointer increment between one x_i and the next x_(i + 1).

  • batch_count[in] [rocblas_int] number of instances in the batch.

  • result[out] device or host pointer for storing contiguous batch_count results. return is 0 if n <= 0, incx<=0.

3.4.2. rocblas_iXamin + batched, strided_batched

rocblas_status rocblas_isamin(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, rocblas_int *result)
rocblas_status rocblas_idamin(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, rocblas_int *result)
rocblas_status rocblas_icamin(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, rocblas_int *result)
rocblas_status rocblas_izamin(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, rocblas_int *result)

BLAS Level 1 API

amin finds the first index of the element of minimum magnitude of a vector x.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] the number of elements in x.

  • x[in] device pointer storing vector x.

  • incx[in] [rocblas_int] specifies the increment for the elements of y.

  • result[inout] device pointer or host pointer to store the amin index. return is 0.0 if n, incx<=0.

rocblas_status rocblas_isamin_batched(rocblas_handle handle, rocblas_int n, const float *const x[], rocblas_int incx, rocblas_int batch_count, rocblas_int *result)
rocblas_status rocblas_idamin_batched(rocblas_handle handle, rocblas_int n, const double *const x[], rocblas_int incx, rocblas_int batch_count, rocblas_int *result)
rocblas_status rocblas_icamin_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *const x[], rocblas_int incx, rocblas_int batch_count, rocblas_int *result)
rocblas_status rocblas_izamin_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *const x[], rocblas_int incx, rocblas_int batch_count, rocblas_int *result)

BLAS Level 1 API

amin_batched finds the first index of the element of minimum magnitude of each vector x_i in a batch, for i = 1, …, batch_count.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] number of elements in each vector x_i.

  • x[in] device array of device pointers storing each vector x_i.

  • incx[in] [rocblas_int] specifies the increment for the elements of each x_i. incx must be > 0.

  • batch_count[in] [rocblas_int] number of instances in the batch. Must be > 0.

  • result[out] device or host pointers to array of batch_count size for results. return is 0 if n, incx<=0.

rocblas_status rocblas_isamin_strided_batched(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, rocblas_int *result)
rocblas_status rocblas_idamin_strided_batched(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, rocblas_int *result)
rocblas_status rocblas_icamin_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, rocblas_int *result)
rocblas_status rocblas_izamin_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, rocblas_int *result)

BLAS Level 1 API

amin_strided_batched finds the first index of the element of minimum magnitude of each vector x_i in a batch, for i = 1, …, batch_count.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] number of elements in each vector x_i.

  • x[in] device pointer to the first vector x_1.

  • incx[in] [rocblas_int] specifies the increment for the elements of each x_i. incx must be > 0.

  • stridex[in] [rocblas_stride] specifies the pointer increment between one x_i and the next x_(i + 1).

  • batch_count[in] [rocblas_int] number of instances in the batch.

  • result[out] device or host pointer to array for storing contiguous batch_count results. return is 0 if n <= 0, incx<=0.

3.4.3. rocblas_Xasum + batched, strided_batched

rocblas_status rocblas_sasum(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, float *result)
rocblas_status rocblas_dasum(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, double *result)
rocblas_status rocblas_scasum(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, float *result)
rocblas_status rocblas_dzasum(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, double *result)

BLAS Level 1 API

asum computes the sum of the magnitudes of elements of a real vector x, or the sum of magnitudes of the real and imaginary parts of elements if x is a complex vector.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] the number of elements in x and y.

  • x[in] device pointer storing vector x.

  • incx[in] [rocblas_int] specifies the increment for the elements of x. incx must be > 0.

  • result[inout] device pointer or host pointer to store the asum product. return is 0.0 if n <= 0.

rocblas_status rocblas_sasum_batched(rocblas_handle handle, rocblas_int n, const float *const x[], rocblas_int incx, rocblas_int batch_count, float *results)
rocblas_status rocblas_dasum_batched(rocblas_handle handle, rocblas_int n, const double *const x[], rocblas_int incx, rocblas_int batch_count, double *results)
rocblas_status rocblas_scasum_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *const x[], rocblas_int incx, rocblas_int batch_count, float *results)
rocblas_status rocblas_dzasum_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *const x[], rocblas_int incx, rocblas_int batch_count, double *results)

BLAS Level 1 API

asum_batched computes the sum of the magnitudes of the elements in a batch of real vectors x_i, or the sum of magnitudes of the real and imaginary parts of elements if x_i is a complex vector, for i = 1, …, batch_count.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] number of elements in each vector x_i.

  • x[in] device array of device pointers storing each vector x_i.

  • incx[in] [rocblas_int] specifies the increment for the elements of each x_i. incx must be > 0.

  • results[out] device array or host array of batch_count size for results. return is 0.0 if n, incx<=0.

  • batch_count[in] [rocblas_int] number of instances in the batch.

rocblas_status rocblas_sasum_strided_batched(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, float *results)
rocblas_status rocblas_dasum_strided_batched(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, double *results)
rocblas_status rocblas_scasum_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, float *results)
rocblas_status rocblas_dzasum_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, double *results)

BLAS Level 1 API

asum_strided_batched computes the sum of the magnitudes of elements of a real vectors x_i, or the sum of magnitudes of the real and imaginary parts of elements if x_i is a complex vector, for i = 1, …, batch_count.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] number of elements in each vector x_i.

  • x[in] device pointer to the first vector x_1.

  • incx[in] [rocblas_int] specifies the increment for the elements of each x_i. incx must be > 0.

  • stridex[in] [rocblas_stride] stride from the start of one vector (x_i) and the next one (x_i+1). There are no restrictions placed on stride_x. However, ensure that stride_x is of appropriate size. For a typical case this means stride_x >= n * incx.

  • results[out] device pointer or host pointer to array for storing contiguous batch_count results. return is 0.0 if n, incx<=0.

  • batch_count[in] [rocblas_int] number of instances in the batch.

3.4.4. rocblas_Xaxpy + batched, strided_batched

rocblas_status rocblas_saxpy(rocblas_handle handle, rocblas_int n, const float *alpha, const float *x, rocblas_int incx, float *y, rocblas_int incy)
rocblas_status rocblas_daxpy(rocblas_handle handle, rocblas_int n, const double *alpha, const double *x, rocblas_int incx, double *y, rocblas_int incy)
rocblas_status rocblas_haxpy(rocblas_handle handle, rocblas_int n, const rocblas_half *alpha, const rocblas_half *x, rocblas_int incx, rocblas_half *y, rocblas_int incy)
rocblas_status rocblas_caxpy(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *alpha, const rocblas_float_complex *x, rocblas_int incx, rocblas_float_complex *y, rocblas_int incy)
rocblas_status rocblas_zaxpy(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *alpha, const rocblas_double_complex *x, rocblas_int incx, rocblas_double_complex *y, rocblas_int incy)

BLAS Level 1 API

axpy computes constant alpha multiplied by vector x, plus vector y:

y := alpha * x + y

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] the number of elements in x and y.

  • alpha[in] device pointer or host pointer to specify the scalar alpha.

  • x[in] device pointer storing vector x.

  • incx[in] [rocblas_int] specifies the increment for the elements of x.

  • y[out] device pointer storing vector y.

  • incy[inout] [rocblas_int] specifies the increment for the elements of y.

rocblas_status rocblas_saxpy_batched(rocblas_handle handle, rocblas_int n, const float *alpha, const float *const x[], rocblas_int incx, float *const y[], rocblas_int incy, rocblas_int batch_count)
rocblas_status rocblas_daxpy_batched(rocblas_handle handle, rocblas_int n, const double *alpha, const double *const x[], rocblas_int incx, double *const y[], rocblas_int incy, rocblas_int batch_count)
rocblas_status rocblas_haxpy_batched(rocblas_handle handle, rocblas_int n, const rocblas_half *alpha, const rocblas_half *const x[], rocblas_int incx, rocblas_half *const y[], rocblas_int incy, rocblas_int batch_count)
rocblas_status rocblas_caxpy_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *alpha, const rocblas_float_complex *const x[], rocblas_int incx, rocblas_float_complex *const y[], rocblas_int incy, rocblas_int batch_count)
rocblas_status rocblas_zaxpy_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *alpha, const rocblas_double_complex *const x[], rocblas_int incx, rocblas_double_complex *const y[], rocblas_int incy, rocblas_int batch_count)

BLAS Level 1 API

axpy_batched compute y := alpha * x + y over a set of batched vectors.

Parameters
  • handle[in] rocblas_handle handle to the rocblas library context queue.

  • n[in] rocblas_int

  • alpha[in] specifies the scalar alpha.

  • x[in] pointer storing vector x on the GPU.

  • incx[in] rocblas_int specifies the increment for the elements of x.

  • y[out] pointer storing vector y on the GPU.

  • incy[inout] rocblas_int specifies the increment for the elements of y.

  • batch_count[in] rocblas_int number of instances in the batch.

rocblas_status rocblas_saxpy_strided_batched(rocblas_handle handle, rocblas_int n, const float *alpha, const float *x, rocblas_int incx, rocblas_stride stridex, float *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)
rocblas_status rocblas_daxpy_strided_batched(rocblas_handle handle, rocblas_int n, const double *alpha, const double *x, rocblas_int incx, rocblas_stride stridex, double *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)
rocblas_status rocblas_haxpy_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_half *alpha, const rocblas_half *x, rocblas_int incx, rocblas_stride stridex, rocblas_half *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)
rocblas_status rocblas_caxpy_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *alpha, const rocblas_float_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_float_complex *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)
rocblas_status rocblas_zaxpy_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *alpha, const rocblas_double_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_double_complex *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)

BLAS Level 1 API

axpy_strided_batched compute y := alpha * x + y over a set of strided batched vectors.

Parameters
  • handle[in] rocblas_handle handle to the rocblas library context queue.

  • n[in] rocblas_int.

  • alpha[in] specifies the scalar alpha.

  • x[in] pointer storing vector x on the GPU.

  • incx[in] rocblas_int specifies the increment for the elements of x.

  • stridex[in] rocblas_stride specifies the increment between vectors of x.

  • y[out] pointer storing vector y on the GPU.

  • incy[inout] rocblas_int specifies the increment for the elements of y.

  • stridey[in] rocblas_stride specifies the increment between vectors of y.

  • batch_count[in] rocblas_int number of instances in the batch.

3.4.5. rocblas_Xcopy + batched, strided_batched

rocblas_status rocblas_scopy(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, float *y, rocblas_int incy)
rocblas_status rocblas_dcopy(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, double *y, rocblas_int incy)
rocblas_status rocblas_ccopy(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, rocblas_float_complex *y, rocblas_int incy)
rocblas_status rocblas_zcopy(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, rocblas_double_complex *y, rocblas_int incy)

BLAS Level 1 API

copy copies each element x[i] into y[i], for i = 1 , … , n:

y := x

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] the number of elements in x to be copied to y.

  • x[in] device pointer storing vector x.

  • incx[in] [rocblas_int] specifies the increment for the elements of x.

  • y[out] device pointer storing vector y.

  • incy[in] [rocblas_int] specifies the increment for the elements of y.

rocblas_status rocblas_scopy_batched(rocblas_handle handle, rocblas_int n, const float *const x[], rocblas_int incx, float *const y[], rocblas_int incy, rocblas_int batch_count)
rocblas_status rocblas_dcopy_batched(rocblas_handle handle, rocblas_int n, const double *const x[], rocblas_int incx, double *const y[], rocblas_int incy, rocblas_int batch_count)
rocblas_status rocblas_ccopy_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *const x[], rocblas_int incx, rocblas_float_complex *const y[], rocblas_int incy, rocblas_int batch_count)
rocblas_status rocblas_zcopy_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *const x[], rocblas_int incx, rocblas_double_complex *const y[], rocblas_int incy, rocblas_int batch_count)

BLAS Level 1 API

copy_batched copies each element x_i[j] into y_i[j], for j = 1 , … , n; i = 1 , … , batch_count:

y_i := x_i,
where (x_i, y_i) is the i-th instance of the batch.
x_i and y_i are vectors.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] the number of elements in each x_i to be copied to y_i.

  • x[in] device array of device pointers storing each vector x_i.

  • incx[in] [rocblas_int] specifies the increment for the elements of each vector x_i.

  • y[out] device array of device pointers storing each vector y_i.

  • incy[in] [rocblas_int] specifies the increment for the elements of each vector y_i.

  • batch_count[in] [rocblas_int] number of instances in the batch.

rocblas_status rocblas_scopy_strided_batched(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, rocblas_stride stridex, float *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)
rocblas_status rocblas_dcopy_strided_batched(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, rocblas_stride stridex, double *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)
rocblas_status rocblas_ccopy_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_float_complex *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)
rocblas_status rocblas_zcopy_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_double_complex *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)

BLAS Level 1 API

copy_strided_batched copies each element x_i[j] into y_i[j], for j = 1 , … , n; i = 1 , … , batch_count:

y_i := x_i,
where (x_i, y_i) is the i-th instance of the batch.
x_i and y_i are vectors.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] the number of elements in each x_i to be copied to y_i.

  • x[in] device pointer to the first vector (x_1) in the batch.

  • incx[in] [rocblas_int] specifies the increments for the elements of vectors x_i.

  • stridex[in] [rocblas_stride] stride from the start of one vector (x_i) and the next one (x_i+1). There are no restrictions placed on stride_x. However, the user should take care to ensure that stride_x is of appropriate size. For a typical case, this means stride_x >= n * incx.

  • y[out] device pointer to the first vector (y_1) in the batch.

  • incy[in] [rocblas_int] specifies the increment for the elements of vectors y_i.

  • stridey[in] [rocblas_stride] stride from the start of one vector (y_i) and the next one (y_i+1). There are no restrictions placed on stride_y, However, ensure that stride_y is of appropriate size, for a typical case this means stride_y >= n * incy. stridey should be non zero.

  • batch_count[in] [rocblas_int] number of instances in the batch.

3.4.6. rocblas_Xdot + batched, strided_batched

rocblas_status rocblas_sdot(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, const float *y, rocblas_int incy, float *result)
rocblas_status rocblas_ddot(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, const double *y, rocblas_int incy, double *result)
rocblas_status rocblas_hdot(rocblas_handle handle, rocblas_int n, const rocblas_half *x, rocblas_int incx, const rocblas_half *y, rocblas_int incy, rocblas_half *result)
rocblas_status rocblas_bfdot(rocblas_handle handle, rocblas_int n, const rocblas_bfloat16 *x, rocblas_int incx, const rocblas_bfloat16 *y, rocblas_int incy, rocblas_bfloat16 *result)
rocblas_status rocblas_cdotu(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, const rocblas_float_complex *y, rocblas_int incy, rocblas_float_complex *result)
rocblas_status rocblas_cdotc(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, const rocblas_float_complex *y, rocblas_int incy, rocblas_float_complex *result)
rocblas_status rocblas_zdotu(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, const rocblas_double_complex *y, rocblas_int incy, rocblas_double_complex *result)
rocblas_status rocblas_zdotc(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, const rocblas_double_complex *y, rocblas_int incy, rocblas_double_complex *result)

BLAS Level 1 API

dot(u) performs the dot product of vectors x and y:

result = x * y;

dotc performs the dot product of the conjugate of complex vector x and complex vector y.

result = conjugate (x) * y;

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] the number of elements in x and y.

  • x[in] device pointer storing vector x.

  • incx[in] [rocblas_int] specifies the increment for the elements of y.

  • y[in] device pointer storing vector y.

  • incy[in] [rocblas_int] specifies the increment for the elements of y.

  • result[inout] device pointer or host pointer to store the dot product. return is 0.0 if n <= 0.

rocblas_status rocblas_sdot_batched(rocblas_handle handle, rocblas_int n, const float *const x[], rocblas_int incx, const float *const y[], rocblas_int incy, rocblas_int batch_count, float *result)
rocblas_status rocblas_ddot_batched(rocblas_handle handle, rocblas_int n, const double *const x[], rocblas_int incx, const double *const y[], rocblas_int incy, rocblas_int batch_count, double *result)
rocblas_status rocblas_hdot_batched(rocblas_handle handle, rocblas_int n, const rocblas_half *const x[], rocblas_int incx, const rocblas_half *const y[], rocblas_int incy, rocblas_int batch_count, rocblas_half *result)
rocblas_status rocblas_bfdot_batched(rocblas_handle handle, rocblas_int n, const rocblas_bfloat16 *const x[], rocblas_int incx, const rocblas_bfloat16 *const y[], rocblas_int incy, rocblas_int batch_count, rocblas_bfloat16 *result)
rocblas_status rocblas_cdotu_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *const x[], rocblas_int incx, const rocblas_float_complex *const y[], rocblas_int incy, rocblas_int batch_count, rocblas_float_complex *result)
rocblas_status rocblas_cdotc_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *const x[], rocblas_int incx, const rocblas_float_complex *const y[], rocblas_int incy, rocblas_int batch_count, rocblas_float_complex *result)
rocblas_status rocblas_zdotu_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *const x[], rocblas_int incx, const rocblas_double_complex *const y[], rocblas_int incy, rocblas_int batch_count, rocblas_double_complex *result)
rocblas_status rocblas_zdotc_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *const x[], rocblas_int incx, const rocblas_double_complex *const y[], rocblas_int incy, rocblas_int batch_count, rocblas_double_complex *result)

BLAS Level 1 API

dot_batched(u) performs a batch of dot products of vectors x and y:

result_i = x_i * y_i;

dotc_batched performs a batch of dot products of the conjugate of complex vector x and complex vector y

result_i = conjugate (x_i) * y_i;
where (x_i, y_i) is the i-th instance of the batch.
x_i and y_i are vectors, for i = 1, ..., batch_count.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] the number of elements in each x_i and y_i.

  • x[in] device array of device pointers storing each vector x_i.

  • incx[in] [rocblas_int] specifies the increment for the elements of each x_i.

  • y[in] device array of device pointers storing each vector y_i.

  • incy[in] [rocblas_int] specifies the increment for the elements of each y_i.

  • batch_count[in] [rocblas_int] number of instances in the batch.

  • result[inout] device array or host array of batch_count size to store the dot products of each batch. return 0.0 for each element if n <= 0.

rocblas_status rocblas_sdot_strided_batched(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, rocblas_stride stridex, const float *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count, float *result)
rocblas_status rocblas_ddot_strided_batched(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, rocblas_stride stridex, const double *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count, double *result)
rocblas_status rocblas_hdot_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_half *x, rocblas_int incx, rocblas_stride stridex, const rocblas_half *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count, rocblas_half *result)
rocblas_status rocblas_bfdot_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_bfloat16 *x, rocblas_int incx, rocblas_stride stridex, const rocblas_bfloat16 *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count, rocblas_bfloat16 *result)
rocblas_status rocblas_cdotu_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, rocblas_stride stridex, const rocblas_float_complex *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count, rocblas_float_complex *result)
rocblas_status rocblas_cdotc_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, rocblas_stride stridex, const rocblas_float_complex *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count, rocblas_float_complex *result)
rocblas_status rocblas_zdotu_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, rocblas_stride stridex, const rocblas_double_complex *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count, rocblas_double_complex *result)
rocblas_status rocblas_zdotc_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, rocblas_stride stridex, const rocblas_double_complex *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count, rocblas_double_complex *result)

BLAS Level 1 API

dot_strided_batched(u) performs a batch of dot products of vectors x and y:

result_i = x_i * y_i;

dotc_strided_batched performs a batch of dot products of the conjugate of complex vector x and complex vector y

result_i = conjugate (x_i) * y_i;
where (x_i, y_i) is the i-th instance of the batch.
x_i and y_i are vectors, for i = 1, ..., batch_count.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] the number of elements in each x_i and y_i.

  • x[in] device pointer to the first vector (x_1) in the batch.

  • incx[in] [rocblas_int] specifies the increment for the elements of each x_i.

  • stridex[in] [rocblas_stride] stride from the start of one vector (x_i) and the next one (x_i+1).

  • y[in] device pointer to the first vector (y_1) in the batch.

  • incy[in] [rocblas_int] specifies the increment for the elements of each y_i.

  • stridey[in] [rocblas_stride] stride from the start of one vector (y_i) and the next one (y_i+1).

  • batch_count[in] [rocblas_int] number of instances in the batch.

  • result[inout] device array or host array of batch_count size to store the dot products of each batch. return 0.0 for each element if n <= 0.

3.4.7. rocblas_Xnrm2 + batched, strided_batched

rocblas_status rocblas_snrm2(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, float *result)
rocblas_status rocblas_dnrm2(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, double *result)
rocblas_status rocblas_scnrm2(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, float *result)
rocblas_status rocblas_dznrm2(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, double *result)

BLAS Level 1 API

nrm2 computes the euclidean norm of a real or complex vector:

result := sqrt( x'*x ) for real vectors
result := sqrt( x**H*x ) for complex vectors

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] the number of elements in x.

  • x[in] device pointer storing vector x.

  • incx[in] [rocblas_int] specifies the increment for the elements of y.

  • result[inout] device pointer or host pointer to store the nrm2 product. return is 0.0 if n, incx<=0.

rocblas_status rocblas_snrm2_batched(rocblas_handle handle, rocblas_int n, const float *const x[], rocblas_int incx, rocblas_int batch_count, float *results)
rocblas_status rocblas_dnrm2_batched(rocblas_handle handle, rocblas_int n, const double *const x[], rocblas_int incx, rocblas_int batch_count, double *results)
rocblas_status rocblas_scnrm2_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *const x[], rocblas_int incx, rocblas_int batch_count, float *results)
rocblas_status rocblas_dznrm2_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *const x[], rocblas_int incx, rocblas_int batch_count, double *results)

BLAS Level 1 API

nrm2_batched computes the euclidean norm over a batch of real or complex vectors:

result := sqrt( x_i'*x_i ) for real vectors x, for i = 1, ..., batch_count
result := sqrt( x_i**H*x_i ) for complex vectors x, for i = 1, ..., batch_count

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] number of elements in each x_i.

  • x[in] device array of device pointers storing each vector x_i.

  • incx[in] [rocblas_int] specifies the increment for the elements of each x_i. incx must be > 0.

  • batch_count[in] [rocblas_int] number of instances in the batch.

  • results[out] device pointer or host pointer to array of batch_count size for nrm2 results. return is 0.0 for each element if n <= 0, incx<=0.

rocblas_status rocblas_snrm2_strided_batched(rocblas_handle handle, rocblas_int n, const float *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, float *results)
rocblas_status rocblas_dnrm2_strided_batched(rocblas_handle handle, rocblas_int n, const double *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, double *results)
rocblas_status rocblas_scnrm2_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, float *results)
rocblas_status rocblas_dznrm2_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_int batch_count, double *results)

BLAS Level 1 API

nrm2_strided_batched computes the euclidean norm over a batch of real or complex vectors:

result := sqrt( x_i'*x_i ) for real vectors x, for i = 1, ..., batch_count
result := sqrt( x_i**H*x_i ) for complex vectors, for i = 1, ..., batch_count

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] number of elements in each x_i.

  • x[in] device pointer to the first vector x_1.

  • incx[in] [rocblas_int] specifies the increment for the elements of each x_i. incx must be > 0.

  • stridex[in] [rocblas_stride] stride from the start of one vector (x_i) and the next one (x_i+1). There are no restrictions placed on stride_x. However, ensure that stride_x is of appropriate size. For a typical case this means stride_x >= n * incx.

  • batch_count[in] [rocblas_int] number of instances in the batch.

  • results[out] device pointer or host pointer to array for storing contiguous batch_count results. return is 0.0 for each element if n <= 0, incx<=0.

3.4.8. rocblas_Xrot + batched, strided_batched

rocblas_status rocblas_srot(rocblas_handle handle, rocblas_int n, float *x, rocblas_int incx, float *y, rocblas_int incy, const float *c, const float *s)
rocblas_status rocblas_drot(rocblas_handle handle, rocblas_int n, double *x, rocblas_int incx, double *y, rocblas_int incy, const double *c, const double *s)
rocblas_status rocblas_crot(rocblas_handle handle, rocblas_int n, rocblas_float_complex *x, rocblas_int incx, rocblas_float_complex *y, rocblas_int incy, const float *c, const rocblas_float_complex *s)
rocblas_status rocblas_csrot(rocblas_handle handle, rocblas_int n, rocblas_float_complex *x, rocblas_int incx, rocblas_float_complex *y, rocblas_int incy, const float *c, const float *s)
rocblas_status rocblas_zrot(rocblas_handle handle, rocblas_int n, rocblas_double_complex *x, rocblas_int incx, rocblas_double_complex *y, rocblas_int incy, const double *c, const rocblas_double_complex *s)
rocblas_status rocblas_zdrot(rocblas_handle handle, rocblas_int n, rocblas_double_complex *x, rocblas_int incx, rocblas_double_complex *y, rocblas_int incy, const double *c, const double *s)

BLAS Level 1 API

rot applies the Givens rotation matrix defined by c=cos(alpha) and s=sin(alpha) to vectors x and y. Scalars c and s may be stored in either host or device memory. Location is specified by calling rocblas_set_pointer_mode.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] number of elements in the x and y vectors.

  • x[inout] device pointer storing vector x.

  • incx[in] [rocblas_int] specifies the increment between elements of x.

  • y[inout] device pointer storing vector y.

  • incy[in] [rocblas_int] specifies the increment between elements of y.

  • c[in] device pointer or host pointer storing scalar cosine component of the rotation matrix.

  • s[in] device pointer or host pointer storing scalar sine component of the rotation matrix.

rocblas_status rocblas_srot_batched(rocblas_handle handle, rocblas_int n, float *const x[], rocblas_int incx, float *const y[], rocblas_int incy, const float *c, const float *s, rocblas_int batch_count)
rocblas_status rocblas_drot_batched(rocblas_handle handle, rocblas_int n, double *const x[], rocblas_int incx, double *const y[], rocblas_int incy, const double *c, const double *s, rocblas_int batch_count)
rocblas_status rocblas_crot_batched(rocblas_handle handle, rocblas_int n, rocblas_float_complex *const x[], rocblas_int incx, rocblas_float_complex *const y[], rocblas_int incy, const float *c, const rocblas_float_complex *s, rocblas_int batch_count)
rocblas_status rocblas_csrot_batched(rocblas_handle handle, rocblas_int n, rocblas_float_complex *const x[], rocblas_int incx, rocblas_float_complex *const y[], rocblas_int incy, const float *c, const float *s, rocblas_int batch_count)
rocblas_status rocblas_zrot_batched(rocblas_handle handle, rocblas_int n, rocblas_double_complex *const x[], rocblas_int incx, rocblas_double_complex *const y[], rocblas_int incy, const double *c, const rocblas_double_complex *s, rocblas_int batch_count)
rocblas_status rocblas_zdrot_batched(rocblas_handle handle, rocblas_int n, rocblas_double_complex *const x[], rocblas_int incx, rocblas_double_complex *const y[], rocblas_int incy, const double *c, const double *s, rocblas_int batch_count)

BLAS Level 1 API

rot_batched applies the Givens rotation matrix defined by c=cos(alpha) and s=sin(alpha) to batched vectors x_i and y_i, for i = 1, …, batch_count. Scalars c and s may be stored in either host or device memory. Location is specified by calling rocblas_set_pointer_mode.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] number of elements in each x_i and y_i vectors.

  • x[inout] device array of deivce pointers storing each vector x_i.

  • incx[in] [rocblas_int] specifies the increment between elements of each x_i.

  • y[inout] device array of device pointers storing each vector y_i.

  • incy[in] [rocblas_int] specifies the increment between elements of each y_i.

  • c[in] device pointer or host pointer to scalar cosine component of the rotation matrix.

  • s[in] device pointer or host pointer to scalar sine component of the rotation matrix.

  • batch_count[in] [rocblas_int] the number of x and y arrays, i.e. the number of batches.

rocblas_status rocblas_srot_strided_batched(rocblas_handle handle, rocblas_int n, float *x, rocblas_int incx, rocblas_stride stride_x, float *y, rocblas_int incy, rocblas_stride stride_y, const float *c, const float *s, rocblas_int batch_count)
rocblas_status rocblas_drot_strided_batched(rocblas_handle handle, rocblas_int n, double *x, rocblas_int incx, rocblas_stride stride_x, double *y, rocblas_int incy, rocblas_stride stride_y, const double *c, const double *s, rocblas_int batch_count)
rocblas_status rocblas_crot_strided_batched(rocblas_handle handle, rocblas_int n, rocblas_float_complex *x, rocblas_int incx, rocblas_stride stride_x, rocblas_float_complex *y, rocblas_int incy, rocblas_stride stride_y, const float *c, const rocblas_float_complex *s, rocblas_int batch_count)
rocblas_status rocblas_csrot_strided_batched(rocblas_handle handle, rocblas_int n, rocblas_float_complex *x, rocblas_int incx, rocblas_stride stride_x, rocblas_float_complex *y, rocblas_int incy, rocblas_stride stride_y, const float *c, const float *s, rocblas_int batch_count)
rocblas_status rocblas_zrot_strided_batched(rocblas_handle handle, rocblas_int n, rocblas_double_complex *x, rocblas_int incx, rocblas_stride stride_x, rocblas_double_complex *y, rocblas_int incy, rocblas_stride stride_y, const double *c, const rocblas_double_complex *s, rocblas_int batch_count)
rocblas_status rocblas_zdrot_strided_batched(rocblas_handle handle, rocblas_int n, rocblas_double_complex *x, rocblas_int incx, rocblas_stride stride_x, rocblas_double_complex *y, rocblas_int incy, rocblas_stride stride_y, const double *c, const double *s, rocblas_int batch_count)

BLAS Level 1 API

rot_strided_batched applies the Givens rotation matrix defined by c=cos(alpha) and s=sin(alpha) to strided batched vectors x_i and y_i, for i = 1, …, batch_count. Scalars c and s may be stored in either host or device memory, location is specified by calling rocblas_set_pointer_mode.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] number of elements in each x_i and y_i vectors.

  • x[inout] device pointer to the first vector x_1.

  • incx[in] [rocblas_int] specifies the increment between elements of each x_i.

  • stride_x[in] [rocblas_stride] specifies the increment from the beginning of x_i to the beginning of x_(i+1).

  • y[inout] device pointer to the first vector y_1.

  • incy[in] [rocblas_int] specifies the increment between elements of each y_i.

  • stride_y[in] [rocblas_stride] specifies the increment from the beginning of y_i to the beginning of y_(i+1)

  • c[in] device pointer or host pointer to scalar cosine component of the rotation matrix.

  • s[in] device pointer or host pointer to scalar sine component of the rotation matrix.

  • batch_count[in] [rocblas_int] the number of x and y arrays, i.e. the number of batches.

3.4.9. rocblas_Xrotg + batched, strided_batched

rocblas_status rocblas_srotg(rocblas_handle handle, float *a, float *b, float *c, float *s)
rocblas_status rocblas_drotg(rocblas_handle handle, double *a, double *b, double *c, double *s)
rocblas_status rocblas_crotg(rocblas_handle handle, rocblas_float_complex *a, rocblas_float_complex *b, float *c, rocblas_float_complex *s)
rocblas_status rocblas_zrotg(rocblas_handle handle, rocblas_double_complex *a, rocblas_double_complex *b, double *c, rocblas_double_complex *s)

BLAS Level 1 API

rotg creates the Givens rotation matrix for the vector (a b). Scalars c and s and arrays a and b may be stored in either host or device memory, location is specified by calling rocblas_set_pointer_mode:

  • If the pointer mode is set to rocblas_pointer_mode_host, then this function blocks the CPU until the GPU has finished and the results are available in host memory.

  • If the pointer mode is set to rocblas_pointer_mode_device, then this function returns immediately and synchronization is required to read the results.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • a[inout] device pointer or host pointer to input vector element, overwritten with r.

  • b[inout] device pointer or host pointer to input vector element, overwritten with z.

  • c[inout] device pointer or host pointer to cosine element of Givens rotation.

  • s[inout] device pointer or host pointer sine element of Givens rotation.

rocblas_status rocblas_srotg_batched(rocblas_handle handle, float *const a[], float *const b[], float *const c[], float *const s[], rocblas_int batch_count)
rocblas_status rocblas_drotg_batched(rocblas_handle handle, double *const a[], double *const b[], double *const c[], double *const s[], rocblas_int batch_count)
rocblas_status rocblas_crotg_batched(rocblas_handle handle, rocblas_float_complex *const a[], rocblas_float_complex *const b[], float *const c[], rocblas_float_complex *const s[], rocblas_int batch_count)
rocblas_status rocblas_zrotg_batched(rocblas_handle handle, rocblas_double_complex *const a[], rocblas_double_complex *const b[], double *const c[], rocblas_double_complex *const s[], rocblas_int batch_count)

BLAS Level 1 API

rotg_batched creates the Givens rotation matrix for the batched vectors (a_i b_i), for i = 1, …, batch_count. a, b, c, and s may be stored in either host or device memory, location is specified by calling rocblas_set_pointer_mode:

  • If the pointer mode is set to rocblas_pointer_mode_host, then this function blocks the CPU until the GPU has finished and the results are available in host memory.

  • If the pointer mode is set to rocblas_pointer_mode_device, then this function returns immediately and synchronization is required to read the results.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • a[inout] device array of device pointers storing each single input vector element a_i, overwritten with r_i.

  • b[inout] device array of device pointers storing each single input vector element b_i, overwritten with z_i.

  • c[inout] device array of device pointers storing each cosine element of Givens rotation for the batch.

  • s[inout] device array of device pointers storing each sine element of Givens rotation for the batch.

  • batch_count[in] [rocblas_int] number of batches (length of arrays a, b, c, and s).

rocblas_status rocblas_srotg_strided_batched(rocblas_handle handle, float *a, rocblas_stride stride_a, float *b, rocblas_stride stride_b, float *c, rocblas_stride stride_c, float *s, rocblas_stride stride_s, rocblas_int batch_count)
rocblas_status rocblas_drotg_strided_batched(rocblas_handle handle, double *a, rocblas_stride stride_a, double *b, rocblas_stride stride_b, double *c, rocblas_stride stride_c, double *s, rocblas_stride stride_s, rocblas_int batch_count)
rocblas_status rocblas_crotg_strided_batched(rocblas_handle handle, rocblas_float_complex *a, rocblas_stride stride_a, rocblas_float_complex *b, rocblas_stride stride_b, float *c, rocblas_stride stride_c, rocblas_float_complex *s, rocblas_stride stride_s, rocblas_int batch_count)
rocblas_status rocblas_zrotg_strided_batched(rocblas_handle handle, rocblas_double_complex *a, rocblas_stride stride_a, rocblas_double_complex *b, rocblas_stride stride_b, double *c, rocblas_stride stride_c, rocblas_double_complex *s, rocblas_stride stride_s, rocblas_int batch_count)

BLAS Level 1 API

rotg_strided_batched creates the Givens rotation matrix for the strided batched vectors (a_i b_i), for i = 1, …, batch_count. a, b, c, and s may be stored in either host or device memory, location is specified by calling rocblas_set_pointer_mode:

  • If the pointer mode is set to rocblas_pointer_mode_host, then this function blocks the CPU until the GPU has finished and the results are available in host memory.

  • If the pointer mode is set to rocblas_pointer_mode_device, then this function returns immediately and synchronization is required to read the results.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • a[inout] device strided_batched pointer or host strided_batched pointer to first single input vector element a_1, overwritten with r.

  • stride_a[in] [rocblas_stride] distance between elements of a in batch (distance between a_i and a_(i + 1)).

  • b[inout] device strided_batched pointer or host strided_batched pointer to first single input vector element b_1, overwritten with z.

  • stride_b[in] [rocblas_stride] distance between elements of b in batch (distance between b_i and b_(i + 1)).

  • c[inout] device strided_batched pointer or host strided_batched pointer to first cosine element of Givens rotations c_1.

  • stride_c[in] [rocblas_stride] distance between elements of c in batch (distance between c_i and c_(i + 1)).

  • s[inout] device strided_batched pointer or host strided_batched pointer to sine element of Givens rotations s_1.

  • stride_s[in] [rocblas_stride] distance between elements of s in batch (distance between s_i and s_(i + 1)).

  • batch_count[in] [rocblas_int] number of batches (length of arrays a, b, c, and s).

3.4.10. rocblas_Xrotm + batched, strided_batched

rocblas_status rocblas_srotm(rocblas_handle handle, rocblas_int n, float *x, rocblas_int incx, float *y, rocblas_int incy, const float *param)
rocblas_status rocblas_drotm(rocblas_handle handle, rocblas_int n, double *x, rocblas_int incx, double *y, rocblas_int incy, const double *param)

BLAS Level 1 API

rotm applies the modified Givens rotation matrix defined by param to vectors x and y.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] number of elements in the x and y vectors.

  • x[inout] device pointer storing vector x.

  • incx[in] [rocblas_int] specifies the increment between elements of x.

  • y[inout] device pointer storing vector y.

  • incy[in] [rocblas_int] specifies the increment between elements of y.

  • param[in] device vector or host vector of 5 elements defining the rotation.

    param[0] = flag
    param[1] = H11
    param[2] = H21
    param[3] = H12
    param[4] = H22
    
    The flag parameter defines the form of H:
    
    flag = -1 => H = ( H11 H12 H21 H22 )
    flag =  0 => H = ( 1.0 H12 H21 1.0 )
    flag =  1 => H = ( H11 1.0 -1.0 H22 )
    flag = -2 => H = ( 1.0 0.0 0.0 1.0 )
    
    param may be stored in either host or device memory,
    location is specified by calling rocblas_set_pointer_mode.
    

rocblas_status rocblas_srotm_batched(rocblas_handle handle, rocblas_int n, float *const x[], rocblas_int incx, float *const y[], rocblas_int incy, const float *const param[], rocblas_int batch_count)
rocblas_status rocblas_drotm_batched(rocblas_handle handle, rocblas_int n, double *const x[], rocblas_int incx, double *const y[], rocblas_int incy, const double *const param[], rocblas_int batch_count)

BLAS Level 1 API

rotm_batched applies the modified Givens rotation matrix defined by param_i to batched vectors x_i and y_i, for i = 1, …, batch_count.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] number of elements in the x and y vectors.

  • x[inout] device array of device pointers storing each vector x_i.

  • incx[in] [rocblas_int] specifies the increment between elements of each x_i.

  • y[inout] device array of device pointers storing each vector y_1.

  • incy[in] [rocblas_int] specifies the increment between elements of each y_i.

  • param[in] device array of device vectors of 5 elements defining the rotation.

    param[0] = flag
    param[1] = H11
    param[2] = H21
    param[3] = H12
    param[4] = H22
    
    The flag parameter defines the form of H:
    
    flag = -1 => H = ( H11 H12 H21 H22 )
    flag =  0 => H = ( 1.0 H12 H21 1.0 )
    flag =  1 => H = ( H11 1.0 -1.0 H22 )
    flag = -2 => H = ( 1.0 0.0 0.0 1.0 )
    
    param may ONLY be stored on the device for the batched version of this function.
    

  • batch_count[in] [rocblas_int] the number of x and y arrays, i.e. the number of batches.

rocblas_status rocblas_srotm_strided_batched(rocblas_handle handle, rocblas_int n, float *x, rocblas_int incx, rocblas_stride stride_x, float *y, rocblas_int incy, rocblas_stride stride_y, const float *param, rocblas_stride stride_param, rocblas_int batch_count)
rocblas_status rocblas_drotm_strided_batched(rocblas_handle handle, rocblas_int n, double *x, rocblas_int incx, rocblas_stride stride_x, double *y, rocblas_int incy, rocblas_stride stride_y, const double *param, rocblas_stride stride_param, rocblas_int batch_count)

BLAS Level 1 API

rotm_strided_batched applies the modified Givens rotation matrix defined by param_i to strided batched vectors x_i and y_i, for i = 1, …, batch_count

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] number of elements in the x and y vectors.

  • x[inout] device pointer pointing to first strided batched vector x_1.

  • incx[in] [rocblas_int] specifies the increment between elements of each x_i.

  • stride_x[in] [rocblas_stride] specifies the increment between the beginning of x_i and x_(i + 1)

  • y[inout] device pointer pointing to first strided batched vector y_1.

  • incy[in] [rocblas_int] specifies the increment between elements of each y_i.

  • stride_y[in] [rocblas_stride] specifies the increment between the beginning of y_i and y_(i + 1).

  • param[in] device pointer pointing to first array of 5 elements defining the rotation (param_1).

    param[0] = flag
    param[1] = H11
    param[2] = H21
    param[3] = H12
    param[4] = H22
    
    The flag parameter defines the form of H:
    
    flag = -1 => H = ( H11 H12 H21 H22 )
    flag =  0 => H = ( 1.0 H12 H21 1.0 )
    flag =  1 => H = ( H11 1.0 -1.0 H22 )
    flag = -2 => H = ( 1.0 0.0 0.0 1.0 )
    
    param may ONLY be stored on the device for the strided_batched
    version of this function.
    

  • stride_param[in] [rocblas_stride] specifies the increment between the beginning of param_i and param_(i + 1).

  • batch_count[in] [rocblas_int] the number of x and y arrays, i.e. the number of batches.

3.4.11. rocblas_Xrotmg + batched, strided_batched

rocblas_status rocblas_srotmg(rocblas_handle handle, float *d1, float *d2, float *x1, const float *y1, float *param)
rocblas_status rocblas_drotmg(rocblas_handle handle, double *d1, double *d2, double *x1, const double *y1, double *param)

BLAS Level 1 API

rotmg creates the modified Givens rotation matrix for the vector (d1 * x1, d2 * y1). Parameters may be stored in either host or device memory. Location is specified by calling rocblas_set_pointer_mode:

  • If the pointer mode is set to rocblas_pointer_mode_host, then this function blocks the CPU until the GPU has finished and the results are available in host memory.

  • If the pointer mode is set to rocblas_pointer_mode_device, then this function returns immediately and synchronization is required to read the results.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • d1[inout] device pointer or host pointer to input scalar that is overwritten.

  • d2[inout] device pointer or host pointer to input scalar that is overwritten.

  • x1[inout] device pointer or host pointer to input scalar that is overwritten.

  • y1[in] device pointer or host pointer to input scalar.

  • param[out] device vector or host vector of five elements defining the rotation.

    param[0] = flag
    param[1] = H11
    param[2] = H21
    param[3] = H12
    param[4] = H22
    
    The flag parameter defines the form of H:
    
    flag = -1 => H = ( H11 H12 H21 H22 )
    flag =  0 => H = ( 1.0 H12 H21 1.0 )
    flag =  1 => H = ( H11 1.0 -1.0 H22 )
    flag = -2 => H = ( 1.0 0.0 0.0 1.0 )
    
    param may be stored in either host or device memory.
    Location is specified by calling rocblas_set_pointer_mode.
    

rocblas_status rocblas_srotmg_batched(rocblas_handle handle, float *const d1[], float *const d2[], float *const x1[], const float *const y1[], float *const param[], rocblas_int batch_count)
rocblas_status rocblas_drotmg_batched(rocblas_handle handle, double *const d1[], double *const d2[], double *const x1[], const double *const y1[], double *const param[], rocblas_int batch_count)

BLAS Level 1 API

rotmg_batched creates the modified Givens rotation matrix for the batched vectors (d1_i * x1_i, d2_i * y1_i), for i = 1, …, batch_count. Parameters may be stored in either host or device memory. Location is specified by calling rocblas_set_pointer_mode:

  • If the pointer mode is set to rocblas_pointer_mode_host, then this function blocks the CPU until the GPU has finished and the results are available in host memory.

  • If the pointer mode is set to rocblas_pointer_mode_device, then this function returns immediately and synchronization is required to read the results.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • d1[inout] device batched array or host batched array of input scalars that is overwritten.

  • d2[inout] device batched array or host batched array of input scalars that is overwritten.

  • x1[inout] device batched array or host batched array of input scalars that is overwritten.

  • y1[in] device batched array or host batched array of input scalars.

  • param[out] device batched array or host batched array of vectors of 5 elements defining the rotation.

    param[0] = flag
    param[1] = H11
    param[2] = H21
    param[3] = H12
    param[4] = H22
    
    The flag parameter defines the form of H:
    
    flag = -1 => H = ( H11 H12 H21 H22 )
    flag =  0 => H = ( 1.0 H12 H21 1.0 )
    flag =  1 => H = ( H11 1.0 -1.0 H22 )
    flag = -2 => H = ( 1.0 0.0 0.0 1.0 )
    
    param may be stored in either host or device memory.
    Location is specified by calling rocblas_set_pointer_mode.
    

  • batch_count[in] [rocblas_int] the number of instances in the batch.

rocblas_status rocblas_srotmg_strided_batched(rocblas_handle handle, float *d1, rocblas_stride stride_d1, float *d2, rocblas_stride stride_d2, float *x1, rocblas_stride stride_x1, const float *y1, rocblas_stride stride_y1, float *param, rocblas_stride stride_param, rocblas_int batch_count)
rocblas_status rocblas_drotmg_strided_batched(rocblas_handle handle, double *d1, rocblas_stride stride_d1, double *d2, rocblas_stride stride_d2, double *x1, rocblas_stride stride_x1, const double *y1, rocblas_stride stride_y1, double *param, rocblas_stride stride_param, rocblas_int batch_count)

BLAS Level 1 API

rotmg_strided_batched creates the modified Givens rotation matrix for the strided batched vectors (d1_i * x1_i, d2_i * y1_i), for i = 1, …, batch_count. Parameters may be stored in either host or device memory. Location is specified by calling rocblas_set_pointer_mode:

  • If the pointer mode is set to rocblas_pointer_mode_host, then this function blocks the CPU until the GPU has finished and the results are available in host memory.

  • If the pointer mode is set to rocblas_pointer_mode_device, then this function returns immediately and synchronization is required to read the results.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • d1[inout] device strided_batched array or host strided_batched array of input scalars that is overwritten.

  • stride_d1[in] [rocblas_stride] specifies the increment between the beginning of d1_i and d1_(i+1).

  • d2[inout] device strided_batched array or host strided_batched array of input scalars that is overwritten.

  • stride_d2[in] [rocblas_stride] specifies the increment between the beginning of d2_i and d2_(i+1).

  • x1[inout] device strided_batched array or host strided_batched array of input scalars that is overwritten.

  • stride_x1[in] [rocblas_stride] specifies the increment between the beginning of x1_i and x1_(i+1).

  • y1[in] device strided_batched array or host strided_batched array of input scalars.

  • stride_y1[in] [rocblas_stride] specifies the increment between the beginning of y1_i and y1_(i+1).

  • param[out] device strided_batched array or host strided_batched array of vectors of 5 elements defining the rotation.

    param[0] = flag
    param[1] = H11
    param[2] = H21
    param[3] = H12
    param[4] = H22
    The flag parameter defines the form of H:
    
    flag = -1 => H = ( H11 H12 H21 H22 )
    flag =  0 => H = ( 1.0 H12 H21 1.0 )
    flag =  1 => H = ( H11 1.0 -1.0 H22 )
    flag = -2 => H = ( 1.0 0.0 0.0 1.0 )
    
    param may be stored in either host or device memory.
    Location is specified by calling rocblas_set_pointer_mode.
    

  • stride_param[in] [rocblas_stride] specifies the increment between the beginning of param_i and param_(i + 1).

  • batch_count[in] [rocblas_int] the number of instances in the batch.

3.4.12. rocblas_Xscal + batched, strided_batched

rocblas_status rocblas_sscal(rocblas_handle handle, rocblas_int n, const float *alpha, float *x, rocblas_int incx)
rocblas_status rocblas_dscal(rocblas_handle handle, rocblas_int n, const double *alpha, double *x, rocblas_int incx)
rocblas_status rocblas_cscal(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *alpha, rocblas_float_complex *x, rocblas_int incx)
rocblas_status rocblas_zscal(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *alpha, rocblas_double_complex *x, rocblas_int incx)
rocblas_status rocblas_csscal(rocblas_handle handle, rocblas_int n, const float *alpha, rocblas_float_complex *x, rocblas_int incx)
rocblas_status rocblas_zdscal(rocblas_handle handle, rocblas_int n, const double *alpha, rocblas_double_complex *x, rocblas_int incx)

BLAS Level 1 API

scal scales each element of vector x with scalar alpha:

x := alpha * x

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] the number of elements in x.

  • alpha[in] device pointer or host pointer for the scalar alpha.

  • x[inout] device pointer storing vector x.

  • incx[in] [rocblas_int] specifies the increment for the elements of x.

rocblas_status rocblas_sscal_batched(rocblas_handle handle, rocblas_int n, const float *alpha, float *const x[], rocblas_int incx, rocblas_int batch_count)
rocblas_status rocblas_dscal_batched(rocblas_handle handle, rocblas_int n, const double *alpha, double *const x[], rocblas_int incx, rocblas_int batch_count)
rocblas_status rocblas_cscal_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *alpha, rocblas_float_complex *const x[], rocblas_int incx, rocblas_int batch_count)
rocblas_status rocblas_zscal_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *alpha, rocblas_double_complex *const x[], rocblas_int incx, rocblas_int batch_count)
rocblas_status rocblas_csscal_batched(rocblas_handle handle, rocblas_int n, const float *alpha, rocblas_float_complex *const x[], rocblas_int incx, rocblas_int batch_count)
rocblas_status rocblas_zdscal_batched(rocblas_handle handle, rocblas_int n, const double *alpha, rocblas_double_complex *const x[], rocblas_int incx, rocblas_int batch_count)

BLAS Level 1 API

scal_batched scales each element of vector x_i with scalar alpha, for i = 1, … , batch_count:

x_i := alpha * x_i,
where (x_i) is the i-th instance of the batch.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] the number of elements in each x_i.

  • alpha[in] host pointer or device pointer for the scalar alpha.

  • x[inout] device array of device pointers storing each vector x_i.

  • incx[in] [rocblas_int] specifies the increment for the elements of each x_i.

  • batch_count[in] [rocblas_int] specifies the number of batches in x.

rocblas_status rocblas_sscal_strided_batched(rocblas_handle handle, rocblas_int n, const float *alpha, float *x, rocblas_int incx, rocblas_stride stride_x, rocblas_int batch_count)
rocblas_status rocblas_dscal_strided_batched(rocblas_handle handle, rocblas_int n, const double *alpha, double *x, rocblas_int incx, rocblas_stride stride_x, rocblas_int batch_count)
rocblas_status rocblas_cscal_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_float_complex *alpha, rocblas_float_complex *x, rocblas_int incx, rocblas_stride stride_x, rocblas_int batch_count)
rocblas_status rocblas_zscal_strided_batched(rocblas_handle handle, rocblas_int n, const rocblas_double_complex *alpha, rocblas_double_complex *x, rocblas_int incx, rocblas_stride stride_x, rocblas_int batch_count)
rocblas_status rocblas_csscal_strided_batched(rocblas_handle handle, rocblas_int n, const float *alpha, rocblas_float_complex *x, rocblas_int incx, rocblas_stride stride_x, rocblas_int batch_count)
rocblas_status rocblas_zdscal_strided_batched(rocblas_handle handle, rocblas_int n, const double *alpha, rocblas_double_complex *x, rocblas_int incx, rocblas_stride stride_x, rocblas_int batch_count)

BLAS Level 1 API

scal_strided_batched scales each element of vector x_i with scalar alpha, for i = 1, … , batch_count:

x_i := alpha * x_i,
where (x_i) is the i-th instance of the batch.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] the number of elements in each x_i.

  • alpha[in] host pointer or device pointer for the scalar alpha.

  • x[inout] device pointer to the first vector (x_1) in the batch.

  • incx[in] [rocblas_int] specifies the increment for the elements of x.

  • stride_x[in] [rocblas_stride] stride from the start of one vector (x_i) and the next one (x_i+1). There are no restrictions placed on stride_x. However, ensure that stride_x is of appropriate size, for a typical case this means stride_x >= n * incx.

  • batch_count[in] [rocblas_int] specifies the number of batches in x.

3.4.13. rocblas_Xswap + batched, strided_batched

rocblas_status rocblas_sswap(rocblas_handle handle, rocblas_int n, float *x, rocblas_int incx, float *y, rocblas_int incy)
rocblas_status rocblas_dswap(rocblas_handle handle, rocblas_int n, double *x, rocblas_int incx, double *y, rocblas_int incy)
rocblas_status rocblas_cswap(rocblas_handle handle, rocblas_int n, rocblas_float_complex *x, rocblas_int incx, rocblas_float_complex *y, rocblas_int incy)
rocblas_status rocblas_zswap(rocblas_handle handle, rocblas_int n, rocblas_double_complex *x, rocblas_int incx, rocblas_double_complex *y, rocblas_int incy)

BLAS Level 1 API

swap interchanges vectors x and y:

y := x;
x := y

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] the number of elements in x and y.

  • x[inout] device pointer storing vector x.

  • incx[in] [rocblas_int] specifies the increment for the elements of x.

  • y[inout] device pointer storing vector y.

  • incy[in] [rocblas_int] specifies the increment for the elements of y.

rocblas_status rocblas_sswap_batched(rocblas_handle handle, rocblas_int n, float *const x[], rocblas_int incx, float *const y[], rocblas_int incy, rocblas_int batch_count)
rocblas_status rocblas_dswap_batched(rocblas_handle handle, rocblas_int n, double *const x[], rocblas_int incx, double *const y[], rocblas_int incy, rocblas_int batch_count)
rocblas_status rocblas_cswap_batched(rocblas_handle handle, rocblas_int n, rocblas_float_complex *const x[], rocblas_int incx, rocblas_float_complex *const y[], rocblas_int incy, rocblas_int batch_count)
rocblas_status rocblas_zswap_batched(rocblas_handle handle, rocblas_int n, rocblas_double_complex *const x[], rocblas_int incx, rocblas_double_complex *const y[], rocblas_int incy, rocblas_int batch_count)

BLAS Level 1 API

swap_batched interchanges vectors x_i and y_i, for i = 1 , … , batch_count:

y_i := x_i;
x_i := y_i

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] the number of elements in each x_i and y_i.

  • x[inout] device array of device pointers storing each vector x_i.

  • incx[in] [rocblas_int] specifies the increment for the elements of each x_i.

  • y[inout] device array of device pointers storing each vector y_i.

  • incy[in] [rocblas_int] specifies the increment for the elements of each y_i.

  • batch_count[in] [rocblas_int] number of instances in the batch.

rocblas_status rocblas_sswap_strided_batched(rocblas_handle handle, rocblas_int n, float *x, rocblas_int incx, rocblas_stride stridex, float *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)
rocblas_status rocblas_dswap_strided_batched(rocblas_handle handle, rocblas_int n, double *x, rocblas_int incx, rocblas_stride stridex, double *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)
rocblas_status rocblas_cswap_strided_batched(rocblas_handle handle, rocblas_int n, rocblas_float_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_float_complex *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)
rocblas_status rocblas_zswap_strided_batched(rocblas_handle handle, rocblas_int n, rocblas_double_complex *x, rocblas_int incx, rocblas_stride stridex, rocblas_double_complex *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)

BLAS Level 1 API

swap_strided_batched interchanges vectors x_i and y_i, for i = 1 , … , batch_count:

y_i := x_i;
x_i := y_i

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • n[in] [rocblas_int] the number of elements in each x_i and y_i.

  • x[inout] device pointer to the first vector x_1.

  • incx[in] [rocblas_int] specifies the increment for the elements of x.

  • stridex[in] [rocblas_stride] stride from the start of one vector (x_i) and the next one (x_i+1). There are no restrictions placed on stride_x. However, ensure that stride_x is of appropriate size. For a typical case this means stride_x >= n * incx.

  • y[inout] device pointer to the first vector y_1.

  • incy[in] [rocblas_int] specifies the increment for the elements of y.

  • stridey[in] [rocblas_stride] stride from the start of one vector (y_i) and the next one (y_i+1). There are no restrictions placed on stride_x. However, ensure that stride_y is of appropriate size. For a typical case this means stride_y >= n * incy. stridey should be non zero.

  • batch_count[in] [rocblas_int] number of instances in the batch.

3.5. rocBLAS Level-2 functions

3.5.1. rocblas_Xgbmv + batched, strided_batched

rocblas_status rocblas_sgbmv(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, rocblas_int kl, rocblas_int ku, const float *alpha, const float *A, rocblas_int lda, const float *x, rocblas_int incx, const float *beta, float *y, rocblas_int incy)
rocblas_status rocblas_dgbmv(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, rocblas_int kl, rocblas_int ku, const double *alpha, const double *A, rocblas_int lda, const double *x, rocblas_int incx, const double *beta, double *y, rocblas_int incy)
rocblas_status rocblas_cgbmv(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, rocblas_int kl, rocblas_int ku, const rocblas_float_complex *alpha, const rocblas_float_complex *A, rocblas_int lda, const rocblas_float_complex *x, rocblas_int incx, const rocblas_float_complex *beta, rocblas_float_complex *y, rocblas_int incy)
rocblas_status rocblas_zgbmv(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, rocblas_int kl, rocblas_int ku, const rocblas_double_complex *alpha, const rocblas_double_complex *A, rocblas_int lda, const rocblas_double_complex *x, rocblas_int incx, const rocblas_double_complex *beta, rocblas_double_complex *y, rocblas_int incy)

BLAS Level 2 API

gbmv performs one of the matrix-vector operations:

y := alpha*A*x    + beta*y,   or
y := alpha*A**T*x + beta*y,   or
y := alpha*A**H*x + beta*y,
where alpha and beta are scalars, x and y are vectors and A is an
m by n banded matrix with kl sub-diagonals and ku super-diagonals.

Note that the empty elements which do not correspond to data will not be referenced.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • trans[in] [rocblas_operation] indicates whether matrix A is tranposed (conjugated) or not.

  • m[in] [rocblas_int] number of rows of matrix A.

  • n[in] [rocblas_int] number of columns of matrix A.

  • kl[in] [rocblas_int] number of sub-diagonals of A.

  • ku[in] [rocblas_int] number of super-diagonals of A.

  • alpha[in] device pointer or host pointer to scalar alpha.

  • A[in] device pointer storing banded matrix A. Leading (kl + ku + 1) by n part of the matrix contains the coefficients of the banded matrix. The leading diagonal resides in row (ku + 1) with the first super-diagonal above on the RHS of row ku. The first sub-diagonal resides below on the LHS of row ku + 2. This propagates up and down across sub/super-diagonals.

      Ex: (m = n = 7; ku = 2, kl = 2)
      1 2 3 0 0 0 0             0 0 3 3 3 3 3
      4 1 2 3 0 0 0             0 2 2 2 2 2 2
      5 4 1 2 3 0 0    ---->    1 1 1 1 1 1 1
      0 5 4 1 2 3 0             4 4 4 4 4 4 0
      0 0 5 4 1 2 0             5 5 5 5 5 0 0
      0 0 0 5 4 1 2             0 0 0 0 0 0 0
      0 0 0 0 5 4 1             0 0 0 0 0 0 0
    

  • lda[in] [rocblas_int] specifies the leading dimension of A. Must be >= (kl + ku + 1).

  • x[in] device pointer storing vector x.

  • incx[in] [rocblas_int] specifies the increment for the elements of x.

  • beta[in] device pointer or host pointer to scalar beta.

  • y[inout] device pointer storing vector y.

  • incy[in] [rocblas_int] specifies the increment for the elements of y.

rocblas_status rocblas_sgbmv_batched(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, rocblas_int kl, rocblas_int ku, const float *alpha, const float *const A[], rocblas_int lda, const float *const x[], rocblas_int incx, const float *beta, float *const y[], rocblas_int incy, rocblas_int batch_count)
rocblas_status rocblas_dgbmv_batched(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, rocblas_int kl, rocblas_int ku, const double *alpha, const double *const A[], rocblas_int lda, const double *const x[], rocblas_int incx, const double *beta, double *const y[], rocblas_int incy, rocblas_int batch_count)
rocblas_status rocblas_cgbmv_batched(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, rocblas_int kl, rocblas_int ku, const rocblas_float_complex *alpha, const rocblas_float_complex *const A[], rocblas_int lda, const rocblas_float_complex *const x[], rocblas_int incx, const rocblas_float_complex *beta, rocblas_float_complex *const y[], rocblas_int incy, rocblas_int batch_count)
rocblas_status rocblas_zgbmv_batched(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, rocblas_int kl, rocblas_int ku, const rocblas_double_complex *alpha, const rocblas_double_complex *const A[], rocblas_int lda, const rocblas_double_complex *const x[], rocblas_int incx, const rocblas_double_complex *beta, rocblas_double_complex *const y[], rocblas_int incy, rocblas_int batch_count)

BLAS Level 2 API

gbmv_batched performs one of the matrix-vector operations:

y_i := alpha*A_i*x_i    + beta*y_i,   or
y_i := alpha*A_i**T*x_i + beta*y_i,   or
y_i := alpha*A_i**H*x_i + beta*y_i,
where (A_i, x_i, y_i) is the i-th instance of the batch.
alpha and beta are scalars, x_i and y_i are vectors and A_i is an
m by n banded matrix with kl sub-diagonals and ku super-diagonals,
for i = 1, ..., batch_count.

Note that the empty elements which do not correspond to data will not be referenced.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • trans[in] [rocblas_operation] indicates whether matrix A is tranposed (conjugated) or not.

  • m[in] [rocblas_int] number of rows of each matrix A_i.

  • n[in] [rocblas_int] number of columns of each matrix A_i.

  • kl[in] [rocblas_int] number of sub-diagonals of each A_i.

  • ku[in] [rocblas_int] number of super-diagonals of each A_i.

  • alpha[in] device pointer or host pointer to scalar alpha.

  • A[in] device array of device pointers storing each banded matrix A_i. Leading (kl + ku + 1) by n part of the matrix contains the coefficients of the banded matrix. The leading diagonal resides in row (ku + 1) with the first super-diagonal above on the RHS of row ku. The first sub-diagonal resides below on the LHS of row ku + 2. This propagates up and down across sub/super-diagonals.

      Ex: (m = n = 7; ku = 2, kl = 2)
      1 2 3 0 0 0 0             0 0 3 3 3 3 3
      4 1 2 3 0 0 0             0 2 2 2 2 2 2
      5 4 1 2 3 0 0    ---->    1 1 1 1 1 1 1
      0 5 4 1 2 3 0             4 4 4 4 4 4 0
      0 0 5 4 1 2 0             5 5 5 5 5 0 0
      0 0 0 5 4 1 2             0 0 0 0 0 0 0
      0 0 0 0 5 4 1             0 0 0 0 0 0 0
    

  • lda[in] [rocblas_int] specifies the leading dimension of each A_i. Must be >= (kl + ku + 1)

  • x[in] device array of device pointers storing each vector x_i.

  • incx[in] [rocblas_int] specifies the increment for the elements of each x_i.

  • beta[in] device pointer or host pointer to scalar beta.

  • y[inout] device array of device pointers storing each vector y_i.

  • incy[in] [rocblas_int] specifies the increment for the elements of each y_i.

  • batch_count[in] [rocblas_int] specifies the number of instances in the batch.

rocblas_status rocblas_sgbmv_strided_batched(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, rocblas_int kl, rocblas_int ku, const float *alpha, const float *A, rocblas_int lda, rocblas_stride stride_A, const float *x, rocblas_int incx, rocblas_stride stride_x, const float *beta, float *y, rocblas_int incy, rocblas_stride stride_y, rocblas_int batch_count)
rocblas_status rocblas_dgbmv_strided_batched(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, rocblas_int kl, rocblas_int ku, const double *alpha, const double *A, rocblas_int lda, rocblas_stride stride_A, const double *x, rocblas_int incx, rocblas_stride stride_x, const double *beta, double *y, rocblas_int incy, rocblas_stride stride_y, rocblas_int batch_count)
rocblas_status rocblas_cgbmv_strided_batched(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, rocblas_int kl, rocblas_int ku, const rocblas_float_complex *alpha, const rocblas_float_complex *A, rocblas_int lda, rocblas_stride stride_A, const rocblas_float_complex *x, rocblas_int incx, rocblas_stride stride_x, const rocblas_float_complex *beta, rocblas_float_complex *y, rocblas_int incy, rocblas_stride stride_y, rocblas_int batch_count)
rocblas_status rocblas_zgbmv_strided_batched(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, rocblas_int kl, rocblas_int ku, const rocblas_double_complex *alpha, const rocblas_double_complex *A, rocblas_int lda, rocblas_stride stride_A, const rocblas_double_complex *x, rocblas_int incx, rocblas_stride stride_x, const rocblas_double_complex *beta, rocblas_double_complex *y, rocblas_int incy, rocblas_stride stride_y, rocblas_int batch_count)

BLAS Level 2 API

gbmv_strided_batched performs one of the matrix-vector operations:

y_i := alpha*A_i*x_i    + beta*y_i,   or
y_i := alpha*A_i**T*x_i + beta*y_i,   or
y_i := alpha*A_i**H*x_i + beta*y_i,
where (A_i, x_i, y_i) is the i-th instance of the batch.
alpha and beta are scalars, x_i and y_i are vectors and A_i is an
m by n banded matrix with kl sub-diagonals and ku super-diagonals,
for i = 1, ..., batch_count.

Note that the empty elements which do not correspond to data will not be referenced.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • trans[in] [rocblas_operation] indicates whether matrix A is tranposed (conjugated) or not.

  • m[in] [rocblas_int] number of rows of matrix A.

  • n[in] [rocblas_int] number of columns of matrix A.

  • kl[in] [rocblas_int] number of sub-diagonals of A.

  • ku[in] [rocblas_int] number of super-diagonals of A.

  • alpha[in] device pointer or host pointer to scalar alpha.

  • A[in] device pointer to first banded matrix (A_1). Leading (kl + ku + 1) by n part of the matrix contains the coefficients of the banded matrix. The leading diagonal resides in row (ku + 1) with the first super-diagonal above on the RHS of row ku. The first sub-diagonal resides below on the LHS of row ku + 2. This propagates up and down across sub/super-diagonals.

      Ex: (m = n = 7; ku = 2, kl = 2)
      1 2 3 0 0 0 0             0 0 3 3 3 3 3
      4 1 2 3 0 0 0             0 2 2 2 2 2 2
      5 4 1 2 3 0 0    ---->    1 1 1 1 1 1 1
      0 5 4 1 2 3 0             4 4 4 4 4 4 0
      0 0 5 4 1 2 0             5 5 5 5 5 0 0
      0 0 0 5 4 1 2             0 0 0 0 0 0 0
      0 0 0 0 5 4 1             0 0 0 0 0 0 0
    

  • lda[in] [rocblas_int] specifies the leading dimension of A. Must be >= (kl + ku + 1).

  • stride_A[in] [rocblas_stride] stride from the start of one matrix (A_i) and the next one (A_i+1).

  • x[in] device pointer to first vector (x_1).

  • incx[in] [rocblas_int] specifies the increment for the elements of x.

  • stride_x[in] [rocblas_stride] stride from the start of one vector (x_i) and the next one (x_i+1).

  • beta[in] device pointer or host pointer to scalar beta.

  • y[inout] device pointer to first vector (y_1).

  • incy[in] [rocblas_int] specifies the increment for the elements of y.

  • stride_y[in] [rocblas_stride] stride from the start of one vector (y_i) and the next one (x_i+1).

  • batch_count[in] [rocblas_int] specifies the number of instances in the batch.

3.5.2. rocblas_Xgemv + batched, strided_batched

rocblas_status rocblas_sgemv(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, const float *alpha, const float *A, rocblas_int lda, const float *x, rocblas_int incx, const float *beta, float *y, rocblas_int incy)
rocblas_status rocblas_dgemv(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, const double *alpha, const double *A, rocblas_int lda, const double *x, rocblas_int incx, const double *beta, double *y, rocblas_int incy)
rocblas_status rocblas_cgemv(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, const rocblas_float_complex *alpha, const rocblas_float_complex *A, rocblas_int lda, const rocblas_float_complex *x, rocblas_int incx, const rocblas_float_complex *beta, rocblas_float_complex *y, rocblas_int incy)
rocblas_status rocblas_zgemv(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, const rocblas_double_complex *alpha, const rocblas_double_complex *A, rocblas_int lda, const rocblas_double_complex *x, rocblas_int incx, const rocblas_double_complex *beta, rocblas_double_complex *y, rocblas_int incy)

BLAS Level 2 API

gemv performs one of the matrix-vector operations:

y := alpha*A*x    + beta*y,   or
y := alpha*A**T*x + beta*y,   or
y := alpha*A**H*x + beta*y,
where alpha and beta are scalars, x and y are vectors and A is an
m by n matrix.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • trans[in] [rocblas_operation] indicates whether matrix A is tranposed (conjugated) or not.

  • m[in] [rocblas_int] number of rows of matrix A.

  • n[in] [rocblas_int] number of columns of matrix A.

  • alpha[in] device pointer or host pointer to scalar alpha.

  • A[in] device pointer storing matrix A.

  • lda[in] [rocblas_int] specifies the leading dimension of A.

  • x[in] device pointer storing vector x.

  • incx[in] [rocblas_int] specifies the increment for the elements of x.

  • beta[in] device pointer or host pointer to scalar beta.

  • y[inout] device pointer storing vector y.

  • incy[in] [rocblas_int] specifies the increment for the elements of y.

rocblas_status rocblas_sgemv_batched(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, const float *alpha, const float *const A[], rocblas_int lda, const float *const x[], rocblas_int incx, const float *beta, float *const y[], rocblas_int incy, rocblas_int batch_count)
rocblas_status rocblas_dgemv_batched(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, const double *alpha, const double *const A[], rocblas_int lda, const double *const x[], rocblas_int incx, const double *beta, double *const y[], rocblas_int incy, rocblas_int batch_count)
rocblas_status rocblas_cgemv_batched(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, const rocblas_float_complex *alpha, const rocblas_float_complex *const A[], rocblas_int lda, const rocblas_float_complex *const x[], rocblas_int incx, const rocblas_float_complex *beta, rocblas_float_complex *const y[], rocblas_int incy, rocblas_int batch_count)
rocblas_status rocblas_zgemv_batched(rocblas_handle handle, rocblas_operation trans, rocblas_int m, rocblas_int n, const rocblas_double_complex *alpha, const rocblas_double_complex *const A[], rocblas_int lda, const rocblas_double_complex *const x[], rocblas_int incx, const rocblas_double_complex *beta, rocblas_double_complex *const y[], rocblas_int incy, rocblas_int batch_count)

BLAS Level 2 API

gemv_batched performs a batch of matrix-vector operations:

y_i := alpha*A_i*x_i    + beta*y_i,   or
y_i := alpha*A_i**T*x_i + beta*y_i,   or
y_i := alpha*A_i**H*x_i + beta*y_i,
where (A_i, x_i, y_i) is the i-th instance of the batch.
alpha and beta are scalars, x_i and y_i are vectors and A_i is an
m by n matrix, for i = 1, ..., batch_count.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • trans[in] [rocblas_operation] indicates whether matrices A_i are tranposed (conjugated) or not.

  • m[in] [rocblas_int] number of rows of each matrix A_i.

  • n[in] [rocblas_int] number of columns of each matrix A_i.

  • alpha[in] device pointer or host pointer to scalar alpha.

  • A[in] device array of device pointers storing each matrix A_i.

  • lda[in] [rocblas_int] specifies the leading dimension of each matrix A_i.

  • x[in] device array of device pointers storing each vector x_i.

  • incx[in] [rocblas_int] specifies the increment for the elements of each vector x_i.

  • beta[in] device pointer or host pointer to scalar beta.

  • y[inout] device array of device pointers storing each vector y_i.

  • incy[in] [rocblas_int] specifies the increment for the elements of each vector y_i.

  • batch_count[in] [rocblas_int] number of instances in the batch.

rocblas_status rocblas_sgemv_strided_batched(rocblas_handle handle, rocblas_operation transA, rocblas_int m, rocblas_int n, const float *alpha, const float *A, rocblas_int lda, rocblas_stride strideA, const float *x, rocblas_int incx, rocblas_stride stridex, const float *beta, float *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)
rocblas_status rocblas_dgemv_strided_batched(rocblas_handle handle, rocblas_operation transA, rocblas_int m, rocblas_int n, const double *alpha, const double *A, rocblas_int lda, rocblas_stride strideA, const double *x, rocblas_int incx, rocblas_stride stridex, const double *beta, double *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)
rocblas_status rocblas_cgemv_strided_batched(rocblas_handle handle, rocblas_operation transA, rocblas_int m, rocblas_int n, const rocblas_float_complex *alpha, const rocblas_float_complex *A, rocblas_int lda, rocblas_stride strideA, const rocblas_float_complex *x, rocblas_int incx, rocblas_stride stridex, const rocblas_float_complex *beta, rocblas_float_complex *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)
rocblas_status rocblas_zgemv_strided_batched(rocblas_handle handle, rocblas_operation transA, rocblas_int m, rocblas_int n, const rocblas_double_complex *alpha, const rocblas_double_complex *A, rocblas_int lda, rocblas_stride strideA, const rocblas_double_complex *x, rocblas_int incx, rocblas_stride stridex, const rocblas_double_complex *beta, rocblas_double_complex *y, rocblas_int incy, rocblas_stride stridey, rocblas_int batch_count)

BLAS Level 2 API

gemv_strided_batched performs a batch of matrix-vector operations:

y_i := alpha*A_i*x_i    + beta*y_i,   or
y_i := alpha*A_i**T*x_i + beta*y_i,   or
y_i := alpha*A_i**H*x_i + beta*y_i,
where (A_i, x_i, y_i) is the i-th instance of the batch.
alpha and beta are scalars, x_i and y_i are vectors and A_i is an
m by n matrix, for i = 1, ..., batch_count.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • transA[in] [rocblas_operation] indicates whether matrices A_i are tranposed (conjugated) or not.

  • m[in] [rocblas_int] number of rows of matrices A_i.

  • n[in] [rocblas_int] number of columns of matrices A_i.

  • alpha[in] device pointer or host pointer to scalar alpha.

  • A[in] device pointer to the first matrix (A_1) in the batch.

  • lda[in] [rocblas_int] specifies the leading dimension of matrices A_i.

  • strideA[in] [rocblas_stride] stride from the start of one matrix (A_i) and the next one (A_i+1).

  • x[in] device pointer to the first vector (x_1) in the batch.

  • incx[in] [rocblas_int] specifies the increment for the elements of vectors x_i.

  • stridex[in] [rocblas_stride] stride from the start of one vector (x_i) and the next one (x_i+1). There are no restrictions placed on stride_x. However, ensure that stride_x is of appropriate size. When trans equals rocblas_operation_none this typically means stride_x >= n * incx, otherwise stride_x >= m * incx.

  • beta[in] device pointer or host pointer to scalar beta.

  • y[inout] device pointer to the first vector (y_1) in the batch.

  • incy[in] [rocblas_int] specifies the increment for the elements of vectors y_i.

  • stridey[in] [rocblas_stride] stride from the start of one vector (y_i) and the next one (y_i+1). There are no restrictions placed on stride_y. However, ensure that stride_y is of appropriate size. When trans equals rocblas_operation_none this typically means stride_y >= m * incy, otherwise stride_y >= n * incy. stridey should be non zero.

  • batch_count[in] [rocblas_int] number of instances in the batch.

3.5.3. rocblas_Xger + batched, strided_batched

rocblas_status rocblas_sger(rocblas_handle handle, rocblas_int m, rocblas_int n, const float *alpha, const float *x, rocblas_int incx, const float *y, rocblas_int incy, float *A, rocblas_int lda)
rocblas_status rocblas_dger(rocblas_handle handle, rocblas_int m, rocblas_int n, const double *alpha, const double *x, rocblas_int incx, const double *y, rocblas_int incy, double *A, rocblas_int lda)
rocblas_status rocblas_cgeru(rocblas_handle handle, rocblas_int m, rocblas_int n, const rocblas_float_complex *alpha, const rocblas_float_complex *x, rocblas_int incx, const rocblas_float_complex *y, rocblas_int incy, rocblas_float_complex *A, rocblas_int lda)
rocblas_status rocblas_zgeru(rocblas_handle handle, rocblas_int m, rocblas_int n, const rocblas_double_complex *alpha, const rocblas_double_complex *x, rocblas_int incx, const rocblas_double_complex *y, rocblas_int incy, rocblas_double_complex *A, rocblas_int lda)
rocblas_status rocblas_cgerc(rocblas_handle handle, rocblas_int m, rocblas_int n, const rocblas_float_complex *alpha, const rocblas_float_complex *x, rocblas_int incx, const rocblas_float_complex *y, rocblas_int incy, rocblas_float_complex *A, rocblas_int lda)
rocblas_status rocblas_zgerc(rocblas_handle handle, rocblas_int m, rocblas_int n, const rocblas_double_complex *alpha, const rocblas_double_complex *x, rocblas_int incx, const rocblas_double_complex *y, rocblas_int incy, rocblas_double_complex *A, rocblas_int lda)

BLAS Level 2 API

ger,geru,gerc performs the matrix-vector operations:

A := A + alpha*x*y**T , OR
A := A + alpha*x*y**H for gerc
where alpha is a scalar, x and y are vectors, and A is an
m by n matrix.

Parameters
  • handle[in] [rocblas_handle] handle to the rocblas library context queue.

  • m[in] [rocblas_int] the number of rows of the matrix A.

  • n[in] [rocblas_int] the number of columns of the matrix A.

  • alpha[in] device pointer or host pointer to scalar alpha.

  • x[in] device pointer storing vector x.

  • incx[in] [rocblas_int] specifies the increment for the elements of x.

  • y[in] device pointer storing vector y.

  • incy[in] [rocblas_int] specifies the increment for the elements of y.

  • A[inout] device pointer storing matrix A.

  • lda[in] [rocblas_int] specifies the leading dimension of A.

rocblas_status rocblas_sger_batched(rocblas_handle handle, rocblas_int m, rocblas_int n, const float *alpha, const float *const x[], rocblas_int incx, const float *const y[], rocblas_int incy, float *const A[], rocblas_int lda, rocblas_int batch_count)
rocblas_status rocblas_dger_batched(rocblas_handle handle, rocblas_int m, rocblas_int n, const double *alpha, const double *const x[], rocblas_int incx, const double *const y[], rocblas_int incy, double *const A[], rocblas_int lda, rocblas_int batch_count)
rocblas_status rocblas_cgeru_batched(rocblas_handle handle, rocblas_int m, rocblas_int n, const rocblas_float_complex *alpha, const rocblas_float_complex *const x[], rocblas_int incx, const rocblas_float_complex *const y[], rocblas_int incy, rocblas_float_complex *const A[], rocblas_int lda, rocblas_int batch_count)
rocblas_status rocblas_zgeru_batched(rocblas_handle handle, rocblas_int m, rocblas_int n, const rocblas_double_complex *alpha, const rocblas_double_complex *const x[], rocblas_int incx, const rocblas_double_complex *const y[], rocblas_int incy, rocblas_double_complex *const A[], rocblas_int lda, rocblas_int batch_count)
rocblas_status rocblas_cgerc_batched(rocblas_handle handle, rocblas_int m, rocblas_int n, const rocblas_float_complex *alpha, const rocblas_float_complex *const x[], rocblas_int incx, const rocblas_float_complex *const y[], rocblas_int incy, rocblas_float_complex *const A[], rocblas_int lda, rocblas_int batch_count)
rocblas_status rocblas_zgerc_batched(rocblas_handle handle, rocblas_int m, rocblas_int n, const rocblas_double_complex *alpha, const rocblas_double_complex *const x[], rocblas_int incx, const rocblas_double_complex *const y[], rocblas_int incy, rocblas_double_complex *const A[], rocblas_int lda, rocblas_int batch_count)

BLAS Level 2 API

ger_batched,geru_batched,gerc_batched perform a batch of the matrix-vector operations:

A := A + alpha*x*y**T , OR
A := A + alpha*x*y**H for gerc
where (A_i, x_i, y_i) is the i-th instance of the batch.
alpha is a scalar, x_i and y_i are vectors and A_i is an