Matrix Multiplication using 2D std::vector

This is my kernel and cgh for simple matrix multiplication of 2D std::vector.

void MatrixMulParallel(queue& q, 
	const std::vector<std::vector<double>>& a_host,
	const std::vector<std::vector<double>>& b_host,
	std::vector<std::vector<double>>& c_gpu) {
		1. Create buffers for array
		2. Create a command group containing kernel as a lambda
		3. Give access permissions inside the submission
	try {
		size_t M = a_host.size();
		size_t N = a_host[0].size();
		size_t P = b_host[0].size();
		buffer a(, range<2>{M, N});
		buffer b(, range<2>{N, P});
		buffer c(, range<2>{M, P});
		PROFILE_SCOPE("Starting Multiply on GPU");
		std::cout << "GPU::Multiplying A and B into C.\n";
		auto e = q.submit([&](handler& h) {

			auto A = a.get_access<access::mode::read>(h);
			auto B = b.get_access<access::mode::read>(h);
			auto C = c.get_access<access::mode::write>(h);

			h.parallel_for(range<2>{M, P}, [=](id<2> index) {
				// index[0] allows accessing ROW index, index[1] is column index
				int row = index[0];
				int col = index[1];
				auto sum = 0.0;
				for (int i = 0; i < N; i++)
					sum += A[row][i] * B[i][col]; // Error #1
				C[index] = sum; // Error #2
	catch (sycl::exception const& e) {
		std::cout << "An exception is caught while multiplying matrices.\n";

I get two errors, indicated along the lines:

  1. Error #1: invalid operands to binary expression ('const std::vector<double, std::allocator<double>>' and 'const std::vector<double, std::allocator<double>>')
  2. Error #2: no viable overloaded '='

Any help would be appreciated.

Hi @MasterSkepticista,

I believe the problem here is that function parameters are std::vector<std::vector<double>> which means data() is returning a std::vector<double> and therefore the buffer element types are deduced as std::vector<double>.

This explains the syntax errors you are seeing as the accessors are retrieving elements of std::vector<double> and being multiplied or assigned a double scalar value.

On a more general note, this approach to representing a 2-dimensional span of data is problematic when using SYCL, as the storage of a std::vector is heap-allocated in order to allow it to be resized, the overall structure is non-contiguous, which is a requirement for moving data to and from a device via a SYCL buffer.

You could try having a single std::vector<double> and performing the linearization from 2-d space in the host application manually. Alternatively, you could try using a std::vector<std::array<double, M>> since std::array is stack-allocated, however for this approach, the matrix width would have to be known statically.

I hope this helps.



Thanks Gordon,

A couple queries:

  1. I agree to use of contiguous data structure like array instead of vector. I was able to run the program with arrays, size statically known. When you say ‘problematic’, do you mean runtime problems or it refers to something that can not be solved to be syntactically correct?
  2. How do frameworks handle dynamically sized arrays, yet parallelize computations?
  3. As I’m also brushing up my C++ skills, can you provide any references or suggestions on how to implement MatMul using Vectors (or dynamically sized arrays)?


I finally deduced that using 2D vector is problematic for SYCL, as pointed out by @gordonbrown589. I used a 1D vector (flattened) and accessed using linear way of accessing flattened 2D array.

I posted an answer here:
I hope it helps some in future.

More generally, avoid non compact data structure when doing HPC. It is less friendly for the memory hierarchy than contiguous array elements and the initialization is complex. Use instead things similar to md_span and md_array (basically Fortran arrays on steroids :slight_smile: ).

Hi @MasterSkepticista,

I’m glad you were able to resolve this, and thanks for creating the StackOverflow question.

To answer your earlier questions, non-contiguous data is not supported by SYCL buffers because the SYCL runtime needs to be able to marshal the data between the host and device memory regions, and it cannot deduce the data layout within host memory if it’s non-contiguous.

SYCL encourages contiguous data as it’s generally much less efficient to copy data if it involves pointer indirections and multiple distinct copies, having fragmented can also make it difficult to achieve pinned memory and other such optimizations. Additionally, once in device memory such as on a GPU having data fragmented rather than contiguous can also interfere with common optimizations such as ensuring coalesced global memory access.

In terms of how best to represent these data structures in C++, I would recommend as you’ve done, using a std::vector where the multi-dimensionality is linearized into a single linear allocation. Unfortunately, C++ does not yet have a way to represent data in a multi-dimensional space, though this is in the works in the form of std::md_span, as @keryell1 mentioned.

I hope this helps.