Tiling kernel for MatMul in SYCL/oneAPI

From one excellent blog post by Cedric Nugerten, I am trying to implement a similar kernel to accelerate matrix multiplication using SYCL instead of oCL. His tiled implementation is shown below.

   // Thread identifiers
    const int row = get_local_id(0); // Local row ID (max: TS)
    const int col = get_local_id(1); // Local col ID (max: TS)
    const int globalRow = TS*get_group_id(0) + row; // Row ID of C (0..M)
    const int globalCol = TS*get_group_id(1) + col; // Col ID of C (0..N)
 
    // Local memory to fit a tile of TS*TS elements of A and B
    __local float Asub[TS][TS];
    __local float Bsub[TS][TS];
 
    // Initialise the accumulation register
    float acc = 0.0f;
    
    // Loop over all tiles
    const int numTiles = K/TS;
    for (int t=0; t<numTiles; t++) {
 
        // Load one tile of A and B into local memory
        const int tiledRow = TS*t + row;
        const int tiledCol = TS*t + col;
        Asub[col][row] = A[tiledCol*M + globalRow];
        Bsub[col][row] = B[globalCol*K + tiledRow];
 
        // Synchronise to make sure the tile is loaded
        barrier(CLK_LOCAL_MEM_FENCE);
 
        // Perform the computation for a single tile
        for (int k=0; k<TS; k++) {
            acc += Asub[k][row] * Bsub[col][k];
        }
 
        // Synchronise before loading the next tile
        barrier(CLK_LOCAL_MEM_FENCE);
    }
 
    // Store the final result in C
    C[globalCol*M + globalRow] = acc;

How can I create __local buffers in SYCL? Is there any specific keyword for it? Also, what are equivalent barrier(...) functions to ensure caches are ready before the computation?

Currently, my naive matrix multiplication (similar to Cedric’s naive implementation) looks like this:

        size_t M = a_host.rows();
		size_t N = a_host.cols();
		size_t P = b_host.cols();
		buffer<double, 2> a(a_host.data(), range<2>{M, N});
		buffer<double, 2> b(b_host.data(), range<2>{N, P});
		buffer<double, 2> c(c_gpu.data(), range<2>{M, P});
		PROFILE_SCOPE("Starting Multiply on GPU");
		std::cout << "GPU::Multiplying A and B into C.\n";
		auto e = q.submit([&](handler& h) {
			auto A = a.get_access<access::mode::read>(h);
			auto B = b.get_access<access::mode::read>(h);
			auto C = c.get_access<access::mode::write>(h);
			// buffer.get_range() returns the dimensions (shape) of the buffer/array
			int num_cols_a = a.get_range()[1];

			h.parallel_for(range<2>{M, P}, [=](id<2> index) {
				// Thread identifiers
				int row = index[0];
				int col = index[1];
				auto sum = 0.0;
				// Compute result of ONE element of C
				for (int i = 0; i < num_cols_a; i++)
					sum += A[row][i] * B[i][col];
				C[index] = sum;
			});
		});
		e.wait();

I am new to this. Can you explain me what changes do I need to make to implement a tiling version of matmul?

First, for this kind of parallelism taking care of work-groups and work-group allocations (the “local” memory), you need to use a parallel_for taking an nd_range instead of a range or to use hierarchical data parallel kernels, which looks more like OpenMP programming (see p. 258 of SYCL 2020 provisional) but unfortunately has some implementation problems today and the SYCL committee is working at solving them.
So, let’s focus only on the nd_range approach. You need to allocate the local memory by using local accessor (SYCL 2020 provisional 4.7.6.11) in the command group.
To synchronize the work-items inside a work-group, you need to use a work-group barrier.
See example on p. 256 of the provisional specification.
I guess you can find some information in some SYCL tutorials too.

Keep us posted with you results about your “syclBlast” library!

Thanks @keryell1, that was quite helpful.

I’ve been able to get a boilerplate working, for tiling based matmul

        /* M = N = P */
        /* window size TS x TS */
		size_t TS = 4;

		/* Create buffers */
		buffer<T, 1> a(a_host.data(), range<1>{a_host.size()});
		buffer<T, 1> b(b_host.data(), range<1>{b_host.size()});
		buffer<T, 1> c(c_gpu.data(), range<1>{c_gpu.size()});

		auto e = q.submit([&](handler& h) {
			/* Create accessors */
			auto A = a.template get_access<access::mode::read>(h);
			auto B = b.template get_access<access::mode::read>(h);
			auto C = c.template get_access<access::mode::write>(h);

			/* Local accessor TILES: hyperfast cache */
			accessor<T, 2, access::mode::read_write, access::target::local> Asub(range<2>{TS, TS}, h);
			accessor<T, 2, access::mode::read_write, access::target::local> Bsub(range<2>{TS, TS}, h);

			/* Create kernel */
			sycl::stream out(1024, 256, h);
			h.parallel_for(nd_range<2>(range<2>(M, P), range<2>(TS, TS)), [=](nd_item<2> item) {
				/* row, col thread identifier for each tile */
				size_t row = item.get_local_id(0);
				size_t col = item.get_local_id(1);

				/* row, col thread identifer of C */
				size_t globalRow = TS * item.get_global_id(0) + row;
				size_t globalCol = TS * item.get_global_id(1) + col;

				auto acc = 0;
				
				/* loop over all tiles */
				const size_t num_tiles = P / TS;
				for (size_t t = 0; t < num_tiles; t++) {
					/* Load one tile of A and B into cache */
					const size_t tiledRow = TS * t + row;
					const size_t tiledCol = TS * t + col;
					Asub[row][col] = A[globalRow * M + tiledRow];
					Bsub[row][col] = B[tiledCol * N + globalCol];
					
					/* Barrier to sync the read-write */
					item.barrier(access::fence_space::local_space);
					
					/* Do the matmul between Asub and Bsub */
					for (size_t k = 0; k < TS; k++) {
						acc += Asub[row][k] * Bsub[k][col];
					}

					/* Barrier to sync the read-write */
					item.barrier(access::fence_space::local_space);
					/* Write from cache to host memory */
					C[globalRow * M + globalCol] = acc;
				}

			});

		});
		e.wait();

But, now I seem to have a problem in deducing get_group_id(0) equivalent in SYCL. One option I found out to be, was to use item.get_group().get_id(0). That didn’t give me correct results either.

Plus, another important info: I am trying to do the matmul in row-major format, as opposed to col-major, used by cederic in his openCL implementation.

Any suggestions, what is going wrong here?

I have some issues to understand this:

size_t globalRow = TS * item.get_global_id(0) + row;
size_t globalCol = TS * item.get_global_id(1) + col;

What are you trying to compute here?
These are clearly out of bound.

auto acc = 0;

means acc is an int… Do you mean T acc = 0; here?
And should it be inside the next loop instead?

Do you have a link to the blog you are looking at?
For such low-level code it should not be that different from the original one and I guess there are some explanations in the blog and the big picture because the CUDA launching code is missing in your code snippet…

Hi @keryell1,

I got it working, here’s where I went wrong:

  1. Fix #1: On your question about what I was computing here,

    size_t globalRow = TS * item.get_global_id(0) + row;
    size_t globalCol = TS * item.get_global_id(1) + col;
    

    These are intended to be “global” thread identifiers, that iterate over dimensions of C (resultant matrix).
    Here is the link I am trying to replicate tiling from: https://cnugteren.github.io/tutorial/pages/page4.html

    Cedric used get_group_id(0) in his OpenCL kernel, and after a bit of reading provisional spec and trying all functions, I found out that item.get_group().get_id(0) is equivalent to the OpenCL snippet.

    So, it should look like:

    // correct way
    size_t globalRow = TS * item.get_group().get_id(0) + row;
    size_t globalCol = TS * item.get_group().get_id(1) + col;
    
  2. On auto acc = 0;, yes, I can replace it with T acc = 0. It is the accumulator, it will store the result computed for each item in the tile, depending on datatype of the vector. Thanks for pointing this out.

  3. Fix #2: I had a small bug in using the tiledRow and tiledCol variables, not indexed properly.

    // before
    Asub[row][col] = A[globalRow * M + tiledCol];
    Bsub[row][col] = B[tiledRow * N + globalCol];
    
    // after
    Asub[row][col] = A[globalRow * M + tiledRow];
    Bsub[row][col] = B[tiledCol * N + globalCol];
    

I was able to get a noticeable speedup of ~2x compared to naive version, I am very happy to have been able to do this! :smiley:
Thanks @keryell1!

I have further questions on the kernel launching part, I will put that in another post of this thread.

So this is the meat of the kernel:

                T acc = 0;
				/* loop over all tiles */
				const size_t num_tiles = P / TS;
				for (size_t t = 0; t < num_tiles; t++) {
					/* Load one tile of A and B into cache */
					const size_t tiledRow = TS * t + row;
					const size_t tiledCol = TS * t + col;
					Asub[row][col] = A[globalRow * M + tiledCol];
					Bsub[row][col] = B[tiledRow * N + globalCol];
					
					/* Barrier to sync the read-write */
					item.barrier(access::fence_space::local_space);
					
					/* Do the matmul between Asub and Bsub */
					for (size_t k = 0; k < TS; k++) {
						acc += Asub[row][k] * Bsub[k][col];
					}

					/* Barrier to sync the read-write */
					item.barrier(access::fence_space::local_space);
                    // Option-1 write to C here 
				}
				/* Write from cache to host memory */
        		// Option-2 write to C outside the loop		
                C[globalRow * M + globalCol] = acc; // Why both work?

Now, my instinct says the final part of writing back to C <- acc, should be within the loop where we iterate through each tile (just after item.barrier(...). So that each accumulated value is written back for each tile.

But, weirdly, the result is same (and correct), regardless of where I place it, inside or outside the num-tiles-for-loop.

Is it that everything inside a h.parallel_for is concurrently executed?

Any rule of thumb on how to understand a concurrent execution, along with sequential for-loops in a kernel? I fear I might be lacking in-depth knowledge of work-groups/threads. Would be great if someone can give me one-shot understanding of this.

Thanks