OpenCL 2.0 Device command queue keeps filling up and halting execution

I am utilizing OpenCL’s enqueue_kernel() function to enqueue kernels dynamically from the GPU to reduce unnecessary host interactions. Here is a simplified example of what I am trying to do in the kernels:

kernel void kernelA(args)
{
    //This kernel is the one that is enqueued from the host, with only one work item. This kernel
    //could be considered the "master" kernel that controls the logic of when to enqueue tasks
    //First, it checks if a condition is met, then it enqueues kernelB

    if (some condition)
    {
        enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelB(args);});
    }
    else
    {
        //do other things
    }
}

kernel void kernelB(args)
{
    //Do some stuff

    //Only enqueue the next kernel with the first work item. I do this because the things
    //occurring in kernelC rely on the things that kernelB does, so it must take place after kernelB is completed,
    //hence, the CLK_ENQUEUE_FLAGS_WAIT_KERNEL
    if (get_global_id(0) == 0)
    {
        enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelC(args);});
    }
}

kernel void kernelC(args)
{
    //Do some stuff. This one in particular is one step in a sorting algorithm

    //This kernel will enqueue kernelD if a condition is met, otherwise it will
    //return to kernelA 
    if (get_global_id(0) == 0 && other requirements)
    {
        enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(1, 1), ^{kernelD(args);});
    }
    else if (get_global_id(0) == 0)
    {
        enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(1, 1), ^{kernelA(args);});
    }
}

kernel void kernelD(args)
{
    //Do some stuff

    //Finally, if some condition is met, enqueue kernelC again. What this will do is it will
    //bounce back and forth between kernelC and kernelD until the condition is
    //no longer met. If it isn't met, go back to kernelA
    if (some condition)
    {
        enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelC(args);});
    }
    else
    {
        enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(1, 1), ^{kernelA(args);});
    }
}

So that is the general flow of the program, and it works perfectly and does exactly as I intended it to do, in the exact order I intended it to do it in, except for one issue. In certain cases when the workload is very high, a random one of the enqueue_kernel()s will fail to enqueue and halt the program. This happens because the device queue is full, and it cannot fit another task into it. But I cannot for the life of me figure out why this is, even after extensive research.

I thought that once a task in the queue (a kernel for instance) is finished, it would free up that spot in the queue. So my queue should really only reach a max of like 1 or 2 tasks at a time. But this program will literally fill up the entire 262,144 byte size of the device command queue, and stop functioning.

I would greatly appreciate some potential insight as to why this is happening if anyone has any ideas. I am sort of stuck and cannot continue until I get past this issue.

Thank you in advance!

(BTW I am running on a Radeon RX 590 card, and am using the AMD APP SDK 3.0 to use with OpenCL 2.0)