Big Data with OpenCL

#1

Hi to all guys, sorry for the question maybe is stupid and without sense, i’m new of this language, i know only the base but i want improve my skills.
I need to load/manage a very big csv file. The csv file is more or less 50GB, so it’s impossible load all the file once, because obviously the gpu have a limit of ram, (in my specific case 8gb).
So the question is this, the gpu is fantastic to compute data but exist some method to load and manage efficently this very big data?
What you suggest?

Thanks for all the answer.

Ps. Maybe is important, the csv file are composed of this data, date, time, pressure, energy

#2

The most likely answer will be “You can’t use OpenCL this way because of PCI-E transfer speed limitation”, but you may try double buffering. Make two command queues. One runs your kernels and another manages copying the next chunk of data to the GPU. Provided you manage to make the calculations and transfer to run in parallel and your kernel execution time is roughly the same as PCI-E transfers, this should work.

#3

Actually, the GPU memory oversubscription might not be so bad, thanks to the fact that CSV is an very inefficient storage format.

Consider: A single-precision float has approximately 7 significant decimal digits and a decimal exponent that goes from -38 to +38. So written in text form, and accounting for the separator, a number will at worst look like this: “x.xxxxxxE-yy,”. That is, 13 characters, which in typical text encodings take up 13 bytes. The equivalent float takes 4 bytes. This means that you can gain up to a factor of 3.25 in space just by loading your data in RAM. Of course, typical CSV export algorithms try to remove trailing zeros and exponents whenever feasible, so in practice you’ll gain a bit less.

If your file uses double precision, another trick that you may use is to do your computation in single precision, if that is enough for your needs. Starting from the textual representation “x.xxxxxxxxxxxxxxxE-zzz,” ( 16 significant decimal digits, decimal exponent from -308 to +308 ), you can then go from up to 23 characters to 4 bytes, which is a best-case memory saving of 5.75, which would get your working set down to about 8.7GB. You’ll get more in practice, because again most CSV export algorithms try to remove some redundant information. But in any case, this is getting very close to what your GPU can handle.

Once you get there, you may want to consider whether you really need to run your computation on that full dataset. If dropping a fraction of it (say, half) is acceptable, you might very well get down to something that fits in your GPU’s memory and leaves you enough space for the actual computation.

To summarize:
[ul]
[li]Do not take CSV file sizes too literally, as CSV is a very inefficient data storage format. Measure your actual RAM footprint.[/li][li]If your data is double precision and your computation only needs single precision, you might actually get pretty close to your GPU’s memory capacity.[/li][li]If you can safely drop a fraction of your dataset, you can probably get into something that entirely fits in your GPU, and leaves enough space for the computation.[/li][li]Otherwise, you can reduce the impact of round-trips from system memory to GPU memory by overlapping data transfers and compute, as Salabar mentioned.[/li][/ul]

#4

By the way, if you are ready to add to your hardware requirements, some modern GPUs have extra tricks up their sleeves in order to manage large datasets:
[ul]
[li]16-bit half-precision is becoming increasingly popular (if that is enough for your computation) due to neural net applications[/li][li]AMD’s Radeon Pro SSG is a GPU with an NVMe SSD cache, designed for huge (think terabyte-sized) datasets.[/li][li]Some implementations of shared virtual memory come with automatic paging from the GPU driver. That will often be less efficient than hand-crafted data transfer code, but is more convenient because you don’t need to manually move the data around.[/li][/ul]

#5

Oh, and looking again, you mention having dates and times as columns in your dataset. This might be another interesting target for memory savings. How much timing precision do you need, and how large a date/time interval do you need to describe?

To give you an example, if you need to measure time with millisecond precision during an experiment that lasts a month, this means that the amount of unique timestamps that your dataset should be able to represent is 1000 x 3600 x 24 x 31 = 2678400000. This fits in a 32-bit integer, so you could replace both your date and your time columns with a single “timestamp” column of 32-bit integers representing the amount of miliseconds elapsed since some epoch (time point), without any loss of precision.

#6

First of all, thanks for the answers, the 50GB of data is on the RAM not the file size but thanks for the thing. I think here is the only forum that know openCL and have enough experience, So i want explain all, maybe you can help me, is one month that i try to understand how split the work and do it.

I have the csv file with the data, and i have a pool of cases that need to be analyzed on the data, this case is random (genetic algorithm), i need to test every case on all the data, so i start with the data pool, the data cases, and like results i need a report of some variable for every cases, (this to choose the best make the crossover and restart all, typical genetic algorithm)

Every cases need to make moving averange of the pressure and energy, so i don’t know if it is better calculate it on the host and moving all on the device, or make this on the device, furthermore when i have the moving average of pressure and energy i need to test every case on the data, summing, the moving average are the trigger in order to start or end the test on the real data.

What you suggest? I know is very complex, or maybe i see it complex and is more simple of that i think. Please help me because i’m going crazy.

Ps. isn’t only two moving average, every cases have a random moving average period, for example one is calculated with period of 10 for the pressure and 5 for the energy, another case 13 for pressure and 55 energy, etc.

#7

i forgot one important thing, all the data are floating point with 5 decimals

#8

Moving average is a linear time algorithm, unless you need to compute few millions of those per iteration you are certain to lose any possible benefit from PCI-E transfer. Stick to ordinary multithreading. Heck, even then you’re probably getting bottlenecked by a HDD. Use mmap, run a dumb double loop, go make a tea for 20 minutes. 50 GB is too big to try do anything with it in realtime, but not to big so you cannot simply wait a little bit. I know GPGPU is exciting, but sometimes it just doesn’t help with anything :smiley: