Important to note that the datasets used are sparse, and that the key to this algorithm is better exploitation of sparsity. The GPU over CPU advantage is a lot lower if you need sparse operations, even with conventional algorithms.
"It should
be noted that these datasets are very sparse, e.g., Delicious
dataset has only 75 non-zeros on an average for input fea-
tures, and hence the advantage of GPU over CPU is not
always noticeable."
In other words, they got a good speedup on their problem, but it might not apply to your problem.
WaveNet, if I remember correctly, has 1-from-256 encoding of input features. And 1-from-256 encoding of output features.
It is extremely sparse.
If you look at language modeling, then things there are even sparsier - typical neural language model has 1-from-several-hundredths-of-thousands for full language (for Russian, for example, it is in range of 700K..1.2M words and it is much worse for Finnish and German) and 1-from-couple-of-tens-of-thousands for byte pair encoded language (most languages have encoding that reduced token count to about 16K distinct tokens, see [1] for such an example).
The image classification task also has sparcity at the output and, if you implement it as RNN, a sparsity at input (1-from-256 encoding of intensities).
Heck, you can engineer you features to be sparse if you want to.
I also think that this paper is an example of "if you do not compute you do not have to pay for it", just like in GNU grep case [2].
Embeddings tables aren't hard on the GPU (being only a lookup table), and the output softmax still requires you do the full matrix-multiply. The label may be sparse, but the computation is far from sparse.
> The reverse is true, embeddings are both the performance and memory-footprint bottleneck of modern NN models.
They may be a bottleneck, but the alternative is worse -- you can't fit complex models with large vocabularies into GPU memory using sparse one-hot encodings.
Technically, the sparse one-hot encoding is the most efficient in terms of memory footprint. You simply store the non-zero coordinates.
The problem in practice for GPUs is that sparse vector/matrix operations are too inefficient.
The whole point of something like this paper is to skip the entire 'densification' step and to directly deal with the sparse matrix input as a sparse matrix. The LSH is used in this paper improves on directly using SpMSpV, as that is also inefficient on CPUs, although to a lesser extent than GPUs.
> If you look at language modeling, then things there are even sparsier - typical neural language model has 1-from-several-hundredths-of-thousands for full language
Most real-world models don't use one-hot encodings of words -- they use embeddings instead, which are very dense vector representations of words. Outside of the fact that embeddings don't blow out GPU memory, they're also semantically encoded, so similar words cluster together.
First, you need to compute these embeddings at least once - sparsity, here you are! Second, these embeddings may be different between tasks and accuracy from their use may differ too.
For example, the embeddings produced from CBOW and skipgram word2vec models are strikingly different in cosine similarity sense - different classes of words are similar in CBOW and skipgram.
So you agree that the problem is fundamentally sparse? Embeddings are used to make sparse (e.g. categorical) data possible on GPUs, and real-world models are limited by how large they can make the embeddings to fit in GPU memory. Embedding lookups is also a compute bottleneck:
Why aren't GPUs better at sparse matrix math? Generally, sparse operations are memory bandwidth limited, but GPUs/TPUs still have much faster memory than CPUs and more memory bandwidth in general (roughly a factor of 4 or so between the latest cpus and gpus).
Sparse matrix math basically boils down to indirect array references: A[B[i]]. GPUs generally trade off memory bandwidth for latency, relying on being able to do a lot of work to hide that memory latency. But because there's no work between the first and second load, you are no longer able to hide the memory latency of the second load with extra work.
CPUs, by contrast, have a thorough caching hierarchy that tends to focus on minimizing memory latency, so it doesn't take as long to do the second load compared to a GPU.
Yeah on the GPU you need to get your threads to ideally load consecutive memory locations for each thread to utilize the memory bandwidth properly. Random-indexing blows this out of the water. I guess that you could pre-process on the CPU though to pack the sparse stuff for better GPU efficiency..
Another thing to note is that sparsity is being leveraged even to build a more efficient version of hardware. A good example of this is the Cerebras Waferscale chip that was announced recently. I'm assuming the author was unaware of developments on the hardware side of things.
"It should be noted that these datasets are very sparse, e.g., Delicious dataset has only 75 non-zeros on an average for input fea- tures, and hence the advantage of GPU over CPU is not always noticeable."
In other words, they got a good speedup on their problem, but it might not apply to your problem.