Important to note that the datasets used are sparse, and that the key to this al...

thesz · on March 6, 2020

WaveNet, if I remember correctly, has 1-from-256 encoding of input features. And 1-from-256 encoding of output features.

It is extremely sparse.

If you look at language modeling, then things there are even sparsier - typical neural language model has 1-from-several-hundredths-of-thousands for full language (for Russian, for example, it is in range of 700K..1.2M words and it is much worse for Finnish and German) and 1-from-couple-of-tens-of-thousands for byte pair encoded language (most languages have encoding that reduced token count to about 16K distinct tokens, see [1] for such an example).

[1] https://bellard.org/nncp/

The image classification task also has sparcity at the output and, if you implement it as RNN, a sparsity at input (1-from-256 encoding of intensities).

Heck, you can engineer you features to be sparse if you want to.

I also think that this paper is an example of "if you do not compute you do not have to pay for it", just like in GNU grep case [2].

[2] https://lists.freebsd.org/pipermail/freebsd-current/2010-Aug...

Given all that I think it is a paper about combination of very clever things which give excellent results in a synergy.

stephenroller · on March 6, 2020

Embeddings tables aren't hard on the GPU (being only a lookup table), and the output softmax still requires you do the full matrix-multiply. The label may be sparse, but the computation is far from sparse.

yvdriess · on March 6, 2020

The reverse is true, embeddings are both the performance and memory-footprint bottleneck of modern NN models.

Check figure 6. of : https://arxiv.org/pdf/1906.00091.pdf

Embeddings are used to lookup sparse features, so you have those pesky data-dependent lookups.

zeroxfe · on March 6, 2020

> The reverse is true, embeddings are both the performance and memory-footprint bottleneck of modern NN models.

They may be a bottleneck, but the alternative is worse -- you can't fit complex models with large vocabularies into GPU memory using sparse one-hot encodings.

yvdriess · on March 6, 2020

Surely you mean dense one-hot?

Technically, the sparse one-hot encoding is the most efficient in terms of memory footprint. You simply store the non-zero coordinates.

The problem in practice for GPUs is that sparse vector/matrix operations are too inefficient.

The whole point of something like this paper is to skip the entire 'densification' step and to directly deal with the sparse matrix input as a sparse matrix. The LSH is used in this paper improves on directly using SpMSpV, as that is also inefficient on CPUs, although to a lesser extent than GPUs.

thesz · on March 6, 2020

No, you can successfully fit complex models if you use byte-pair or similar encodings (morphessor [1] comes to mind).

[1] https://morfessor.readthedocs.io/en/latest/

You also will get much more meaningful embeddings from summing embeddings of part of the word.

Der_Einzige · on March 6, 2020

Only a bad bottleneck because proper database techniques aren't being used widely for embeddings yet within ML pipelines

See libraries like magnitude for proper embedding lookup implementations

zeroxfe · on March 6, 2020

> If you look at language modeling, then things there are even sparsier - typical neural language model has 1-from-several-hundredths-of-thousands for full language

Most real-world models don't use one-hot encodings of words -- they use embeddings instead, which are very dense vector representations of words. Outside of the fact that embeddings don't blow out GPU memory, they're also semantically encoded, so similar words cluster together.

thesz · on March 6, 2020

First, you need to compute these embeddings at least once - sparsity, here you are! Second, these embeddings may be different between tasks and accuracy from their use may differ too.

For example, the embeddings produced from CBOW and skipgram word2vec models are strikingly different in cosine similarity sense - different classes of words are similar in CBOW and skipgram.

yvdriess · on March 6, 2020

So you agree that the problem is fundamentally sparse? Embeddings are used to make sparse (e.g. categorical) data possible on GPUs, and real-world models are limited by how large they can make the embeddings to fit in GPU memory. Embedding lookups is also a compute bottleneck:

An example is facebook DLRM: https://arxiv.org/pdf/1906.00091.pdf

xiphias2 · on March 6, 2020

I believe that it's so critical here that the dataset is sparse, that it should be in the title of the paper.

Like this I view it as clickbait.

wdobbels · on March 6, 2020

It's not even mentioned in the abstract.

spott · on March 6, 2020

Why aren't GPUs better at sparse matrix math? Generally, sparse operations are memory bandwidth limited, but GPUs/TPUs still have much faster memory than CPUs and more memory bandwidth in general (roughly a factor of 4 or so between the latest cpus and gpus).

jcranmer · on March 6, 2020

Sparse matrix math basically boils down to indirect array references: A[B[i]]. GPUs generally trade off memory bandwidth for latency, relying on being able to do a lot of work to hide that memory latency. But because there's no work between the first and second load, you are no longer able to hide the memory latency of the second load with extra work.

CPUs, by contrast, have a thorough caching hierarchy that tends to focus on minimizing memory latency, so it doesn't take as long to do the second load compared to a GPU.

l33tman · on March 6, 2020

Yeah on the GPU you need to get your threads to ideally load consecutive memory locations for each thread to utilize the memory bandwidth properly. Random-indexing blows this out of the water. I guess that you could pre-process on the CPU though to pack the sparse stuff for better GPU efficiency..

vchak1 · on March 6, 2020

You can solve around this by using cuckoo or robin hood hashing. See for example: https://www.researchgate.net/scientific-contributions/148064...

wbl · on March 6, 2020

Sparsity breaks the spatial coherence GPUs like. Scatter gather pays a penalty vs direct.

rajesh-s · on March 6, 2020

Another thing to note is that sparsity is being leveraged even to build a more efficient version of hardware. A good example of this is the Cerebras Waferscale chip that was announced recently. I'm assuming the author was unaware of developments on the hardware side of things.