maxlam's comments

maxlam · on March 20, 2018

Training these quantized word vectors has to be done in full precision (so no memory gains during training the word vectors).

But when you save them to disk every value is either -1/3 or +1/3 so one could encode the word vectors in binary. This can lead to reducing memory usage during application time if you kept the word vectors in this compressed format (though you'd need to write a decode function in tensorflow or pytorch to take a sequence of bits corresponding to a word and convert it into a vector of -1/3s and +1/3s)

opportune · on March 20, 2018

Oh interesting I see, so this is like a digital format mostly for sharing models between storage / over networks. I definitely think it would be possible (and useful!) to extend it to in-memory usage, though a C-function wrapper might be better than a native python function. Personally I'm often more frustrated by word2vec's size in memory than in storage so it might be used more in this manner. Would you mind if I submit a pull request?

maxlam · on March 20, 2018

Absolutely, please do!

maxlam · on March 20, 2018

Definitely tried to figure out if the dimensions mean anything -- as far as I can tell they don't really mean much :(

Maybestring · on March 20, 2018

If you want them to be meaningful without changing the model...

Couldn't you rotate the basis to minimize the distance between each basis vector and it's nearest neighbor?

maxlam · on March 20, 2018

Haven't tried this but this is definitely a good idea for visualizing what's going on!

loxias · on March 20, 2018

One of the fascinating things about neural embedding such as these is that the individual component dimensions have NO "real" semantic meaning to us humans. It's better to think of them as single points in a higher dimensional space.

(of course, with a clever network design you could probably FORCE "meaning" onto some components)

maxlam · on March 20, 2018

Interesting, definitely need to think about debiasing. Seems like it won't really work straight out of the box since it'd destroy the 1 bit-ness of the vectors.

Though if only a few of the vectors are de-biased then you can still save a lot of space since all the other vectors are still represented using 2 numbers (while the de-biased vectors are represented using the full range of 32 bit numbers).

rspeer · on March 20, 2018

There may be a way to build it into the loss function so that it happens before the quantization, right?

(and holy crap, look how fast the HN conservatives are getting to my comment)

hnuser1234 · on March 20, 2018

Literally everything is either Left or Right. Anyone who disagrees with you is definitely a sexist conservative. No other possibility. The center is a black hole.

maxlam · on March 20, 2018

Hm not sure, would need to think more about this -- definitely an interesting idea though!

rspeer · on March 20, 2018

Thanks for being willing to discuss it. And sorry if I'm complicating your task.

But any interesting release of NLP data has the potential to affect the way the field progresses, so take it as a compliment that I consider this an interesting release of NLP data. That's why I'm asking you to actively consider the downstream effects of word vectors and find out if you can make them better.

maxlam · on March 20, 2018

Thanks for the compliments! And thanks also for bringing up the debiasing! It's interesting to see what people care about and any conversation that leads to new ideas is a plus.

maxlam · on March 20, 2018

The idea is that you can kind of capture the "meaning" of a word with a sequence of numbers (a vector) -- and then you use these vectors for machine learning tasks to do cool stuff like answer questions!

Word2Vec is one of the algorithms to do this. Given a bunch of text (like Wikipedia) it turns words into vectors. These vectors have interesting properties like:

vector("man") - vector("woman") + vector("queen") = vector("king")

and

distance(vector("man"), vector("woman)) < distance(vector("man"), vector("cat"))

What Word2Bits does is make sure that the numbers that represent a word is limited to just 2 values (-.333 and +.333). This reduces the amount of storage the vectors take and surprisingly improves accuracy in some scenarios.

If you're interested in learning more, check out http://colah.github.io/posts/2014-07-NLP-RNNs-Representation... which has a lot more details about representations in deep learning!

Sukotto · on March 20, 2018

Thank you. That really helped.

maxlam · on March 20, 2018

You're definitely right, the quantization function and its values definitely have an impact on performance.

For 1 bit I think I tried something like -1/+1, -.5/+.5, -.25/+.25, -.333/+.333. and something like -10/+10 -- (and I think a few more). It seemed -.333/+.333 worked the best while +10/-10 did the worst on the google analogy task (getting like 0% right). All this was tuned on 100MB of Wikipedia data.

yorwba · on March 20, 2018

Have you considered doing gradient descent on the quantization steps? It looks to me like the model should be differentiable with respect to those values, so I'm not sure why you'd have to fix them to a constant.

maxlam · on March 20, 2018

Hm what do you mean? I'm not quite seeing how to differentiate with respect to the quantization steps.

yorwba · on March 21, 2018

Say you have a function f(q(x)) where q quantizes x into one of s_1, ..., s_n. Then if q(x) = s_i for a certain x, df/ds_i = df/dq and df/ds_j = 0 for all j != i.

That breaks down for values of x precisely at the boundary between steps, so I should have qualified "differentiable" with "almost everywhere".

It also occurs to me that this might interact strangely with the approximation dq/dx = 1, but since the quantization steps are globally shared, I think it should be stable anyway.

If the evaluation suite for your code doesn't require too much manual interaction, I might try and see for myself.

maxlam · on March 21, 2018

That's definitely an interesting idea -- it seems this would allow for boundaries that "change" along with the data (instead of having static boundaries as it is). Would be interested to know how that turns out!

maxlam · on March 20, 2018

This might be because "Artist" has an uppercase "A" -- I trained all the word vectors to be case sensitive so "Artist" is not the same as "artist" (which should be closer to "man" than "Artist")

NKCSS · on March 20, 2018

Why would you do that? If you look at something like LSA, they goal is to uniform those, rather than distinguish. artist and Artist should be (near)100% match; what are you trying to do here?

maxlam · on March 20, 2018

Main reason I did it this way is because Facebook's DrQA (which I evaluate the vectors on for the SQuAD task) uses case sensitive vectors.

Was a tough decision between choosing whether to train case sensitive vectors / case insensitive vectors and a future task would be to train case-insensitive vectors.

newman8r · on March 20, 2018

Makes sense. The plural vs singular probably impacts that one too.

maxlam · on March 20, 2018

Yeah, I should definitely put more detail in the writeup -- thanks for the feedback!

What's happening with figure 1a (epochs vs google accuracy) is that as you train for more epochs the full precision loss continues to decrease (dotted red line) but accuracy also starts decreasing (solid red line). This indicates overfitting (since you'd expect accuracy to increase if loss decreases). The blue lines (quantized training with 1 bit) do not show this which suggests that quantized training seems to act as a form of regularization.

Figure 1b is pretty similar, except on the x axis we have vector dimension. As you increase vector dimension, full precision loss decreases, yet after a certain point full precision accuracy decreases as well. I took this to mean that word2vec training was overfitting with respect to vector dimension.

maxlam · on March 20, 2018

Oops, thought you meant the graphs (with the dotted/solid lines) in the writeup.

If you're referring to the image under "Visualizing Quantized Word Vectors" then each row is a word vector (and there are only two colors since each parameter is either -1/3 or +1/3).

wodenokoto · on March 20, 2018

Thanks for the reply. Yes, I did mean the image under "Visualizing Quantized Word Vectors".

I did get that the colors indicated values of dimensions, but I suppose what I really meant is, what is the take-away message? To me, it just looks like noise. Is there a pattern I should look for and go "a-ha, I see"?

maxlam · on March 20, 2018

You can kind of see that words that are similar have similar looking vector values (that's why there are vertical stripes of yellow / black). But you're right in that most of it just looks like noise. I put the picture there mainly to show what the quantized vectors look like.

s0cket · on March 20, 2018

In Figure 2, have you noticed that every 25 words or so, there are off-pattern words? They are represented by visually differing lines (looks like a glitch, and it may be a glitch as this happens for all visualization charts).

What words correspond to these "glitches" for "mushroom" for example? In the case of "mushroom" there is a glitch line just below "earthstar".

Can you provide the full y-axis word vector for any of the visualization charts?

Nice work.

maxlam · on March 20, 2018

Here's the one for man:

['man', 'woman', 'boy', 'handsome', 'stranger', 'gentleman', 'young', 'drunkard', 'devil', 'lonely', 'lady', 'lad', 'drunken', 'beggar', 'kid', 'effeminate', 'brave', 'bearded', 'himself', 'dressed', 'loner', 'meek', 'sees', 'hustler', 'girl', 'coward', 'thief', 'wicked', 'person', 'balding', 'dashing', 'deranged', 'tramp', 'mysterious', 'him', 'pretends', 'lecherous', 'friend', 'shepherd', 'portly', 'bespectacled', 'jolly', 'thug', 'gangster', 'dapper', 'genius', 'slob', 'beast', 'hero', 'hoodlum', 'policeman', 'elderly', 'drunk', 'manly', 'mustachioed', 'ruffian', 'cop', 'burly', 'beard', 'fool', 'terrified', 'scarecrow', 'scruffy', 'lover', 'peddler', 'remembers', 'supposedly', 'gambler', 'bloke', 'bastard', 'acquaintance', 'mighty', 'playboy', 'unshaven', 'prostitute', 'pimp', 'mans', 'skinny', 'carefree', 'scoundrel', 'crook', 'obsessed', 'surly', 'fancies', 'accosted', 'foolish', 'jovial', 'cocky', 'shifty', 'loves', 'narrator', 'butler', 'dying', 'casually', 'waiter', 'evil', 'frightened', 'gigolo', 'conman', 'cunning']

(Edit: Actually this isn't quite right as it doesn't match the image. Many of these vectors actually have the same distance to "man" and the dict doesn't keep a deterministic order. What you can do is modify https://github.com/agnusmaximus/Word2Bits/blob/development/s... and run it on the 1 bit 400k vectors and see what it prints out. To run it do: `python w2bvisualize.py path_to_1bit800d400kvectors` then it should generate similar figures as in the writeup)