Understanding Word2Vec With PySpark
Published:
Understanding Word2Vec with PySpark
Goal
I need to use word embeddings to study the evolution of hate speech across social media. I chose to explore Word2Vec in hopes of learning more about it and to begin to probe the field of Natural Language Processing.
Today we are going to look at how Word2Vec incorporates word embeddings to create a numeric vectors to represent meaning of words. This is an important part of natural language processing (NLP). The goal of NLP is to extract meaning from human language, often this is provided in the form of text. And this meaning can be found in many components of language.
Some components of language
- Pragmatics
- Semantics
- Syntax
- Morphology
- Phonology
Dataset
I am using a Gab.ai dataset of posts submitted to the social platform. Gab.ai prides itself on the values of “free speech” and a lack of censorship. As a result it has become known for attracting trolls, bots, and the socially maligned. Comments and posts made to this site are notorious for being extreme and hate laden. I have collected over 28 million posts and will use a 1 million post sample to train a skip-grams variant of the word2vec word embedding model. The goal is to identify the proximity of two related words in a vector space.
Distributional Semantic Models
Word embeddings are word representation algorithms used in an NLP. Word embeddings are a subclass of distributional semantic models because they rely on the distributional hypothesis. The distributional hypothesis, created by Zellig Harris in his 1956 paper “Distributional structure” , is assumption that words in the same context tend to proport similar meanings and occur near each other. And thus synonyms have similar representations in a collection of texts. Word embeddings are represented as vector values created as a result of a neural network.
Example Posts
data.body
"Probably because I see the faint hint of 'horns' holding that halo up... "
https://youtu.be/YMQRFT4bZuc
http://www.epochtimes.de/politik/europa/zahl-der-toten-nach-londoner-hochhausbrand-auf-79-gestiegen-2-a2146594.html
https://t.co/LTMBeXvHrC
"Ps 37:14 Die Gottlosen ziehen das Schwert aus und spannen ihren Bogen, daß sie fällen den Elenden und Armen und schlachten die Frommen.\nPs 37:15 Aber ihr Schwert wird in ihr Herz gehen, und ihr Bogen wird zerbrechen.\n\n"
At least 25 killed in airstrike on market in Yemen – reports\nhttps://www.rt.com/news/392838-saudi-yemen-market-airstrike/ #saudiarabia #yemen
Example after cleaning
['data', 'body']
['', 'probably', 'because', 'i', 'see', 'the', 'faint', 'hint', 'of', "'horns'", 'holding', 'that', 'halo', 'up', '', '', '', '', '']
['https', '//youtu', 'be/ymqrft4bzuc']
['http', '//www', 'epochtimes', 'de/politik/europa/zahl', 'der', 'toten', 'nach', 'londoner', 'hochhausbrand', 'auf', '79', 'gestiegen', '2', 'a2146594', 'html']
['https', '//t', 'co/ltmbexvhrc']
['', 'ps', '37', '14', 'die', 'gottlosen', 'ziehen', 'das', 'schwert', 'aus', 'und', 'spannen', 'ihren', 'bogen', '', 'daß', 'sie', 'fällen', 'den', 'elenden', 'und', 'armen', 'und', 'schlachten', 'die', 'frommen', '\\nps', '37', '15', 'aber', 'ihr', 'schwert', 'wird', 'in', 'ihr', 'herz', 'gehen', '', 'und', 'ihr', 'bogen', 'wird', 'zerbrechen', '\\n\\n', '']
['at', 'least', '25', 'killed', 'in', 'airstrike', 'on', 'market', 'in', 'yemen', '–', 'reports\\nhttps', '//www', 'rt', 'com/news/392838', 'saudi', 'yemen', 'market', 'airstrike/', '#saudiarabia', '#yemen']
Now we have to take this tokenized text and use this as our input text
Building a Vector
To use the distributional hypothesis to build a vector, we have to choose what words being near each other means to us. This value of “nearness” is known as a window. In the image below, taken from Chris McCormick’s tutorial, the target word is highlighted in blue, and the window shown around it as being two words away from the target. This means the window size is equal to 2.
Word pairs are created between the target word and all other words in the window which can extend forwards or backwards. The target is then moved to the next word and the process repeats. Some embedding models treat words to the left of the target word differently than words to the right. But for now we will treat them both equally.
Figure 2: A diagram of the skip-gram model for starting with the target word and trying to predict the context words which are the words in the window: This image was taken from Tomas Mikolov et al’s original paper: Distributed Representations of Words and Phrases
These word pairs become the training samples for the model. We will be using the key of this pair to create what is known as a one-hot vector. Currently they are in the form of (target word, context-word-in-window) and this will be used as the input for a simple 1 hidden layer neural network. In Figure 2 above, the target word is W(t). The projection of the input onto a hidden layer of which has a pre-determined number of neurons that was specified as a hyper parameter. A hyper parameter is the number of hidden layer neurons has a large effect on the accacury and speed of the model’s runtime and 300 is widely used in practice since it was used by word2vec’s creators. This simple neural network is known as a Restricted Boltzmann Machine (RBM).
Restricted Boltzmann Machines (RBMs)
In the image above, there are three columns that are known in discriptions of neural networks as layers. These diagrams show cause and effect between the layers of a neural network and are read from left-to-right. Each circle represents a neuron and is called a node. A node is where a calculation is preformed to determine if it will send a 0 or a 1 to a node in the next layer, which is to the right. This communication is known as firing and only happens in one direction, left-to-right.
In our restricted boltzman machine, nodes are not linked to, or communicate with, other nodes within the same layer. This restriction gives the RBM its name. And every node in the input layer is linked to each node in the hidden layer. The nodes/neurons in the input layer are considered to be different neurons in the hidden layer, hence why they are in different layers.
I stress this point because this is known as a bipartite graph. But not just any bipartide graph, a complete bipartite graph because these two layers are fully linked. Note, that some texts call this a symmetrical bipartite graph. Also it is important to notice in the graphic above how the hidden layer has fewer nodes than the input or output layers. This is an important quality of RBMs as a feature known as dimensionality reduction.
When a RBM is inalitized, four things are determined in advance and thus are hard-coded into the construction of the neural network. This things are known as hyper parameters.
- Number of nodes in the input layer
- Number of nodes in the hidden layer
- Number of nodes in the output layer
- The weights of the nodes in the hidden layer
With RBMs a special step happens when the hidden layer is created. Each node is randomlly assigned a weight. A wight is the power that node has on the nodes it is linked to in the next layer. This process of randomlly assigning weights is known as Stochastic Gradient Descent. It is called this because stocastic means “random” and these weights provide influence on the node in the next layer they are linked to.
The random assignment of the weights for word2vec is only done on the first round. The entire neural network is trained multiple times. Each one of these training rounds is called an Epoch. The weight matrix from the previous round is used as the initial weight table for the next round.
These weights are important to Word2Vec and like normal RBMs, Word2Vec randomly assigns weights. Word2Vec adjusts these weights over time while the neural network is fed our one-hot vector we created from our word pairs. This is known as training the neural network. Word2vec is similar to an autoencoder, as it encodes each word into a vector. But rather than training against the input words through reconstruction, as a restricted Boltzmann machine does, word2vec trains words against other words that are found in the context window of the input corpus.
Figure 4: This image was adapted from Chris McCormick’s Word2Vec Tutorial - The Skip-Gram Model tutorial
Figure 5: A diagram of how word2vec uses linear matrixes to train its neural network
Understanding Figure 5, the linear algebra of Word2Vec
Above I explained how the network is trained many times. The same input is fed into the network each time but the weight matrixes from the previous round is used as the inital weight matrix for the next round. In Figure 5, this is matrix W
.
So for our first round we will start with
h=xTW
Since x
is a one-hot encoded vector, h
is simply doing a lookup of the kth row of the weight matrix W
. Each row becomes a hidden layer for each word via the lookup trick provided by the one-hot encoding.
yc,j= W′Th
so essentally the output element is just the transpose of the weight matrix between the input and hidden layer and the hidden layer and the output layer. so
yc,j = W′TWTx
But there is one step missing from this equation above. The final output vector needs to be softmaxed. This takes the output layer (Y) and compresses the values into the range between 0 and 1. This allows for Y to act as a probability distribution for the input words (X). The equation for softmax is below:
We are using softmax since it continuous output provides a good use case for multiclass classification
I have trained using the method explained above. I used only 100 nodes in my hidden layer. The model would be more accurate if it was trained on a larger hiden layer. Often the size of this layer is 300 nodes.
Explaining skip-gram
Create list of predicted words for the target word ‘trump’
print(model_vectors['trump'])
[0.012244515, 0.05669574, 0.4243817, -0.13005282, 0.14591245, 0.15754509, 0.12144023, -0.043377433, -0.036471862, -0.17795071, -0.15042417, 0.4285602, -0.16748007, -0.09618644, 0.07635299, 0.021112783, -0.1097202, 0.16649377, 0.31761286, 0.2781521, 0.26321766, -0.35739362, -0.17595355, -0.28173986, 0.2220869, 0.421465, 0.12334027, 0.17061687, -0.16097873, 0.1101991, 0.39143816, -0.10224187, 0.19060156, 0.06379647, -0.055479944, 0.30508712, -0.33523571, -0.3099334, 0.16205992, 0.23172502, 0.12932838, -0.25712037, -0.24778262, -0.41348562, 0.10876833, -0.095286794, -0.12277438, 0.08167293, 0.2416396, -0.29519707, 0.07202256, -0.03740526, -0.08972215, -0.03250894, -0.21824007, 0.04827257, -0.009086915, 0.18352096, -0.10135367, -0.47981852, -0.06576853, 0.021472175, 0.023349164, 0.05336668, -0.37836334, 0.08596835, -0.08231194, -0.09812828, 0.0058923, -0.06080334, 0.15352124, 0.2911331, 0.15038647, 0.15921666, 0.13570379, 0.09163106, 0.0093092015, 0.0024938602, 0.16191821, -0.116921216, 0.37449756, -0.37325835, 0.17355393, 0.22919315, -0.22791475, -0.12990569, 0.15548478, -0.16302991, 0.09529176, -0.124482594, 0.01942392, -0.18610963, -0.43775123, -0.2965226, -0.07572919, 0.2682866, 0.15111415, -0.03312072, 0.023581406, -0.035215713] --- > Number of total posts processed: 1,000,000
Number of words in the model: 116,568
Number of features per word: 100
Now visualize the vector space using PCA and KMeans
Here I have to specify the number of clusters that Kmeans should use. A good approximation is to take the square root of half the number of words in the vocabulary list.
Size of the Word2vec matrix (words, features) is: (116568, 100)
Number of PCA clusters used: 241
The dimensions of the Word2Vec matrix: (116568, 100)
Find cosine simularity between each word in the W
matrix
Using cosine simularity we have the closeness of the word inauguration
with the word trump
.
nwords = 100
indexes = np.argpartition(dist,-(nwords+1))[-(nwords+1):]
di = []
for counter in range(nwords+1):
di.append(( words[indexes[counter]], dist[indexes[counter]], labels[indexes[counter]] ) )
print(di[2])
('inauguration/', 0.5486706985071742, 112)
The top 100 words that are simular according to Word2Vec
ranked_results = unsorted_result.iloc[::-1] # order results from closest to chosen word
print(ranked_results)
word similarity cluster
100 trump 1.000000 112
99 trump's 0.729306 112
98 elect 0.722010 112
97 potus 0.705011 112
96 donald 0.694213 112
95 trump’s 0.688237 112
94 president 0.687781 112
93 trumps 0.686771 112
92 clintonrussia 0.680471 35
91 trump/ 0.671885 112
90 unverified 0.666894 128
89 winner\r\nsearch 0.663887 118
88 trump? 0.660687 112
87 peotus 0.656724 35
86 html\n\ntrump 0.650845 35
85 remarks/ 0.643584 13
84 video/?utm_medium=email 0.641376 35
83 somers 0.638150 118
82 pence 0.636276 112
81 com/appalling 0.634154 118
80 tv/watch 0.633391 144
79 soci 0.632695 35
78 peegate 0.628730 35
77 obama 0.627831 93
76 shole 0.627038 221
75 ‘shithole’ 0.624290 13
74 feeley 0.623910 35
73 trump?\n 0.618699 35
72 netanyahu 0.617232 181
71 schumershutdown 0.616409 35
.. ... ... ...
29 'president 0.565560 35
28 com/2017/10/winning 0.564972 0
27 com/2018/01/president 0.564370 218
26 remark/ 0.564070 35
25 com/2017/06/mika 0.563695 17
24 djt 0.563693 112
23 sexist/ 0.563230 118
22 erin'strump® 0.562768 118
21 com/2018/01/13/pandoras 0.562444 112
20 com/2017/07/03/president 0.561896 17
19 office/2017/06/19/statement 0.560339 112
18 statements/remarks 0.559158 144
17 com/california/2017/01/07/dear 0.557056 0
16 \npresident 0.555016 112
15 office/2017/06/29/statement 0.553720 82
14 pledge/ 0.553681 118
13 com/donald 0.553214 83
12 trump\n\ntrump 0.553146 35
11 house/356849 0.552361 35
10 haiti\nhttp 0.551846 0
9 ‘resistance’ 0.551726 35
8 call/ 0.551359 35
7 com/2017/06/28/condoleezza 0.551023 187
6 j 0.550414 112
5 streep’s 0.550304 35
4 trump\nhttp 0.549403 218
3 statements/president 0.548829 17
2 inauguration/ 0.548671 112
1 nevertrumper 0.547763 13
0 trump\n\nhttp 0.547598 218
[101 rows x 3 columns]
Visualize Words using PCA
for i, word in enumerate(words):
ax.scatter(model[i, 0], model[i, 1], color='red', marker='o', edgecolors='black')
ax.text(model[i, 0], model[i, 1], model[i, 2], word)
plt.scatter(model[i, 0], model[i, 1], color='red', marker='o', edgecolors='black')
if(i > 50):
break
counter = 0
i = 0
plt.figure(figsize=(18, 16), dpi= 80, facecolor='w', edgecolor='k')
for counter in range(nwords+1):
word_lable = ranked_results.iloc[counter][0]
cosine_sim = ranked_results.iloc[counter][1]
assigned_cluster = ranked_results.iloc[counter][2]
plt.scatter(dist[indexes[counter]], labels[indexes[counter]], color='red', marker='o', edgecolors='black')
plt.annotate(word_lable, (cosine_sim, assigned_cluster))
if(i > 10):
break
plt.show()
References and Credits
[1] Jorge Castanon at https://github.com/castanan/w2v