Understanding Word2Vec With PySpark

13 minute read

Published: December 12, 2018

Understanding Word2Vec with PySpark

Gabriel Fair

Goal

I need to use word embeddings to study the evolution of hate speech across social media. I chose to explore Word2Vec in hopes of learning more about it and to begin to probe the field of Natural Language Processing.

Today we are going to look at how Word2Vec incorporates word embeddings to create a numeric vectors to represent meaning of words. This is an important part of natural language processing (NLP). The goal of NLP is to extract meaning from human language, often this is provided in the form of text. And this meaning can be found in many components of language.

Some components of language

Pragmatics
Semantics
Syntax
Morphology
Phonology

Dataset

I am using a Gab.ai dataset of posts submitted to the social platform. Gab.ai prides itself on the values of “free speech” and a lack of censorship. As a result it has become known for attracting trolls, bots, and the socially maligned. Comments and posts made to this site are notorious for being extreme and hate laden. I have collected over 28 million posts and will use a 1 million post sample to train a skip-grams variant of the word2vec word embedding model. The goal is to identify the proximity of two related words in a vector space.

Distributional Semantic Models

Word embeddings are word representation algorithms used in an NLP. Word embeddings are a subclass of distributional semantic models because they rely on the distributional hypothesis. The distributional hypothesis, created by Zellig Harris in his 1956 paper “Distributional structure” , is assumption that words in the same context tend to proport similar meanings and occur near each other. And thus synonyms have similar representations in a collection of texts. Word embeddings are represented as vector values created as a result of a neural network.

Example Posts

data.body
"Probably because I see the faint hint of 'horns' holding that halo up... "
https://youtu.be/YMQRFT4bZuc
http://www.epochtimes.de/politik/europa/zahl-der-toten-nach-londoner-hochhausbrand-auf-79-gestiegen-2-a2146594.html
https://t.co/LTMBeXvHrC
"Ps 37:14 Die Gottlosen ziehen das Schwert aus und spannen ihren Bogen, daß sie fällen den Elenden und Armen und schlachten die Frommen.\nPs 37:15 Aber ihr Schwert wird in ihr Herz gehen, und ihr Bogen wird zerbrechen.\n\n"
At least 25 killed in airstrike on market in Yemen – reports\nhttps://www.rt.com/news/392838-saudi-yemen-market-airstrike/ #saudiarabia #yemen

Example after cleaning

['data', 'body']

['', 'probably', 'because', 'i', 'see', 'the', 'faint', 'hint', 'of', "'horns'", 'holding', 'that', 'halo', 'up', '', '', '', '', '']
['https', '//youtu', 'be/ymqrft4bzuc']
['http', '//www', 'epochtimes', 'de/politik/europa/zahl', 'der', 'toten', 'nach', 'londoner', 'hochhausbrand', 'auf', '79', 'gestiegen', '2', 'a2146594', 'html']
['https', '//t', 'co/ltmbexvhrc']
['', 'ps', '37', '14', 'die', 'gottlosen', 'ziehen', 'das', 'schwert', 'aus', 'und', 'spannen', 'ihren', 'bogen', '', 'daß', 'sie', 'fällen', 'den', 'elenden', 'und', 'armen', 'und', 'schlachten', 'die', 'frommen', '\\nps', '37', '15', 'aber', 'ihr', 'schwert', 'wird', 'in', 'ihr', 'herz', 'gehen', '', 'und', 'ihr', 'bogen', 'wird', 'zerbrechen', '\\n\\n', '']
['at', 'least', '25', 'killed', 'in', 'airstrike', 'on', 'market', 'in', 'yemen', '–', 'reports\\nhttps', '//www', 'rt', 'com/news/392838', 'saudi', 'yemen', 'market', 'airstrike/', '#saudiarabia', '#yemen']

Now we have to take this tokenized text and use this as our input text

Building a Vector

To use the distributional hypothesis to build a vector, we have to choose what words being near each other means to us. This value of “nearness” is known as a window. In the image below, taken from Chris McCormick’s tutorial, the target word is highlighted in blue, and the window shown around it as being two words away from the target. This means the window size is equal to 2.

Word pairs are created between the target word and all other words in the window which can extend forwards or backwards. The target is then moved to the next word and the process repeats. Some embedding models treat words to the left of the target word differently than words to the right. But for now we will treat them both equally.

Figure 2: A diagram of the skip-gram model for starting with the target word and trying to predict the context words which are the words in the window: This image was taken from Tomas Mikolov et al’s original paper: Distributed Representations of Words and Phrases

These word pairs become the training samples for the model. We will be using the key of this pair to create what is known as a one-hot vector. Currently they are in the form of (target word, context-word-in-window) and this will be used as the input for a simple 1 hidden layer neural network. In Figure 2 above, the target word is W(t). The projection of the input onto a hidden layer of which has a pre-determined number of neurons that was specified as a hyper parameter. A hyper parameter is the number of hidden layer neurons has a large effect on the accacury and speed of the model’s runtime and 300 is widely used in practice since it was used by word2vec’s creators. This simple neural network is known as a Restricted Boltzmann Machine (RBM).

Restricted Boltzmann Machines (RBMs)

In the image above, there are three columns that are known in discriptions of neural networks as layers. These diagrams show cause and effect between the layers of a neural network and are read from left-to-right. Each circle represents a neuron and is called a node. A node is where a calculation is preformed to determine if it will send a 0 or a 1 to a node in the next layer, which is to the right. This communication is known as firing and only happens in one direction, left-to-right.

In our restricted boltzman machine, nodes are not linked to, or communicate with, other nodes within the same layer. This restriction gives the RBM its name. And every node in the input layer is linked to each node in the hidden layer. The nodes/neurons in the input layer are considered to be different neurons in the hidden layer, hence why they are in different layers.

I stress this point because this is known as a bipartite graph. But not just any bipartide graph, a complete bipartite graph because these two layers are fully linked. Note, that some texts call this a symmetrical bipartite graph. Also it is important to notice in the graphic above how the hidden layer has fewer nodes than the input or output layers. This is an important quality of RBMs as a feature known as dimensionality reduction.

When a RBM is inalitized, four things are determined in advance and thus are hard-coded into the construction of the neural network. This things are known as hyper parameters.

Number of nodes in the input layer
Number of nodes in the hidden layer
Number of nodes in the output layer
The weights of the nodes in the hidden layer

With RBMs a special step happens when the hidden layer is created. Each node is randomlly assigned a weight. A wight is the power that node has on the nodes it is linked to in the next layer. This process of randomlly assigning weights is known as Stochastic Gradient Descent. It is called this because stocastic means “random” and these weights provide influence on the node in the next layer they are linked to.

The random assignment of the weights for word2vec is only done on the first round. The entire neural network is trained multiple times. Each one of these training rounds is called an Epoch. The weight matrix from the previous round is used as the initial weight table for the next round.

These weights are important to Word2Vec and like normal RBMs, Word2Vec randomly assigns weights. Word2Vec adjusts these weights over time while the neural network is fed our one-hot vector we created from our word pairs. This is known as training the neural network. Word2vec is similar to an autoencoder, as it encodes each word into a vector. But rather than training against the input words through reconstruction, as a restricted Boltzmann machine does, word2vec trains words against other words that are found in the context window of the input corpus.

Figure 4: This image was adapted from Chris McCormick’s Word2Vec Tutorial - The Skip-Gram Model tutorial

Figure 5: A diagram of how word2vec uses linear matrixes to train its neural network

Understanding Figure 5, the linear algebra of Word2Vec

Above I explained how the network is trained many times. The same input is fed into the network each time but the weight matrixes from the previous round is used as the inital weight matrix for the next round. In Figure 5, this is matrix W.

So for our first round we will start with

h=x^TW

Since x is a one-hot encoded vector, h is simply doing a lookup of the kth row of the weight matrix W. Each row becomes a hidden layer for each word via the lookup trick provided by the one-hot encoding.

y_c,j= W′^Th

so essentally the output element is just the transpose of the weight matrix between the input and hidden layer and the hidden layer and the output layer. so

y_c,j = W′^TW^Tx

But there is one step missing from this equation above. The final output vector needs to be softmaxed. This takes the output layer (Y) and compresses the values into the range between 0 and 1. This allows for Y to act as a probability distribution for the input words (X). The equation for softmax is below:

$y_{c,j} = P(w_{c,j}=w_{O,c}|w_{I}) = \frac{exp(y_{c,j})}{\sum_{j'=1}^{V} exp(y_{j})} ,$

We are using softmax since it continuous output provides a good use case for multiclass classification

I have trained using the method explained above. I used only 100 nodes in my hidden layer. The model would be more accurate if it was trained on a larger hiden layer. Often the size of this layer is 300 nodes.

Explaining skip-gram

Create list of predicted words for the target word ‘trump’

print(model_vectors['trump'])

[0.012244515, 0.05669574, 0.4243817, -0.13005282, 0.14591245, 0.15754509, 0.12144023, -0.043377433, -0.036471862, -0.17795071, -0.15042417, 0.4285602, -0.16748007, -0.09618644, 0.07635299, 0.021112783, -0.1097202, 0.16649377, 0.31761286, 0.2781521, 0.26321766, -0.35739362, -0.17595355, -0.28173986, 0.2220869, 0.421465, 0.12334027, 0.17061687, -0.16097873, 0.1101991, 0.39143816, -0.10224187, 0.19060156, 0.06379647, -0.055479944, 0.30508712, -0.33523571, -0.3099334, 0.16205992, 0.23172502, 0.12932838, -0.25712037, -0.24778262, -0.41348562, 0.10876833, -0.095286794, -0.12277438, 0.08167293, 0.2416396, -0.29519707, 0.07202256, -0.03740526, -0.08972215, -0.03250894, -0.21824007, 0.04827257, -0.009086915, 0.18352096, -0.10135367, -0.47981852, -0.06576853, 0.021472175, 0.023349164, 0.05336668, -0.37836334, 0.08596835, -0.08231194, -0.09812828, 0.0058923, -0.06080334, 0.15352124, 0.2911331, 0.15038647, 0.15921666, 0.13570379, 0.09163106, 0.0093092015, 0.0024938602, 0.16191821, -0.116921216, 0.37449756, -0.37325835, 0.17355393, 0.22919315, -0.22791475, -0.12990569, 0.15548478, -0.16302991, 0.09529176, -0.124482594, 0.01942392, -0.18610963, -0.43775123, -0.2965226, -0.07572919, 0.2682866, 0.15111415, -0.03312072, 0.023581406, -0.035215713] --- > Number of total posts processed:  1,000,000

Number of words in the model: 116,568

Number of features per word: 100

Now visualize the vector space using PCA and KMeans

Here I have to specify the number of clusters that Kmeans should use. A good approximation is to take the square root of half the number of words in the vocabulary list.

Size of the Word2vec matrix (words, features) is: (116568, 100)

Number of PCA clusters used: 241

The dimensions of the Word2Vec matrix: (116568, 100)

Find cosine simularity between each word in the `W` matrix

Using cosine simularity we have the closeness of the word inauguration with the word trump.

nwords = 100
indexes = np.argpartition(dist,-(nwords+1))[-(nwords+1):]
di = []
for counter in range(nwords+1):
    di.append(( words[indexes[counter]], dist[indexes[counter]], labels[indexes[counter]] ) )
print(di[2])

('inauguration/', 0.5486706985071742, 112)

The top 100 words that are simular according to Word2Vec

ranked_results = unsorted_result.iloc[::-1] # order results from closest to chosen word
print(ranked_results)

                               word  similarity  cluster
                         trump    1.000000      112
                        trump's    0.729306      112
                          elect    0.722010      112
                          potus    0.705011      112
                         donald    0.694213      112
                        trump’s    0.688237      112
                      president    0.687781      112
                         trumps    0.686771      112
                  clintonrussia    0.680471       35
                         trump/    0.671885      112
                     unverified    0.666894      128
               winner\r\nsearch    0.663887      118
                         trump?    0.660687      112
                         peotus    0.656724       35
                  html\n\ntrump    0.650845       35
                       remarks/    0.643584       13
        video/?utm_medium=email    0.641376       35
                         somers    0.638150      118
                          pence    0.636276      112
                  com/appalling    0.634154      118
                       tv/watch    0.633391      144
                           soci    0.632695       35
                        peegate    0.628730       35
                          obama    0.627831       93
                          shole    0.627038      221
                     ‘shithole’    0.624290       13
                         feeley    0.623910       35
                       trump?\n    0.618699       35
                      netanyahu    0.617232      181
                schumershutdown    0.616409       35
..                              ...         ...      ...
                     'president    0.565560       35
            com/2017/10/winning    0.564972        0
          com/2018/01/president    0.564370      218
                        remark/    0.564070       35
               com/2017/06/mika    0.563695       17
                            djt    0.563693      112
                        sexist/    0.563230      118
                   erin'strump®    0.562768      118
        com/2018/01/13/pandoras    0.562444      112
       com/2017/07/03/president    0.561896       17
    office/2017/06/19/statement    0.560339      112
             statements/remarks    0.559158      144
 com/california/2017/01/07/dear    0.557056        0
                    \npresident    0.555016      112
    office/2017/06/29/statement    0.553720       82
                        pledge/    0.553681      118
                     com/donald    0.553214       83
                 trump\n\ntrump    0.553146       35
                   house/356849    0.552361       35
                    haiti\nhttp    0.551846        0
                    ‘resistance’    0.551726       35
                           call/    0.551359       35
      com/2017/06/28/condoleezza    0.551023      187
                               j    0.550414      112
                        streep’s    0.550304       35
                     trump\nhttp    0.549403      218
            statements/president    0.548829       17
                   inauguration/    0.548671      112
                    nevertrumper    0.547763       13
                   trump\n\nhttp    0.547598      218

[101 rows x 3 columns]

Visualize Words using PCA

for i, word in enumerate(words):
    ax.scatter(model[i, 0], model[i, 1], color='red', marker='o', edgecolors='black')
    ax.text(model[i, 0], model[i, 1], model[i, 2], word)
    plt.scatter(model[i, 0], model[i, 1], color='red', marker='o', edgecolors='black')
    if(i > 50):
        break

png

counter = 0
i = 0
plt.figure(figsize=(18, 16), dpi= 80, facecolor='w', edgecolor='k')
for counter in range(nwords+1):
    word_lable = ranked_results.iloc[counter][0]
    cosine_sim = ranked_results.iloc[counter][1]
    assigned_cluster = ranked_results.iloc[counter][2]
    
    plt.scatter(dist[indexes[counter]], labels[indexes[counter]], color='red', marker='o', edgecolors='black')
    plt.annotate(word_lable, (cosine_sim, assigned_cluster))
    if(i > 10):
        break
    
plt.show()

png

References and Credits

[1] Jorge Castanon at https://github.com/castanan/w2v

Share on

Twitter Facebook Google+ LinkedIn