快速注册

cs224n L6 language model and RNN 笔记

春和景明 2020-04-11 19:39:17

Lecture outlines：

1.Language modeling (LM) 2.Recurrent nerual networks (RNN) and RNN-LM 3.RNN applications

Language modeling

语言模型定义：在给与一序列单词 x(1), x(2),..., x(t)，计算出下一个单词x(t+1)的概率分布 P(x(t+1)|x(t),...,x(1))，其中x(t+1)可以是vocabulary V中的任何单词。或者，语言模型也可看作一段给定文本的概率。

Fig. 1 一段文本x(1), x(2),..., x(T)根据某个language model可计算其概率

A）pre-deep learning method: n-gram LM n-gram LM中心思想是通过统计文本中不同n-grams的频率来做下一个单词的预测。模型假设：x(t+1)仅仅只跟在此之前的n-1个单词有关。

Fig. 2 在一个大的文本中统计n-grams词频

n-gram LM的问题： 1）只考虑n-1个之前的单词会丢失重要的信息 2）sparsity problems：分子上的count为0，则出现的概率就为0；但之前没有出现过该词组并不代表之后也不会出现；解决方法是给Vocabulary中的每个word count加一个小的delta，称为smoothing。分母上的count为0，即n-1个单词词组没有在文本中出现过，解决方法backoff，查看之前n-2个单词有没有出现过。通常我们使用的n不会超过5。 3）storage problem：随着n的增加，要保存的n-gram词组也会增加，model size也相应增加。

B）A fixed-window neural LM

Fig. 3 模型就是window-based NN，之前在NER问题中也用过

improvements: 1) no sparsity problem; 2) no storage problem remaining problems: 1) 模型依旧受限于window size，window之外的信息依然会丢失；2）增大window size，W和model size(指模型参数个数)也会相应增大；3）x(1)和x(2)被W中完全不同的部分相乘，但处理每个单词的方法是可以共用的，因此W中会有很多冗余的部分，或者说W的学习效率是不高的。

C）Standard evaluation metric for language models Perplexity: inverse probability of corpus, according to language model. The lower perplexity is better.

Fig. 4 Perplexity is the exponential of the cross entropy loss J(theta) of the LM

perplexity等于LM的overall loss的指数形式，由此可见，在train LM时，目标就是最小化J(theta)，这与得到一个小的perplexity一致。因此用perplexity来衡量一个language model是合理的。 LM is a benchmark task，可帮助我们评估在理解语言方面的进展，LM还是很多NLP任务的子任务，特别是涉及文本生成或者估计文本概率的任务，比如: predictive typing, speech/handwriting recognition, spelling/grammar correction, authorship identification, machine translation, summarization, dialogue.

Recurrent neural networks (RNN)

A）Vanilla RNN 基本思想：每个 hidden state(t) 由在它之前的 hidden state(t-1) 和当前的 input(t) 的线性叠加来决定，线性叠加中涉及到两个权重矩阵Wh，We对计算所有hidden state都一样。

Fig. 5 Apply the same weights Wh and We repeatedly

RNN特点：1) 可以处理任意长度的input，model size不会因为input数量增加而变大；2) hidden state的数量跟input一样，output数量可任意选择；3) 在每个timestep上都应用相同的权重，使权重可以包含更多有用的信息；4) 从左到右依次计算，计算无法并行，速度较慢；5) 理论上，每个hidden state都包含了之前所有input的信息，但其实实现起来很难。

B）RNN language model

Fig. 6 RNN-LM

训练RNN-LM：准备一个大的文本 x(1),...,x(t),...,x(T) 输入到RNN-LM中，按到目前为止给出的单词，依次计算每一步的输出分布y_hat(t)；损失函数是计算值 y_hat(t) 与真实值 y(t) 的cross entropy (or negative log likelihood of y_hat(t))，真实值 y(t) 为下一个单词的one-hot vector。

Fig. 7 Loss function of RNN-LM

要计算整个文本的loss和gradient太computational expensive，在实际中，我们会以句子（少量句子）为batch计算loss，更新weights，思路类似SGD。

C）Backprop for RNNs

Wh是每次算法迭代时要更新的对象，因此要计算网络中每个输出J(t)对Wh的梯度。

Fig. 8 每个timestep下的Wh都对J(t)有贡献，利用multivariable chain rule求梯度

如何计算上述梯度？backpropagate over timesteps i=t,...,0, 反向传播并累积梯度总和，不能分开计算每个梯度，只有在得到上层梯度后，才能依次计算下层，最后叠加所有计算所得。

Fig. 9 This algorithm is called "backpropagation through time"

RNN applications

A) RNNs for tagging (POS or NER)

Fig. 10 Left: part-of-speech(POS) tagging; Right: sentiment classification

B) RNNs for sentence encoding and sentiment classification Take element-wise max or mean of all hidden states to compute sentence encoding or directly use the final hidden state to compute sentence encoding.

C) RNNs for an encoder module (Q&A, MT)

Fig. 11 Question sentence encoder

D) RNN-LMs for generate text (speech recognition, MT, summarization)

Fig. 12 Conditional language model

回应转发赞收藏

还没人转发这篇日记

春和景明 (上海)

有女友，勿扰~ 人，要自己成全自己。 Hope is a good thing. 想要发展的...

cs224n L6 language model and RNN 笔记

Lecture outlines：

Language modeling

Recurrent neural networks (RNN)

RNN applications

热门话题 · · · · · · ( 去话题广场 )