LDA实作指南

LDA的动机和思路

让我们从下面这张图讲起:

生成过程是按照一定的概率选择topic,再从对应topic中按照一定概率选择token;而推断过程则是给定了一些文档和其中的token,不知道每个token是由哪个topic生成,需要去推断每篇文档取自各个主题的概率,以及每个主题在抽出各个词的概率。

这个推断不能胡乱推断,而应该根据一定的假设进行,这个假设就是LDA的生成过程(这也是为什么许多介绍LDA的文章会在一开始就把LDA的生成过程写出来)。

根据上述生成过程,我们可以看出,其中最重要的就是给出参数$\Theta$和$\Phi$的估计,也就是$\theta_{m}$和$\phi_{k}$的估计。这两个参数的估计都可以通过主题分配向量$\textbf{z}$的采样来得到,也就是说,如果能够“合理”地给出每个词的主题分配情况(即$\textbf{z}$的若干采样),那么就能推断出$\theta$和$\phi$。

于是问题就变成,在给定观测数据$\textbf{w}$和先验超参数$\alpha$和$\beta$的前提下,如何推断$\textbf{z}$的分布(进而可以从中采样),也就是求:

$p(\textbf{z}|\textbf{w},\alpha,\beta)$,

由Bayes法则, $p(\textbf{z}|\textbf{w},\alpha,\beta)=\frac{p(\textbf{z},\textbf{w}|\alpha,\beta)}{p(\textbf{w}|\alpha,\beta)}$,分母由于维数过高极其稀疏而难以估计,注意到在对$\textbf{z}$的采样中,只需求得$\textbf{z}$在各主题上的相对概率大小即可,即$p(\textbf{z}|\textbf{w},\alpha,\beta)=\frac{p(\textbf{z},\textbf{w}|\alpha,\beta)}{p(\textbf{w}|\alpha,\beta)} \propto p(\textbf{z},\textbf{w}|\alpha,\beta)$,

因此现在关键在算$p(\textbf{z},\textbf{w}|\alpha,\beta)$,

根据生成过程,给定$\alpha$和$\beta$,要得到$\textbf{z}$和$\textbf{w}$,中间有隐变量$\theta$和$\phi$,因此上述条件概率需要将这两个隐变量通过积分消去(即求边缘概率),

$p(\textbf{z},\textbf{w}|\alpha,\beta)=\int{\int{p(\textbf{z},\textbf{w},\theta,\phi|\alpha,\beta) \mathrm{d}\theta} \mathrm{d}\phi} = \int{\int{p(\phi|\beta)p(\theta|\alpha)p(\textbf{z}|\theta)p(\textbf{w}|\phi_{z})\mathrm{d}\theta}\mathrm{d}\phi} = \int{p(\textbf{z}|\theta)p(\theta|\alpha)\mathrm{d}\theta}\int{p(\textbf{w}|\phi_{z})p(\phi|\beta)\mathrm{d}\phi}$,(也即=$p(\textbf{z}|\alpha)p(\textbf{w}|\textbf{z},\beta)$)

分别推导这两项:

记$\theta_{mk}$是第$m$个文本生成第$k$个主题的概率,$n_{mk}$是语料库中第$m$的文本生成第$k$个主题的次数,

第一项:

$$
\begin{align*}
p(\mathbf{z} \mid \alpha) &=
\int p(\mathbf{z} \mid \theta) p(\theta \mid \alpha) \mathrm{d} \theta \\
&=\int \prod_{m=1}^{M} \frac{1}{\mathrm{~B}(\alpha)} \prod_{k=1}^{K} \theta_{m k}^{n_{m k}+\alpha_{k}-1} \mathrm{~d} \theta \\
&=\prod_{m=1}^{M} \frac{1}{\mathrm{~B}(\alpha)} \int \prod_{k=1}^{K} \theta_{m k}^{n_{m k}+\alpha_{k}-1} \mathrm{~d} \theta \\
&=\prod_{m=1}^{M} \frac{\mathrm{B}\left(n_{m}+\alpha\right)}{\mathrm{B}(\alpha)} \\
\end{align*}
$$

(此处$p(\mathbf{z} \mid \theta) = \prod_{m=1}^{M} \prod_{k=1}^{K} \theta_{m k}^{n_{m k}}$,而由假设:$\theta$服从参数为$\alpha$的Dirichlet分布,因此$\theta$的先验分布$p(\theta \mid \alpha)=\frac{\Gamma\left(\sum_{i=1}^{k} \alpha_{i}\right)}{\prod_{i=1}^{k} \Gamma\left(\alpha_{i}\right)} \prod_{i=1}^{k} \theta_{i}^{\alpha_{i}-1}=\frac{1}{\mathrm{~B}(\alpha)} \prod_{i=1}^{k} \theta_{i}^{\alpha_{i}-1}=\operatorname{Dir}(\theta \mid \alpha)$,推导参考[1])。

第二项:

$$
\begin{align*}
p(\mathbf{w} \mid \mathbf{z}, \beta) &=\int p(\mathbf{w} \mid \mathbf{z}, \varphi) p(\varphi \mid \beta) \mathrm{d} \varphi \\
&=\int \prod_{k=1}^{K} \frac{1}{\mathrm{~B}(\beta)} \prod_{v=1}^{V} \varphi_{k v}^{n_{k v}+\beta_{v}-1} \mathrm{~d} \varphi \\
&=\prod_{k=1}^{K} \frac{1}{\mathrm{~B}(\beta)} \int \prod_{v=1}^{V} \varphi_{k v}^{n_{k v}+\beta_{v}-1} \mathrm{~d} \varphi \\
&=\prod_{k=1}^{K} \frac{\mathrm{B}\left(n_{k}+\beta\right)}{\mathrm{B}(\beta)}
\end{align*}
$$

其中$n_{k}=(n_{k1},n_{k2},\dots,n_{kV})$,(此处$p(\mathbf{w} \mid \mathbf{w}, \phi) = \prod_{k=1}^{K} \prod_{v=1}^{V} \phi_{kv}^{n_{kv}}$,同样由假设:$\phi$服从参数为$\beta$的Dirichlet分布,因此$\phi$的先验分布$p(\phi_{k} \mid \beta)=\frac{\Gamma\left(\sum_{i=1}^{V} \beta_{i}\right)}{\prod_{i=1}^{V} \Gamma\left(\beta_{i}\right)} \prod_{v=1}^{V} \phi_{kv}^{\beta_{v}-1}=\frac{1}{\mathrm{~B}(\beta)} \prod_{v=1}^{V} \phi_{kv}^{\beta_{v}-1}=\operatorname{Dir}(\phi \mid \beta))$)

可知:$p(\textbf{z},\textbf{w}|\alpha,\beta)=\prod_{d}\frac{B(n_{d,.}+\alpha)}{B(\alpha)} \prod_{k}\frac{B(n_{k,.}+\beta)}{B(\beta)}$,即$p(\textbf{z} |\textbf{w},\alpha,\beta) \propto \prod_{d}\frac{B(n_{d,.}+\alpha)}{B(\alpha)} \prod_{k}\frac{B(n_{k,.}+\beta)}{B(\beta)}$

于是我们可以据上式采样$p(\textbf{z}|\textbf{w},\alpha,\beta)$。直接采样是可以的,但高维空间效率极低,因此可以Gimbbs采样,一维一维地采,因此需要估计$p(z_{i}|\textbf{z}_{-i},\textbf{w},\alpha,\beta)$,易知:

$$
\begin{equation}
p(z_{i}|\textbf{z}{-i},\textbf{w},\alpha,\beta)
=\frac{p(z
{i},\textbf{z}{-i}|\textbf{w},\alpha,\beta)}{p(\textbf{z}{-i}|\textbf{w},\alpha,\beta)}
=\frac{p(\textbf{z}|\textbf{w},\alpha,\beta)}{p(\textbf{z}_{-i}|\textbf{w},\alpha,\beta)}
\end{equation}
$$

分母 $p(\textbf{z}{-i},\textbf{w},\alpha,\beta)$ 记作$Z{i}$,$Z_{i}$为将位置$i$处的主题排除后的边缘概率,注意到$Z_{i}$与$z_{i}$的取值无关(因为$z_{i}$被排除了),即 $Z_{zi=t1}=Z_{zi=t2}(t1 \neq t2)$ ,因此

$p(z_{i}|\textbf{z}{-i},\textbf{w},\alpha,\beta)=\frac{p(\textbf{z}|\textbf{w},\alpha,\beta)}{Z{i}} \propto p(\textbf{z} | \textbf{w},\alpha,\beta) \propto \prod_{d}\frac{B(n_{d,.}+\alpha)}{B(\alpha)} \prod_{k}\frac{B(n_{k,.}+\beta)}{B(\beta)}$

即 $p(z_{i}|\textbf{z}{-i},\textbf{w},\alpha,\beta) \propto \frac{n{kv}+\beta_{v}}{\sum_{v=1}^{V}(n_{kv}+\beta_{v})} \frac{n_{mk}+\alpha_{k}}{\sum_{k=1}^{K}(n_{mk}+\alpha_{k})}$,(此步骤推断可参考文献[^2]P35)

到这里,LDA的Gibbs采样算法就呼之欲出了:

(以下摘自李航《统计学习方法》[^1])

  • Input:语料库的单词序列$\textbf{w}={\textbf{w}{1},\textbf{w}{2},\dots,\textbf{w}_{M}}$,

  • Output: 语料库的主题序列$\textbf{z}={\textbf{z}{1},\textbf{z}{2},\dots,\textbf{z}_{M}}$,估计模型参数$\theta$和$\phi$。

  • 超参数:主题数$K$,Dirichlet参数$\alpha$和$\beta$。

  • 初始化:

    • “文档-主题”计数矩阵记为$C_{MK}$,其元素$C_{mk}$表示文档$m$中主题$k$的出现次数
    • “主题-词”计数矩阵记为$P_{KV}$,其元素$P_{kv}$表示主题$k$中单词$v$的出现次数
    • “文档-主题和”计数向量记为$(C_{1},C_{2},\dots,C_{M})$,其元素$C_{m}$表示文档$m$中包含的主题数
    • “主题-词”计数向量记为$(P_{1},P_{2},\dots,P_{K})$,其元素$P_{k}$表示主题$k$中包含的单词数

    以上变量均初始化为0,接着执行以下步骤:

    对所有文档$\textbf{w}_{m}$:

    ​ 对该文档中的所有单词$w_{mn},(n=1,2,\dots,N_{m})$:

    ​ 基于多项分布抽样主题$z_{mn}=z_{k}\sim Mult(\frac{1}{K})$

    ​ $C_{mk} += 1$, $C_{m}+=1$, $P_{kv}+=1$, $P_{k}+=1$

  • 学习:

    对所有文档$\textbf{w}_{m}$:

    ​ 对该文档中的所有单词$w_{mn}$:

    ​ 设当前单词在词表中索引为$v$,所分配的主题$z_{mn}$的主题索引为$k$

    ​ $C_{mk} -= 1$, $C_{m}-=1$, $P_{kv}-=1$, $P_{k}-=1$

    ​ 根据条件分布进行抽样:

    $p(z_{i}|\textbf{z}{-i},\textbf{w},\alpha,\beta) \propto \frac{n{kv}+\beta_{v}}{\sum_{v=1}^{V}(n_{kv}+\beta_{v})} \frac{n_{mk}+\alpha_{k}}{\sum_{k=1}^{K}(n_{mk}+\alpha_{k})}$,

    ​ 设抽样得到主题$k^{\prime}$,令$z_{mn}=k^{\prime}$,

    ​ $C_{mk^{\prime}} += 1$, $C_{m}+=1$, $P_{k^{\prime}v}+=1$, $P_{k^{\prime}}+=1$

    重复上述步骤,直至度过燃烧期,当收敛后,得到$\textbf{z}$的若干样本,于是可以据此估计参数$\theta$和$\phi$,注意是用分布的期望进行的估计,

最终得到$\theta_{mk}=\frac{n_{mk}+\alpha_{k}}{\sum_{k=1}^{K}(n_{mk}+\alpha_{k})}$,以及 $\phi_{kv}=\frac{n_{kv}+\beta_{v}}{\sum_{v=1}^{V}(n_{kv}+\beta_{v})}$。

​ 算法完毕。

当有新文档来时,保留$\phi$,如上更新$\theta$,得到文档的主题分布。

估计参数$\theta$时,由LDA的定义,有

$p\left(\theta_{m} \mid \mathbf{z}{m}, \alpha\right)=\frac{1}{Z{\theta_{m}}} \prod_{n=1}^{N_{m}} p\left(z_{m n} \mid \theta_{m}\right) p\left(\theta_{m} \mid \alpha\right)=\operatorname{Dir}\left(\theta_{m} \mid n_{m}+\alpha\right)$,由Dirichlet分布的性质(参考[2]),得到

$\theta_{mk}=\frac{n_{mk}+\alpha_{k}}{\sum_{k=1}^{K}(n_{mk}+\alpha_{k})}$,

估计参数$\phi$时,有

$p\left(\varphi_{k} \mid \mathbf{w}, \mathbf{z}, \beta\right)=\frac{1}{Z_{\varphi_{k}}} \prod_{i=1}^{I} p\left(w_{i} \mid \varphi_{k}\right) p\left(\varphi_{k} \mid \beta\right)=\operatorname{Dir}\left(\varphi_{k} \mid n_{k}+\beta\right)$,同样由Dirichlet分布的性质,得到

$\phi_{kv}=\frac{n_{kv}+\beta_{v}}{\sum_{v=1}^{V}(n_{kv}+\beta_{v})}$

根据上述算法,可以很容易地写出程序,需要说明的是,本文的程序实现尽可能从易于理解的角度出发,并不打算作太多优化。这里以数据集作为示例数据。

首先进行分词,分词器选择HanLP;并去掉特殊符号和停用词,停用词表选用hit_stopwords。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import re
import json
from pyhanlp import *

stopwords = set([line.strip('\n') for line in open('cn_stopwords.txt','r',encoding='utf-8').readlines()])

def tokenize(sent):
pat = re.compile(r'[0-9!"#$%&\'()*+,-./:;<=>?@—,。:★、¥…【】()《》?“”‘’!\[\\\]^_`{|}~\u3000]+')
return [t.word for t in HanLP.segment(sent) if pat.search(t.word)==None and t.word.strip()!='' and not (t.word in stopwords)]

def load_docs(filename,n_samples=-1):
tokenized_docs = []
with open(filename,'r',encoding='utf-8') as rfp:
lines = [line.strip('\n') for line in rfp.readlines()]
lines = lines if n_samples==-1 else lines[:n_samples]
for i,line in enumerate(lines):
sent = json.loads(line)['sentence']
tokenized_docs.append(tokenize(sent))
if i<100 and i%10==0:
print(tokenize(sent))
return tokenized_docs

trn_tokenized_docs = load_docs('tnews_public/train.json',n_samples=1000)
new_tokenized_docs = load_docs('tnews_public/dev.json',n_samples=1000)


'''
❯ python .\LDA.py
['上课时', '学生', '手机', '响', '个', '不停', '老师', '一怒之下', '把', '手机', '摔', '了', '家长', '拿', '发票', '让', '老师', '赔', '大家', '怎么', '
看待', '这种', '事']
['凌云', '研发', '的', '国产', '两', '轮', '电动车', '怎么样', '有', '什么', '惊喜']
['取名', '困难', '症', '患者', '皇马', '的', '贝尔', '第', '一个', '受害者', '就是', '他', '的', '儿子']
['葫芦', '都', '能', '做成', '什么', '乐器']
['中级会计', '考试', '每日', '一练']
['复仇者', '联盟', '中', '奇异', '博士', '为什么', '不', '用', '时间', '宝石', '跟', '灭', '霸', '谈判']
['拥抱', '编辑', '时代', '内容', '升级', '为', '产品', '海内外', '媒体', '如何', '规划', '下', '个', '十', '年']
['地球', '这', '是', '怎么', '了', '美国', '夏威夷', '群岛', '突发', '级', '地震', '游客', '紧急', '疏散']
['定安', '计划', '用', '三', '年', '时间', '修复', '全县', '处', '不', '可移动', '文物']
['军工', '已', '动真格', '中航', '科工', '占据', '着', '舞台', '正', '中央']
'''

接下来实现LDA模型,我们希望实现如下接口:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
lda_model = LDA(docs=tokenized_docs,K=20)

lda_model.train()

lda_model.add_docs(new_tokenized_docs)

tp_wd_dist = lda_model.topic_word_dist()
# return [K,V] matrix

lda_model.show_topic_words()

doc_tp_dist = lda_model.get_corpus_dist()
# return [M,K] matrix, where M = # of all the docs have feed in.

doc_tp_dist = lda_model.batch_inference(new_tokenized_docs)
# return [M',K] matrix, where M' = # of new_tokenized_docs

tp_dist = lda_model.inference(new_tokenized_doc)
# return [1,K] matrix

首先是初始化的部分,这部分可以很容易地完成,各关键部分的功能在注释中有所标注:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
class LDA:
def __init__(self,docs,K):
# params: docs: tokenized sentence list, e.g. [['hello','world'],['nice','job'],...]
# params: K: number of topics
self.idx2token = []
self.token2idx = {}
self.M = len(docs)
self.K = K
# Build vocabulary
for doc in docs:
for wd in doc:
if not (wd in self.idx2token):
self.idx2token.append(wd)
for idx,wd in enumerate(self.idx2token):
self.token2idx[wd] = idx
self.V = len(self.idx2token)
print(f'Vocabulary length: {self.V}')

# Initialize count matrix and vectors
self.beta = np.ones(self.V) / self.V
self.alpha = np.ones(self.K) / self.K
self.matC = np.zeros((self.M,self.K))
self.matP = np.zeros((self.K,self.V))
self.vecC = np.zeros(self.M)
self.vecP = np.zeros(self.K)
self.zs = []
self.k_ids = list(range(self.K)) # ids of topics
self.v_docs = [] # mapping tokens in docs to their vocabulary index

for m, doc in enumerate(docs):
# sample topic ids for doc;
# np.random.choice is bootstrap sample method, which can generate array like [2,1,3,3]
_zs = np.random.choice(self.k_ids,len(doc))
self.zs.append(_zs)
_idx = [self.token2idx[tk] for tk in doc]
self.v_docs.append(_idx)
for z,v in zip(_zs,_idx):
self.matC[m,z] += 1
self.matP[z,v] += 1
self.vecP[z] += 1
self.vecC[m] += len(doc)

然后是LDA的训练部分:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def train(self,n_iter=100):
for it in range(n_iter):
print(f'Iteration {it} ...')
for m,vdoc in enumerate(self.v_docs):
for i,v in enumerate(vdoc):
z = self.zs[m][i]
self.matC[m][z] -= 1
self.vecC[m] -= 1
self.matP[z][v] -= 1
self.vecP[z] -= 1
fst_itm = lambda k: (self.matP[k][v]+self.beta[v])/(self.matP[k]+self.beta).sum()
scd_itm = lambda k: (self.matC[m][k]+self.alpha[k])/(self.matC[m]+self.alpha).sum()
_probs = np.array([fst_itm(k)*scd_itm(k) for k in self.k_ids])
probs = _probs / _probs.sum()
zp = np.random.choice(self.k_ids,p=probs)
self.zs[m][i] = zp
self.matC[m][zp] += 1
self.vecC[m] += 1
self.matP[zp][v] += 1
self.vecP[zp] += 1
_theta = self.matC + self.alpha
self.theta = _theta / _theta.sum(axis=1,keepdims=True)
_phi = self.matP + self.beta
self.phi = _phi / _phi.sum(axis=1,keepdims=True)

假如我们已经训练好了模型,后来又增加了新的训练数据,考虑添加一个add_docs接口,使模型在之前的基础上继续训练。在实现时可以有两种选择,一种是沿用之前的词表,只更新文档集,这样实现简单,计算效率也更高;另一种是扩充词表,这么做计算量会更大一些,但也更准确地反映了数据的变化。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def add_docs(self,docs,n_iter=100):
old_v = self.V
# Update vocabulary
for doc in docs:
for tk in doc:
if not (tk in self.idx2token):
self.idx2token.append(tk)
self.token2idx[tk] = len(self.idx2token)-1
self.V = len(self.idx2token)
self.beta = np.ones(self.V) / self.V # A convience way; other methods might be better?
# Update idxes of docs: v_docs
nv_docs = [[self.token2idx[tk] for tk in doc] for doc in docs]
self.v_docs += nv_docs
# Update shape of matC, matP, topic list: zs
suf_matC = np.zeros((len(docs),self.K))
suf_matP = np.zeros((self.K,self.V-old_v))
self.matC = np.concatenate([self.matC,suf_matC],axis=0) # [M+M',K]
self.matP = np.concatenate([self.matP,suf_matP],axis=1) # [K,V+V']
vecC = []
for m,doc in enumerate(nv_docs):
_zs = np.random.choice(self.k_ids,len(doc))
self.zs.append(_zs)
for v,z in zip(doc,_zs):
self.matC[m+self.M][z] += 1
self.matP[z][v] += 1
self.vecP[z] += 1
vecC.append(len(doc))
self.vecC = np.append(self.vecC,vecC)
self.M = len(self.v_docs)

# train for another more loops
self.train(n_iter=n_iter)

实现inference接口时,因为推断本质是加入新数据继续迭代至收敛,因此可复用已经实现的add_docs。剩下的接口都是辅助性功能,实现起来比较简单,就一并列出了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
def batch_inference(self,docs,n_iter=100):
n_docs = len(docs)
self.add_docs(docs,n_iter=n_iter)
return self.get_corpus_dist()[-n_docs:]

def inference(self,doc,n_iter=100):
bth = [doc]
dist = self.batch_inference(bth,n_iter=n_iter)
return dist.squeeze()

def get_corpus_dist(self):
return self.theta

def topic_word_dist(self):
return self.phi

def show_topic_words(self,n=20,show_weight=True):
sorted_wghts = np.sort(-1 * self.phi,axis=1)[:,:n]*(-1)
top_idx = (-1 * self.phi).argsort(axis=1)[:,:n]
suf = lambda wght: f'*{wght:.07f}' if show_weight else ''
topic_words = [[f"{self.idx2token[idx]}{suf(wght)}" for idx,wght in zip(idxes,wghts)] for idxes,wghts in zip(top_idx,sorted_wghts)]
for tp_wd in topic_words:
print(tp_wd)
return topic_words

让我们将模型应用到TNEWS上,迭代次数设定为100次,查看模型对文档主题分布$\theta$的推断,以及权重top20的主题词。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
if __name__ == '__main__':
N_Iter = 100
print('Train on trn data ...')
lda_model = LDA(docs=trn_tokenized_docs,K=20)
lda_model.train(n_iter=N_Iter)
tp_wd_dist = lda_model.topic_word_dist() # return [K,V] matrix
print('tp_wd_dist:',tp_wd_dist)
lda_model.show_topic_words(n=20)
doc_tp_dist = lda_model.get_corpus_dist() # return [M,K] matrix, where M = # of all the docs have feed in.
print('doc_tp_dist:',doc_tp_dist)

print('='*40)
print('Add Dev data')
lda_model.add_docs(dev_tokenized_docs,n_iter=N_Iter)
tp_wd_dist = lda_model.topic_word_dist() # return [K,V] matrix
print('tp_wd_dist:',tp_wd_dist)
lda_model.show_topic_words(n=20)
doc_tp_dist = lda_model.get_corpus_dist() # return [M,K] matrix, where M = # of all the docs have feed in.
print('doc_tp_dist:',doc_tp_dist)


print('='*40)
print('Inference on Test data')
doc_tp_dist = lda_model.batch_inference(tst_tokenized_docs,n_iter=N_Iter) # return [M',K] matrix, where M' = # of new_tokenized_docs
print('doc_tp_dist:',doc_tp_dist)
new_tokenized_doc = tst_tokenized_docs[17]
tp_dist = lda_model.inference(new_tokenized_doc,n_iter=N_Iter) # return [K] vector
print('tp_dist:',tp_dist)


'''
>python LDA.py

Train on trn data ...
Vocabulary length: 4771

tp_wd_dist: [[6.29428422e-07 6.29428422e-07 6.29428422e-07 ... 6.29428422e-07
6.29428422e-07 6.29428422e-07]
[5.68020771e-07 5.68020771e-07 5.68020771e-07 ... 2.71059512e-03
5.68020771e-07 5.68020771e-07]
[5.64958665e-07 5.64958665e-07 4.58226674e-02 ... 5.64958665e-07
5.64958665e-07 5.64958665e-07]
...
[4.93175682e-07 4.93175682e-07 4.93175682e-07 ... 4.93175682e-07
4.93175682e-07 4.93175682e-07]
[5.22692431e-07 5.22692431e-07 5.22692431e-07 ... 5.22692431e-07
5.22692431e-07 5.22692431e-07]
[4.69954405e-07 4.69954405e-07 4.69954405e-07 ... 4.69954405e-07
4.69954405e-07 4.69954405e-07]]

['年*0.0360367', '美国*0.0240247', '创新*0.0150156', '房产*0.0150156', '影响*0.0150156', '能力*0.0120126', '文化*0.0120126', '一个*0.0120126', '评价*0.0120126', '飞*0.0090096', '国*0.0090096', '娶*0.0090096', 'Q*0.0090096', '辆*0.0090096', '计划*0.0090096', '世界杯*0.0090096', '基地*0.0090096', '装备*0.0090096', '到来*0.0090096', '人性*0.0090096']
['四*0.0298109', '媒体*0.0162607', '建设*0.0162607', '楼市*0.0135507', '时间*0.0135507', '小镇*0.0135507', '西方*0.0108407', '成功*0.0108407', '时刻*0.0108407', '比特币*0.0108407', '爱*0.0108407', '首届*0.0108407', '赚钱*0.0108407', '农民*0.0108407', '没*0.0108407', '大战*0.0081306', '机遇*0.0081306', '沪
*0.0081306', '信*0.0081306', '专家*0.0081306']
['中*0.0512135', '手机*0.0458227', '太*0.0269547', '詹姆斯*0.0188685', '送*0.0161731', '科技*0.0134777', '里*0.0134777', '问题*0.0134777', '已经*0.0134777', '不能*0.0134777', '家长*0.0107822', '万*0.0107822', '万元*0.0107822', '路*0.0107822', '学会*0.0107822', '项*0.0107822', '农业*0.0080868', '制造*0.0080868', '幸福*0.0080868', '思考*0.0080868']
['一个*0.0488894', '中国*0.0444449', '三*0.0288894', '旅游*0.0266671', '新*0.0244449', '更*0.0200005', '没有*0.0200005', '即将*0.0177782', '经济*0.0133338', '重要*0.0111116', '产品*0.0111116', '喜欢*0.0111116', '老*0.0111116', '实力*0.0111116', '司机*0.0111116', '曝光*0.0111116', '请*0.0111116', '产业*0.0088894', '品牌*0.0088894', '难道*0.0088894']
['两*0.0474045', '做*0.0316032', '款*0.0248312', '王者*0.0203165', '俄*0.0180592', '开*0.0158018', '地球*0.0135445', '推荐*0.0135445', '孩子*0.0135445',
'成*0.0112872', '投资*0.0112872', '值得*0.0112872', '要求*0.0112872', '谢娜*0.0090298', '区别*0.0090298', '首发*0.0090298', '重*0.0090298', '型*0.0090298', '见过*0.0090298', '先*0.0090298']
['现在*0.0473543', '历史*0.0222847', '房价*0.0167137', '儿子*0.0167137', '进*0.0167137', '举行*0.0167137', '元*0.0139282', '发*0.0111426', '梦*0.0111426', '航母*0.0111426', '联想*0.0083571', '逆袭*0.0083571', '周*0.0083571', '旗*0.0083571', '行情*0.0083571', '红毯*0.0083571', '再次*0.0083571', '收入*0.0083571', '现场*0.0083571', '风险*0.0083571']
['五*0.0294845', '到底*0.0196565', '网友*0.0171995', '核*0.0171995', '协议*0.0147425', '小米*0.0147425', '联*0.0147425', '地方*0.0147425', '微信*0.0122855', '亿*0.0122855', '伊*0.0122855', '|*0.0122855', '融资*0.0122855', '出现*0.0098285', '爆*0.0098285', '好玩*0.0098285', '阿里巴巴*0.0098285', '改变*0.0098285', '教*0.0098285', '申请*0.0098285']
['」*0.0329119', '第一*0.0303803', '特朗普*0.0303803', '岁*0.0253170', '成功*0.0151904', '复*0.0151904', '上市*0.0126588', '油*0.0126588', '跑*0.0126588', '称*0.0126588', '印度*0.0101271', '智能*0.0101271', '电*0.0101271', '接*0.0101271', '参加*0.0101271', '进口*0.0101271', '电子*0.0101271', '上映*0.0075955', '获得*0.0075955', '号*0.0075955']
['会*0.0410262', '说*0.0333339', '使用*0.0179493', '位*0.0179493', '买房*0.0128211', '月*0.0128211', '终于*0.0128211', '便宜*0.0128211', '价格*0.0128211', '卡*0.0128211', '低*0.0102569', '酒*0.0102569', '主播*0.0102569', '苹果*0.0102569', '腿*0.0076928', '游*0.0076928', '高校*0.0076928', '竟然*0.0076928', '晋级*0.0076928', '分*0.0076928']
['亿*0.0453339', '美*0.0373339', '点*0.0320006', '没*0.0186672', '事件*0.0160006', '事*0.0160006', '版*0.0133339', '发明*0.0133339', '走*0.0133339', '猪*0.0133339', '恒大*0.0106672', '手游*0.0106672', '怒*0.0106672', '角色*0.0106672', '取消*0.0106672', '无数*0.0106672', '导弹*0.0106672', '响*0.0106672', '穿*0.0080006', '海南*0.0080006']
['中国*0.0316032', '看待*0.0316032', '俄罗斯*0.0293458', '学生*0.0225738', '发展*0.0203165', '级*0.0203165', '里*0.0203165', '老师*0.0180592', '知道*0.0158018', '普京*0.0158018', '种*0.0158018', '是否*0.0158018', '生活*0.0158018', '轮*0.0135445', '公司*0.0135445', '古代*0.0135445', '这种*0.0135445', '片*0.0112872', '最好*0.0112872', '故事*0.0112872']
['荣耀*0.0243249', '比较*0.0243249', '教育*0.0216222', '首*0.0189195', '应该*0.0162168', '全球*0.0162168', '国产*0.0162168', 'G*0.0135141', '数据*0.0135141', '行业*0.0135141', '半*0.0108114', '一战*0.0108114', '反*0.0108114', '南昌*0.0108114', '偶遇*0.0081087', '公布*0.0081087', '大爷*0.0081087', '听说*0.0081087', '发声*0.0081087', '创*0.0081087']
['年*0.0725005', '十*0.0475005', '日本*0.0400005', '汽车*0.0150005', '项目*0.0150005', '银行*0.0125005', '选择*0.0125005', '实现*0.0125005', '晒*0.0125005', '地震*0.0125005', '女友*0.0100005', '国内*0.0100005', '山*0.0100005', '德国*0.0100005', '结婚*0.0100005', '微博*0.0100005', '张*0.0100005', '富士康*0.0075005', '史上*0.0075005', '互联网*0.0075005']
['会*0.0232564', 'A*0.0206724', '比赛*0.0155044', '原来*0.0129204', '腾讯*0.0129204', '发生*0.0129204', '济南*0.0129204', 'NBA*0.0129204', '活动*0.0129204', '方面*0.0103365', '直接*0.0103365', '水平*0.0103365', '没有*0.0103365', '·*0.0103365', '仅*0.0103365', '功能*0.0103365', '尴尬*0.0077525', '陕西*0.0077525', '令狐冲*0.0077525', '棋牌*0.0077525']
['上联*0.0362817', '下联*0.0362817', '城市*0.0294789', '三*0.0181411', '游客*0.0158735', '公里*0.0158735', '深圳*0.0158735', '勇士*0.0136059', '千*0.0136059', '火箭*0.0136059', '座*0.0136059', '上海*0.0113383', '泰山*0.0113383', '黄山*0.0113383', '造*0.0113383', '赵本山*0.0113383', '分钟*0.0113383', '正确
*0.0090708', '内容*0.0090708', '有人*0.0090708']
['美国*0.0388606', '伊朗*0.0310886', '车*0.0284980', '真的*0.0207259', '超*0.0155446', '名*0.0129539', '叙利亚*0.0129539', '小时*0.0129539', '最多*0.0129539', '空袭*0.0129539', '面临*0.0129539', '适合*0.0129539', '外*0.0103632', '升级*0.0103632', '进入*0.0103632', '鸡*0.0077726', '退出*0.0077726', '季*0.0077726', '大型*0.0077726', '计算机*0.0077726']
['买*0.0467039', '万*0.0439566', '国家*0.0247259', '中*0.0219786', '米*0.0192313', '吃*0.0164841', '出*0.0164841', '行*0.0137368', '知道*0.0137368', '最
后*0.0137368', '启动*0.0137368', '现身*0.0109896', '河南*0.0109896', '战争*0.0109896', '作品*0.0109896', '时*0.0109896', '抵达*0.0109896', '文明*0.0109896', '工作*0.0082423', '每月*0.0082423']
['中国*0.0635299', '世界*0.0447064', '高*0.0211770', '链*0.0188240', '成为*0.0188240', '原因*0.0188240', '前*0.0164711', '区块*0.0164711', '美元*0.0164711', '技术*0.0141181', '曾经*0.0141181', '今年*0.0117652', '平台*0.0117652', '认为*0.0094123', '太空*0.0094123', '唯一*0.0094123', '无法*0.0094123', '北京
*0.0094123', '解说*0.0094123', '排名*0.0094123']
['日*0.0374070', '游戏*0.0374070', '「*0.0349132', '钱*0.0224444', '以色列*0.0174569', '农村*0.0174569', '省*0.0149631', '建*0.0124694', '天*0.0124694',
'东西*0.0124694', '市场*0.0124694', '考*0.0099756', '泰国*0.0099756', '创业*0.0099756', '快*0.0099756', '车*0.0099756', '处理*0.0074818', '机器人*0.0074818', '届*0.0074818', '灯*0.0074818']
['次*0.0336328', '会*0.0291485', '联盟*0.0246641', '英雄*0.0246641', '玩*0.0201798', '需要*0.0201798', '股*0.0134534', '复仇者*0.0134534', '冠军*0.0134534', '战场*0.0112112', '回应*0.0112112', '很多*0.0089691', '获*0.0089691', '赛*0.0089691', '带*0.0089691', '看到*0.0089691', '贷款*0.0089691', '虎牙*0.0089691', '怕*0.0089691', '央视*0.0089691']

doc_tp_dist: [[0.00294118 0.00294118 0.47352941 ... 0.00294118 0.00294118 0.00294118]
[0.065625 0.003125 0.003125 ... 0.065625 0.003125 0.003125 ]
[0.00454545 0.00454545 0.00454545 ... 0.00454545 0.36818182 0.00454545]
...
[0.00714286 0.72142857 0.00714286 ... 0.00714286 0.00714286 0.00714286]
[0.00833333 0.00833333 0.00833333 ... 0.00833333 0.00833333 0.00833333]
[0.00625 0.00625 0.00625 ... 0.13125 0.00625 0.00625 ]]
'''

直观来说,从结果上看,部分主题的主题词具有一定关联性,而其余主题则较为隐晦;而定量地评测主题建模结果的优劣是另一个大话题,本文不予展开。如果我们更换一下停用词,比如把停用词换成baidu_stopwords,结果也会发生较大改变,这表明主题模型的效果受停用词库的影响较大,在选择停用词时应充分考虑数据所属的领域。

参考文献