【NLP】文本相似度的BERT度量方法
編譯 | VK
來源 | Towards Data Science

這篇文章討論的是關(guān)于BERT的序列相似性。
NLP的很大一部分依賴于高維空間中的相似性。通常,一個NLP解決方案需要一些文本,處理這些文本來創(chuàng)建一個大的向量/數(shù)組來表示該文本。
這是高維的魔法。
句子的相似性是一個最清楚的例子,說明了高維魔法是多么強(qiáng)大。
邏輯是這樣的:
把一個句子,轉(zhuǎn)換成一個向量。
把其他許多句子,轉(zhuǎn)換成向量。
找出它們之間的距離(歐幾里德)或余弦相似性。
我們現(xiàn)在就有了一個句子間語義相似性的度量!
當(dāng)然,我們希望更詳細(xì)地了解正在發(fā)生的事情,并用Python實現(xiàn)它!所以,讓我們開始吧。
BERT
BERT,正如我們已經(jīng)提到的,是NLP的MVP。其中很大一部分歸功于BERT將單詞的意思嵌入到密集向量的能力。
我們稱之為密集向量,因為向量中的每個值都有一個值,并且有一個成為該值的原因-這與稀疏向量相反,例如one-hot編碼向量,其中大多數(shù)值為0。
BERT擅長創(chuàng)建這些密集向量,每個編碼器層輸出一組密集向量。

對于BERT-base,這將是一個包含768維的向量,這768個值包含我們對單個token的數(shù)字表示,我們可以使用它作為上下文詞嵌入。
我們可以把這些張量轉(zhuǎn)換成輸入序列的語義表示。然后,我們可以采用相似性度量并計算不同序列之間的相似性。
最簡單和最常用的提取張量是最后的隱藏狀態(tài)。
當(dāng)然,這是一個相當(dāng)大的張量,是512x768維,因為有512個token,我們需要一個向量來應(yīng)用我們的相似性度量。
要做到這一點,我們需要把最后一個隱藏態(tài)張量轉(zhuǎn)換成768維的向量。
創(chuàng)建向量
為了把最后一個隱藏態(tài)張量轉(zhuǎn)換成向量,我們使用了平均池運(yùn)算。
這512個token中的每一個都有各自的768個值。這個池操作將取所有token嵌入的平均值,并將它們壓縮到一個768向量空間中,從而創(chuàng)建一個“句子向量”。
我們不需要考慮填充token(我們不應(yīng)該包括它)。
代碼
這是理論和邏輯-但我們?nèi)绾卧诂F(xiàn)實中應(yīng)用這一點?
我們將概述兩種方法-簡單方法和稍微復(fù)雜一點的方法。
簡單—Sentence-Transformers
對于我們來說,實現(xiàn)我們剛剛介紹的所有內(nèi)容的最簡單方法是通過Sentence-Transformers庫——它將這個過程的大部分內(nèi)容封裝成幾行代碼。
首先,我們使用pip install sentence-transformers來安裝sentence-transformers。這個庫使用HuggingFace的Transformer,所以我們可以在這里找到 sentence-transformers模型:https://huggingface.co/sentence-transformers
我們將使用bert-base-nli-mean-tokens模型,它實現(xiàn)了我們到目前為止討論的相同邏輯。
(它還使用128個輸入token,而不是512個)。
讓我們創(chuàng)建一些句子,初始化我們的模型,并對句子進(jìn)行編碼:
Write a few sentences to encode (sentences 0 and 2 are both similar):
sentences = [
"Three years later, the coffin was still full of Jello.",
"The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go.",
"The person box was packed with jelly many dozens of months later.",
"He found a leprechaun in his walnut shell."
]
Initialize our model:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-base-nli-mean-tokens')
HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=405234788.0), HTML(value='')))
Encode the sentences:
sentence_embeddings = model.encode(sentences)
sentence_embeddings.shape
(4, 768)
很好,我們現(xiàn)在有四個句子嵌入-每個包含768維。
現(xiàn)在我們要做的是取這些嵌入,找出它們之間的余弦相似性。所以對于第0句:
Three years later, the coffin was still full of Jello.
我們可以通過以下方法找到最相似的句子:
from sklearn.metrics.pairwise import cosine_similarity
讓我們計算第0句的余弦相似度:
cosine_similarity(
[sentence_embeddings[0]],
sentence_embeddings[1:]
)
array([[0.33088642, 0.7218851 , 0.55473834]], dtype=float32)
這些相似之處可以解釋為:
| Index | Sentence | Similarity |
|---|---|---|
| 1 | "The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go." | 0.3309 |
| 2 | "The person box was packed with jelly many dozens of months later." | 0.7219 |
| 3 | "He found a leprechaun in his walnut shell." | 0.5547 |
復(fù)雜-Transformer和PyTorch
在進(jìn)入第二種方法之前,值得注意的是,它與第一種方法做了相同的事情,但有點復(fù)雜。
使用這種方法,我們需要自己創(chuàng)建句子嵌入。為此,我們執(zhí)行平均池操作。
https://youtu.be/jVPd7lEvjtg
此外,在平均池操作之前,我們需要創(chuàng)建last_hidden_state,如下所示:
from transformers import AutoTokenizer, AutoModel
import torch
First we initialize our model and tokenizer:
tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')
model = AutoModel.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')
Then we tokenize the sentences just as before:
sentences = [
"Three years later, the coffin was still full of Jello.",
"The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go.",
"The person box was packed with jelly many dozens of months later.",
"He found a leprechaun in his walnut shell."
]
# 初始化字典來存儲
tokens = {'input_ids': [], 'attention_mask': []}
for sentence in sentences:
# 編碼每個句子并添加到字典
new_tokens = tokenizer.encode_plus(sentence, max_length=128,
truncation=True, padding='max_length',
return_tensors='pt')
tokens['input_ids'].append(new_tokens['input_ids'][0])
tokens['attention_mask'].append(new_tokens['attention_mask'][0])
# 將張量列表重新格式化為一個張量
tokens['input_ids'] = torch.stack(tokens['input_ids'])
tokens['attention_mask'] = torch.stack(tokens['attention_mask'])
We process these tokens through our model:
outputs = model(**tokens)
outputs.keys()
odict_keys(['last_hidden_state', 'pooler_output'])
The dense vector representations of our text are contained within the outputs 'last_hidden_state' tensor, which we access like so:
embeddings = outputs.last_hidden_state
embeddings
tensor([[[-0.0692, 0.6230, 0.0354, ..., 0.8033, 1.6314, 0.3281],
[ 0.0367, 0.6842, 0.1946, ..., 0.0848, 1.4747, -0.3008],
[-0.0121, 0.6543, -0.0727, ..., -0.0326, 1.7717, -0.6812],
...,
[ 0.1953, 1.1085, 0.3390, ..., 1.2826, 1.0114, -0.0728],
[ 0.0902, 1.0288, 0.3297, ..., 1.2940, 0.9865, -0.1113],
[ 0.1240, 0.9737, 0.3933, ..., 1.1359, 0.8768, -0.1043]],
[[-0.3212, 0.8251, 1.0554, ..., -0.1855, 0.1517, 0.3937],
[-0.7146, 1.0297, 1.1217, ..., 0.0331, 0.2382, -0.1563],
[-0.2352, 1.1353, 0.8594, ..., -0.4310, -0.0272, -0.2968],
...,
[-0.5400, 0.3236, 0.7839, ..., 0.0022, -0.2994, 0.2659],
[-0.5643, 0.3187, 0.9576, ..., 0.0342, -0.3030, 0.1878],
[-0.5172, 0.3599, 0.9336, ..., 0.0243, -0.2232, 0.1672]],
[[-0.7576, 0.8399, -0.3792, ..., 0.1271, 1.2514, 0.1365],
[-0.6591, 0.7613, -0.4662, ..., 0.2259, 1.1289, -0.3611],
[-0.9007, 0.6791, -0.3778, ..., 0.1142, 0.9080, -0.1830],
...,
[-0.2158, 0.5463, 0.3117, ..., 0.1802, 0.7169, -0.0672],
[-0.3092, 0.4833, 0.3021, ..., 0.2289, 0.6656, -0.0932],
[-0.2940, 0.4678, 0.3095, ..., 0.2782, 0.5144, -0.1021]],
[[-0.2362, 0.8551, -0.8040, ..., 0.6122, 0.3003, -0.1492],
[-0.0868, 0.9531, -0.6419, ..., 0.7867, 0.2960, -0.7350],
[-0.3016, 1.0148, -0.3380, ..., 0.8634, 0.0463, -0.3623],
...,
[-0.1090, 0.6320, -0.8433, ..., 0.7485, 0.1025, 0.0149],
[ 0.0072, 0.7347, -0.7689, ..., 0.6064, 0.1287, 0.0331],
[-0.1108, 0.7605, -0.4447, ..., 0.6719, 0.1059, -0.0034]]],
grad_fn=<NativeLayerNormBackward>)
embeddings.shape
torch.Size([4, 128, 768])
在生成密集向量嵌入之后,我們需要執(zhí)行平均池操作來創(chuàng)建單個向量編碼(句子嵌入)。
為了實現(xiàn)這個平均池操作,我們需要將嵌入張量中的每個值乘以其各自的掩碼值,這樣我們就可以忽略非實數(shù)token。
To perform this operation, we first resize our attention_mask tensor:
attention_mask = tokens['attention_mask']
attention_mask.shape
torch.Size([4, 128])
mask = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
mask.shape
torch.Size([4, 128, 768])
mask
tensor([[[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]],
[[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]],
[[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]],
[[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]])
上面的每個向量表示一個單獨(dú)token的掩碼——現(xiàn)在每個token都有一個大小為768的向量,表示它的attention_mask狀態(tài)。然后將兩個張量相乘:
masked_embeddings = embeddings * mask
masked_embeddings.shape
torch.Size([4, 128, 768])
masked_embeddings
tensor([[[-0.0692, 0.6230, 0.0354, ..., 0.8033, 1.6314, 0.3281],
[ 0.0367, 0.6842, 0.1946, ..., 0.0848, 1.4747, -0.3008],
[-0.0121, 0.6543, -0.0727, ..., -0.0326, 1.7717, -0.6812],
...,
[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, -0.0000],
[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, -0.0000],
[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, -0.0000]],
[[-0.3212, 0.8251, 1.0554, ..., -0.1855, 0.1517, 0.3937],
[-0.7146, 1.0297, 1.1217, ..., 0.0331, 0.2382, -0.1563],
[-0.2352, 1.1353, 0.8594, ..., -0.4310, -0.0272, -0.2968],
...,
[-0.0000, 0.0000, 0.0000, ..., 0.0000, -0.0000, 0.0000],
[-0.0000, 0.0000, 0.0000, ..., 0.0000, -0.0000, 0.0000],
[-0.0000, 0.0000, 0.0000, ..., 0.0000, -0.0000, 0.0000]],
[[-0.7576, 0.8399, -0.3792, ..., 0.1271, 1.2514, 0.1365],
[-0.6591, 0.7613, -0.4662, ..., 0.2259, 1.1289, -0.3611],
[-0.9007, 0.6791, -0.3778, ..., 0.1142, 0.9080, -0.1830],
...,
[-0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, -0.0000],
[-0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, -0.0000],
[-0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, -0.0000]],
[[-0.2362, 0.8551, -0.8040, ..., 0.6122, 0.3003, -0.1492],
[-0.0868, 0.9531, -0.6419, ..., 0.7867, 0.2960, -0.7350],
[-0.3016, 1.0148, -0.3380, ..., 0.8634, 0.0463, -0.3623],
...,
[-0.0000, 0.0000, -0.0000, ..., 0.0000, 0.0000, 0.0000],
[ 0.0000, 0.0000, -0.0000, ..., 0.0000, 0.0000, 0.0000],
[-0.0000, 0.0000, -0.0000, ..., 0.0000, 0.0000, -0.0000]]],
grad_fn=<MulBackward0>)
然后我們沿著軸1將剩余的嵌入項求和:
summed = torch.sum(masked_embeddings, 1)
summed.shape
torch.Size([4, 768])
然后將張量的每個位置上的值相加:
summed_mask = torch.clamp(mask.sum(1), min=1e-9)
summed_mask.shape
torch.Size([4, 768])
summed_mask
tensor([[15., 15., 15., ..., 15., 15., 15.],
[22., 22., 22., ..., 22., 22., 22.],
[15., 15., 15., ..., 15., 15., 15.],
[14., 14., 14., ..., 14., 14., 14.]])
最后,我們計算平均值:
mean_pooled = summed / summed_mask
mean_pooled
tensor([[ 0.0745, 0.8637, 0.1795, ..., 0.7734, 1.7247, -0.1803],
[-0.3715, 0.9729, 1.0840, ..., -0.2552, -0.2759, 0.0358],
[-0.5030, 0.7950, -0.1240, ..., 0.1441, 0.9704, -0.1791],
[-0.2131, 1.0175, -0.8833, ..., 0.7371, 0.1947, -0.3011]],
grad_fn=<DivBackward0>)
一旦我們有了密集向量,我們就可以計算每個向量之間的余弦相似性——這和我們以前使用的邏輯是一樣的:
from sklearn.metrics.pairwise import cosine_similarity
讓我們計算第0句的余弦相似度:
# 將PyTorch張量轉(zhuǎn)換為numpy數(shù)組
mean_pooled = mean_pooled.detach().numpy()
# 計算
cosine_similarity(
[mean_pooled[0]],
mean_pooled[1:]
)
array([[0.33088905, 0.7219259 , 0.55483633]], dtype=float32)
These similarities translate to:
| Index | Sentence | Similarity |
|---|---|---|
| 1 | "The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go." | 0.3309 |
| 2 | "The person box was packed with jelly many dozens of months later." | 0.7219 |
| 3 | "He found a leprechaun in his walnut shell." | 0.5548 |
我們返回了幾乎相同的結(jié)果-唯一的區(qū)別是索引3的余弦相似性從0.5547移到了0.5548,這是一個微小的差異。
以上就是介紹如何使用BERT測量句子的語義相似性的全部內(nèi)容—使用sentence-transformers ,PyTorch和transformers兩種方法實現(xiàn)。
兩種方法的完整筆記本:https://github.com/jamescalam/transformers/blob/main/course/similarity/04_sentence_transformers.ipynb和https://github.com/jamescalam/transformers/blob/main/course/similarity/03_calculating_similarity.ipynb。
感謝閱讀!
參考引用
N. Reimers, I. Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (2019), Proceedings of the 2019 Conference on Empirical Methods in NLP
往期精彩回顧 本站qq群851320808,加入微信群請掃碼:
