小莹的性荡生活37章,国产v亚洲,九色在线视频,欧美一级人妻免费视频,久8精品,а天堂中文在线官网在线,亚洲综合中文字幕在线,我诱女偷伦初尝云雨

作者 | James Briggs

編譯 | VK
來源 | Towards Data Science

這篇文章討論的是關(guān)于BERT的序列相似性。

NLP的很大一部分依賴于高維空間中的相似性。通常，一個NLP解決方案需要一些文本，處理這些文本來創(chuàng)建一個大的向量/數(shù)組來表示該文本。

這是高維的魔法。

句子的相似性是一個最清楚的例子，說明了高維魔法是多么強(qiáng)大。

邏輯是這樣的：

把一個句子，轉(zhuǎn)換成一個向量。
把其他許多句子，轉(zhuǎn)換成向量。
找出它們之間的距離（歐幾里德）或余弦相似性。
我們現(xiàn)在就有了一個句子間語義相似性的度量！

當(dāng)然，我們希望更詳細(xì)地了解正在發(fā)生的事情，并用Python實現(xiàn)它！所以，讓我們開始吧。

BERT

BERT，正如我們已經(jīng)提到的，是NLP的MVP。其中很大一部分歸功于BERT將單詞的意思嵌入到密集向量的能力。

我們稱之為密集向量，因為向量中的每個值都有一個值，并且有一個成為該值的原因-這與稀疏向量相反，例如one-hot編碼向量，其中大多數(shù)值為0。

BERT擅長創(chuàng)建這些密集向量，每個編碼器層輸出一組密集向量。

對于BERT-base，這將是一個包含768維的向量，這768個值包含我們對單個token的數(shù)字表示，我們可以使用它作為上下文詞嵌入。

我們可以把這些張量轉(zhuǎn)換成輸入序列的語義表示。然后，我們可以采用相似性度量并計算不同序列之間的相似性。

最簡單和最常用的提取張量是最后的隱藏狀態(tài)。

當(dāng)然，這是一個相當(dāng)大的張量，是512x768維，因為有512個token，我們需要一個向量來應(yīng)用我們的相似性度量。

要做到這一點，我們需要把最后一個隱藏態(tài)張量轉(zhuǎn)換成768維的向量。

創(chuàng)建向量

為了把最后一個隱藏態(tài)張量轉(zhuǎn)換成向量，我們使用了平均池運(yùn)算。

這512個token中的每一個都有各自的768個值。這個池操作將取所有token嵌入的平均值，并將它們壓縮到一個768向量空間中，從而創(chuàng)建一個“句子向量”。

我們不需要考慮填充token（我們不應(yīng)該包括它）。

代碼

這是理論和邏輯-但我們?nèi)绾卧诂F(xiàn)實中應(yīng)用這一點？

我們將概述兩種方法-簡單方法和稍微復(fù)雜一點的方法。

簡單—Sentence-Transformers

對于我們來說，實現(xiàn)我們剛剛介紹的所有內(nèi)容的最簡單方法是通過Sentence-Transformers庫——它將這個過程的大部分內(nèi)容封裝成幾行代碼。

首先，我們使用pip install sentence-transformers來安裝sentence-transformers。這個庫使用HuggingFace的Transformer，所以我們可以在這里找到 sentence-transformers模型：https://huggingface.co/sentence-transformers

我們將使用bert-base-nli-mean-tokens模型，它實現(xiàn)了我們到目前為止討論的相同邏輯。

（它還使用128個輸入token，而不是512個）。

讓我們創(chuàng)建一些句子，初始化我們的模型，并對句子進(jìn)行編碼：

Write a few sentences to encode (sentences 0 and 2 are both similar):

sentences = [
    "Three years later, the coffin was still full of Jello.",
    "The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go.",
    "The person box was packed with jelly many dozens of months later.",
    "He found a leprechaun in his walnut shell."
]

Initialize our model:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('bert-base-nli-mean-tokens')

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=405234788.0), HTML(value='')))

Encode the sentences:

sentence_embeddings = model.encode(sentences)

sentence_embeddings.shape

(4, 768)

很好，我們現(xiàn)在有四個句子嵌入-每個包含768維。

現(xiàn)在我們要做的是取這些嵌入，找出它們之間的余弦相似性。所以對于第0句：

Three years later, the coffin was still full of Jello.

我們可以通過以下方法找到最相似的句子：

from sklearn.metrics.pairwise import cosine_similarity

讓我們計算第0句的余弦相似度:

cosine_similarity(
    [sentence_embeddings[0]],
    sentence_embeddings[1:]
)

array([[0.33088642, 0.7218851 , 0.55473834]], dtype=float32)
這些相似之處可以解釋為:

Index	Sentence	Similarity
1	"The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go."	0.3309
2	"The person box was packed with jelly many dozens of months later."	0.7219
3	"He found a leprechaun in his walnut shell."	0.5547

復(fù)雜-Transformer和PyTorch

在進(jìn)入第二種方法之前，值得注意的是，它與第一種方法做了相同的事情，但有點復(fù)雜。

使用這種方法，我們需要自己創(chuàng)建句子嵌入。為此，我們執(zhí)行平均池操作。

https://youtu.be/jVPd7lEvjtg

此外，在平均池操作之前，我們需要創(chuàng)建last_hidden_state，如下所示：

from transformers import AutoTokenizer, AutoModel
import torch

First we initialize our model and tokenizer:

tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')
model = AutoModel.from_pretrained('sentence-transformers/bert-base-nli-mean-tokens')

Then we tokenize the sentences just as before:

sentences = [
    "Three years later, the coffin was still full of Jello.",
    "The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go.",
    "The person box was packed with jelly many dozens of months later.",
    "He found a leprechaun in his walnut shell."
]

# 初始化字典來存儲
tokens = {'input_ids': [], 'attention_mask': []}

for sentence in sentences:
    # 編碼每個句子并添加到字典
    new_tokens = tokenizer.encode_plus(sentence, max_length=128,
                                       truncation=True, padding='max_length',
                                       return_tensors='pt')
    tokens['input_ids'].append(new_tokens['input_ids'][0])
    tokens['attention_mask'].append(new_tokens['attention_mask'][0])

# 將張量列表重新格式化為一個張量
tokens['input_ids'] = torch.stack(tokens['input_ids'])
tokens['attention_mask'] = torch.stack(tokens['attention_mask'])

We process these tokens through our model:

outputs = model(**tokens)
outputs.keys()

odict_keys(['last_hidden_state', 'pooler_output'])

The dense vector representations of our text are contained within the outputs 'last_hidden_state' tensor, which we access like so:

embeddings = outputs.last_hidden_state
embeddings

tensor([[[-0.0692,  0.6230,  0.0354,  ...,  0.8033,  1.6314,  0.3281],
         [ 0.0367,  0.6842,  0.1946,  ...,  0.0848,  1.4747, -0.3008],
         [-0.0121,  0.6543, -0.0727,  ..., -0.0326,  1.7717, -0.6812],
         ...,
         [ 0.1953,  1.1085,  0.3390,  ...,  1.2826,  1.0114, -0.0728],
         [ 0.0902,  1.0288,  0.3297,  ...,  1.2940,  0.9865, -0.1113],
         [ 0.1240,  0.9737,  0.3933,  ...,  1.1359,  0.8768, -0.1043]],

        [[-0.3212,  0.8251,  1.0554,  ..., -0.1855,  0.1517,  0.3937],
         [-0.7146,  1.0297,  1.1217,  ...,  0.0331,  0.2382, -0.1563],
         [-0.2352,  1.1353,  0.8594,  ..., -0.4310, -0.0272, -0.2968],
         ...,
         [-0.5400,  0.3236,  0.7839,  ...,  0.0022, -0.2994,  0.2659],
         [-0.5643,  0.3187,  0.9576,  ...,  0.0342, -0.3030,  0.1878],
         [-0.5172,  0.3599,  0.9336,  ...,  0.0243, -0.2232,  0.1672]],

        [[-0.7576,  0.8399, -0.3792,  ...,  0.1271,  1.2514,  0.1365],
         [-0.6591,  0.7613, -0.4662,  ...,  0.2259,  1.1289, -0.3611],
         [-0.9007,  0.6791, -0.3778,  ...,  0.1142,  0.9080, -0.1830],
         ...,
         [-0.2158,  0.5463,  0.3117,  ...,  0.1802,  0.7169, -0.0672],
         [-0.3092,  0.4833,  0.3021,  ...,  0.2289,  0.6656, -0.0932],
         [-0.2940,  0.4678,  0.3095,  ...,  0.2782,  0.5144, -0.1021]],

        [[-0.2362,  0.8551, -0.8040,  ...,  0.6122,  0.3003, -0.1492],
         [-0.0868,  0.9531, -0.6419,  ...,  0.7867,  0.2960, -0.7350],
         [-0.3016,  1.0148, -0.3380,  ...,  0.8634,  0.0463, -0.3623],
         ...,
         [-0.1090,  0.6320, -0.8433,  ...,  0.7485,  0.1025,  0.0149],
         [ 0.0072,  0.7347, -0.7689,  ...,  0.6064,  0.1287,  0.0331],
         [-0.1108,  0.7605, -0.4447,  ...,  0.6719,  0.1059, -0.0034]]],
       grad_fn=<NativeLayerNormBackward>)

embeddings.shape

torch.Size([4, 128, 768])

在生成密集向量嵌入之后，我們需要執(zhí)行平均池操作來創(chuàng)建單個向量編碼（句子嵌入）。

為了實現(xiàn)這個平均池操作，我們需要將嵌入張量中的每個值乘以其各自的掩碼值，這樣我們就可以忽略非實數(shù)token。

To perform this operation, we first resize our attention_mask tensor:

attention_mask = tokens['attention_mask']
attention_mask.shape

torch.Size([4, 128])

mask = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
mask.shape

torch.Size([4, 128, 768])

mask

tensor([[[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]],

        [[1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         [1., 1., 1.,  ..., 1., 1., 1.],
         ...,
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]])

上面的每個向量表示一個單獨(dú)token的掩碼——現(xiàn)在每個token都有一個大小為768的向量，表示它的attention_mask狀態(tài)。然后將兩個張量相乘:

masked_embeddings = embeddings * mask
masked_embeddings.shape

torch.Size([4, 128, 768])

masked_embeddings

tensor([[[-0.0692,  0.6230,  0.0354,  ...,  0.8033,  1.6314,  0.3281],
         [ 0.0367,  0.6842,  0.1946,  ...,  0.0848,  1.4747, -0.3008],
         [-0.0121,  0.6543, -0.0727,  ..., -0.0326,  1.7717, -0.6812],
         ...,
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000, -0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000, -0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000, -0.0000]],

        [[-0.3212,  0.8251,  1.0554,  ..., -0.1855,  0.1517,  0.3937],
         [-0.7146,  1.0297,  1.1217,  ...,  0.0331,  0.2382, -0.1563],
         [-0.2352,  1.1353,  0.8594,  ..., -0.4310, -0.0272, -0.2968],
         ...,
         [-0.0000,  0.0000,  0.0000,  ...,  0.0000, -0.0000,  0.0000],
         [-0.0000,  0.0000,  0.0000,  ...,  0.0000, -0.0000,  0.0000],
         [-0.0000,  0.0000,  0.0000,  ...,  0.0000, -0.0000,  0.0000]],

        [[-0.7576,  0.8399, -0.3792,  ...,  0.1271,  1.2514,  0.1365],
         [-0.6591,  0.7613, -0.4662,  ...,  0.2259,  1.1289, -0.3611],
         [-0.9007,  0.6791, -0.3778,  ...,  0.1142,  0.9080, -0.1830],
         ...,
         [-0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000, -0.0000],
         [-0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000, -0.0000],
         [-0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000, -0.0000]],

        [[-0.2362,  0.8551, -0.8040,  ...,  0.6122,  0.3003, -0.1492],
         [-0.0868,  0.9531, -0.6419,  ...,  0.7867,  0.2960, -0.7350],
         [-0.3016,  1.0148, -0.3380,  ...,  0.8634,  0.0463, -0.3623],
         ...,
         [-0.0000,  0.0000, -0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000, -0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [-0.0000,  0.0000, -0.0000,  ...,  0.0000,  0.0000, -0.0000]]],
       grad_fn=<MulBackward0>)

然后我們沿著軸1將剩余的嵌入項求和:

summed = torch.sum(masked_embeddings, 1)
summed.shape

torch.Size([4, 768])

然后將張量的每個位置上的值相加:

summed_mask = torch.clamp(mask.sum(1), min=1e-9)
summed_mask.shape

torch.Size([4, 768])

summed_mask

tensor([[15., 15., 15.,  ..., 15., 15., 15.],
        [22., 22., 22.,  ..., 22., 22., 22.],
        [15., 15., 15.,  ..., 15., 15., 15.],
        [14., 14., 14.,  ..., 14., 14., 14.]])

最后，我們計算平均值:

mean_pooled = summed / summed_mask

mean_pooled

tensor([[ 0.0745,  0.8637,  0.1795,  ...,  0.7734,  1.7247, -0.1803],
        [-0.3715,  0.9729,  1.0840,  ..., -0.2552, -0.2759,  0.0358],
        [-0.5030,  0.7950, -0.1240,  ...,  0.1441,  0.9704, -0.1791],
        [-0.2131,  1.0175, -0.8833,  ...,  0.7371,  0.1947, -0.3011]],
       grad_fn=<DivBackward0>)

一旦我們有了密集向量，我們就可以計算每個向量之間的余弦相似性——這和我們以前使用的邏輯是一樣的：

from sklearn.metrics.pairwise import cosine_similarity

讓我們計算第0句的余弦相似度:

# 將PyTorch張量轉(zhuǎn)換為numpy數(shù)組
mean_pooled = mean_pooled.detach().numpy()

# 計算
cosine_similarity(
    [mean_pooled[0]],
    mean_pooled[1:]
)

array([[0.33088905, 0.7219259 , 0.55483633]], dtype=float32)

These similarities translate to:

Index	Sentence	Similarity
1	"The fish dreamed of escaping the fishbowl and into the toilet where he saw his friend go."	0.3309
2	"The person box was packed with jelly many dozens of months later."	0.7219
3	"He found a leprechaun in his walnut shell."	0.5548

我們返回了幾乎相同的結(jié)果-唯一的區(qū)別是索引3的余弦相似性從0.5547移到了0.5548，這是一個微小的差異。

以上就是介紹如何使用BERT測量句子的語義相似性的全部內(nèi)容—使用sentence-transformers ，PyTorch和transformers兩種方法實現(xiàn)。

兩種方法的完整筆記本：https://github.com/jamescalam/transformers/blob/main/course/similarity/04_sentence_transformers.ipynb和https://github.com/jamescalam/transformers/blob/main/course/similarity/03_calculating_similarity.ipynb。

感謝閱讀！

參考引用

N. Reimers, I. Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks (2019), Proceedings of the 2019 Conference on Empirical Methods in NLP


往期精彩回顧




適合初學(xué)者入門人工智能的路線及資料下載
機(jī)器學(xué)習(xí)及深度學(xué)習(xí)筆記等資料打印
機(jī)器學(xué)習(xí)在線手冊
深度學(xué)習(xí)筆記專輯
《統(tǒng)計學(xué)習(xí)方法》的代碼復(fù)現(xiàn)專輯
AI基礎(chǔ)下載
機(jī)器學(xué)習(xí)的數(shù)學(xué)基礎(chǔ)專輯
黃海廣老師《機(jī)器學(xué)習(xí)課程》課件合集
本站qq群851320808，加入微信群請掃碼：

国产秋霞理论久久久电影-婷婷色九月综合激情丁香-欧美在线观看乱妇视频-精品国avA久久久久久久-国产乱码精品一区二区三区亚洲人-欧美熟妇一区二区三区蜜桃视频

【NLP】文本相似度的BERT度量方法

BERT

創(chuàng)建向量

代碼

簡單—Sentence-Transformers

復(fù)雜-Transformer和PyTorch