极品美女操逼,婷婷一区二区三区网站,年轻情侣主圣水调教vk,亚洲AV无码秘翔田,www.操,爱搞网站,五月天伊人,国产午夜夜伦鲁鲁片

總覽

了解圖像字幕生成的注意力機(jī)制實(shí)現(xiàn)注意力機(jī)制以在python中生成字幕

介紹

注意機(jī)制是人類(lèi)所具有的復(fù)雜的認(rèn)知能力。當(dāng)人們收到信息時(shí)，他們可以有意識(shí)地選擇一些主要信息，而忽略其他次要信息。

這種自我選擇的能力稱(chēng)為注意力。注意機(jī)制使神經(jīng)網(wǎng)絡(luò)能夠?qū)Ｗ⒂谄漭斎胱蛹赃x擇特定特征。

近年來(lái)，神經(jīng)網(wǎng)絡(luò)推動(dòng)了圖像字幕的巨大發(fā)展。研究人員正在為計(jì)算機(jī)視覺(jué)和序列到序列建模系統(tǒng)尋找更具挑戰(zhàn)性的應(yīng)用程序。他們?cè)噲D用人類(lèi)的術(shù)語(yǔ)描述世界。之前我們看到了通過(guò)Merge架構(gòu)進(jìn)行圖像標(biāo)題處理的過(guò)程，今天，我們將探討一種更為復(fù)雜而精致的設(shè)計(jì)來(lái)解決此問(wèn)題。

注意機(jī)制已成為深度學(xué)習(xí)社區(qū)中從業(yè)者的首選方法。它最初是在使用Seq2Seq模型的神經(jīng)機(jī)器翻譯的背景下設(shè)計(jì)的，但今天我們將看看它在圖像字幕中的實(shí)現(xiàn)。

注意機(jī)制不是將整個(gè)圖像壓縮為靜態(tài)表示，而是使顯著特征在需要時(shí)動(dòng)態(tài)地走在最前列。當(dāng)圖像中有很多雜波時(shí)，這一點(diǎn)尤其重要。

讓我們舉個(gè)例子來(lái)更好地理解：

我們的目標(biāo)是生成一個(gè)標(biāo)題，例如“兩只白狗在雪地上奔跑”。為此，我們將看到如何實(shí)現(xiàn)一種稱(chēng)為Bahdanau的注意力或本地注意力的特定類(lèi)型的注意力機(jī)制。

通過(guò)這種方式，我們可以看到模型在生成標(biāo)題時(shí)將焦點(diǎn)放在圖像的哪些部分。此實(shí)現(xiàn)將需要深度學(xué)習(xí)的強(qiáng)大背景。

1、問(wèn)題陳述的處理

2、了解數(shù)據(jù)集

3、實(shí)現(xiàn)

3.1、導(dǎo)入所需的庫(kù)

3.2、數(shù)據(jù)加載和預(yù)處理

3.3、模型定義

3.4、模型訓(xùn)練

3.5、貪婪搜索和BLEU評(píng)估

4、下一步是什么？

5、尾注

問(wèn)題陳述的處理

編碼器-解碼器圖像字幕系統(tǒng)將使用將產(chǎn)生隱藏狀態(tài)的預(yù)訓(xùn)練卷積神經(jīng)網(wǎng)絡(luò)對(duì)圖像進(jìn)行編碼。然后，它將使用LSTM解碼此隱藏狀態(tài)并生成標(biāo)題。

對(duì)于每個(gè)序列元素，將先前元素的輸出與新序列數(shù)據(jù)結(jié)合起來(lái)用作輸入。這為RNN網(wǎng)絡(luò)提供了一種記憶，可能使字幕更具信息性和上下文感知能力。

但是RNN的訓(xùn)練和評(píng)估在計(jì)算上往往很昂貴，因此在實(shí)踐中，內(nèi)存只限于少數(shù)幾個(gè)元素。注意模型可以通過(guò)從輸入圖像中選擇最相關(guān)的元素來(lái)幫助解決此問(wèn)題。使用Attention機(jī)制，首先將圖像分為n個(gè)部分，然后我們計(jì)算每個(gè)圖像的圖像表示形式。當(dāng)RNN生成新單詞時(shí)，注意機(jī)制將注意力集中在圖像的相關(guān)部分上，因此解碼器僅使用特定的圖片的一部分。

在Bahdanau或本地關(guān)注中，關(guān)注僅放在少數(shù)幾個(gè)來(lái)源位置。由于全球關(guān)注集中于所有目標(biāo)詞的所有來(lái)源方詞，因此在計(jì)算上非常昂貴。為了克服這種缺陷，本地注意力選擇只關(guān)注每個(gè)目標(biāo)詞的編碼器隱藏狀態(tài)的一小部分。

局部注意力首先找到對(duì)齊位置，然后在其位置所在的左右窗口中計(jì)算注意力權(quán)重，最后對(duì)上下文向量進(jìn)行加權(quán)。局部注意的主要優(yōu)點(diǎn)是減少了注意機(jī)制計(jì)算的成本。

在計(jì)算中，本地注意力不是考慮源語(yǔ)言端的所有單詞，而是根據(jù)預(yù)測(cè)函數(shù)預(yù)測(cè)在當(dāng)前解碼時(shí)要對(duì)齊的源語(yǔ)言端的位置，然后在上下文窗口中導(dǎo)航，僅考慮窗口中的單詞。

Bahdanau注意的設(shè)計(jì)

編碼器和解碼器的所有隱藏狀態(tài)用于生成上下文向量。注意機(jī)制將輸入和輸出序列與前饋網(wǎng)絡(luò)參數(shù)化的比對(duì)得分進(jìn)行比對(duì)。它有助于注意源序列中最相關(guān)的信息。該模型基于與源位置和先前生成的目標(biāo)詞關(guān)聯(lián)的上下文向量來(lái)預(yù)測(cè)目標(biāo)詞。

為了參考原始字幕評(píng)估字幕，我們使用一種稱(chēng)為BLEU的評(píng)估方法。它是使用最廣泛的評(píng)估指標(biāo)。它用于分析要評(píng)估的翻譯語(yǔ)句與參考翻譯語(yǔ)句之間n-gram的相關(guān)性。

在本文中，多個(gè)圖像等效于翻譯中的多個(gè)源語(yǔ)言句子。BLEU的優(yōu)點(diǎn)是考慮更長(zhǎng)的匹配信息，它認(rèn)為的粒度是n元語(yǔ)法字而不是單詞。BLEU的缺點(diǎn)是無(wú)論匹配哪種n-gram，都將被視為相同。

我希望這使您對(duì)我們?nèi)绾翁幚泶藛?wèn)題陳述有所了解。讓我們深入研究實(shí)施！

了解數(shù)據(jù)集

我使用了Flickr8k數(shù)據(jù)集，其中每個(gè)圖像都與五個(gè)不同的標(biāo)題相關(guān)聯(lián)，這些標(biāo)題描述了所收集的圖像中描述的實(shí)體和事件。

Flickr8k體積小巧，可以使用CPU在低端筆記本電腦/臺(tái)式機(jī)上輕松進(jìn)行培訓(xùn)，因此是一個(gè)很好的入門(mén)數(shù)據(jù)集。

我們的數(shù)據(jù)集結(jié)構(gòu)如下：

讓我們實(shí)現(xiàn)字幕生成的注意力機(jī)制！

步驟1：導(dǎo)入所需的庫(kù)

在這里，我們將利用Tensorflow創(chuàng)建模型并對(duì)其進(jìn)行訓(xùn)練。大部分代碼歸功于TensorFlow教程。如果您想要GPU進(jìn)行訓(xùn)練，則可以使用Google Colab或Kaggle筆記本。

import string
import numpy as np
import pandas as pd
from numpy import array
from pickle import load
 
from PIL import Image
import pickle
from collections import Counter
import matplotlib.pyplot as plt
 
import sys, time, os, warnings
warnings.filterwarnings("ignore")
import re
 
import keras
import tensorflow as tf
from tqdm import tqdm
from nltk.translate.bleu_score import sentence_bleu
 
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.utils import plot_model
from keras.models import Model
from keras.layers import Input
from keras.layers import Dense, BatchNormalization
from keras.layers import LSTM
from keras.layers import Embedding
from keras.layers import Dropout
from keras.layers.merge import add
from keras.callbacks import ModelCheckpoint
from keras.preprocessing.image import load_img, img_to_array
from keras.preprocessing.text import Tokenizer
from keras.applications.vgg16 import VGG16, preprocess_input
 
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

步驟2：數(shù)據(jù)加載和預(yù)處理

定義圖像和字幕路徑，并檢查數(shù)據(jù)集中總共有多少圖像。

image_path = "/content/gdrive/My Drive/FLICKR8K/Flicker8k_Dataset"
dir_Flickr_text = "/content/gdrive/My Drive/FLICKR8K/Flickr8k_text/Flickr8k.token.txt"
jpgs = os.listdir(image_path)
 
print("Total Images in Dataset = {}".format(len(jpgs)))

輸出如下：

我們創(chuàng)建一個(gè)數(shù)據(jù)框來(lái)存儲(chǔ)圖像ID和標(biāo)題，以便于使用。

file = open(dir_Flickr_text,'r')
text = file.read()
file.close()
 
datatxt = []
for line in text.split('\n'):
   col = line.split('\t')
   if len(col) == 1:
       continue
   w = col[0].split("#")
   datatxt.append(w + [col[1].lower()])
 
data = pd.DataFrame(datatxt,columns=["filename","index","caption"])
data = data.reindex(columns =['index','filename','caption'])
data = data[data.filename != '2258277193_586949ec62.jpg.1']
uni_filenames = np.unique(data.filename.values)
 
data.head()

輸出如下：

接下來(lái)，讓我們可視化一些圖片及其5個(gè)標(biāo)題：

npic = 5
npix = 224
target_size = (npix,npix,3)
count = 1
 
fig = plt.figure(figsize=(10,20))
for jpgfnm in uni_filenames[10:14]:
   filename = image_path + '/' + jpgfnm
   captions = list(data["caption"].loc[data["filename"]==jpgfnm].values)
   image_load = load_img(filename, target_size=target_size)
   ax = fig.add_subplot(npic,2,count,xticks=[],yticks=[])
   ax.imshow(image_load)
   count += 1
 
   ax = fig.add_subplot(npic,2,count)
   plt.axis('off')
   ax.plot()
   ax.set_xlim(0,1)
   ax.set_ylim(0,len(captions))
   for i, caption in enumerate(captions):
       ax.text(0,i,caption,fontsize=20)
   count += 1
plt.show()

輸出如下：

接下來(lái)，讓我們看看我們當(dāng)前的詞匯量是多少：

vocabulary = []
for txt in data.caption.values:
   vocabulary.extend(txt.split())
print('Vocabulary Size: %d' % len(set(vocabulary)))

輸出如下：

接下來(lái)執(zhí)行一些文本清理，例如刪除標(biāo)點(diǎn)符號(hào)，單個(gè)字符和數(shù)字值：

def remove_punctuation(text_original):
   text_no_punctuation = text_original.translate(string.punctuation)
   return(text_no_punctuation)
 
def remove_single_character(text):
   text_len_more_than1 = ""
   for word in text.split():
       if len(word) > 1:
           text_len_more_than1 += " " + word
   return(text_len_more_than1)
 
def remove_numeric(text):
   text_no_numeric = ""
   for word in text.split():
       isalpha = word.isalpha()
       if isalpha:
           text_no_numeric += " " + word
   return(text_no_numeric)
 
def text_clean(text_original):
   text = remove_punctuation(text_original)
   text = remove_single_character(text)
   text = remove_numeric(text)
   return(text)
 
for i, caption in enumerate(data.caption.values):
   newcaption = text_clean(caption)
   data["caption"].iloc[i] = newcaption

現(xiàn)在讓我們看一下清理后詞匯量的大小

clean_vocabulary = []
for txt in data.caption.values:
   clean_vocabulary.extend(txt.split())
print('Clean Vocabulary Size: %d' % len(set(clean_vocabulary)))

輸出如下：

接下來(lái)，我們將所有標(biāo)題和圖像路徑保存在兩個(gè)列表中，以便我們可以使用路徑集立即加載圖像。我們還向每個(gè)字幕添加了“ <開(kāi)始>”和“ <結(jié)束>”標(biāo)簽，以便模型可以理解每個(gè)字幕的開(kāi)始和結(jié)束。

PATH = "/content/gdrive/My Drive/FLICKR8K/Flicker8k_Dataset/"
all_captions = []
for caption  in data["caption"].astype(str):
   caption = '<start> ' + caption+ ' <end>'
   all_captions.append(caption)
 
all_captions[:10]

輸出如下：

all_img_name_vector = []
for annot in data["filename"]:
   full_image_path = PATH + annot
   all_img_name_vector.append(full_image_path)
 
all_img_name_vector[:10]

輸出如下：

現(xiàn)在您可以看到我們有40455個(gè)圖像路徑和標(biāo)題。

print(f"len(all_img_name_vector) : {len(all_img_name_vector)}")
print(f"len(all_captions) : {len(all_captions)}")

輸出如下：

我們將僅取每個(gè)批次的40000個(gè)，以便可以正確選擇批次大小，即如果批次大小= 64，則可以選擇625個(gè)批次。為此，我們定義了一個(gè)函數(shù)來(lái)將數(shù)據(jù)集限制為40000個(gè)圖像和標(biāo)題。

def data_limiter(num,total_captions,all_img_name_vector):
 train_captions, img_name_vector = shuffle(total_captions,all_img_name_vector,random_state=1)
 train_captions = train_captions[:num]
 img_name_vector = img_name_vector[:num]
 return train_captions,img_name_vector
 
train_captions,img_name_vector = data_limiter(40000,total_captions,all_img_name_vector)

步驟3：模型定義

讓我們使用VGG16定義圖像特征提取模型。我們必須記住，這里不需要分類(lèi)圖像，只需要為圖像提取圖像矢量即可。因此，我們從模型中刪除了softmax層。我們必須先將所有圖像預(yù)處理為相同大小，即224×224，然后再將其輸入模型。

def load_image(image_path):
   img = tf.io.read_file(image_path)
   img = tf.image.decode_jpeg(img, channels=3)
   img = tf.image.resize(img, (224, 224))
   img = preprocess_input(img)
   return img, image_path
 
image_model = tf.keras.applications.VGG16(include_top=False, weights='imagenet')
new_input = image_model.input
hidden_layer = image_model.layers[-1].output
image_features_extract_model = tf.keras.Model(new_input, hidden_layer)
 
image_features_extract_model.summary()

輸出如下：

接下來(lái)，讓我們將每個(gè)圖片名稱(chēng)映射到要加載圖片的函數(shù)：

encode_train = sorted(set(img_name_vector))
image_dataset = tf.data.Dataset.from_tensor_slices(encode_train)
image_dataset = image_dataset.map(load_image, num_parallel_calls=tf.data.experimental.AUTOTUNE).batch(64)

我們提取特征并將其存儲(chǔ)在各自的.npy文件中，然后將這些特征通過(guò)編碼器傳遞.NPY文件存儲(chǔ)在任何計(jì)算機(jī)上重建數(shù)組所需的所有信息，包括dtype和shape信息。

%%time
for img, path in tqdm(image_dataset):
 batch_features = image_features_extract_model(img)
 batch_features = tf.reshape(batch_features,
                             (batch_features.shape[0], -1, batch_features.shape[3]))
 
 for bf, p in zip(batch_features, path):
   path_of_feature = p.numpy().decode("utf-8")
   np.save(path_of_feature, bf.numpy())

接下來(lái)，我們標(biāo)記標(biāo)題，并為數(shù)據(jù)中所有唯一的單詞建立詞匯表。我們還將詞匯量限制在前5000個(gè)單詞以節(jié)省內(nèi)存。我們將更換的話不詞匯與令牌

top_k = 5000
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=top_k,
                                                 oov_token="<unk>",
                                                 filters='!"#$%&()*+.,-/:;=?@[\]^_`{|}~ ')
 
tokenizer.fit_on_texts(train_captions)
train_seqs = tokenizer.texts_to_sequences(train_captions)
tokenizer.word_index['<pad>'] = 0
tokenizer.index_word[0] = '<pad>'
 
train_seqs = tokenizer.texts_to_sequences(train_captions)
cap_vector = tf.keras.preprocessing.sequence.pad_sequences(train_seqs, padding='post')

讓我們可視化填充的訓(xùn)練和標(biāo)題以及標(biāo)記化的向量：

train_captions[:3]

輸出如下：

train_seqs[:3]

輸出如下：

接下來(lái)，我們可以計(jì)算所有字幕的最大和最小長(zhǎng)度：

def calc_max_length(tensor):
   return max(len(t) for t in tensor)
max_length = calc_max_length(train_seqs)
 
def calc_min_length(tensor):
   return min(len(t) for t in tensor)
min_length = calc_min_length(train_seqs)
 
print('Max Length of any caption : Min Length of any caption = '+ str(max_length) +" : "+str(min_length))

輸出如下：

接下來(lái)，使用80-20拆分創(chuàng)建訓(xùn)練和驗(yàn)證集：

img_name_train, img_name_val, cap_train, cap_val = train_test_split(img_name_vector,cap_vector, test_size=0.2, random_state=0)

定義訓(xùn)練參數(shù)：

BATCH_SIZE = 64
BUFFER_SIZE = 1000
embedding_dim = 256
units = 512
vocab_size = len(tokenizer.word_index) + 1
num_steps = len(img_name_train) // BATCH_SIZE
features_shape = 512
attention_features_shape = 49
 
 
 
def map_func(img_name, cap):
 img_tensor = np.load(img_name.decode('utf-8')+'.npy')
 return img_tensor, cap
dataset = tf.data.Dataset.from_tensor_slices((img_name_train, cap_train))
 
# Use map to load the numpy files in parallel
dataset = dataset.map(lambda item1, item2: tf.numpy_function(
        map_func, [item1, item2], [tf.float32, tf.int32]),
         num_parallel_calls=tf.data.experimental.AUTOTUNE)
 
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE)
dataset = dataset.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

接下來(lái)，讓我們重點(diǎn)定義編碼器-解碼器的體系結(jié)構(gòu)。本文定義的架構(gòu)類(lèi)似于論文“ Show and Tell：一種神經(jīng)圖像字幕生成器”中描述的架構(gòu)：-

VGG-16編碼器定義如下：

class VGG16_Encoder(tf.keras.Model):
   # This encoder passes the features through a Fully connected layer
   def __init__(self, embedding_dim):
       super(VGG16_Encoder, self).__init__()
       # shape after fc == (batch_size, 49, embedding_dim)
       self.fc = tf.keras.layers.Dense(embedding_dim)
       self.dropout = tf.keras.layers.Dropout(0.5, noise_shape=None, seed=None)
 
   def call(self, x):
       #x= self.dropout(x)
       x = self.fc(x)
       x = tf.nn.relu(x)
       return x

我們基于GPU / CPU功能定義RNN

def rnn_type(units):
   if tf.test.is_gpu_available():
       return tf.compat.v1.keras.layers.CuDNNLSTM(units,
                                       return_sequences=True,
                                       return_state=True,
                                       recurrent_initializer='glorot_uniform')
   else:
       return tf.keras.layers.GRU(units,
                                  return_sequences=True,
                                  return_state=True,
                                  recurrent_activation='sigmoid',
                                  recurrent_initializer='glorot_uniform')

接下來(lái)，使用Bahdanau注意定義RNN解碼器：

'''The encoder output(i.e. 'features'), hidden state(initialized to 0)(i.e. 'hidden') and
the decoder input (which is the start token)(i.e. 'x') is passed to the decoder.'''
 
class Rnn_Local_Decoder(tf.keras.Model):
 def __init__(self, embedding_dim, units, vocab_size):
   super(Rnn_Local_Decoder, self).__init__()
   self.units = units
   self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
   self.gru = tf.keras.layers.GRU(self.units,
                                  return_sequences=True,
                                  return_state=True,
                                  recurrent_initializer='glorot_uniform')
  
   self.fc1 = tf.keras.layers.Dense(self.units)
 
   self.dropout = tf.keras.layers.Dropout(0.5, noise_shape=None, seed=None)
   self.batchnormalization = tf.keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True, beta_initializer='zeros', gamma_initializer='ones', moving_mean_initializer='zeros', moving_variance_initializer='ones', beta_regularizer=None, gamma_regularizer=None, beta_constraint=None, gamma_constraint=None)
 
   self.fc2 = tf.keras.layers.Dense(vocab_size)
 
   # Implementing Attention Mechanism
   self.Uattn = tf.keras.layers.Dense(units)
   self.Wattn = tf.keras.layers.Dense(units)
   self.Vattn = tf.keras.layers.Dense(1)
 
 def call(self, x, features, hidden):
   # features shape ==> (64,49,256) ==> Output from ENCODER
   # hidden shape == (batch_size, hidden_size) ==>(64,512)
   # hidden_with_time_axis shape == (batch_size, 1, hidden_size) ==> (64,1,512)
 
   hidden_with_time_axis = tf.expand_dims(hidden, 1)
 
   # score shape == (64, 49, 1)
   # Attention Function
   '''e(ij) = f(s(t-1),h(j))'''
   ''' e(ij) = Vattn(T)*tanh(Uattn * h(j) + Wattn * s(t))'''
 
   score = self.Vattn(tf.nn.tanh(self.Uattn(features) + self.Wattn(hidden_with_time_axis)))
 
   # self.Uattn(features) : (64,49,512)
   # self.Wattn(hidden_with_time_axis) : (64,1,512)
   # tf.nn.tanh(self.Uattn(features) + self.Wattn(hidden_with_time_axis)) : (64,49,512)
   # self.Vattn(tf.nn.tanh(self.Uattn(features) + self.Wattn(hidden_with_time_axis))) : (64,49,1) ==> score
 
   # you get 1 at the last axis because you are applying score to self.Vattn
   # Then find Probability using Softmax
   '''attention_weights(alpha(ij)) = softmax(e(ij))'''
 
   attention_weights = tf.nn.softmax(score, axis=1)
 
   # attention_weights shape == (64, 49, 1)
   # Give weights to the different pixels in the image
   ''' C(t) = Summation(j=1 to T) (attention_weights * VGG-16 features) '''
 
   context_vector = attention_weights * features
   context_vector = tf.reduce_sum(context_vector, axis=1)
 
   # Context Vector(64,256) = AttentionWeights(64,49,1) * features(64,49,256)
   # context_vector shape after sum == (64, 256)
   # x shape after passing through embedding == (64, 1, 256)
 
   x = self.embedding(x)
   # x shape after concatenation == (64, 1,  512)
 
   x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
   # passing the concatenated vector to the GRU
 
   output, state = self.gru(x)
   # shape == (batch_size, max_length, hidden_size)
 
   x = self.fc1(output)
   # x shape == (batch_size * max_length, hidden_size)
 
   x = tf.reshape(x, (-1, x.shape[2]))
 
   # Adding Dropout and BatchNorm Layers
   x= self.dropout(x)
   x= self.batchnormalization(x)
 
   # output shape == (64 * 512)
   x = self.fc2(x)
 
   # shape : (64 * 8329(vocab))
   return x, state, attention_weights
 
 def reset_state(self, batch_size):
   return tf.zeros((batch_size, self.units))
 
 
encoder = VGG16_Encoder(embedding_dim)
decoder = Rnn_Local_Decoder(embedding_dim, units, vocab_size)

接下來(lái)，我們定義損失函數(shù)和優(yōu)化器：

optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(
   from_logits=True, reduction='none')
 
def loss_function(real, pred):
 mask = tf.math.logical_not(tf.math.equal(real, 0))
 loss_ = loss_object(real, pred)
 mask = tf.cast(mask, dtype=loss_.dtype)
 loss_ *= mask
 
 return tf.reduce_mean(loss_)

步驟4：模型訓(xùn)練

接下來(lái)，讓我們定義培訓(xùn)步驟。我們使用一種稱(chēng)為教師強(qiáng)制的技術(shù)，該技術(shù)將目標(biāo)單詞作為下一個(gè)輸入傳遞給解碼器。此技術(shù)有助于快速了解正確的序列或序列的正確統(tǒng)計(jì)屬性。

loss_plot = []
 
@tf.function
def train_step(img_tensor, target):
 loss = 0
 # initializing the hidden state for each batch
 # because the captions are not related from image to image
 
 hidden = decoder.reset_state(batch_size=target.shape[0])
 dec_input = tf.expand_dims([tokenizer.word_index['<start>']] * BATCH_SIZE, 1)
 
 with tf.GradientTape() as tape:
     features = encoder(img_tensor)
     for i in range(1, target.shape[1]):
         # passing the features through the decoder
         predictions, hidden, _ = decoder(dec_input, features, hidden)
         loss += loss_function(target[:, i], predictions)
 
         # using teacher forcing
         dec_input = tf.expand_dims(target[:, i], 1)
 
 total_loss = (loss / int(target.shape[1]))
 trainable_variables = encoder.trainable_variables + decoder.trainable_variables
 gradients = tape.gradient(loss, trainable_variables)
 optimizer.apply_gradients(zip(gradients, trainable_variables))
 
 return loss, total_loss

接下來(lái)，我們訓(xùn)練模型：

EPOCHS = 20
for epoch in range(start_epoch, EPOCHS):
   start = time.time()
   total_loss = 0
 
   for (batch, (img_tensor, target)) in enumerate(dataset):
       batch_loss, t_loss = train_step(img_tensor, target)
       total_loss += t_loss
 
       if batch % 100 == 0:
           print ('Epoch {} Batch {} Loss {:.4f}'.format(
             epoch + 1, batch, batch_loss.numpy() / int(target.shape[1])))
   # storing the epoch end loss value to plot later
   loss_plot.append(total_loss / num_steps)
 
   print ('Epoch {} Loss {:.6f}'.format(epoch + 1,
                                        total_loss/num_steps))
 
   print ('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

讓我們繪制誤差圖：

plt.plot(loss_plot)
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.title('Loss Plot')
plt.show()

輸出如下：

步驟5：貪婪搜尋和BLEU評(píng)估

讓我們定義定義字幕的貪婪方法：

def evaluate(image):
   attention_plot = np.zeros((max_length, attention_features_shape))
 
   hidden = decoder.reset_state(batch_size=1)
   temp_input = tf.expand_dims(load_image(image)[0], 0)
   img_tensor_val = image_features_extract_model(temp_input)
   img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0], -1, img_tensor_val.shape[3])
 
   features = encoder(img_tensor_val)
   dec_input = tf.expand_dims([tokenizer.word_index['<start>']], 0)
   result = []
 
   for i in range(max_length):
       predictions, hidden, attention_weights = decoder(dec_input, features, hidden)
       attention_plot[i] = tf.reshape(attention_weights, (-1, )).numpy()
       predicted_id = tf.argmax(predictions[0]).numpy()
       result.append(tokenizer.index_word[predicted_id])
 
       if tokenizer.index_word[predicted_id] == '<end>':
           return result, attention_plot
 
       dec_input = tf.expand_dims([predicted_id], 0)
   attention_plot = attention_plot[:len(result), :]
 
   return result, attention_plot

另外，我們定義了一個(gè)函數(shù)來(lái)繪制生成的每個(gè)單詞的注意力圖，就像在簡(jiǎn)介中看到的那樣

def plot_attention(image, result, attention_plot):
   temp_image = np.array(Image.open(image))
   fig = plt.figure(figsize=(10, 10))
   len_result = len(result)
   for l in range(len_result):
       temp_att = np.resize(attention_plot[l], (8, 8))
       ax = fig.add_subplot(len_result//2, len_result//2, l+1)
       ax.set_title(result[l])
       img = ax.imshow(temp_image)
       ax.imshow(temp_att, cmap='gray', alpha=0.6, extent=img.get_extent())
 
   plt.tight_layout()
   plt.show()

最后，讓我們?cè)谖恼麻_(kāi)頭為圖片生成標(biāo)題，看看注意力機(jī)制關(guān)注什么并生成

# captions on the validation set
rid = np.random.randint(0, len(img_name_val))
image = '/content/gdrive/My Drive/FLICKR8K/Flicker8k_Dataset/2319175397_3e586cfaf8.jpg'
 
# real_caption = ' '.join([tokenizer.index_word[i] for i in cap_val[rid] if i not in [0]])
result, attention_plot = evaluate(image)
 
# remove <start> and <end> from the real_caption
first = real_caption.split(' ', 1)[1]
real_caption = 'Two white dogs are playing in the snow'
 
#remove "<unk>" in result
for i in result:
   if i=="<unk>":
       result.remove(i)
 
for i in real_caption:
   if i=="<unk>":
       real_caption.remove(i)
 
#remove <end> from result        
result_join = ' '.join(result)
result_final = result_join.rsplit(' ', 1)[0]
 
real_appn = []
real_appn.append(real_caption.split())
reference = real_appn
candidate = result
 
score = sentence_bleu(reference, candidate)
print(f"BELU score: {score*100}")
 
print ('Real Caption:', real_caption)
print ('Prediction Caption:', result_final)
plot_attention(image, result, attention_plot)

輸出如下：

您可以看到我們能夠生成與真實(shí)字幕相同的字幕。讓我們嘗試一下測(cè)試集中的其他圖像。

rid = np.random.randint(0, len(img_name_val))
image = img_name_val[rid]
start = time.time()
real_caption = ' '.join([tokenizer.index_word[i] for i in cap_val[rid] if i not in [0]])
result, attention_plot = evaluate(image)
 
first = real_caption.split(' ', 1)[1]
real_caption = first.rsplit(' ', 1)[0]
 
#remove "<unk>" in result
for i in result:
   if i=="<unk>":
       result.remove(i)
 
#remove <end> from result        
result_join = ' '.join(result)
result_final = result_join.rsplit(' ', 1)[0]
 
real_appn = []
real_appn.append(real_caption.split())
reference = real_appn
candidate = result_final
 
print ('Real Caption:', real_caption)
print ('Prediction Caption:', result_final)
 
plot_attention(image, result, attention_plot)
print(f"time took to Predict: {round(time.time()-start)} sec")
 
Image.open(img_name_val[rid])

輸出如下：

您可以看到，即使我們的字幕與真實(shí)字幕有很大不同，它仍然非常準(zhǔn)確。它能夠識(shí)別出女人的黃色襯衫和她的手在口袋里。

讓我們看看另一個(gè)：

rid = np.random.randint(0, len(img_name_val))
image = img_name_val[rid]
 
real_caption = ' '.join([tokenizer.index_word[i] for i in cap_val[rid] if i not in [0]])
result, attention_plot = evaluate(image)
 
# remove <start> and <end> from the real_caption
first = real_caption.split(' ', 1)[1]
real_caption = first.rsplit(' ', 1)[0]
 
#remove "<unk>" in result
for i in result:
   if i=="<unk>":
       result.remove(i)
 
for i in real_caption:
   if i=="<unk>":
       real_caption.remove(i)
 
#remove <end> from result        
result_join = ' '.join(result)
result_final = result_join.rsplit(' ', 1)[0]
 
real_appn = []
real_appn.append(real_caption.split())
reference = real_appn
candidate = result
 
score = sentence_bleu(reference, candidate)
print(f"BELU score: {score*100}")
 
print ('Real Caption:', real_caption)
print ('Prediction Caption:', result_final)
 
plot_attention(image, result, attention_plot)

在這里，我們可以看到我們的字幕比真實(shí)的字幕之一更好地定義了圖像。

在那里！我們已經(jīng)成功實(shí)現(xiàn)了用于生成圖像標(biāo)題的注意力機(jī)制。

下一步是什么？

近年來(lái)，注意力機(jī)制得到了高度利用，這僅僅是更多先進(jìn)系統(tǒng)的開(kāi)始。您可以實(shí)施以改善模型的事情：-利用較大的數(shù)據(jù)集，尤其是MS COCO數(shù)據(jù)集或比MS COCO大26倍的Stock3M數(shù)據(jù)集。實(shí)現(xiàn)不同的注意力機(jī)制，例如帶有Visual Sentinel和的自適應(yīng)注意力。語(yǔ)義注意實(shí)現(xiàn)基于Transformer的模型，該模型的性能應(yīng)比LSTM好得多。為圖像特征提取實(shí)現(xiàn)更好的體系結(jié)構(gòu)，例如Inception，Xception和Efficient network。

尾注

這對(duì)注意力機(jī)制及其如何應(yīng)用于深度學(xué)習(xí)應(yīng)用程序非常有趣。在注意力機(jī)制和取得最新成果方面進(jìn)行了大量研究。請(qǐng)務(wù)必嘗試我的一些建議！您覺(jué)得這篇文章對(duì)您有幫助嗎？請(qǐng)?jiān)谙旅娴脑u(píng)論部分中分享您的寶貴反饋。

作者：沂水寒城，CSDN博客專(zhuān)家，個(gè)人研究方向：機(jī)器學(xué)習(xí)、深度學(xué)習(xí)、NLP、CV

Blog: http://yishuihancheng.blog.csdn.net

贊賞作者

更多閱讀

2020 年最佳流行 Python 庫(kù) Top 10

2020 Python中文社區(qū)熱門(mén)文章 Top 10

5分鐘快速掌握 Python 定時(shí)任務(wù)框架

特別推薦

點(diǎn)擊下方閱讀原文加入社區(qū)會(huì)員

Python 中圖像標(biāo)題生成的注意力機(jī)制實(shí)戰(zhàn)