久久r热视频,国产九一精品,国产丝袜精品A片免费,色欲激情网,久涩网,上课呢别进去摸好舒服同桌,欧美精品性爱,国产性自拍

點(diǎn)擊下方卡片，關(guān)注“新機(jī)器視覺(jué)”公眾號(hào)

視覺(jué)/圖像重磅干貨，第一時(shí)間送達(dá)

來(lái)源：磐創(chuàng)AI

動(dòng)機(jī)

在許多行業(yè)中，尤其是在學(xué)術(shù)界，抄襲是一個(gè)重大問(wèn)題。隨著互聯(lián)網(wǎng)和開放信息的興起，這種現(xiàn)象甚至變得更加嚴(yán)重，任何人都可以通過(guò)點(diǎn)擊訪問(wèn)特定主題的任何信息。

基于這一觀察，研究人員一直在嘗試使用不同的文本分析方法解決這個(gè)問(wèn)題。在這篇概念文章中，我們將嘗試解決抄襲檢測(cè)工具的兩個(gè)主要限制：（1）內(nèi)容改寫抄襲和（2）內(nèi)容翻譯抄襲。

(1) 對(duì)于傳統(tǒng)工具來(lái)說(shuō)，重新表述的內(nèi)容可能很難捕捉到，因?yàn)樗鼈儧](méi)有考慮整體上下文的同義詞和反義詞。

(2) 使用與原文不同語(yǔ)言編寫的內(nèi)容也是一個(gè)巨大的問(wèn)題，即使是最先進(jìn)的基于機(jī)器學(xué)習(xí)的工具也面臨著這個(gè)問(wèn)題，因?yàn)樯舷挛耐耆D(zhuǎn)移到了另一種語(yǔ)言。

在這篇概念性的博客文章中，我們將解釋如何使用基于Transformer的模型以創(chuàng)新的方式解決這兩個(gè)挑戰(zhàn)。

首先，我們將帶你了解分析方法，描述從數(shù)據(jù)收集到性能分析的整個(gè)工作流程。然后，我們將深入探討解決方案的科學(xué)/技術(shù)實(shí)現(xiàn)，然后展示最終結(jié)果。

問(wèn)題陳述

假設(shè)你有興趣構(gòu)建一個(gè)學(xué)術(shù)內(nèi)容管理平臺(tái)。你可能希望只接受在你的平臺(tái)上沒(méi)有共享過(guò)的文章。在這種情況下，你的目標(biāo)將是拒絕所有與現(xiàn)有文章相似度超過(guò)某個(gè)閾值的新文章。

為了說(shuō)明這種情況，我們將使用cord-19數(shù)據(jù)集，這是由Allen Institute for AI在Kaggle上免費(fèi)提供的開放研究挑戰(zhàn)數(shù)據(jù)集。

https://allenai.org/

分析方法

在進(jìn)一步進(jìn)行分析之前，讓我們從以下問(wèn)題明確我們?cè)谶@里試圖實(shí)現(xiàn)的目標(biāo)：

問(wèn)題：我們能否在我們的數(shù)據(jù)庫(kù)中找到一個(gè)或多個(gè)與新提交的文檔相似（超過(guò)某個(gè)閾值）的文檔？

下面的工作流程突出顯示了回答這個(gè)問(wèn)題所需的所有主要步驟。

讓我們了解這里正在發(fā)生的事情 ??。

在收集源數(shù)據(jù)后，我們首先對(duì)內(nèi)容進(jìn)行預(yù)處理，然后使用BERT創(chuàng)建一個(gè)向量數(shù)據(jù)庫(kù)。

然后，每當(dāng)我們有一個(gè)新的文檔進(jìn)入時(shí)，我們檢查語(yǔ)言并進(jìn)行抄襲檢測(cè)。更多詳細(xì)信息將在文章后面給出。

科學(xué)實(shí)施

本節(jié)專注于分析方法中各個(gè)部分的技術(shù)實(shí)施。

數(shù)據(jù)預(yù)處理

我們只對(duì)源數(shù)據(jù)的摘要列感興趣，為了簡(jiǎn)單起見，我們將僅使用100個(gè)觀察結(jié)果來(lái)加快預(yù)處理的速度。

    
     
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
     
     import pandas as pd
def preprocess_data(data_path, sample_size):
  # Read the data from specific path  data = pd.read_csv(data_path, low_memory=False)
  # Drop articles without Abstract  data = data.dropna(subset = ['abstract']).reset_index(drop = True)
  # Get "sample_size" random articles  data = data.sample(sample_size)[['abstract']]
  return data
# Read data & preprocess itdata_path = "./data/cord19_source_data.csv"source_data = preprocess_data(data_path, 100)

以下是源數(shù)據(jù)集的五個(gè)隨機(jī)觀察結(jié)果。

文檔向量化器

在引言中觀察到的挑戰(zhàn)分別導(dǎo)致選擇以下兩個(gè)基于Transformer的模型：

(1) BERT模型：用于解決第一個(gè)限制，因?yàn)樗峁┝宋谋拘畔⒏玫纳舷挛谋硎?。為此，我們將使用以下功能?/span>

create_vector_from_text：用于生成單個(gè)文檔的向量表示。
create_vector_database：負(fù)責(zé)創(chuàng)建一個(gè)數(shù)據(jù)庫(kù)，其中包含每個(gè)文檔的相應(yīng)向量。

    
     
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
     
     # Useful librariesimport numpy as npimport torchfrom keras.preprocessing.sequence import pad_sequencesfrom transformers import BertTokenizer,  AutoModelForSequenceClassification
# Load bert modelmodel_path = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_path,                                           do_lower_case=True)
model = AutoModelForSequenceClassification.from_pretrained(model_path,                                                          output_attentions=False,                                                          output_hidden_states=True)

def create_vector_from_text(tokenizer, model, text, MAX_LEN = 510):
    input_ids = tokenizer.encode(                        text,                         add_special_tokens = True,                         max_length = MAX_LEN,                                              )    
    results = pad_sequences([input_ids], maxlen=MAX_LEN, dtype="long",                               truncating="post", padding="post")
    # Remove the outer list.    input_ids = results[0]
    # Create attention masks        attention_mask = [int(i>0) for i in input_ids]
    # Convert to tensors.    input_ids = torch.tensor(input_ids)    attention_mask = torch.tensor(attention_mask)
    # Add an extra dimension for the "batch" (even though there is only one     # input in this batch.)    input_ids = input_ids.unsqueeze(0)    attention_mask = attention_mask.unsqueeze(0)
    # Put the model in "evaluation" mode, meaning feed-forward operation.    model.eval()
    # Run the text through BERT, and collect all of the hidden states produced    # from all 12 layers.     with torch.no_grad():                logits, encoded_layers = model(                                    input_ids = input_ids,                                     token_type_ids = None,                                     attention_mask = attention_mask,                                    return_dict=False)
    layer_i = 12 # The last BERT layer before the classifier.    batch_i = 0 # Only one input in the batch.    token_i = 0 # The first token, corresponding to [CLS]
    # Extract the vector.    vector = encoded_layers[layer_i][batch_i][token_i]
    # Move to the CPU and convert to numpy ndarray.    vector = vector.detach().cpu().numpy()
    return(vector)

def create_vector_database(data):
    # The list of all the vectors    vectors = []
    # Get overall text data    source_data = data.abstract.values
    # Loop over all the comment and get the embeddings    for text in tqdm(source_data):
        # Get the embedding         vector = create_vector_from_text(tokenizer, model, text)
        #add it to the list        vectors.append(vector)
    data["vectors"] = vectors    data["vectors"] = data["vectors"].apply(lambda emb: np.array(emb))    data["vectors"] = data["vectors"].apply(lambda emb: emb.reshape(1, -1))
    return data
# Create the vector database vector_database = create_vector_database(source_data)vector_database.sample(5)

第94行顯示了向量數(shù)據(jù)庫(kù)中的五個(gè)隨機(jī)觀察結(jié)果，包括新向量列。

(2) 使用機(jī)器翻譯Transformer模型將傳入文檔的語(yǔ)言翻譯為英語(yǔ)，因?yàn)槲覀兊脑次臋n是英文的。只有當(dāng)文檔的語(yǔ)言是以下五種語(yǔ)言之一時(shí)，才執(zhí)行翻譯：德語(yǔ)、法語(yǔ)、日語(yǔ)、希臘語(yǔ)和俄語(yǔ)。以下是使用MarianMT模型實(shí)現(xiàn)此邏輯的輔助函數(shù)。

    
     
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
     
     from langdetect import detect, DetectorFactoryDetectorFactory.seed = 0
def translate_text(text, text_lang, target_lang='en'):
  # Get the name of the model  model_name = f"Helsinki-NLP/opus-mt-{text_lang}-{target_lang}"
  # Get the tokenizer  tokenizer = MarianTokenizer.from_pretrained(model_name)
  # Instantiate the model  model = MarianMTModel.from_pretrained(model_name)
  # Translation of the text  formated_text = ">>{}<< {}".format(text_lang, text)
  translation = model.generate(**tokenizer([formated_text], return_tensors="pt", padding=True))
  translated_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translation][0]
  return translated_text

抄襲分析器

當(dāng)傳入文檔的向量與數(shù)據(jù)庫(kù)中的某個(gè)向量在一定閾值水平上相似時(shí)，就存在抄襲。

但是，什么時(shí)候兩個(gè)向量是相似的？→ 當(dāng)它們具有相同的大小和方向時(shí)。

這個(gè)定義要求我們的向量具有相同的大小，這可能是一個(gè)問(wèn)題，因?yàn)槲臋n向量的維度取決于該文檔的長(zhǎng)度。幸運(yùn)的是，我們有多種相似度測(cè)量方法可以用來(lái)解決這個(gè)問(wèn)題，其中之一就是余弦相似度，我們將在本例中使用它。

如果你對(duì)其他方法感興趣，可以參考James Briggs的這篇精彩內(nèi)容。他解釋了每種方法的工作原理、優(yōu)點(diǎn)，并指導(dǎo)你如何實(shí)施它們。

https://www.pinecone.io/learn/semantic-search/

抄襲分析是使用run_plagiarism_analysis函數(shù)執(zhí)行的。我們首先使用check_incoming_document函數(shù)檢查文檔語(yǔ)言，必要時(shí)執(zhí)行正確的翻譯。

最終結(jié)果是一個(gè)包含四個(gè)主要值的字典：

similarity_score：傳入文章與數(shù)據(jù)庫(kù)中最相似的現(xiàn)有文章之間的得分。
is_plagiarism：如果相似度得分等于或超過(guò)閾值，則值為true。否則為false。
most_similar_article：最相似文章的文本信息。
article_submitted：提交審批的文章。

    
     
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
     
     def process_document(text):    """    Create a vector for given text and adjust it for cosine similarity search    """    text_vect = create_vector_from_text(tokenizer, model, text)    text_vect = np.array(text_vect)    text_vect = text_vect.reshape(1, -1)
    return text_vect

def is_plagiarism(similarity_score, plagiarism_threshold):
  return similarity_score < plagiarism_threshold

def check_incoming_document(incoming_document):
  text_lang = detect(incoming_document)  language_list = ['de', 'fr', 'el', 'ja', 'ru']
  final_result = ""
  if(text_lang == 'en'):    final_result = incoming_document 
  elif(text_lang not in language_list):    final_result = None
  else:    # Translate in English    final_result = translate_text(incoming_document, text_lang)
  return final_result

def run_plagiarism_analysis(query_text, data, plagiarism_threshold=0.8):
    top_N=3
    # Check the language of the query/incoming text and translate if required.     document_translation = check_incoming_document(query_text)
    if(document_translation is None):      print("Only the following languages are supported: English, French, Russian, German, Greek and Japanese")      exit(-1)
    else:      # Preprocess the document to get the required vector for similarity analysis      query_vect = process_document(document_translation)
      # Run similarity Search      data["similarity"] = data["vectors"].apply(lambda x: cosine_similarity(query_vect, x))      data["similarity"] = data["similarity"].apply(lambda x: x[0][0])
      similar_articles = data.sort_values(by='similarity', ascending=False)[1:top_N+1]      formated_result = similar_articles[["abstract", "paper_id", "similarity"]].reset_index(drop = True)
      similarity_score = formated_result.iloc[0]["similarity"]       most_similar_article = formated_result.iloc[0]["abstract"]       is_plagiarism_bool = is_plagiarism(similarity_score, plagiarism_threshold)
      plagiarism_decision = {'similarity_score': similarity_score,                              'is_plagiarism': is_plagiarism_bool,                             'most_similar_article': most_similar_article,                              'article_submitted': query_text                            }
      return plagiarism_decision

系統(tǒng)實(shí)驗(yàn)

我們已經(jīng)涵蓋并實(shí)施了工作流程的所有組件?，F(xiàn)在，是時(shí)候使用我們的系統(tǒng)來(lái)測(cè)試三種被系統(tǒng)接受的語(yǔ)言：德語(yǔ)、法語(yǔ)、日語(yǔ)、希臘語(yǔ)和俄語(yǔ)。

評(píng)估

以下是我們要檢查作者是否抄襲的文章摘要文本。

英文文章

這篇文章實(shí)際上是源數(shù)據(jù)中的一個(gè)示例。

english_article_to_check = "The need for multidisciplinary research to address today's complex health and environmental challenges has never been greater. The One Health (OH) approach to research ensures that human, animal, and environmental health questions are evaluated in an integrated and holistic manner to provide a more comprehensive understanding of the problem and potential solutions than would be possible with siloed approaches. However, the OH approach is complex, and there is limited guidance available for investigators regarding the practical design and implementation of OH research. In this paper we provide a framework to guide researchers through conceptualizing and planning an OH study. We discuss key steps in designing an OH study, including conceptualization of hypotheses and study aims, identification of collaborators for a multi-disciplinary research team, study design options, data sources and collection methods, and analytical methods. We illustrate these concepts through the presentation of a case study of health impacts associated with land application of biosolids. Finally, we discuss opportunities for applying an OH approach to identify solutions to current global health issues, and the need for cross-disciplinary funding sources to foster an OH approach to research."

# Select an existing article from the databasenew_incoming_text = source_data.iloc[0]['abstract']
# Run the plagiarism detectionanalysis_result = run_plagiarism_analysis(new_incoming_text, vector_database, plagiarism_threshold=0.8)

運(yùn)行系統(tǒng)后，我們得到了一個(gè)相似度得分為1，與現(xiàn)有文章完全匹配。這是顯而易見的，因?yàn)槲覀儚臄?shù)據(jù)庫(kù)中取了完全相同的文章。

法文文章

這篇文章可以從法國(guó)農(nóng)業(yè)網(wǎng)站免費(fèi)獲取。

french_article_to_check = """Les Réseaux d’Innovation et de Transfert Agricole (RITA) ont été créés en 2011 pour mieux connecter la recherche et le développement agricole, intra et inter-DOM, avec un objectif d’accompagnement de la diversification des productions locales. Le CGAAER a été chargé d'analyser ce dispositif et de proposer des pistes d'action pour améliorer la chaine Recherche – Formation – Innovation – Développement – Transfert dans les outre-mer dans un contexte d'agriculture durable, au profit de l'accroissement de l'autonomie alimentaire."""

    
     
    
    
     
      
      
     
     analysis_result = run_plagiarism_analysis(french_article_to_check, vector_database, plagiarism_threshold=0.8)analysis_result

在這種情況下，沒(méi)有發(fā)生抄襲，因?yàn)橄嗨贫鹊梅值陀陂撝怠?/span>

德文文章

假設(shè)有人非常喜歡數(shù)據(jù)庫(kù)中的第五篇文章，并決定將其翻譯成德語(yǔ)?，F(xiàn)在讓我們看看系統(tǒng)如何判斷這篇文章。

german_article_to_check = """Derzeit ist eine Reihe strukturell und funktionell unterschiedlicher temperaturempfindlicher Elemente wie RNA-Thermometer bekannt, die eine Vielzahl biologischer Prozesse in Bakterien, einschlie?lich der Virulenz, steuern. Auf der Grundlage einer Computer- und thermodynamischen Analyse der vollst?ndig sequenzierten Genome von 25 Salmonella enterica-Isolaten wurden ein Algorithmus und Kriterien für die Suche nach potenziellen RNA-Thermometern entwickelt. Er wird es erm?glichen, die Suche nach potentiellen Riboschaltern im Genom anderer gesellschaftlich wichtiger Krankheitserreger durchzuführen. Für S. enterica wurden neben dem bekannten 4U-RNA-Thermometer vier Hairpin-Loop-Strukturen identifiziert, die wahrscheinlich als weitere RNA-Thermometer fungieren. Sie erfüllen die notwendigen und hinreichenden Bedingungen für die Bildung von RNA-Thermometern und sind hochkonservative nichtkanonische Strukturen, da diese hochkonservativen Strukturen im Genom aller 25 Isolate von S. enterica gefunden wurden. Die Hairpins, die eine kreuzf?rmige Struktur in der supergewickelten pUC8-DNA bilden, wurden mit Hilfe der Rasterkraftmikroskopie sichtbar gemacht."""

    
     
    
    
     
      
      
     
     analysis_result = run_plagiarism_analysis(german_article_to_check, vector_database, plagiarism_threshold=0.8)analysis_result

相似度達(dá)到了97% - 模型捕捉到了這一點(diǎn)！結(jié)果非常令人印象深刻。這篇文章絕對(duì)是一個(gè)剽竊作品。

結(jié)論

恭喜！現(xiàn)在你擁有了構(gòu)建更強(qiáng)大的抄襲檢測(cè)系統(tǒng)所需的所有工具，使用BERT和機(jī)器翻譯模型結(jié)合余弦相似度。

感謝閱讀！

其他資源

https://huggingface.co/docs/transformers/model_doc/marian

https://github.com/keitazoumana/Medium-Articles-Notebooks/blob/main/Plagiarism_detection.ipynb