1. 不引入ES,如何利用 MySQL 實現(xiàn)模糊匹配

        共 11042字,需瀏覽 23分鐘

         ·

        2024-08-02 07:40

        關(guān)注我們,設(shè)為星標,每天7:40不見不散,架構(gòu)路上與您共享

        回復(fù)架構(gòu)師獲取資源


        大家好,我是你們的朋友架構(gòu)君,一個會寫代碼吟詩的架構(gòu)師。


        1. 業(yè)務(wù)場景概述

        目標是實現(xiàn)一個公司的申請審批流程,整個業(yè)務(wù)流程涉及到兩種角色,分別為商務(wù)角色與管理員角色。整個流程如下圖所示:

        核心流程總結(jié)為一句話:商務(wù)角色申請?zhí)砑庸竞笥晒芾韱T進行審批。

        商務(wù)在添加公司時,可能為了方便,直接填寫公司簡稱,而公司全稱可能之前已經(jīng)被添加過了,為了防止添加重復(fù)的公司,所以管理員在針對公司信息審批之前,需要查看以往添加的公司信息里有無同一個公司。

        2. 實現(xiàn)思路

        以上是一個業(yè)務(wù)場景的大概介紹。從技術(shù)層面需要考慮實現(xiàn)的功能點:

        • ? 分詞

        • ? 與庫里已有數(shù)據(jù)進行匹配

        • ? 按照匹配度對結(jié)果進行排序

        分詞功能有現(xiàn)成的分詞器,所以整個需求的核心重點在于如何與數(shù)據(jù)庫中的數(shù)據(jù)匹配并按照匹配度排序。

        3. 模糊匹配技術(shù)選型

        • ? 方案一:引入ES

        • ? 方案二:利用MySQL實現(xiàn)

        本系統(tǒng)規(guī)模較小,單純?yōu)榱藢崿F(xiàn)這個功能引入ES成本較大,還要涉及到數(shù)據(jù)同步等問題,系統(tǒng)復(fù)雜性會提高,所以盡量使用MySQL已有的功能進行實現(xiàn)。

        MySQL提供了以下三種模糊搜索的方式:

        • like匹配: 要求模式串與整個目標字段完全匹配;

        • RegExp正則匹配: 要求目標字段包含模式串即可;

        • Fulltext全文索引: 在字段類型為CHAR、VARCHAR、TEXT的列上創(chuàng)建全文索引,執(zhí)行SQL進行查詢。

        針對于上述業(yè)務(wù)場景,對相關(guān)技術(shù)進行優(yōu)劣分析:

        • like匹配: 無法滿足需求,所以pass;

        • 全文索引: 可定制性差,不支持任意匹配查詢,pass;

        • 正則匹配: 可實現(xiàn)任意模式匹配,缺點在于執(zhí)行效率不如全文索引。

        針對于這個場景,記錄數(shù)目相對來說沒有那么多,所以對于效率稍低的結(jié)果可以接受,因此技術(shù)選型方面采用RegExp正則匹配來實現(xiàn)模糊匹配的需求。

        4. 實現(xiàn)效果展示

        5. 核心代碼

        整個邏輯基于 提取公司名稱關(guān)鍵信息 -->分詞 --> 匹配 三個核心步驟。

        5.1 提取公司關(guān)鍵信息

        對輸入的公司名稱去除無用信息,保留關(guān)鍵信息。這里的無用信息指的是地名,圓括號,以及集團,股份,有限等。

        匹配前處理公司名稱

        /**
         * 匹配前去除公司名稱的無意義信息
         * @param targetCompanyName
         * @return
         */

        privateStringformatCompanyName(String targetCompanyName){

        Stringregex="(?<province>[^省]+自治區(qū)|.*?省|.*?行政區(qū)|.*?市)"+
        "?(?<city>[^市]+自治州|.*?地區(qū)|.*?行政單位|.+盟|市轄區(qū)|.*?市|.*?縣)"+
        "?(?<county>[^(區(qū)|市|縣|旗|島)]+區(qū)|.*?市|.*?縣|.*?旗|.*?島)"+
        "?(?<village>.*)";
        Matchermatcher=Pattern.compile(regex).matcher(targetCompanyName);
        while(matcher.find()){
        Stringprovince= matcher.group("province");
                log.info("province:{}",province);
        if(StringUtils.isNotBlank(province)&& targetCompanyName.contains(province)){
                    targetCompanyName = targetCompanyName.replace(province,"");
        }
                log.info("處理完省份的公司名稱:{}",targetCompanyName);
        Stringcity= matcher.group("city");
                log.info("city:{}",city);
        if(StringUtils.isNotBlank(city)&& targetCompanyName.contains(city)){
                    targetCompanyName = targetCompanyName.replace(city,"");
        }
                log.info("處理完城市的公司名稱:{}",targetCompanyName);
        Stringcounty= matcher.group("county");
                log.info("county:{}",county);
        if(StringUtils.isNotBlank(county)&& targetCompanyName.contains(county)){
                    targetCompanyName = targetCompanyName.replace(county,"");
        }
                log.info("處理完區(qū)縣級的公司名稱:{}",targetCompanyName);
        }
        String[][] address =AddressUtil.ADDRESS;
        for(String[] city: address){
        for(String b : city ){
        if(targetCompanyName.contains(b)){
                        targetCompanyName = targetCompanyName.replace(b,"");
        }
        }
        }
            log.info("處理后的公司名稱:{}",targetCompanyName);
        return targetCompanyName;
        }

        地名工具類

        public classAddressUtil{
        publicstaticfinalString[][] ADDRESS ={
        {"北京"},
        {"天津"},
        {"安徽","安慶","蚌埠","亳州","巢湖","池州","滁州","阜陽","合肥","淮北","淮南","黃山","六安","馬鞍山","宿州","銅陵","蕪湖","宣城"},
        {"澳門"},
        {"香港"},
        {"福建","福州","龍巖","南平","寧德","莆田","泉州","廈門","漳州"},
        {"甘肅","白銀","定西","甘南藏族自治州","嘉峪關(guān)","金昌","酒泉","蘭州","臨夏回族自治州","隴南","平?jīng)?,"慶陽","天水","武威","張掖"},
        {"廣東","潮州","東莞","佛山","廣州","河源","惠州","江門","揭陽","茂名","梅州","清遠","汕頭","汕尾","韶關(guān)","深圳","陽江","云浮","湛江","肇慶","中山","珠海"},
        {"廣西","百色","北海","崇左","防城港","貴港","桂林","河池","賀州","來賓","柳州","南寧","欽州","梧州","玉林"},
        {"貴州","安順","畢節(jié)地區(qū)","貴陽","六盤水","黔東南苗族侗族自治州","黔南布依族苗族自治州","黔西南布依族苗族自治州","銅仁地區(qū)","遵義"},
        {"海南","???,"三亞","直轄縣級行政區(qū)劃"},
        {"河北","保定","滄州","承德","邯鄲","衡水","廊坊","秦皇島","石家莊","唐山","邢臺","張家口"},
        {"河南","安陽","鶴壁","焦作","開封","洛陽","漯河","南陽","平頂山","濮陽","三門峽","商丘","新鄉(xiāng)","信陽","許昌","鄭州","周口","駐馬店"},
        {"黑龍江","大慶","大興安嶺地區(qū)","哈爾濱","鶴崗","黑河","雞西","佳木斯","牡丹江","七臺河","齊齊哈爾","雙鴨山","綏化","伊春"},
        {"湖北","鄂州","恩施土家族苗族自治州","黃岡","黃石","荊門","荊州","十堰","隨州","武漢","咸寧","襄樊","孝感","宜昌"},
        {"湖南","長沙","常德","郴州","衡陽","懷化","婁底","邵陽","湘潭","湘西土家族苗族自治州","益陽","永州","岳陽","張家界","株洲"},
        {"吉林","白城","白山","長春","吉林","遼源","四平","松原","通化","延邊朝鮮族自治州"},
        {"江蘇","常州","淮安","連云港","南京","南通","蘇州","宿遷","泰州","無錫","徐州","鹽城","揚州","鎮(zhèn)江"},
        {"江西","撫州","贛州","吉安","景德鎮(zhèn)","九江","南昌","萍鄉(xiāng)","上饒","新余","宜春","鷹潭"},
        {"遼寧","鞍山","本溪","朝陽","大連","丹東","撫順","阜新","葫蘆島","錦州","遼陽","盤錦","沈陽","鐵嶺","營口"},
        {"內(nèi)蒙古","阿拉善盟","巴彥淖爾","包頭","赤峰","鄂爾多斯","呼和浩特","呼倫貝爾","通遼","烏海","烏蘭察布","錫林郭勒盟","興安盟"},
        {"寧夏回族","固原","石嘴山","吳忠","銀川","中衛(wèi)"},
        {"青海","果洛藏族自治州","海北藏族自治州","海東地區(qū)","海南藏族自治州","海西蒙古族藏族自治州","黃南藏族自治州","西寧","玉樹藏族自治州"},
        {"山東","濱州","德州","東營","菏澤","濟南","濟寧","萊蕪","聊城","臨沂","青島","日照","泰安","威海","濰坊","煙臺","棗莊","淄博"},
        {"山西","長治","大同","晉城","晉中","臨汾","呂梁","朔州","太原","忻州","陽泉","運城"},
        {"陜西","安康","寶雞","漢中","商洛","銅川","渭南","西安","咸陽","延安","榆林"},
        {"上海"},
        {"四川","阿壩藏族羌族自治州","巴中","成都","達州","德陽","甘孜藏族自治州","廣安","廣元","樂山","涼山彝族自治州","瀘州","眉山","綿陽","內(nèi)江","南充","攀枝花","遂寧","雅安","宜賓","資陽","自貢"},
        {"西藏","阿里地區(qū)","昌都地區(qū)","拉薩","林芝地區(qū)","那曲地區(qū)","日喀則地區(qū)","山南地區(qū)"},
        {"新疆維吾爾","阿克蘇地區(qū)","阿勒泰地區(qū)","巴音郭楞蒙古自治州","博爾塔拉蒙古自治州","昌吉回族自治州","哈密地區(qū)","和田地區(qū)","喀什地區(qū)","克拉瑪依","克孜勒蘇柯爾克孜自治州","塔城地區(qū)","吐魯番地區(qū)","烏魯木齊","伊犁哈薩克自治州","直轄縣級行政區(qū)劃"},
        {"云南","保山","楚雄彝族自治州","大理白族自治州","德宏傣族景頗族自治州","迪慶藏族自治州","紅河哈尼族彝族自治州","昆明","麗江","臨滄","怒江僳僳族自治州","普洱","曲靖","文山壯族苗族自治州","西雙版納傣族自治州","玉溪","昭通"},
        {"浙江","杭州","湖州","嘉興","金華","麗水","寧波","衢州","紹興","臺州","溫州","舟山"},
        {"重慶"},
        {"臺灣","臺北","高雄","基隆","臺中","臺南","新竹","嘉義"},
        };
        }

        5.2 分詞相關(guān)代碼

        pom文件:引入IK分詞器相關(guān)依賴

         <!-- ikAnalyzer 中文分詞器  -->
        <dependency>
        <groupId>com.janeluo</groupId>
        <artifactId>ikanalyzer</artifactId>
        <version>2012_u6</version>
        <exclusions>
        <exclusion>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-core</artifactId>
        </exclusion>
        <exclusion>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-queryparser</artifactId>
        </exclusion>
        <exclusion>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-analyzers-common</artifactId>
        </exclusion>
        </exclusions>
        </dependency>

        <!--  lucene-queryParser 查詢分析器模塊 -->
        <dependency>
        <groupId>org.apache.lucene</groupId>
        <artifactId>lucene-queryparser</artifactId>
        <version>7.3.0</version>
        </dependency>

        IKAnalyzerSupport類:用于配置分詞器

        @Slf4j
        publicclassIKAnalyzerSupport{
        /**
             * IK分詞
             * @param target
             * @return
             */

        publicstaticList<String>iKSegmenterToList(String target)throwsException{
        if(StringUtils.isEmpty(target)){
        returnnewArrayList();
        }
        List<String> result =newArrayList<>();
        StringReadersr=newStringReader(target);
        // false:關(guān)閉智能分詞 (對分詞的精度影響較大)
        IKSegmenterik=newIKSegmenter(sr,true);
        Lexeme lex;
        while((lex=ik.next())!=null){
        StringlexemeText= lex.getLexemeText();
                    result.add(lexemeText);
        }
        return result;
        }
        }

        ServiceImpl類:進行分詞處理

         /**
         * 對目標公司名稱進行分詞
         * @param targetCompanyName
         * @return
         */

        privateStringsplitWord(String targetCompanyName){
            log.info("對處理后端公司名稱進行分詞");

        List<String> splitWord =newArrayList<>();
        Stringresult= targetCompanyName;
        try{
                splitWord = iKSegmenterToList(targetCompanyName);
                result =  splitWord.stream().map(String::valueOf).distinct().collect(Collectors.joining("|"));
                log.info("分詞結(jié)果:{}",result);
        }catch(Exception e){
                log.error("分詞報錯:{}",e.getMessage());
        }
        return result;
        }

        5.3 匹配

        ServiceImpl類:匹配核心代碼

        public JsonResultmatchCompanyName(CompanyDTO companyDTO, String accessToken, String localIp){
        // 對公司名稱進行處理
        StringsourceCompanyName= companyDTO.getCompanyName();
        StringtargetCompanyName= sourceCompanyName;
            log.info("處理前公司名稱:{}",targetCompanyName);
        // 處理圓括號
            targetCompanyName = targetCompanyName.replaceAll("[(]|[)]|[(]|[)]","");
        // 處理公司相關(guān)關(guān)鍵詞
            targetCompanyName = targetCompanyName.replaceAll("[(集團|股份|有限|責任|分公司)]","");

        if(!targetCompanyName.contains("銀行")){
        // 去除行政區(qū)域
                targetCompanyName = formatCompanyName(targetCompanyName);
        }
        // 分詞
        StringsplitCompanyName= splitWord(targetCompanyName);
        //  匹配
        List<Company> matchedCompany = companyRepository.queryMatchCompanyName(splitCompanyName,targetCompanyName);

        List<String> result =newArrayList();
        for(Company companyInfo : matchedCompany){
                result.add(companyInfo.getCompanyName());
        if(companyDTO.getCompanyId().equals(companyInfo.getCompanyId())){
                    result.remove(companyInfo.getCompanyName());
        }
        }
        returnJsonResult.successResult(result);
        }

        Repository類:編寫SQL語句

        /**  
        * 模糊匹配公司名稱  
        @param companyNameRegex 分詞后的公司名稱
        @param companyName 分詞前的公司名稱  
        @return  
        */

        @Query(value = 
        "SELECT * FROM company WHERE isDeleted = '0' and companyName REGEXP ?1 
        ORDER BY length(REPLACE(companyName,?2,''))/length(companyName) ",
        nativeQuery = true)
          
        List<Company> queryMatchCompanyName(String companyNameRegex,String companyName);

        按照匹配度排序這個功能點,LENGTH(companyName)返回companyName的長度,LENGTH(REPLACE(companyName, ?2, ''))計算出companyName中關(guān)鍵詞出現(xiàn)的次數(shù)。通過這種方式,我們可以根據(jù)匹配程度進行排序,匹配次數(shù)越多的公司名稱排序越靠前。

        來源:juejin.cn/post/7340574992256466953


        到此文章就結(jié)束了。Java架構(gòu)師必看一個集公眾號、小程序、網(wǎng)站(3合1的文章平臺,給您架構(gòu)路上一臂之力)。如果今天的文章對你在進階架構(gòu)師的路上有新的啟發(fā)和進步,歡迎轉(zhuǎn)發(fā)給更多人。歡迎加入架構(gòu)師社區(qū)技術(shù)交流群,眾多大咖帶你進階架構(gòu)師,在后臺回復(fù)“加群”即可入群。



        這些年小編給你分享過的干貨


        0.ChatGPT 4o 國內(nèi)直接用 ?。。?/strong>

        1.idea2024.1.4永久激活碼(親測可用)

        2.優(yōu)質(zhì)ERP系統(tǒng)帶進銷存財務(wù)生產(chǎn)功能(附源碼)

        3.優(yōu)質(zhì)SpringBoot帶工作流管理項目(附源碼)

        4.最好用的OA系統(tǒng),拿來即用(附源碼)

        5.SBoot+Vue外賣系統(tǒng)前后端都有(附源碼

        6.SBoot+Vue可視化大屏拖拽項目(附源碼)


        轉(zhuǎn)發(fā)在看就是最大的支持??

        瀏覽 84
        點贊
        評論
        收藏
        分享

        手機掃一掃分享

        分享
        舉報
        評論
        圖片
        表情
        推薦
        點贊
        評論
        收藏
        分享

        手機掃一掃分享

        分享
        舉報
          
          

            1. 青娱乐亚洲精品 | 特级婬片A片AAA毛片咕噜咕噜 | 国产精品电影一区二区 | 免费观看操逼 | 欧美国产精品一区二区三区 |