Hive CTE | with as 隱藏的秘密!
CTE
?Hive with as 語(yǔ)句
Hive with 語(yǔ)句默認(rèn)是不把數(shù)據(jù)進(jìn)行物化的,相當(dāng)于視圖,定義了一個(gè)SQL片段,每次使用時(shí)候可以將該定義的SQL片段拿出來(lái)再被使用,該SQL片段可以理解為一個(gè)變量,主要用途簡(jiǎn)化SQL,讓SQL更簡(jiǎn)潔,替換子查詢(xún),方便定位問(wèn)題。該子句緊跟在SELECT或INSERT關(guān)鍵字之前,可以在Hive SELECT,INSERT,CREATE TABLE AS SELECT或CREATE VIEW AS SELECT語(yǔ)句中使用一個(gè)或多個(gè)CTE 。
with as 也叫做子查詢(xún)部分,首先定義一個(gè)sql片段,該sql片段會(huì)被整個(gè)sql語(yǔ)句所用到,為了讓sql語(yǔ)句的可讀性更高些,作為提供數(shù)據(jù)的部分,也常常用在union等集合操作中。
注意:hive with as不像oracle等數(shù)據(jù)庫(kù)會(huì)將數(shù)據(jù)緩存到內(nèi)存中,只是定義了一個(gè)SQL代碼片段,方便下次使用,使代碼更簡(jiǎn)潔優(yōu)美!??!? 其最大的好處就是適當(dāng)?shù)奶岣叽a可讀性,而且如果with子句在后面要多次使用到,這可以大大的簡(jiǎn)化SQL,但不一定提高執(zhí)行效率。
在高版本的SQL中,with語(yǔ)句進(jìn)行了物化,默認(rèn)是不開(kāi)啟的,這個(gè)參數(shù)為
hive.optimize.cte.materialize.threshold
該參數(shù)默認(rèn)情況下是-1,是關(guān)閉的,當(dāng)開(kāi)啟(大于0),比如設(shè)置為2,則如果with..as語(yǔ)句被引用2次及以上時(shí),會(huì)把with..as語(yǔ)句生成的table物化,從而做到with..as語(yǔ)句只執(zhí)行一次,來(lái)提高效率。
物化實(shí)際上是先相對(duì)于視圖而言的,視圖創(chuàng)建的是一個(gè)虛擬表,只是定義了一個(gè)SQL片段,并沒(méi)有實(shí)體表的創(chuàng)建,只是概念性的東西,那么物化就是將這種概念性的東西進(jìn)行實(shí)體化,如數(shù)據(jù)進(jìn)行緩存,存放在內(nèi)存,數(shù)據(jù)進(jìn)行落盤(pán)等,類(lèi)似于計(jì)算的中間結(jié)果進(jìn)行緩存或落到磁盤(pán),這樣每次計(jì)算的時(shí)候可以從該中間結(jié)果中取數(shù),這樣才可以達(dá)到一次分析,多次使用的目的。(這里我們可以類(lèi)比Spark RDD的緩存概念:RDD 緩存指的是將 RDD 以緩存的形式物化到內(nèi)存或磁盤(pán)的過(guò)程)
Hive中具體源碼如下:

從源碼看,在獲取元數(shù)據(jù)時(shí),會(huì)進(jìn)行判斷,判斷配置參數(shù)大于0且cte的引用次數(shù)超過(guò)配置的參數(shù)時(shí)候則開(kāi)啟cte的物化。
1 CTE 的使用
1 命令格式
WITHcte_name AS(cte_query)[,cte_name2 AS(cte_query2),……]
參數(shù)說(shuō)明
cte_name:CTE的名稱(chēng),不能與當(dāng)前WITH子句中的其他CTE的名稱(chēng)相同。查詢(xún)中任何使用到cte_name標(biāo)識(shí)符的地方,均指CTE。
cte_query:一個(gè)SELECT語(yǔ)句。它產(chǎn)生的結(jié)果集用于填充CTE。
2 示例
示例1:
INSERT OVERWRITE TABLE srcp PARTITION (p='abc')SELECT * FROM (SELECT a.key, b.valueFROM (SELECT * FROM src WHERE key IS NOT NULL ) aJOIN (SELECT * FROM src2 WHERE value > 0 ) bON a.key = b.key) cUNION ALLSELECT * FROM (SELECT a.key, b.valueFROM (SELECT * FROM src WHERE key IS NOT NULL ) aLEFT OUTER JOIN (SELECT * FROM src3 WHERE value > 0 ) bON a.key = b.key AND b.key IS NOT NULL)d;
頂層的UNION兩側(cè)各為一個(gè)JOIN,JOIN的左表是相同的查詢(xún)。通過(guò)寫(xiě)子查詢(xún)的方式,只能重復(fù)這段代碼。
使用CTE的方式重寫(xiě)以上語(yǔ)句。
witha as (select * from src where key is not null),b as (select * from src2 where value>0),c as (select * from src3 where value>0),d as (select a.key,b.value from a join b on a.key=b.key),e as (select a.key,c.value from a left outer join c on a.key=c.key and c.key is not null)insert overwrite table srcp partition (p='abc')select * from d union all select * from e;
重寫(xiě)后,a對(duì)應(yīng)的子查詢(xún)只需寫(xiě)一次,便可在后面進(jìn)行重用。您可以在CTE的WITH子句中指定多個(gè)子查詢(xún),像使用變量一樣在整個(gè)語(yǔ)句中反復(fù)重用。除重用外,不必反復(fù)嵌套。
示例2:
對(duì)于一些比較復(fù)雜的計(jì)算任務(wù),為了避免過(guò)多的JOIN,通常會(huì)先把一些需要提取的部分?jǐn)?shù)據(jù)使用臨時(shí)表或是CTE的形式在主要查詢(xún)區(qū)塊前進(jìn)行提取。
臨時(shí)表的作法:
CREATE TEMPORARY TABLE table_1 ASSELECTcolumnsFROM table A;CREATE TEMPORARY table_2 ASSELECTcolumnsFROM table B;SELECTtable_1.columns,table_2.columns,c.columnsFROM table C JOIN table_1JOIN table_2;
CTE的作法:
-- 注意Hive、Impala支持這種語(yǔ)法,低版本的MySQL不支持(高版本支持)WITH employee_by_title_count AS (SELECTt.name as job_title, COUNT(e.id) as amount_of_employeesFROM employees eJOIN job_titles t on e.job_title_id = t.idGROUP BY 1),salaries_by_title AS (SELECTname as job_title, salaryFROM job_titles)SELECT *FROM employee_by_title_count eJOIN salaries_by_title s ON s.job_title = e.job_title
可以看到TEMP TABLE和CTE WITH的用法其實(shí)非常類(lèi)似,目的都是為了讓你的Query更加一目了然且優(yōu)雅簡(jiǎn)潔。很多人習(xí)慣將所有的Query寫(xiě)在單一的區(qū)塊里面,用過(guò)多的JOIN或SUBQUERY,導(dǎo)致最后邏輯丟失且自己也搞不清楚寫(xiě)到哪里,適時(shí)的使用TEMP TABLE和CTE作為輔助,絕對(duì)是很加分的。
示例3:優(yōu)化子查詢(xún),方便維護(hù),代碼更簡(jiǎn)潔
with cte as (SELECT gw_id,sensor_id,stddev(temp) over(PARTITION by gw_id,sensor_id) as stddev_temp,stddev(humi) over(PARTITION by gw_id,sensor_id) as stddev_humi,avg(temp) over(PARTITION by gw_id,sensor_id) as avg_temp,avg(humi) over(PARTITION by gw_id,sensor_id) as avg_humiFROM phmdwdb.dwd_iot_phm_trackcir_envwhere from_unixtime(cast(substr(msg_time,1,10) AS BIGINT),'yyyy-MM-dd') = '2020-11-20')select gw_id,sensor_id,3*stddev_temp+avg_temp as temp_std_up,3*stddev_humi+avg_humi as humi_std_up,abs(3*stddev_temp-avg_temp) as temp_std_dn,abs(3*stddev_humi-avg_humi) as temp_std_dnfrom cte
+----------------------+----------------------+--------------------+--------------------+---------------------+--------------------+--+| gw_id | sensor_id | temp_std_up | humi_std_up | temp_std_dn | temp_std_dn |+----------------------+----------------------+--------------------+--------------------+---------------------+--------------------+--+| 1263740609944641536 | 1329016193440047104 | 22.53016020309899 | 76.27771107659711 | 16.851530077439218 | 64.75383829654604 || 1263740609944641536 | 1329016193440047104 | 22.53016020309899 | 76.27771107659711 | 16.851530077439218 | 64.75383829654604 || 1263740609944641536 | 1329016193440047104 | 22.53016020309899 | 76.27771107659711 | 16.851530077439218 | 64.75383829654604 || 1263740609944641536 | 1329016193440047104 | 22.53016020309899 | 76.27771107659711 | 16.851530077439218 | 64.75383829654604 || 1263740609944641536 | 1329016193440047104 | 22.53016020309899 | 76.27771107659711 | 16.851530077439218 | 64.75383829654604 || 1263740609944641536 | 1329016193440047104 | 22.53016020309899 | 76.27771107659711 | 16.851530077439218 | 64.75383829654604 || 1263740609944641536 | 1329016193440047104 | 22.53016020309899 | 76.27771107659711 | 16.851530077439218 | 64.75383829654604 || 1263740609944641536 | 1329016193440047104 | 22.53016020309899 | 76.27771107659711 | 16.851530077439218 | 64.75383829654604 || 1263740609944641536 | 1329016193440047104 | 22.53016020309899 | 76.27771107659711 | 16.851530077439218 | 64.75383829654604 || 1263740609944641536 | 1329016193440047104 | 22.53016020309899 | 76.27771107659711 | 16.851530077439218 | 64.75383829654604 || 1263740609944641536 | 1329016193440047104 | 22.53016020309899 | 76.27771107659711 | 16.851530077439218 | 64.75383829654604 || 1263740609944641536 | 1329016193440047104 | 22.53016020309899 | 76.27771107659711 | 16.851530077439218 | 64.75383829654604 || 1263740609944641536 | 1329016193440047104 | 22.53016020309899 | 76.27771107659711 | 16.851530077439218 | 64.75383829654604 || 1263740609944641536 | 1329016193440047104 | 22.53016020309899 | 76.27771107659711 | 16.851530077439218 | 64.75383829654604 || 1263740609944641536 | 1329016193440047104 | 22.53016020309899 | 76.27771107659711 | 16.851530077439218 | 64.75383829654604 || 1263740609944641536 | 1329016193440047104 | 22.53016020309899 | 76.27771107659711 | 16.851530077439218 | 64.75383829654604 |
示例4:CTE in Views, CTAS, and Insert Statements
-- insert examplecreate table s1 like src;with q1 as ( select key, value from src where key = '5')from q1insert overwrite table s1select *;-- ctas examplecreate table s2 aswith q1 as ( select key from src where key = '4')select * from q1;-- view examplecreate view v1 aswith q1 as ( select key from src where key = '5')select * from q1;select * from v1;-- view example, name collisioncreate view v1 aswith q1 as ( select key from src where key = '5')select * from q1;with q1 as ( select key from src where key = '4')select * from v1;
?示例5:CET做遞歸查詢(xún)
#!/bin/bashinput_para="hive"hive_home='/usr/idp/current/hive-client/bin';if [ "$1" != "" ];theninput_para=$1fi;option=`echo ${input_para} | awk -F '_' '{print $1}' | sed s/[[:space:]]//g`if [ "$option" = "beeline" ];thenhive_addr=`hadoop fs -cat /phm/JTTL_ETL_COMMON/etl_process.properties | grep hive_addr | awk -F '=' '{print $2}' | sed s/[[:space:]]//g`hive_url="${hive_addr}/phmdwdb"cd $hive_homebeeline -u $hive_url \-hivevar start_day='2020-01-01' \-hivevar end_day='2025-12-31' \-hivevar timeDimTable='phmdwdb.dim_phm_date' \-f /home/centos/phm/JINI_GLOBAL_CREATETABLE/hive_dim_phm_date_query.sqlfiif [ "$option" = "hive" ];thenhive -hivevar start_day='2018-01-01' \-hivevar end_day='2025-12-31' \-hivevar timeDimTable='phmdwdb.dim_phm_date' \-f /home/centos/phm/JINI_GLOBAL_CREATETABLE/hive_dim_phm_date_query.sqlfi具體遞歸查詢(xún)SQL如下:drop table ${timeDimTable};create table if not exists ${timeDimTable} aswith dates as (select date_add("${start_day}", a.pos) as dfrom (select posexplode(split(repeat("o", datediff("${end_day}", "${start_day}")), "o"))) a),hours as (select a.pos as hfrom (select posexplode(split(repeat("o", 23), "o"))) a)selectfrom_unixtime(unix_timestamp(cast(d as timestamp)) + (h * 3600)) as dtime, d as dday, weekofyear(d) as dweek, date_format(d, 'u') as ddayofweek, month(d) as dmonth, year(d) as dyear, h as dhourfrom datesjoin hourssort by dtime;
3 CTE作用 小結(jié)
(1)可以復(fù)用公共代碼塊,減少表的 讀取次數(shù),降低IO 提高性能。如優(yōu)化join,優(yōu)化union 語(yǔ)句,優(yōu)化子查詢(xún)。將公共語(yǔ)句提前到select語(yǔ)句之前,達(dá)到一次查詢(xún)(讀),多次使用,目的是減少讀的次數(shù)。注意hive必須開(kāi)啟CTE物化的參數(shù)才起作用,如果沒(méi)開(kāi)啟,表還是被重復(fù)讀取,達(dá)不到一次查詢(xún)多次使用的目的
(2)提高代碼的可讀性:使代碼更簡(jiǎn)潔,便于維護(hù)。如將子查詢(xún)抽出來(lái)以后,放到with語(yǔ)句,可方便定位,維護(hù)代碼,代碼的可讀性增強(qiáng)。
(3)做遞歸查詢(xún),進(jìn)行迭代計(jì)算。
———————————————
原文:https://blog.csdn.net/godlovedaniel/article/details/115480115
