1. <strong id="7actg"></strong>
    2. <table id="7actg"></table>

    3. <address id="7actg"></address>
      <address id="7actg"></address>
      1. <object id="7actg"><tt id="7actg"></tt></object>

        使用 SQL 窗口函數(shù)進行增長數(shù)據(jù)分析

        共 5897字,需瀏覽 12分鐘

         ·

        2020-09-05 22:25

        點擊上方數(shù)據(jù)管道”,選擇“置頂星標”公眾號

        干貨福利,第一時間送達


        本文轉(zhuǎn)自:大數(shù)據(jù)技術(shù)與數(shù)倉

        本文會從一個商務(wù)分析案例入手,說明SQL窗口函數(shù)的使用方式。通過本文的5個需求分析,可以看出SQL窗口函數(shù)的功能十分強大,不僅能夠使我們編寫的SQL邏輯更加清晰,而且在某種程度上可以簡化需求開發(fā)。

        數(shù)據(jù)準備

        本文主要分析只涉及一張訂單表orders,操作過程在Hive中完成,具體數(shù)據(jù)如下:

        --?建表
        CREATE?TABLE?orders(
        ????order_id?int,
        ????customer_id?string,
        ????city?string,
        ????add_time?string,
        ????amount?decimal(10,2));

        --?準備數(shù)據(jù)??????????????????????????????
        INSERT?INTO?orders?VALUES
        (1,"A","上海","2020-01-01?00:00:00.000000",200),
        (2,"B","上海","2020-01-05?00:00:00.000000",250),
        (3,"C","北京","2020-01-12?00:00:00.000000",200),
        (4,"A","上海","2020-02-04?00:00:00.000000",400),
        (5,"D","上海","2020-02-05?00:00:00.000000",250),
        (5,"D","上海","2020-02-05?12:00:00.000000",300),
        (6,"C","北京","2020-02-19?00:00:00.000000",300),
        (7,"A","上海","2020-03-01?00:00:00.000000",150),
        (8,"E","北京","2020-03-05?00:00:00.000000",500),
        (9,"F","上海","2020-03-09?00:00:00.000000",250),
        (10,"B","上海","2020-03-21?00:00:00.000000",600);

        需求1:收入增長

        在業(yè)務(wù)方面,第m1個月的收入增長計算如下:100 *(m1-m0)/ m0

        其中,m1是給定月份的收入,m0是上個月的收入。因此,從技術(shù)上講,我們需要找到每個月的收入,然后以某種方式將每個月的收入與上一個收入相關(guān)聯(lián),以便進行上述計算。計算當時如下:

        WITH
        monthly_revenue?as?(
        ????SELECT
        ????trunc(add_time,'MM')?as?month,
        ????sum(amount)?as?revenue
        ????FROM?orders
        ????GROUP?BY?1
        )
        ,prev_month_revenue?as?(
        ????SELECT?
        ????month,
        ????revenue,
        ????lag(revenue)?over?(order?by?month)?as?prev_month_revenue?--?上一月收入
        ????FROM?monthly_revenue
        )
        SELECT?
        ??month,
        ??revenue,
        ??prev_month_revenue,
        ??round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1)?as?revenue_growth
        FROM?prev_month_revenue
        ORDER?BY?1

        結(jié)果輸出

        monthrevenueprev_month_revenuerevenue_growth
        2020-01-01650NULLNULL
        2020-02-01125065092.3
        2020-03-011500125020

        我們還可以按照按城市分組進行統(tǒng)計,查看某個城市某個月份的收入增長情況

        WITH
        monthly_revenue?as?(
        ????SELECT
        ?????trunc(add_time,'MM')?as?month,
        ????city,
        ????sum(amount)?as?revenue
        ????FROM?orders
        ????GROUP?BY?1,2
        )
        ,prev_month_revenue?as?(
        ????SELECT?
        ????month,
        ????city,
        ????revenue,
        ????lag(revenue)?over?(partition?by?city?order?by?month)?as?prev_month_revenue
        ????FROM?monthly_revenue
        )
        SELECT?
        month,
        city,
        revenue,
        round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1)?as?revenue_growth
        FROM?prev_month_revenue
        ORDER?BY?2,1

        結(jié)果輸出

        monthcityrevenuerevenue_growth
        2020-01-01上海450NULL
        2020-02-01上海950111.1
        2020-03-01上海10005.3
        2020-01-01北京200NULL
        2020-02-01北京30050
        2020-03-01北京50066.7

        需求2:累計求和

        累計匯總,即當前元素和所有先前元素的總和,如下面的SQL:

        WITH
        monthly_revenue?as?(
        ????SELECT
        ????trunc(add_time,'MM')?as?month,
        ????sum(amount)?as?revenue
        ????FROM?orders
        ????GROUP?BY?1
        )
        SELECT?
        month,
        revenue,
        sum(revenue)?over?(order?by?month?rows?between?unbounded?preceding?and?current?row)?as?running_total
        FROM?monthly_revenue
        ORDER?BY?1

        結(jié)果輸出

        monthrevenuerunning_total
        2020-01-01650650
        2020-02-0112501900
        2020-03-0115003400

        我們還可以使用下面的組合方式進行分析,SQL如下:

        SELECT
        ???order_id,
        ???customer_id,
        ???city,
        ???add_time,
        ???amount,
        ???sum(amount)?over?()?as?amount_total,?--?所有數(shù)據(jù)求和
        ???sum(amount)?over?(order?by?order_id?rows?between?unbounded?preceding?and?current?row)?as?running_sum,?--?累計求和
        ???sum(amount)?over?(partition?by?customer_id?order?by?add_time?rows?between?unbounded????preceding?and?current?row)?as?running_sum_by_customer,?
        ???avg(amount)?over?(order?by?add_time?rows?between?5?preceding?and?current?row)?as??trailing_avg?--?滾動求平均
        FROM?orders
        ORDER?BY?1

        結(jié)果輸出

        order_idcustomer_idcityadd_timeamountamount_totalrunning_sumrunning_sum_by_customertrailing_avg
        1A上海2020-01-01 00:00:00.0000002003400200200200
        2B上海2020-01-05 00:00:00.0000002503400450250225
        3C北京2020-01-12 00:00:00.0000002003400650200216.666667
        4A上海2020-02-04 00:00:00.00000040034001050600262.5
        5D上海2020-02-05 00:00:00.00000025034001300250260
        5D上海2020-02-05 12:00:00.00000030034001600550266.666667
        6C北京2020-02-19 00:00:00.00000030034001900500283.333333
        7A上海2020-03-01 00:00:00.00000015034002050750266.666667
        8E北京2020-03-05 00:00:00.00000050034002550500316.666667
        9F上海2020-03-09 00:00:00.00000025034002800250291.666667
        10B上海2020-03-21 00:00:00.00000060034003400850

        需求3:處理重復數(shù)據(jù)

        從上面的數(shù)據(jù)可以看出,存在兩條重復的數(shù)據(jù)**(5,"D","上海","2020-02-05 00:00:00.000000",250), (5,"D","上海","2020-02-05 12:00:00.000000",300),**顯然需要對其進行清洗去重,保留最新的一條數(shù)據(jù),SQL如下:

        我們先進行分組排名,然后保留最新的那條數(shù)據(jù)即可:

        SELECT?*
        FROM?(
        ????SELECT?*,
        ????row_number()?over?(partition?by?order_id?order?by?add_time?desc)?as?rank
        ????FROM?orders
        )?t
        WHERE?rank=1

        結(jié)果輸出

        t.order_idt.customer_idt.cityt.add_timet.amountt.rank
        1A上海2020-01-01 00:00:00.0000002001
        2B上海2020-01-05 00:00:00.0000002501
        3C北京2020-01-12 00:00:00.0000002001
        4A上海2020-02-04 00:00:00.0000004001
        5D上海2020-02-05 12:00:00.0000003001
        6C北京2020-02-19 00:00:00.0000003001
        7A上海2020-03-01 00:00:00.0000001501
        8E北京2020-03-05 00:00:00.0000005001
        9F上海2020-03-09 00:00:00.0000002501
        10B上海2020-03-21 00:00:00.0000006001

        經(jīng)過上面的清洗過程,對數(shù)據(jù)進行了去重。重新計算上面的需求1,正確SQL腳本為:

        WITH
        orders_cleaned?as?(
        ????SELECT?*
        ????FROM?(
        ????????SELECT?*,
        ????????row_number()?over?(partition?by?order_id?order?by?add_time?desc)?as?rank
        ????????FROM?orders
        ????)t
        ????WHERE?rank=1
        )
        ,monthly_revenue?as?(
        ????SELECT
        ????trunc(add_time,'MM')?as?month,
        ????sum(amount)?as?revenue
        ????FROM?orders_cleaned
        ????GROUP?BY?1
        )
        ,prev_month_revenue?as?(
        ????SELECT?
        ????month,
        ????revenue,
        ????lag(revenue)?over?(order?by?month)?as?prev_month_revenue
        ????FROM?monthly_revenue
        )
        SELECT?
        month,
        revenue,
        round(100.0*(revenue-prev_month_revenue)/prev_month_revenue,1)?as?revenue_growth
        FROM?prev_month_revenue
        ORDER?BY?1

        結(jié)果輸出

        monthrevenuerevenue_growth
        2020-01-01650NULL
        2020-02-01100053.8
        2020-03-01150050

        將清洗后的數(shù)據(jù)創(chuàng)建成視圖,方便以后使用

        CREATE?VIEW?orders_cleaned?AS
        SELECT
        ????order_id,?
        ????customer_id,?
        ????city,?
        ????add_time,?
        ????amount
        FROM?(
        ????SELECT?*,
        ????row_number()?over?(partition?by?order_id?order?by?add_time?desc)?as?rank
        ????FROM?orders
        )t
        WHERE?rank=1

        需求4:分組取TopN

        分組取topN是最長見的SQL窗口函數(shù)使用場景,下面的SQL是計算每個月份的top2訂單金額,如下:

        WITH?orders_ranked?as?(
        ????SELECT
        ????trunc(add_time,'MM')?as?month,
        ????*,
        ????row_number()?over?(partition?by?trunc(add_time,'MM')?order?by?amount?desc,?add_time)?as?rank
        ????FROM?orders_cleaned
        )
        SELECT?
        ????month,
        ????order_id,
        ????customer_id,
        ????city,
        ????add_time,
        ????amount
        FROM?orders_ranked
        WHERE?rank?<=2
        ORDER?BY?1

        需求5:重復購買行為

        下面的SQL計算重復購買率:重復購買的人數(shù)/總?cè)藬?shù)*100%以及第一筆訂單金額與第二筆訂單金額之間的典型差額:avg(第二筆訂單金額/第一筆訂單金額)

        WITH?customer_orders?as?(
        ????SELECT?*,
        ????row_number()?over?(partition?by?customer_id?order?by?add_time)?as?customer_order_n,
        ????lag(amount)?over?(partition?by?customer_id?order?by?add_time)?as?prev_order_amount
        ????FROM?orders_cleaned
        )
        SELECT
        round(100.0*sum(case?when?customer_order_n=2?then?1?end)/count(distinct?customer_id),1)?as?repeat_purchases,--?重復購買率
        avg(case?when?customer_order_n=2?then?1.0*amount/prev_order_amount?end)?as?revenue_expansion?--?重復購買較上次購買差異,第一筆訂單金額與第二筆訂單金額之間的典型差額
        FROM?customer_orders

        結(jié)果輸出

        WITH結(jié)果輸出:

        orders_cleaned.order_idorders_cleaned.customer_idorders_cleaned.cityorders_cleaned.add_timeorders_cleaned.amountcustomer_order_nprev_order_amount
        1A上海2020-01-01 00:00:00.0000002001NULL
        4A上海2020-02-04 00:00:00.0000004002200
        7A上海2020-03-01 00:00:00.0000001503400
        2B上海2020-01-05 00:00:00.0000002501NULL
        10B上海2020-03-21 00:00:00.0000006002250
        3C北京2020-01-12 00:00:00.0000002001NULL
        6C北京2020-02-19 00:00:00.0000003002200
        5D上海2020-02-05 12:00:00.0000003001NULL
        8E北京2020-03-05 00:00:00.0000005001NULL
        9F上海2020-03-09 00:00:00.000000250

        最終結(jié)果輸出:

        repeat_purchasesrevenue_expansion
        501.9666666666666668

        總結(jié)

        本文主要分享了SQL窗口函數(shù)的基本使用方式以及使用場景,并結(jié)合了具體的分析案例。通過本文的分析案例,可以加深對SQL窗口函數(shù)的理解。

        瀏覽 76
        點贊
        評論
        收藏
        分享

        手機掃一掃分享

        分享
        舉報
        評論
        圖片
        表情
        推薦
        點贊
        評論
        收藏
        分享

        手機掃一掃分享

        分享
        舉報
        1. <strong id="7actg"></strong>
        2. <table id="7actg"></table>

        3. <address id="7actg"></address>
          <address id="7actg"></address>
          1. <object id="7actg"><tt id="7actg"></tt></object>
            成人 免费视频在线看一个小时 | 欧美二三四区 | 成年人免费性爱视频 | 色哟哟丨小泬丨国产专区 | 欧美性猛交XXX | 欧美丝袜足交 | 九九国产 | 女人扒开屁股爽桶30分钟的演员 | chinese野外求欢bbw 久久免费偷拍 | 各种少妇正面bbw撒尿视频 |