亚洲电影中文字幕,超碰影院在线,做爱视屏免费网,无码骚妇,亚裔evelynlin毛片,免费看A片秘免费麻豆,男人添女荫道口免费视频,青娱乐人操

一?背景

接到一個需求，需要把hive數(shù)據(jù)同步到clickhouse，本來以為是一個非常簡單的需求，因為數(shù)據(jù)平臺已經(jīng)集成了datax，最新版的datax是支持clickhouse writer的。

萬萬沒想到，同步的時候有點慢，每小時400w條數(shù)據(jù)左右，表里面這么多數(shù)據(jù)，要同步到什么時候去。所以開始了漫漫調(diào)研路，最終選擇了waterdrop

二關(guān)于waterdrop

Waterdrop是生產(chǎn)環(huán)境中的海量數(shù)據(jù)計算引擎，可以滿足你的流式，離線，etl，聚合等計算需求。InterestingLab是一個以為用戶簡化和普及大數(shù)據(jù)處理為核心目標(biāo)的開源技術(shù)團(tuán)隊。核心項目Waterdrop是基于Spark，F(xiàn)link構(gòu)建的配置化，零開發(fā)成本的大規(guī)模流式及離線處理工具。目前已有360、滴滴、華為、微博、新浪、一點資訊、永輝集團(tuán)、水滴籌等多個行業(yè)的公司在線上使用。

項目地址: https://github.com/InterestingLab/waterdrop

文檔地址：https://interestinglab.github.io/waterdrop-docs/

快速入門：https://interestinglab.github.io/waterdrop-docs/#/zh-cn/v1/quick-start

行業(yè)應(yīng)用案例：https://interestinglab.github.io/waterdrop-docs/#/zh-cn/v1/case_study/

插件開發(fā)：https://interestinglab.github.io/waterdrop-docs/#/zh-cn/v1/developing-plugin

Waterdrop的設(shè)計與實現(xiàn)原理：https://mp.weixin.qq.com/s/lYECVCYdKsfcL64xhWEqPg

三 waterdrop架構(gòu)

3.1?input

3.2?filter

3.3??output

四安裝使用

4.1 下載

https://github.com/InterestingLab/waterdrop/releases

4.2 解壓

tar -zxvf waterdrop-1.4.2-with-spark.zip

4.3配置文件修改(hive-->clickhouse)

waterdrop-env.sh

#!/usr/bin/env bash

# Home directory of spark distribution.

SPARK_HOME=/usr/local/spark-current/

test_df.conf

spark {

??spark.app.name = "hive-ck"

??spark.executor.instances = 8

??spark.executor.cores = 2

??spark.executor.memory = "2g"

??spark.sql.catalogImplementation = "hive"

??spark.yarn.queue="root.test"

}

input {

??hive {

????pre_sql = "select * from wedw_tmp.test_df"

????table_name = "test_df"

}

filter {

}

output {

????clickhouse {

????host = "10.20.xxx.xxx:8123"

????database = "ck"

????clickhouse.socket_timeout=600000

????table = "test_df"

????username = "root"

????password = "123456"

????bulk_size = 50000

????retry = 3

}

4.4?啟動waterdrop同步數(shù)據(jù)

/home/pgxl/liuzc/waterdrop-1.4.2/bin/start-waterdrop.sh --master yarn --deploy-mode client --config /home/pgxl/liuzc/waterdrop-1.4.2/config/test.conf

4.5?速度

2億條數(shù)據(jù)，一個小時左右

五使用中可能遇到的問題

5.1 Too many parts (304). Merges are processing significantly slower than inserts

merge速度跟不上插入速度，也就是insert，可能原因：?數(shù)據(jù)是否可能跨多個分區(qū)，如果這樣的話每次寫入有多個partition， merge壓力很大，可以減少并發(fā)

?spark.executor.instances = 4

5.2 read time out?

超時問題，可適當(dāng)增加超時時間

clickhouse.socket_timeout=600000

5.3 找不到類

需要看一下spark的配置?

--end--

掃描下方二維碼
添加好友，備注【交流】
可私聊交流，也可進(jìn)資源豐富學(xué)習(xí)群

為了把Hive數(shù)據(jù)同步到ClickHouse，我調(diào)研了Waterdrop

5.1 Too many parts (304). Merges are processing significantly slower than inserts

5.2 read time out?

5.3 找不到類

--end--