> 文章列表 > Spark1.x VS Spark2.x

Spark1.x VS Spark2.x

Spark1.x VS Spark2.x

Spark-Benchmark

基准数据

col_name            	data_type           
username            	string
name                	string
blood_group         	string
company             	string
birthdate           	string
sex                 	string
job                 	string
ssn                 	string
mail                	string

测试方法:

Fake数据,数据以10W每批通过Mobius写入,记录不同集群环境的写入耗时,Merge耗时及常见统计查询的耗时。

查询耗时:3次查询耗时求平均。

数据写入的机器和集群所处的网络环境相同,即网络数据传输均为内网传输。

关键语句 SQL
GroupBy select blood_group from contact group by blood_group
Distinct select distinct(blood_group) from contact
Count(Distinct) select sex, count(distinct(username)) from contact group by sex
Where select sex, count(distinct(username)) from contact where blood_group in (‘A+’, ‘B-’, ‘AB+’) group by sex
UDF select count(merge(blood_group, sex)) from contact
合表1 create table contact2 as select * from contact
合表2 create table contact2 as select * from contact where sex=‘M’

测试结果

线下集群

运行环境

Driver Mem: 8G, 50Core, Executor Mem: 20G, Node Count: 5

数据操作性能

数据量 表名 文件大小 写入耗时 Merge 合表1 合表2
1000万 contact01 1.1G 5min 26s 7.7s 9.6s
1亿 contact 11.1G 83.3min 4.3min 1.3min 1.2min
10亿 contact10 110.7G 8.3h 44min 8.5min 4.7min

查询性能(Spark 1.3.1, 2015-05-18)

数据量 Count GroupBy Distinct Count_Distinct Where UDF
1000万 1s 2s 2.1s 6.6s 5s 4s
1亿 4.6s 9.2s 9.2s 43s 36s 29s
10亿 1.2min 1.4min 1.2min 7.6min 5.6min 2.6min

查询性能(Spark 1.4.0, 2015-07-03)

数据量 Count GroupBy Distinct Count_Distinct Where UDF
1000万 0.49s 1.3s 1.36s 7.26s 5.33s 2.74s
1亿 2.48s 9.06s 8.48s 48.58s 36.94s 25.9s
10亿 1.2min 1.3min 1.3min 7.4min 6min 4.6min

查询性能(Spark 1.4.1, 2015-07-17)

数据量 Count GroupBy Distinct Count_Distinct Where UDF
1000万 1.81s 1.45s 1.19s 6.66s 3.75s 2.63s
1亿 5.88s 9.01s 9.65s 49.06s 38.42s 17.62s
10亿 1.26min 1.4min 1.4min 7.6min 5.9min 2.8min

Parquet格式

数据量 Count GroupBy Distinct Count_Distinct Where UDF
1000万 0.48s 0.81s 0.83s 5.49s 3.2s 2.13s
1亿 0.63s 1.26s 1.41s 34.52s 15.32s 2.82s
10亿 2.48s 7.06s 8.21s 315.84s 142.91s 20.31s

查询性能(Spark 1.5.2, 2015-11-26)

Parquet

数据量 Count GroupBy Distinct Count_Distinct Where UDF
1000万 0.5s 0.81s 0.83s 5.3s 3.04s 1.55s
1亿 0.58s 1.29s 1.23s 30.71s 14.32s 2.92s
10亿 1.76s 7.32s 7.35s 259.33s 122.62s 18.71s

关闭tungsten

数据量 Count GroupBy Distinct Count_Distinct Where UDF
1000万 0.63s 0.99s 0.77s 6.02s 4.17s 1.88s
1亿 0.67s 1.35s 1.34s 12.12s 6.56s 9.52s
10亿 2.63s 8.03s 7.69s 63.57s 35.58s 42.77s

查询性能(Spark 1.6.0, 2016-01-07)

Parquet

数据量 Count GroupBy Distinct Count_Distinct Where UDF
1000万 0.63s 0.99s 0.77s 6.02s 4.17s 1.88s
1亿 0.67s 1.35s 1.34s 12.12s 6.56s 9.52s
10亿 2.63s 8.03s 7.69s 63.57s 35.58s 42.77s

查询性能(Spark 1.6.1, 2016-05-19)

数据量 Count GroupBy Distinct Count_Distinct Where UDF
1000万 0.34s 0.6s 0.59s 3.32s 2.17s 1.54s
1亿 0.54s 1.14s 1.16s 8.48s 6.66s 3.51s
10亿 2.77s 7.85s 7.87s 73.42s 42.96s 29.31s

查询性能(Spark 1.6.2, 2016-07-09)

数据量 Count GroupBy Distinct Count_Distinct Where UDF
1000万 0.42s 0.56s 0.48s 2.37s 1.39s 0.82s
1亿 0.56s 0.98s 0.96s 8.28s 4.77s 2.87s
10亿 2.55s 6.27s 5.85s 67.11s 35.9s 20.66s

查询性能(Spark 1.6.2, 2016-09-27)

数据量 Count GroupBy Distinct Count_Distinct Where UDF
1000万 0.4s 0.57s 0.43s 10.17s 4.83s 2.94s
1亿 0.39s 0.89s 0.91s 31.53s 13.6s 10.57s
10亿 1.83s 4.62s 4.81s 67.9s 28.55s 64.32s

查询性能(Spark 2.1.0, 2017-01-02)

数据量 Count GroupBy Distinct Count_Distinct Where UDF
1000万 0.67s 0.5s 0.46s 2.76s 1.47s 1.6s
1亿 0.24s 0.73s 0.69s 7.74s 5.05s 5.66s
10亿 0.65s 3.0s 2.97s 46.51s 25.45s 50.23s

查询性能(Spark 2.2.0, 2017-07-09)

数据量 Count GroupBy Distinct Count_Distinct Where UDF
1000万 0.21s 0.33s 0.33s 1.88s 1.08s 1.7s
1亿 0.22s 0.71s 0.6s 6.84s 4.31s 5.56s
10亿 0.46s 3.43s 3.29s 50.38s 25.55s 43.93s

线下集群

运行环境

Driver Mem: 8G, 50Core, Executor Mem: 20G, Node Count: 5

查询性能(Spark 2.2.0, 2020-07-24)

数据量 Count GroupBy Distinct Count_Distinct Where UDF
1000万 5.67s 2.89s 0.83s 8.74s 3.88s 5.54s
1亿 1.3s 0.32s 0.36s 0.48s 0.4s 0.38s
10亿 1.08s 3.0s 2.57s 250.25s 116.11s 48.98s

查询性能(Spark 2.3.0, 2020-07-24)

数据量 Count GroupBy Distinct Count_Distinct Where UDF
1000万 3.02s 1.22s 0.45s 8.22s 3.19s 2.89s
1亿 0.23s 0.2s 0.2s 0.27s 0.26s 0.3s
10亿 0.59s 3.54s 3.4s 270.14s 111.44s 49.32s

查询性能(Spark 2.4.6, 2020-07-24)

数据量 Count GroupBy Distinct Count_Distinct Where UDF
1000万 0.41s 0.51s 0.48s 8.29s 3.07s 3.3s
1亿 0.2s 0.21s 0.22s 0.25s 0.24s 0.23s
10亿 0.44s 3.26s 3.43s 257.84s 113.04s 54.31s

线上集群(bdp-192)

运行环境

Driver Mem: 25G, 216Core, Executor Mem: 18G, Node Count: 5

查询性能(Spark 2.3.0, 2020-08-4)

数据量 Count GroupBy Distinct Count_Distinct Where UDF
1000万 0.54s 0.49s 0.35s 1.32s 0.97s 1.01s
1亿 1.02s 0.98s 0.76s 3.65s 2.42s 3.71s
10亿 3.53s 2.9s 2.79s 17.37s 11.25s 27.25s

查询性能(Spark 2.4.6, 2020-08-04)

数据量 Count GroupBy Distinct Count_Distinct Where UDF
1000万 0.58s 0.51s 0.37s 1.33s 0.92s 0.73s
1亿 0.45s 0.85s 0.81s 3.77s 2.68s 4.89s
10亿 1.28s 2.87s 2.79s 18.12s 12.13s 33.11s

新集群 性能测试
spark-extra.conf

spark.scheduler.mode FAIRspark.executor.extraJavaOptions -Dtag=mobius.query.spi-wdb8 -XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -XX:PermSize=256M -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:G1HeapRegionSize=1m -Xloggc:./gc.log -verbose:gcspark.memory.fraction 0.8spark.memory.storageFraction 0.2spark.speculation.quantile 0.92spark.speculation.multiplier 2spark.speculation.interval 200msspark.executor.cores 36spark.driver.maxResultSize 2048mspark.sql.shuffle.partitions 180spark.speculation truespark.kryoserializer.buffer.max 512mspark.shuffle.consolidateFiles truespark.sql.autoBroadcastJoinThreshold -1spark.sql.mergeSchema.parallelize 2spark.locality.wait 100spark.cleaner.ttl 3600spark.default.parallelism 180spark.sql.adaptive.enabled truespark.sql.adaptive.shuffle.targetPostShuffleInputSize 1024880spark.sql.adaptive.minNumPostShufflePartitions 4spark.ui.retainedJobs 1000spark.ui.retainedStages 2000spark.sql.crossJoin.enabled true

新机器 spark 1.6.2 5节点共180 cores,每节点36 cores 18G memory, driver 10 cores, 25G memory 2016-09-30

数据量 Count GroupBy Distinct Count_Distinct Where UDF
1000万 0.23s 0.31s 0.31s 1.11s 0.82s 1.21s
1亿 0.99s 1.05s 0.87s 4.94s 3.25s 1.89s
10亿 1.2s 2.93s 3.36s 22.51s 12.39s 8.27s

新机器 spark 2.2.0 5节点共180 cores,每节点36 cores 18G memory, driver 10 cores, 25G memory 2017-11-01

数据量 Count GroupBy Distinct Count_Distinct Where UDF
1000万 0.48s 0.53s 0.34s 1.37s 0.61s 0.82s
1亿 0.45s 0.98s 0.56s 4.44s 2.35s 3.77s
10亿 0.61s 1.51s 1.23s 18.41s 10.74s 26.82s

线上集群

运行环境

Driver Mem: 25G, 40Core, Executor Mem: 4G, Node Count: 13

数据操作性能

数据量 表名 文件大小 写入耗时 Merge 合表1 合表2
1000万 contact01 1.1G 5min 27s 7.2s 4.4s
1亿 contact 11.1G 83.3min 4min 63s 40s
10亿 contact10 110.7G 8.3h 54min 10min 5min

查询性能(Spark1.3.1, 13个节点,2015-05-18)

数据量 Count GroupBy Distinct Count_Distinct Where UDF
1000万 1.6s 1.7s 1.4s 5s 3.8s 2.1s
1亿 2.6s 6.7s 6.8s 50s 29s 12s
10亿 32s 52s 51s 7.4min 4.5min 99s

查询性能(Spark1.3.1, 51台, 2015-06-06)

Driver Mem: 25G, 150Core, Executor Mem: 6G, Node Count: 51

数据量 Count GroupBy Distinct Count_Distinct Where UDF
1000万 0.8s 1.5s 1.4s 4.6s 3.2s 1.8s
1亿 1.8s 4s 3.2s 35s 14s 4.3s
10亿 8s 16s 17s 4.4min 2.1min 27s

查询性能(Spark1.6.1, 59台, 2016-05-13)

Driver Mem: 25G, 150Core, Executor Mem: 6G

数据量 Count GroupBy Distinct Count_Distinct Where UDF
1000万 0.62s 0.75s 0.78s 2.5s 1.88s 0.92s
1亿 0.66s 1.52s 1.32s 6.2s 4.85s 3.57s
10亿 2.5s 3.64s 3.26s 17.75s 9.67s 19.15s

create table contact50w as SELECT * FROM contact01 TABLESAMPLE (500000 ROWS)

Spark并发性能测试

测试SQL: select blood_group from contact group by blood_group

对线上集群进行测试,Spark节点数据:13。

测试数据如下, 耗时/ms:

并发查询数 最小 平均 最大
1 7434 7499 8197
5 7557 10698 16514
10 18843 21328 37128
20 35113 39487 42407

实时join查询性能测试

测试方法: ab -n 20 -c 1 [url] , 执行20次查询请求,并发数为1,取平均

测试两张表进行left join,执行SELECT a,SUM(b) FROM temp GROUP BY a LIMIT 10000,对比表实体化与否的查询性能。

数据量指的是join后的数据量。

数据量 未实体化 实体化
1000 450.216ms 531.483ms
20W 1282.255ms 674.580ms
1000W 6869.201ms 2634.941ms

BDP-Benchmark

使用 tassadar 的 tools 下的 bdp_benchmark.py 脚本,往 bdp-192 上的 mobius 依次发请求,获得的结果。

Spark 2.2.0

数据量 LinkRelativeRatio DateDistinct Filter Count SumByDay DayOnDayBasis
1000万 1.7s 0.75s 1.17s 0.43s 1.05s 1.19s
1亿 3.03s 2.36s 2.81s 0.41s 2.41s 2.54s
10亿 17.74s 15.76s 16.96s 1.79s 16.89s 17.17s

Spark 1.6.2

数据量 LinkRelativeRatio DateDistinct Filter Count SumByDay DayOnDayBasis
1000万 1.7s 0.86s 1.6s 0.49s 1.08s 1.5s
1亿 5.49s 2.29s 5.49s 0.82s 2.36s 4.12s
10亿 30.85s 16.74s 34.36s 2.52s 17.84s 30.35s