1.需求描述
将a表的数据与b表的两个字段进行关联,输出结果
a表数据约24亿条
b表数据约30万条
2.优化效果
优化后执行时间从数天减少到数分钟
3.资源配置
spark 1.4.1
200core,600G RAM
4.代码简化版(优化前)
sqlContext.sql("name,ip1,ip2 as ip from table_A where name is not null and ip2 is not null or ip2 is not null) group by name,ip1,ip2").registerTempTable("a") sqlContext.read.parquet("table_B").registerTempTable("b") sqlContext.sql(''' select ip, count(1) as cnt from (select bb.ip as ip, aa.name as name from (select * from b where ip != '')bb left join (select * from a)aa on (bb.ip=aa.ip2 or bb.ip=aa.ip1) group by bb.ip, aa.name) group by ip ''').write.json("result")
5.代码简化版(优化后)
后来经过排查发现是使用or语句导致的运行缓慢,于是将两个条件查询注册成两张表,然后union成一张表,union操作其实只是合并两个rdd的分区,基本没有什么开销。然后在对这张表进行关联操作
代码如下:
//查询出需要的字段并进行缓存,因为下面要查询2次
sqlContext.sql("CACHE TABLE all AS select name,ip1,ip2 from table_A where name is not null and (ip1 is not null or ip2 is not null) group by name,ip1,ip2") sqlContext.sql("select name,ip1 from all group by name,ip1").registerTempTable("temp1") sqlContext.sql("select name,ip2 from all group by name,ip2").registerTempTable("temp2") sqlContext.sql("select name,ip from (select * from temp1 union all select * from temp2)a group by name,ip").registerTempTable("a") sqlContext.read.parquet("table_B").registerTempTable("b") sqlContext.sql(''' select ip, count(1) as cnt from (select bb.ip as ip, aa.name as name from (select * from b where ip != '')bb left join (select * from a)aa on bb.ip=aa.ip group by bb.ip, aa.name) group by ip ''').write.json("result")
|