有什么好的办法可以用 Flink/Spark 高效率并行处理大量大小不一的压缩数据

› Apache Hadoop

› Treasure Data

This topic created in 2277 days ago, the information mentioned may be changed or developed.

数据格式都是 gzip 压缩，都没法切分，只能一个线程读一个文件，很多时候小文件早就处理完了，但大文件会非常慢。有没有什么好的办法可以让 gzip 变得 splittable

2 replies • 2020-03-12 09:33:25 +08:00

alya

Mar 11, 2020

换 snappy

kex0916

Mar 12, 2020

可以先将大文件解压缩后放到 hdfs 上后再做计算，或者可以试试 https://github.com/nielsbasjes/splittablegzip 这种