FileSpout: add support for gzip and bz2 files#1952
Conversation
… leak Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Julien Nioche <julien@digitalpebble.com>
|
Thanks for the PR. I see the need for it :) Two things worth addressing before merge: 1. bz2 concatenated streams are silently truncated.
To reproduce: printf 'https://a.example\nhttps://b.example\n' | bzip2 > part1.bz2
printf 'https://c.example\nhttps://d.example\n' | bzip2 > part2.bz2
cat part1.bz2 part2.bz2 > seeds.bz2
bzip2 -dc seeds.bz2 # CLI reads all 4 URLsThe current code reads only Note the asymmetry with the gzip path: Suggested fix, enable concatenated stream support: is = new BZip2CompressorInputStream(is, /* decompressConcatenated */ true);Single-stream 2. Missing BufferedInputStream on the compressed path.
Cheap fix, buffer the file before decompressing: InputStream is = new BufferedInputStream(new FileInputStream(inputPath.toFile()));Not a correctness issue, just I/O efficiency, but worth it given the large-seed-list motivation. |
Signed-off-by: Julien Nioche <julien@digitalpebble.com>
|
well spotted, thanks @rzo1 |
This makes it easier to ingest a large lists of seeds