Skip to content

FileSpout: add support for gzip and bz2 files#1952

Merged
jnioche merged 3 commits into
mainfrom
FileSpoutCompressedFiles
Jun 16, 2026
Merged

FileSpout: add support for gzip and bz2 files#1952
jnioche merged 3 commits into
mainfrom
FileSpoutCompressedFiles

Conversation

@jnioche

@jnioche jnioche commented Jun 16, 2026

Copy link
Copy Markdown
Contributor
  • fix javadoc and prevent possible leak

This makes it easier to ingest a large lists of seeds

jnioche added 2 commits June 15, 2026 18:23
… leak

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
Signed-off-by: Julien Nioche <julien@digitalpebble.com>
@rzo1

rzo1 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Thanks for the PR. I see the need for it :)

Two things worth addressing before merge:

1. bz2 concatenated streams are silently truncated.

new BZip2CompressorInputStream(is) only reads the first stream in a multi-stream bzip2 file. Concatenated bzip2 files are valid and common, for example when seed lists are built with cat part1.bz2 part2.bz2 > seeds.bz2 or produced by parallel compressors like pbzip2/lbzip2. With the single-arg constructor, everything after the first stream is dropped silently (no exception, no warning), so the spout emits fewer seeds than the file contains.

To reproduce:

printf 'https://a.example\nhttps://b.example\n' | bzip2 > part1.bz2
printf 'https://c.example\nhttps://d.example\n' | bzip2 > part2.bz2
cat part1.bz2 part2.bz2 > seeds.bz2

bzip2 -dc seeds.bz2   # CLI reads all 4 URLs

The current code reads only a and b from that file.

Note the asymmetry with the gzip path: GZIPInputStream decodes concatenated members automatically, so .gz returns all 4 URLs while .bz2 returns 2 from equivalent input.

Suggested fix, enable concatenated stream support:

is = new BZip2CompressorInputStream(is, /* decompressConcatenated */ true);

Single-stream .bz2 files (the normal bzip2 file.txt case) are unaffected either way. The bug only affects multi-stream files, which are exactly what you tend to get when assembling the large seed lists this PR targets.

2. Missing BufferedInputStream on the compressed path.

GZIPInputStream (512-byte default read buffer) and BZip2CompressorInputStream both issue many small reads directly against the raw FileInputStream, one syscall each. On the uncompressed path this is fine because InputStreamReader has its own ~8 KB buffer between the file and the reader, but on the compressed path the decompressor sits in front of that, so there is nothing absorbing the small reads. For large compressed seed files this is a lot of tiny reads.

Cheap fix, buffer the file before decompressing:

InputStream is = new BufferedInputStream(new FileInputStream(inputPath.toFile()));

Not a correctness issue, just I/O efficiency, but worth it given the large-seed-list motivation.

Signed-off-by: Julien Nioche <julien@digitalpebble.com>
@jnioche

jnioche commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

well spotted, thanks @rzo1

@jnioche jnioche added this to the 3.6.1 milestone Jun 16, 2026
@jnioche jnioche merged commit 6217723 into main Jun 16, 2026
2 checks passed
@jnioche jnioche deleted the FileSpoutCompressedFiles branch June 16, 2026 11:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants