FileSpout: add support for gzip and bz2 files by jnioche · Pull Request #1952 · apache/stormcrawler

jnioche · 2026-06-16T08:53:40Z

fix javadoc and prevent possible leak

This makes it easier to ingest a large lists of seeds

… leak Signed-off-by: Julien Nioche <julien@digitalpebble.com>

Signed-off-by: Julien Nioche <julien@digitalpebble.com>

rzo1 · 2026-06-16T09:04:56Z

Thanks for the PR. I see the need for it :)

Two things worth addressing before merge:

1. bz2 concatenated streams are silently truncated.

new BZip2CompressorInputStream(is) only reads the first stream in a multi-stream bzip2 file. Concatenated bzip2 files are valid and common, for example when seed lists are built with cat part1.bz2 part2.bz2 > seeds.bz2 or produced by parallel compressors like pbzip2/lbzip2. With the single-arg constructor, everything after the first stream is dropped silently (no exception, no warning), so the spout emits fewer seeds than the file contains.

To reproduce:

printf 'https://a.example\nhttps://b.example\n' | bzip2 > part1.bz2
printf 'https://c.example\nhttps://d.example\n' | bzip2 > part2.bz2
cat part1.bz2 part2.bz2 > seeds.bz2

bzip2 -dc seeds.bz2   # CLI reads all 4 URLs

The current code reads only a and b from that file.

Note the asymmetry with the gzip path: GZIPInputStream decodes concatenated members automatically, so .gz returns all 4 URLs while .bz2 returns 2 from equivalent input.

Suggested fix, enable concatenated stream support:

is = new BZip2CompressorInputStream(is, /* decompressConcatenated */ true);

Single-stream .bz2 files (the normal bzip2 file.txt case) are unaffected either way. The bug only affects multi-stream files, which are exactly what you tend to get when assembling the large seed lists this PR targets.

2. Missing BufferedInputStream on the compressed path.

GZIPInputStream (512-byte default read buffer) and BZip2CompressorInputStream both issue many small reads directly against the raw FileInputStream, one syscall each. On the uncompressed path this is fine because InputStreamReader has its own ~8 KB buffer between the file and the reader, but on the compressed path the decompressor sits in front of that, so there is nothing absorbing the small reads. For large compressed seed files this is a lot of tiny reads.

Cheap fix, buffer the file before decompressing:

InputStream is = new BufferedInputStream(new FileInputStream(inputPath.toFile()));

Not a correctness issue, just I/O efficiency, but worth it given the large-seed-list motivation.

Signed-off-by: Julien Nioche <julien@digitalpebble.com>

jnioche · 2026-06-16T10:44:36Z

well spotted, thanks @rzo1

jnioche added 2 commits June 15, 2026 18:23

Add support for gzip and bz2 files + fix javadoc and prevent possible…

2105724

… leak Signed-off-by: Julien Nioche <julien@digitalpebble.com>

comply to checkstyle

1358eac

Signed-off-by: Julien Nioche <julien@digitalpebble.com>

jnioche mentioned this pull request Jun 16, 2026

FileSpout to handle compressed input thegreenwebfoundation/carbontxt-crawler#2

Closed

buffered input + handle concats for bz2

e5a54e1

Signed-off-by: Julien Nioche <julien@digitalpebble.com>

rzo1 approved these changes Jun 16, 2026

View reviewed changes

jnioche added this to the 3.6.1 milestone Jun 16, 2026

jnioche added enhancement core labels Jun 16, 2026

jnioche merged commit 6217723 into main Jun 16, 2026
2 checks passed

jnioche deleted the FileSpoutCompressedFiles branch June 16, 2026 11:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FileSpout: add support for gzip and bz2 files#1952

FileSpout: add support for gzip and bz2 files#1952
jnioche merged 3 commits into
mainfrom
FileSpoutCompressedFiles

jnioche commented Jun 16, 2026 •

edited

Loading

Uh oh!

rzo1 commented Jun 16, 2026

Uh oh!

jnioche commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jnioche commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rzo1 commented Jun 16, 2026

Uh oh!

jnioche commented Jun 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jnioche commented Jun 16, 2026 •

edited

Loading