This post could also be called Reading
.gz.tmp files with Spark. At Socialmetrix we have several pipelines writing logs to AWS S3, sometimes Apache Flume fails on the last phase to rename the final archive from
.gz, therefore those files are unavailable to be read by
SparkContext.textFile API. This post presents our workaround to process those files.
The diagram below contains the sink part of our architecture:
Flume is listening to an AMQP queue, dequeuing logs as soon they arrive;
Each 10 minutes, Flume Gzip the accumulated content and save to a S3 bucket;
For some unknown reason at this moment, some files doesn’t end up with the final desired extension
.gz, instead it is saved with
If you try to read this files with Spark (or Hadoop) all you gonna get is gibberish. Because any unknown extension is defaulted to plain-text.
The reason why you can’t read a file
.gz.tmp is because Spark try to match the file extension with registered compression codecs and no codec handlers the extension
Having this in mind, the solution is really easy, all we had to do was to extend
GzipCodec and override the
Here is our
1 2 3 4 5 6 7 8 9
Now we just registered this codec, setting
1 2 3 4 5 6 7 8
Now it is just a matter of assembly this codec as part of your project, in our case
sbt assembly and run your code as usual.
From the tests we ran on our environment, registering this Codec does not affect Spark’s default configuration, so we still can process extensions