Hadoop is really bad to handle small files, the framework is heavy and was not designed to work on small files.
We process social signals, this means tons of small JSON from Twitter, Facebook, Google Plus API’s. In order to improve the overall performance we use mainly two technics:
Buffer all information on a intermediate process and consolidate those files as one bigger file. We use Apache Flume to this job and the only drawback is to loose Realtime, which have to be solved on our speed layer - someday I’ll post about it.
Improve the JVM reuse setting the variable
mapred.job.reuse.jvm.num.tasks. It doesn’t do miracles but helps. This variable is set to 20 in Amazon Elastic MapReduce. A value of -1 means infinite reuse within a single job, and 1 means do not reuse tasks.
You can start a new JVM for every task, which provides better task isolation, or you can share JVMs between tasks, providing lower framework overhead. If you are processing many small files, it makes sense to reuse the JVM many times to amortize the cost of start-up. However, if each task takes a long time or processes a large amount of data, then you might choose to not reuse the JVM to ensure all memory is freed for subsequent tasks.