Compressing hippos really fast

Lee D. Rothstein l1ee057@veritech.com
Tue Mar 4 18:57:00 GMT 2008


Sounds like he needs data-dedupe. Google "data de-duplication" for an 
array of vendors.

Phil Betts wrote:
> Corinna Vinschen wrote on Tuesday, March 04, 2008 3:43 PM::
>
>   
>> Hi,
>>
>>
>> does anybody know about a compression tool which is above all capable
>> of compressing really fast?  The compression ratio is only a mild
>> concern, it's rather more important that the tool is not acting as
>> bottleneck when compressing files which are badly compressable. 
>> Unfortunately 
>> the usual compression tools are rather interested in a good
>> compression than in a good speed when streaming lots of data.
>>
>> Here are a couple of disks which are supposed to be backed up.  Right
>> now this is done using a script which creats tar.gz archives of all
>> disks.  Some of this disks are quite big and contains many files which
>> are already compressed.  It turns out that gzipping these disks is
>> *the* bottleneck when backing up.
>>
>> When not compressing, tar creates archives with 37MB/s.  When creating
>> tar.gz archives, the compression takes so much time that the speed
>> goes down to 6MB/s.  Using gzip --fast doesn't help much.  bzip is a
>> lot slower than gzip.
>>
>> So the question is, does anybody know a compression tool which can be
>> used with tar, which doesn't slow down the backup by a factor of 6? 
>> It would be cool to have a tool which is as quick as the hardware
>> compression used in modern tape drives, but that's just dreaming...
>>
>>
>> May the hippos be with you,
>> Corinna
>>     
>
> I had this problem ages ago.  My solution was to run two backups.  
> One uncompressed including only files globbing *.gz, *.t[bg]z, *.[zZ], 
> *.bz2, *.zip etc, and one for the remainder which was piped 
> through gzip.
>
> Even a fast compression algorithm is just wasting time trying to 
> compress previously compressed files, and as most compressors work 
> on some variant of Lempel Ziv, if they're fed a mixture of 
> compressible and incompressible data, the incompressible data 
> flushes the dictionary making the compression of the compressible 
> part worse.
>
> Phil
>
>   



More information about the Cygwin-talk mailing list