[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: bzip2 1.0.7 released



Hi Julian,

On Fri, 2019-06-28 at 08:08 +0200, Julian Seward wrote:
> I see the following, in the 1.0.6 sources:
> 
> * bzlib_private.h:#define BZ_MAX_SELECTORS (2 + (900000 / BZ_G_SIZE))
> 
> * compress.c has
>     AssertH( nSelectors < 32768 &&
>              nSelectors <= (2 + (900000 / BZ_G_SIZE)),
>              3003 );
> 
> * decompress.c has (well, wants to have)
>        if (nSelectors < 1 || nSelectors > BZ_MAX_SELECTORS)
> 
> Mark observes this:
> 
>  > So it looks like some implementations might add more selectors than
>  > necessary. For example lbzip2 seems to use a max of 18000 + 1 + 7.  Which
>  > might explain why our 18002 = 2 + (900000 / 50) isn't enough, and why my
>  > random increase of 5 seemed to work for the given file.
> 
> Seems plausible, except that (18000 + 1 + 7) - (2 + (900000 / 50)) == 6,
> not 5.  So we'd have to increase the number by 6, not 5.

:} off-by-one...

> IIUC (which I may not), bzip2-1.0.7 is internally self-consistent, in that it
> won't create more than 18002 selectors, and it also won't accept more than
> 18002 selectors.  But some lbzip2 implementations can create up to 18008
> selectors (per Mark's comments), and that is what is causing the problem here.

Yes. if I understand correctly, then 18002 is also the theoretical
limit of the number of selectors. Being 2 (a zero and end of block
marker) plus (900000 / BZ_G_SIZE), the maximum block size, because you
cannot have more symbols than fit in a block, divided by BZ_G_SIZE
(50), the minimum group of symbols per tree.

But some implementations might "round up" that number. And I see some
implementations do accept the theoretical limit of 32768 (being 2^15).
But that seems impossible, and is just an upper limit because 18002
doesn't fit in 14 bits, so you have to use 15 bits to encode the number
of selectors.

> I would propose the following fix for discussion:
> 
> * split BZ_MAX_SELECTORS into two different values:
> 
>    - BZ_MAX_SELECTORS_ENC = (2 + (900000 / BZ_G_SIZE)) [= 18002], the max
>      number of selectors that bzip2 can create ("encode")
> 
>    - BZ_MAX_SELECTORS_DEC = BZ_MAX_SELECTORS_ENC + 6 [= 18008], the max number
>      of selectors that we'll accept during decoding, and add a comment
>      explaining that the difference is due to buggy lbzip2/whatever creating
>      more than BZ_MAX_SELECTORS_ENC
> 
> * use BZ_MAX_SELECTORS_ENC to dimension the arrays in struct EState
> 
> * use BZ_MAX_SELECTORS_DEC to dimension the arrays in struct DState, and
>    for the new decompress.c range check
> 
> * change the compress.c assertion
> 
>     AssertH( nSelectors < 32768 &&
>              nSelectors <= (2 + (900000 / BZ_G_SIZE)),
>              3003 );
> 
>     to actually mention BZ_MAX_SELECTORS_ENC directly, instead of
>     (2 + (900000 / BZ_G_SIZE)), [which is embarrassingly lame]
> 
> It seems to me to be important to now split BZ_MAX_SELECTORS into these two
> parts so as to make it clear to everybody that we're accepting (decompressing)
> a slightly larger set of inputs than we create (a la that old saying about
> network protocol implementations), so as to tolerate other compressors.

That seems good. The attached patch does this and makes it possible to
decode the problematic bz2 file.

> That then leaves the questions:
> 
> * Why did it work before?  I suspect Federico's guess that "I think the len
>    arrays are still uninitialized and will get filled until later" is correct,
>    but I'd need to look more at that.

Yes, I think Federico is correct, the len array comes directly after
the selectorMtf array and is being filled in after the selectorMtf
array overflow has been written. It is also large enough (BZ_N_GROUPS *
BZ_MAX_ALPHA_SIZE = 6 * 258 = 1548 UChars) so would have caught a small
overflow. The extra selectors are also never really used, since they
were just padding beyond the theoretical max. So the file would decode
correctly.

> * How to verify any change?  For whatever fix we arrive at, I'd like to do
>    some test runs over big trees of files (several tens of GB) with both
>    compressor and decompressor compiled with asan (and/or ubsan), since I
>    believe at least asan can detect overruns of the statically sized arrays
>    within the EState and DState structs -- V can't.

I couldn't get gcc -fsanitize=address to find the overwrite in the
subtle case. But the original report was for a fuzzed file. With a
ridiculously large selector value it should be easy to show (even under
valgrind).

>    Per communication with Federico (which didn't go to Mark, unfortunately), I
>    have a C driver program that compresses and decompresses every file in a
>    tree and checks it is the same as the original.  But I don't know what the
>    copyright status of it is and so I'm not happy to redistribute it (alas).
>    We should try to replace it.
> 
>    In particular, the three test files in the tarball merely serve to verify
>    that the build didn't fail in some obvious way.  They are in no way a
>    comprehensive test set.

Yes, having a more comprehensive test set would be great.
Especially having file encoded with other bzip2 encoders.

Thanks,

Mark
From 8c99340e3cb4538b91bc508ebfe778f7de2174b1 Mon Sep 17 00:00:00 2001
From: Mark Wielaard <mark@klomp.org>
Date: Fri, 28 Jun 2019 12:38:41 +0200
Subject: [PATCH] Be more liberal in the number of selectors we accept when
 decoding.

As proposed by Julian Sewart:

* split BZ_MAX_SELECTORS into two different values:

  - BZ_MAX_SELECTORS_ENC = (2 + (900000 / BZ_G_SIZE)) [= 18002], the max
    number of selectors that bzip2 can create ("encode")

  - BZ_MAX_SELECTORS_DEC = BZ_MAX_SELECTORS_ENC + 6 [= 18008], the max
    number of selectors that we'll accept during decoding, and add a
    comment explaining that the difference is due to buggy lbzip2/whatever
    creating more than BZ_MAX_SELECTORS_ENC

* use BZ_MAX_SELECTORS_ENC to dimension the arrays in struct EState

* use BZ_MAX_SELECTORS_DEC to dimension the arrays in struct DState,
  and for the new decompress.c range check

* change the compress.c assertion

  AssertH( nSelectors < 32768 &&
           nSelectors <= (2 + (900000 / BZ_G_SIZE)),
           3003 );

  to actually mention BZ_MAX_SELECTORS_ENC directly, instead of
  (2 + (900000 / BZ_G_SIZE)), [which is embarrassingly lame]
---
 bzlib_private.h | 15 ++++++++++-----
 compress.c      |  2 +-
 decompress.c    |  2 +-
 3 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/bzlib_private.h b/bzlib_private.h
index 7975552..6208ba1 100644
--- a/bzlib_private.h
+++ b/bzlib_private.h
@@ -122,7 +122,12 @@ extern void bz_internal_error ( int errcode );
 #define BZ_G_SIZE   50
 #define BZ_N_ITERS  4
 
-#define BZ_MAX_SELECTORS (2 + (900000 / BZ_G_SIZE))
+/* The max number of selectors that bzip2 can create ("encode") [= 18002] */
+#define BZ_MAX_SELECTORS_ENC (2 + (900000 / BZ_G_SIZE))
+/* The max number of selectors that bzip2 accept during decoding [= 18008]
+   This is larger than BZ_MAX_SELECTORS_ENC because some implementations,
+   might round up the number of selectors to a factor of 8. */
+#define BZ_MAX_SELECTORS_DEC (BZ_MAX_SELECTORS_ENC + 6)
 
 
 
@@ -253,8 +258,8 @@ typedef
       /* stuff for coding the MTF values */
       Int32    nMTF;
       Int32    mtfFreq    [BZ_MAX_ALPHA_SIZE];
-      UChar    selector   [BZ_MAX_SELECTORS];
-      UChar    selectorMtf[BZ_MAX_SELECTORS];
+      UChar    selector   [BZ_MAX_SELECTORS_ENC];
+      UChar    selectorMtf[BZ_MAX_SELECTORS_ENC];
 
       UChar    len     [BZ_N_GROUPS][BZ_MAX_ALPHA_SIZE];
       Int32    code    [BZ_N_GROUPS][BZ_MAX_ALPHA_SIZE];
@@ -399,8 +404,8 @@ typedef
       /* for decoding the MTF values */
       UChar    mtfa   [MTFA_SIZE];
       Int32    mtfbase[256 / MTFL_SIZE];
-      UChar    selector   [BZ_MAX_SELECTORS];
-      UChar    selectorMtf[BZ_MAX_SELECTORS];
+      UChar    selector   [BZ_MAX_SELECTORS_DEC];
+      UChar    selectorMtf[BZ_MAX_SELECTORS_DEC];
       UChar    len  [BZ_N_GROUPS][BZ_MAX_ALPHA_SIZE];
 
       Int32    limit  [BZ_N_GROUPS][BZ_MAX_ALPHA_SIZE];
diff --git a/compress.c b/compress.c
index 237620d..9b660f8 100644
--- a/compress.c
+++ b/compress.c
@@ -454,7 +454,7 @@ void sendMTFValues ( EState* s )
 
    AssertH( nGroups < 8, 3002 );
    AssertH( nSelectors < 32768 &&
-            nSelectors <= (2 + (900000 / BZ_G_SIZE)),
+            nSelectors <= BZ_MAX_SELECTORS_ENC,
             3003 );
 
 
diff --git a/decompress.c b/decompress.c
index 20ce493..d24a052 100644
--- a/decompress.c
+++ b/decompress.c
@@ -287,7 +287,7 @@ Int32 BZ2_decompress ( DState* s )
       GET_BITS(BZ_X_SELECTOR_1, nGroups, 3);
       if (nGroups < 2 || nGroups > BZ_N_GROUPS) RETURN(BZ_DATA_ERROR);
       GET_BITS(BZ_X_SELECTOR_2, nSelectors, 15);
-      if (nSelectors < 1 || nSelectors > BZ_MAX_SELECTORS) RETURN(BZ_DATA_ERROR);
+      if (nSelectors < 1 || nSelectors > BZ_MAX_SELECTORS_DEC) RETURN(BZ_DATA_ERROR);
       for (i = 0; i < nSelectors; i++) {
          j = 0;
          while (True) {
-- 
1.8.3.1