This is the mail archive of the guile@cygnus.com mailing list for the guile project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Outline for Guile Generational GC


I read your most excellent notes on Guile's generational gc.  I
started thinking about all the issues you raised and came up an
outline for a possible implementation.  I am not very familiar with
most of Guile so there may be some gaping holes in what follows.

Guile's non-copying gc mode prevents the traditional technique of
associating a generation with a heap segment.  Hence, generational
information must be recorded somewhere.  Adding it to the cell doesn't
seem feasible unless the tag structure is radically changed; even then
few bits would be available.  An obvious thought is to use a vector of
bytes (one for each cell) in parallel with each heap segment; subtracting
the base of the segment from a pointer and shifting right by log2 of
cell size gives an index into the vector.  However, finding the
segment base for an arbitrary pointer also means doing a lookup in a
segment address table which is somewhat clumsy, given that this must
be done for every pointer encountered in the mark phase.

I think the following scheme would make everything much less painless.
(The discussion assumes a 32-bit implementation; there is an obvious
64-bit analog.)  Consider a heap segment to be a collection of
super-cells each containing space for 8 contiguous cells.  The first
cell is not used as a cell but as an 8-byte vector, containing a byte
for each of the remaining cells (with the final byte open for other
use).  The byte corresponding to each cell (call it a Cell_Info) will
contain the generation number and some other flags discussed below.
The super-cells are aligned so that their addresses are a multiple of
0x40; this can always be done by wasting a few bytes in segments that
do not begin on a page boundary.

Now, given a pointer p to a cell we can derive a pointer to its
Cell_Info by:
  #define SCM_CELL_INFOP(p) ((p) & 0x3f)/8 + ((p) & -0x40)
  (some casting also required)
This code sequence should only take 4 to 6 instructions depending on
the processor architecture.

So what's in Cell_Info?  My thoughts go something like this:

struct Cell_Info {
   int generation:4;
   int car_newer:1;
   int car_older:1;
   int cdr_newer:1;
   int cdr_older:1;
   };

struct Super_Cell {
   Cell_Info cell_info[7];
   char any_out;
   Cell cells[7];
   };

The generation field is, of course, the cell's generation number.
Four bits is probably overkill; three might be enough, with the fourth
bit used for something else, like a pinned flag (because of reference
from the stack), but I'm not sure if this is useful to be kept
semi-permanently.  The car/cdr-newer/older flags are set by the write
barrier.  The newer or older flag is set when the pointer is to a
member of a younger or older generation respectively.  As an
optimization, we can set a flag in the any_out byte of the super-cell
to indicate (at least) one of the cells has a pointer outside its
generation.

The initial marking phase from the root set now has lots of
opportunities for tree pruning.  If a pointer goes outside the
generation being collected, it need not be followed for additional
marking.  Unfortunately, a second marking phase must scan the whole
heap for pointers into the generation being collected, but we can make
effective use of the Cell_Info bits to lighten the work load.  (The
normal copying collector wins big by not having to do this; this is
the price we have to pay for a non-copying collector.)

So the second mark phase looks something like:

  int n = generation_being_collected;
  Super_Cell *scp = initial_value;
  while (/* some condition */) {
    if (scp->any_out)
       for (i=0; i<7; ++i) {
         int sign = scp->cell_info[i].generation - n;
         if (sign < 0) {
            if (scp->cell_info[i].car_older) {
               SCM *the_car = SCM_CAR(scp->cells[i]);  /* this is safe! */
               if (SCM_CELL_INFOP(the_car)->generation == n)
                  mark(the_car);
               }
            if (scp->cell_info[i].cdr_older) {
               SCM *the_cdr = SCM_CDR(scp->cells[i]);
               if (SCM_CELL_INFOP(the_cdr)->generation == n)
                  mark(the_cdr);
               }
            }
          else if (sign > 0) {
            if (scp->cell_info[i].car_newer) {
               SCM *the_car = SCM_CAR(scp->cells[i]);
               if (SCM_CELL_INFOP(the_car)->generation == n)
                  mark(the_car);
               }
            if (scp->cell_info[i].cdr_newer) {
               SCM *the_cdr = SCM_CDR(scp->cells[i]);
               if (SCM_CELL_INFOP(the_cdr)->generation == n)
                  mark(the_cdr);
               }
          else
             ;   /* cell is in generation being collected; */
                 /* it's either already marked or will be marked by */
                 /* being pointed to by another generation or is garbage */
          }
       }
   ++scp;
   }

Finally, the sweep phase will do the usual work of scanning for cells
of the proper generation and unmarking them and either incrementing
their generation number or placing them on the free list.  Of course,
the any_out flags of any super cells with a newly collected cell must
be recomputed.

After thinking about the issues you raised in your notes, I think that
babies should be treated specially.  I would propose that they be
allocated in an "incubator" rather than in a normal generation heap
segment.  This incubator would not consist of super-cells but be
treated as a vector of cells.  Allocation would not require a free
list, but just testing for exhaustion of the area and incrementing a
pointer by 8.  The write barrier for babies would be handled by
keeping a list of pairs: the cell addresses that were stored into an
older generation cell and the address of that cell.  This way a full
secondary scan of the heap to find references to babies is not
required.  The final difference of the incubator is that live cells
that are not pinned are copied into the higher generation segments,
using the normal copying gc techniques and allocating from the free
list (but also updating older generation pointers from the write
barrier list).

If there are no pinned cells, the whole incubator is ready to be
reused as one contiguous allocation area; if there are n pinned cells,
then we have (at most) n+1 contiguous areas that can each be allocated
sequentially.  If we get too many pinned cells, we can just allocate a
new incubator, but I think that long-term pinned cells will be fairly
rare.

The advantages of the incubator are that we have a lot of locality
during allocation and during baby collection, which is the most
frequent gc activity.  Since the incubator will contain mostly garbage
(if the generational assumptions hold), the copying is minimal and we
can avoid building a free list for the incubator.  We may prematurely
copy some babies created at the end of the cycle that will soon become
garbage.  We also might have to allocate a new heap segment if the
normal free list is exhausted by the copying.  This affects the
croaking geezer problem by filling out free cells, but may end up
mixing generations too much.  As you point out, we can order the free
list as desired.  Identifying babies complicates the calculation of
the Cell_Info pointer (which doesn't exist for babies).  The secondary
mark phase described above doesn't need to worry about this as elders
never have their newer bits set when pointing to babies.  The root set
trace and the write barrier must do a range check against the
incubator bounds to see if a pointer is to a baby.

The write barrier by now is non-trivial, but it only needs to be dealt
with in set-car! and set-cdr! (and letrec?) which should be rare in
truly schemeful programming.  We need to determine the generations of
the cell being modified and the cell being referenced (if the pointer
is to a pair). If the referenced cell is a baby and the modified cell
is not a baby, we add the two cell addresses to the incubator list; if
both are non-babies and of different generations, we set the
car/cdr-newer/older flags as appropriate and the any-out flag.

IMHO the heap segment allocation strategy needs to be modified from
the current Guile for a generational gc.  It starts out by allocating
32k cells and then triples the heap on each further allocation (as
long the OS will agree).  While selecting the appropriate parameters
will require lots of tuning, I would think the growth should be more
linear.  We are gc-ing the incubator fairly frequently so the normal
heap segments should stay more in line with longer-term storage needs.

There is no particular reason that the incubator should be the same
size as a heap segment; it should be sized so that it can be gc-ed in
a human-imperceptible amount of time.  If we provide a gc-blocking
mode that continues collecting the incubator, and doesn't try to
collect older generations, but just allocates new heap segments as
necessary, we would have a servicable semi-incremental collector that
might satisfy MDJ's needs for smooth GUI performance.  This would only
work if there were eventually some pause or restart points in
applications when normal gc could be resumed or there were very few
non-baby pairs surviving.

I don't have a strong feeling about scheduling higher generation
collections; since we are not copying them, there is no area to fill up
to trigger a collection.  You suggested a frequency of 1/(n+1), which
seems reasonable.  Perhaps we need some sort of adaptive scheduling
which pays attention to the rates of change of the sizes of various
generations.

I see two scenarios that most long-running applications will have some
mixture of: 1) a substantial data base is built up during a startup
phase and then a steady state phase is entered and various numbers of
cells are allocated and most soon become garbage, with some minimal
growth of the long term data; and 2) the application operates in a
cyclical manner, accumulating data for a medium-long time period,
after which a great deal of garbage is suddenly produced and the cycle
repeats.  In the first scenario spending a lot of time in higher
generations is relatively wasteful, while it is essential in the
second.  Probing for which generations are changing size by running
gcs shouldn't affect long-running programs too much.  If they aren't
long-running it doesn't really matter.

Well, it seems no meaningful discussion of gc can be short.  I hope
some of these ideas fit in with your current thinking.  Sorry if my
presentation is too pedantic, but I have to work things out well
enough that they don't seem completely idiotic or wishful.

Dale Jordan.