This is the mail archive of the guile@sources.redhat.com mailing list for the Guile project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

String representation proposal


The following proposal has the probllem that it isn't thread safe.
Can anyone come up with an alternative which is thread-safe?

Implementation of shared substrings with fresh-copy semantics
=============================================================

Version: $Id: sharedstr.text,v 1.1 2000/08/26 20:55:21 mdj Exp $

Background
----------

In Guile, most string operations work on two other data types apart
from strings: shared substrings and read-only strings (which includes
symbols).  One of Guile's sub-goals is to be a scripting language in
which string management is important.  Read-only strings and shared
substrings were introduced in order to reduce overhead in string
manipulation.

We now want to simplify the Guile API by removing these two data
types, but keeping performance by allowing ordinary strings to share
storage.

The idea is to let operations like `symbol->string' and `substring'
return a pointer into the original string/symbol, thus avoiding the
need to copy the string.

Two of the problems which then arise are:

* If s2 is derived from s1, and therefore share storage with s1, a
  modification to either s1 or s2 will affect the other.

* Guile is supposed to interact closely with many UNIX libraries in
  which the NUL character is used to terminate strings.  Therefore
  Guile strings contain a NUL character at the end, in addition to the
  string length (the latter of which is used by Guile's string
  operations).

The solutions to these problems are to

* Copy a string with shared storage when it's modified.

* Copy a string with shared storage when it's being used as argument
  to a C library call.  (Copying implies inserting an ending NUL
  character.)

But this leads to memory management problems.  When is it OK to free
a character array which was allocated for a symbol or a string?

Abstract description of proposed solution
-----------------------------------------

Definitions

  STRING = <TYPETAG, LENGTH, CHARRECORDPTR, CHARPTR>

  SYMBOL = <TYPETAG, LENGTH, CHARRECORDPTR, CHARPTR>

  CHARRECORD = <PHASE, SHAREDFLAG, CHARS>

  PHASE = black | white

  SHAREDFLAG = private | shared

  CHARS is a character array

  CHARPTR points into it

Memory management

A string or symbol is initially allocated with its contents stored in
a character array in a character record.  The string/symbol header
contains a pointer to this record.  The initial value of the shared
flag in the character record is `private'.

The GC mark phases alternate between black and white---every second
phase is black, the rest are white.  This is used to distinguish
whether a character record has been encountered before:

During a black mark phase, when the GC encounters a string or symbol,
it changes the PHASE and SHAREDFLAG marks of the corresponding
character record according to the following table:

  <white, private> --> <black, private>   (white => unconditionally
  <white, shared>  --> <black, private>    set to <black, private>)
  <black, private> --> <black, shared>    (SHAREDFLAG changed)
  <black, shared>  --> <black, shared>    (no change)

The behaviour of a white phase is quivalent with the color names
switched.

The GC sweep phase frees any unmarked string or symbol header and
frees its character record either if it is marked with the "wrong"
color (not matching the color of the last mark phase) or if its
SHAREDFLAG is `private'.

Copy-on-write

An attempt at mutating string contents leads to copying if SHAREDFLAG
is `shared'.  Copying means making a copy of the character record and
mutating the CHARRECORDPTR and CHARPTR fields of the object header to
point to the copy.

Substring operation

When making a substring, a new string header is allocated, with new
contents for the LENGTH and CHARPTR fields.

Implementation details
----------------------

* We store the character record consecutively with the character
  array and lump the PHASE and SHAREDFLAG fields together into one
  byte containing an integer code for the four possible states of the
  PHASE and SHAREDFLAG fields.  Another way of viewing it is that
  these fields are represented as bits 1 and 0 in the "header" of the
  character array.  We let CHARRECORDPTR point to the first character
  position instead of on this header:

  CHARRECORDPTR
   |
   V
  FCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

  F = 0, 1, 2, 3
   
* We represent strings as the sub-types `simple-string' and
  `substring'.

* In a simple string, CHARRECORDPTR and CHARPTR are represented by a
  single pointer, so a `simple-string' is an ordinary heap cell with
  TYPETAG and LENGTH in the CAR and CHARPTR in the CDR.

* substring:s are represented as double cells, with TYPETAG and LENGTH
  in word 0, CHARRECORDPTR in word 1 and CHARPTR in word 2
  (alternatively, we could store an offset from CHARRECORDPTR).

Problems with this implementation
---------------------------------

* How do we make copy-on-write thread-safe?  Is there a different
  implementation which is efficient and thread-safe?

* If small substrings are frequently generated from large, temporary
  strings and the small substrings are kept in a data structure, the
  heap will still have to host the large original strings.  Should we
  simply accept this?

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]