This is the mail archive of the guile@cygnus.com mailing list for the guile project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: improving read-line


> Date: 12 Dec 1997 06:52:11 -0000
> From: Gary Houston <ghouston@actrix.gen.nz>
> CC: guile@cygnus.com
> Sender: owner-guile@cygnus.com
> 
> | Date: Thu, 11 Dec 1997 17:40:57 -0500 (EST)
> | From: Tim Pierce <twp@skepsis.com>
> | 
> | That means that if you've rebound
> | scm-line-incrementors in order to change the behavior of read-line,
> | you'll have to call read-delimited instead.
> 
> Setting scm-line-incrementors doesn't seem good enough anyway: it
> wouldn't cope well with lines terminated by \r\n. fgets can, but only
> after recompiling on a system where that's the norm.

I know.  For that reason, I didn't think it was likely that anyone was
doing this.  But it's possible (someone may have set it to "\t" to
read tab-delimited records).

> I does seem like something that should be configurable in some way.
> It's like a trivial case of the general external character encoding
> problem (support JIS, BIG5, ISO-10646 etc., too if it seems
> convenient).

Well, I'm not (presently) proposing that we change anything about
`%read-delimited!', so all of the functionality will still be there.

> | While I'm at it, I'd also like to get rid of the `split' and `peek'
> | arguments to read-line.  These are a pain in the neck to implement and
> | don't seem to be very useful if you're only reading newline-delimited
> | records.
> 
> You didn't suggest abolishing 'concat and 'trim.  Are the other two
> really such a pain in the neck to implement?

'concat is easy, because it's the default behavior of fgets.  Beyond
that, anything that involves removing the delimiter is a hassle, since
the last line of a file may not be newline-terminated.  In order to
decide whether the last character of the string must be removed, we
need to examine it to make sure it's indeed a delimiter.  (This is
where %read-delimited! has an advantage over fgets, because it stores
data in a buffer and returns the delimiter it found.)  In the case of
'trim, it's easy to throw it away, but 'split and 'peek want us to
hang on to the delimiter, and do various things depending whether it
is an #<eof> object or a newline... it's nasty.

This is the code I wrote before deciding that the unwieldiness
probably outweighs the utility of the 'split or 'peek arguments.  Of
course, the problem may be more that I'm a naive Scheme programmer.

(define (read-line . args)
  (let* ((port		(if (null? args)
			    (current-input-port)
			    (car args)))
	 (handle-delim	(if (> (length args) 1)
			    (cadr args)
			    'trim))
	 (line		(%read-line port)))
    (cond ((eof-object? line) (if (eq? handle-delim 'split)
				  (cons line line)
				  line))
	  ((eq? handle-delim 'concat) line)
	  (else
	   (let* ((last-pos (1- (string-length line)))
		  (terminator (string-ref line last-pos))
		  (trunc-line line))
	     (if (char=? #\newline terminator)
		 (set! trunc-line (substring line 0 last-pos))
		 (set! terminator (read-char port)))
	     (case handle-delim
	       ((trim) trunc-line)
	       ((split) (cons trunc-line terminator))
	       ((peek) (begin
			 (unread-char terminator port)
			 trunc-line)
		(else
		 (error "unexpected handle-delim value: " handle-delim)))))))))

> Can you not just do an
> SCM_CUNGET after the fgets for the 'peek case?  The extra options
> don't seem essential, but they give compatibility with scsh's
> read-line.

Sigh.  I didn't know that all of this was inherited from scsh, and
don't want to break compatibility.

If Shivers really has us whupped on I/O performance, we should
probably try to figure out why, since our implementation is pretty
much derived from his.

> %read-line seems to have a bug when handling the NUL character:

Thanks for pointing this out.  If we decide to go with %read-line,
I'll see if it can be fixed.

> Perhaps it would be useful to optimise read-line!, which is often the
> one to use if you care about speed.  Scsh doesn't have it.  The SCM
> version would be faster than the current implementation, but I'm not
> sure by how much.

I'll look at this, too.  But I would really like to make read-line
(the most natural and obvious interface to use) to be a good, fast I/O
implementation.

thanks for the suggestions.

love, T.