This is the mail archive of the xsl-list@mulberrytech.com mailing list .


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

n level grouping, possibly a design pattern.


This is a retrospective view of a small project I undertook
which uses heavily the ideas of multi-level grouping.

The overall requirement is to sort a flat datafile of books, derived
from a bibliographic database marked up in Marc record format, into
 category
  subcategories
   by first letter of the authors surname
     by author beginning with this letter
       all other books by same author.

so I suppose its a 5 level grouping exercise.


The source data was a CSV dump of some database records.
After a little perl like processing, the input data appeared as
as a series of records r with internal fields f. Each field
has further divisions, represented by a leading dollar and a code,
e.g. $aXXX was the dollar a code.

<r>
<f>024050</f>
<f>6390</f>
<f>Science fiction and fantasy</f>
<f></f>
<f>1</f>
<f>09:00</f>
<f></f>
<f>Sequel to : Over sea, under stone</f>
<f></f>
<f>$aCooper$hSusan</f>
<f>$aScience Fiction and Fantasy$xJunior$xLate Twentieth
Century$xFiction$xEnglish Literature</f>
<f>Gabriel Woolf</f>
<f>$aOnly four days to Christmas so why can't Will feel Happy? Things seem
out of key this year, as if someone is trying to tell him something in a
language he cannot understand. Animals show fear when he comes by, and what
makes that tramp take one look athim and scuttle off in such terror?</f>
<f></f>
<f>$aThe dark is rising.$eby Susan Cooper</f>
<f>$aHarmondsworth$bPuffin Books$c1976</f>
<f>$ar19761973$ben$cj$e0$f0$g0$jf$leng$tb$zSCI$s3</f>
<f>$aThe dark is rising$v2</f>
<f>503.00</f>
<f>$aSequel to : Over sea, under stone</f>
<f></f>
<f></f>
<f></f>
<f></f>
</r>

After many changes to the input format, I designed an interpretation 
schema for the records, to 

Select the wanted fields and dump the unwanted ones
Interpret the internal dollar codes
More correctly mark up the data by
providing the appropriate tags to be used for each field.

<def:structure >
    <bibno>k</bibno> <!--  Bib number-->
  <cat-num>k</cat-num> <!--  Talking Book number -->
  <TBcategory>k</TBcategory> <!--  TB category-->
  <comments>d</comments> <!--  No idea, tb related-->
  <num-cassettes>d</num-cassettes> <!--  No of tapes-->
  <play-time>k</play-time> <!--  Duration of play time-->
  <rdr-gender>k</rdr-gender> <!--  Gender of reader M/F-->
  <seq>k</seq> <!--  Sequel to: title -->
  <wng>k</wng> <!--  Bad lang etc-->
  <m100>k</m100> <!--  -->
  <m655>k</m655>  <!-- subject category -->
  <readers>k</readers> <!--  List of narrators-->
    <m513>k</m513>  <!-- blurb -->
      <m700>k</m700>  <!-- author  -->
     <m245>k</m245> <!--  -->  
   <m260>k</m260>   <!-- publisher -->
   <m008>k</m008>  <!-- for childrens cat -->
   <m440>k</m440> <!-- Series field -->
   <ex1t>k</ex1t> 
   <ex1>k</ex1> 
   <ex2t>k</ex2t> 
   <ex2>k</ex2> 
   <ex3t>k</ex3t> 
   <ex3>k</ex3> 
</def:structure>
  

This structure identifies those fields I want to keep (k), those I want to
dump (d), and sequentially follows the input data, so if processing needs
change
I can modify both sequence and the keep dump processing. The wrapper in the
tenth position is the tag I want to use for the tenth field. 

Since the input data was around 12Mb, it also reduces the file size and
permits use of saxons small processing model.


<xsl:template match="f">
   <xsl:variable name="posn" select="position()"/>
   <xsl:variable name="keep-dump">
     <xsl:value-of select="document('')//def:structure/*[$posn]" 
       xmlns:def="http://rnib.org.uk/tb#"/>
   </xsl:variable>
   <xsl:variable name="tag" 
     select="name(document('')//def:structure/*[$posn])" 
     xmlns:def="http://rnib.org.uk/tb#"/>

   <xsl:if test="$keep-dump = 'k' ">
   <xsl:element name="{$tag}">
     <xsl:choose>
       <xsl:when test="contains(.,'$') ">
         <xsl:call-template name="dollarCodes">
           <xsl:with-param name="field" select="$tag"/>
         </xsl:call-template>
       </xsl:when>
       <xsl:otherwise>
         <xsl:value-of select="."/>
       </xsl:otherwise>
     </xsl:choose>
 </xsl:element>
</xsl:if>
 
  </xsl:template>

 
The call to the dollarCodes template resolves the subfields based on the
field name, obtained from the <def:structure> </def:structure> contents,
using saxon:tokenise.
If the field name is unknown it enables exception processing. An example is
shown for processing one of the dollar coded fields. Each field has 
a different interpretation.


<xsl:template name="m100">
    <xsl:for-each select="saxon:tokenize(.,'$')">
      <xsl:choose>
        <xsl:when test="starts-with(.,'a')">
          <surname><xsl:value-of select="substring(.,2)"/></surname>
        </xsl:when>
       <xsl:when test="starts-with(.,'h')"><xsl:template name="m100">
    <xsl:for-each select="saxon:tokenize(.,'$')">
      <xsl:choose>
        <xsl:when test="starts-with(.,'a')">
          <surname><xsl:value-of select="substring(.,2)"/></surname>
        </xsl:when>
       <xsl:when test="starts-with(.,'h')">
         <f-name><xsl:value-of select="substring(.,2)"/> </f-name>
        </xsl:when>
       <xsl:when test="starts-with(.,'f') or starts-with(.,'e') ">
         <title>(<xsl:value-of select="substring(.,2)"/>)</title>
        </xsl:when>
        <xsl:when test="starts-with(.,'k') or starts-with(.,'e') ">
         <f-names>(<xsl:value-of select="substring(.,2)"/>)</f-names>
        </xsl:when>
      </xsl:choose>
      </xsl:for-each>
    </xsl:template>

         <f-name><xsl:value-of select="substring(.,2)"/> </f-name>
        </xsl:when>
       <xsl:when test="starts-with(.,'f') or starts-with(.,'e') ">
         <title>(<xsl:value-of select="substring(.,2)"/>)</title>
        </xsl:when>
        <xsl:when test="starts-with(.,'k') or starts-with(.,'e') ">
         <f-names>(<xsl:value-of select="substring(.,2)"/>)</f-names>
        </xsl:when>
      </xsl:choose>
      </xsl:for-each>
    </xsl:template>


This first stylesheet reduces the file size to managable size and
facilitates easier further processing.

The second stylesheet further reduces the file size by selecting wanted
records, keyed off a seperate XML file which provides the list of
records, in a simple document derived seperately. The objective is
to select only those records with a cat-num content listed in the
external file.

For this particular example, concerned with bibliographic data source,
the title and author are held in various fields with a given priority,
hence this information is sorted out at this stage in to a single
field Ra, and the sources are dropped. Later processing needs an
alphabetic sort, hence the first letter of the author is included
in the output, of the form
<Ra>
  <lett>a</lett>
  <au>surname,firstname</au>
</Ra>


 <doc>
    <xsl:apply-templates 
   select="r[cat-num = document('inicat.xml')/doc/bk/n]"/>
</doc>

Is the essence of this file. The control file holds the list of 
required records in the form

<doc>
<bk><n>1234</n><ttl>.. </ttl></bk>
<bk><n>2345</n><ttl>.. </ttl></bk>
...
</doc>


This step may not be required if all records are required to be processed,
but adds the flexibility which I needed, since only a subset of records
were required.

The first stage of grouping is based on one of six groups, using 
a single input field. Further complicated by the fact that a single 
group holds a small number of individual categories, sub-categories.
This stage of processing is also used to sort alphabetically
In the example, the categories are held in the TBcategory field.
A further complication is that some records have no category, i.e.
the TBcategory field is empty. These form a final general category

An example of selecting one group is shown below.


 <cat1>
     <adventure><hd>Adventure </hd>
       <xsl:for-each select='r[starts-with(TBcategory,   "Adven"    )]'>
         <xsl:sort select='Ra/lett'/>
         <xsl:sort select="Ra/au"/>
         <r>
           <xsl:copy-of select="*"/>
         </r>
       </xsl:for-each>
     </adventure>

     <war><hd>War stories </hd>
       <xsl:for-each select=' r[starts-with(TBcategory,   "War"     )] '>
         <xsl:sort select='Ra/lett'/>
         <xsl:sort select="Ra/au"/>
         <r>
           <xsl:copy-of select="*"/>
         </r>
       </xsl:for-each>
     </war>
.... 

   </cat1>



The output of this stage is a grouped, sorted list ready for processing.
The final form output is required in several media, in the form

category
  letter A
  author beginning with A
    All records by that author
  letter B
  author beginning with B
  ...

The final grouping addresses the books within each category,
  All authors within that group beginning with that letter.
   All books within that group and letter.

So again its a grouping problem with the surrounding alphabetic sort.

A further twist is that each major category is required to be
a seperate document with a heading section introducing it,
a table of contents and an index sorted by authors, irrespective
of which subcategory they are in.


Top level processing is all done on a pull basis, using moded templates
with the following form.

 <xsl:template match="/doc/*[contains(name(), 'cat') ]">
   <xsl:variable name="cat" select="position() div 2"/> <!-- first digit -->
   <h1 id="{generate-id()}"><xsl:value-of
select="document('')//t:ttls/t[$cat]"/></h1>
   <xsl:for-each select="*[r]"> <!-- for each subcat -->
   <h2 class="subhead"><xsl:value-of select="hd"/> </h2>
   <xsl:variable name="subcat" select="position()"/>
     <xsl:call-template name="alphabetical" >
       <xsl:with-param name="key1">
         <xsl:call-template name="getKey1">
           <xsl:with-param name="cat" select="$cat"/>
           <xsl:with-param name="subcat" select="$subcat"/>
         </xsl:call-template>
       </xsl:with-param>
      <xsl:with-param name="key2">
         <xsl:call-template name="getKey2">
           <xsl:with-param name="cat" select="$cat"/>
           <xsl:with-param name="subcat" select="$subcat"/>
         </xsl:call-template>
       </xsl:with-param>
     </xsl:call-template>
   </xsl:for-each>
  </xsl:template>

The selection of keys is done via an external file which lists
all keys for the various categories. It follows the sequence of
categories and is used to sort by letter, remember that the
records are already sorted.

<xsl:key name="cat1a-letters"
  match="cat1/adventure/r"
         use="Ra/lett" />
<xsl:key name="cat1a-authors"
  match="cat1/adventure/r"
         use="Ra/au" />

<xsl:key name="cat1w-letters"
  match="cat1/war/r"
         use="Ra/lett" />
<xsl:key name="cat1w-authors"
  match="cat1/war/r"
         use="Ra/au" />
...

This is used in two ways. The key name is abstracted automatically, based
on the category number,
The getkey templates are shown below. I am almost certain that this
pattern is usable elsewhere, equally certain that it is non-optimal for
this situation. However, it works.

 <xsl:template name="getKey1">
    <xsl:param name="cat" select="1"/>
    <xsl:param name="subcat" select="1"/>
     <xsl:value-of select="document('../print/catkeys.xsl')//xsl:key
    [contains(@name, string($cat))][($subcat * 2) - 1]/@name"/>
  </xsl:template>

And the second one,

<xsl:template name="getKey2">
   <xsl:param name="cat" select="1"/>
    <xsl:param name="subcat" select="1"/>
     <xsl:value-of select="document('../print/catkeys.xsl')//xsl:key
    [contains(@name, string($cat))][$subcat * 2]/@name"/>
  </xsl:template>


These two are identical except that one fetches the first of a pair,
the second fetches the second of the pair.

The alphabetic sort is based on a Jeni T suggestion, to loop
through the alphabet, a parameter to a named template. Its outline is
as below.

<xsl:template name="alphabetical">
  <xsl:param name="key1" select="'nocat'"/>
  <xsl:param name="key2" select="'nocat'"/>
  <xsl:param name="alphabet" select="'abcdefghijklmnopqrstuvwxyz'" />

  <xsl:if test="$alphabet != ''"><!--If not finished -->
    <xsl:variable name="letter"
                  select="substring($alphabet, 1, 1)" />
  
.....


      <xsl:call-template name="alphabetical">
      <xsl:with-param name="alphabet" select="substring($alphabet, 2)" />
        <xsl:with-param name="key1" select="$key1"/>
        <xsl:with-param name="key2" select="$key2"/>
      </xsl:call-template>
  </xsl:if>
</xsl:template>

The loop terminates with the last letter, on the xsl:if.
For special processing, books without an author (it does happen),
are grouped under the letter z (another design weakness). It should
perhaps be some other character, but suffices for this scheme.

The internal processing (again suggested by Jeni), is along the
following lines. This code fits within the .... area of the
above template.

   <xsl:variable name="books"
                select="key($key1,$letter)" />   
  
       <xsl:choose>
         <xsl:when test="$books"> 
            <!-- when there are authors with this letter -->
           <xsl:apply-templates
             select="$books[generate-id(.) =
                     generate-id(key($key2,Ra/au) [1])]"  mode="first"/>
      </xsl:when>
      <xsl:otherwise>
<h2>The letter   <xsl:value-of select="translate($letter, $l,$u)"/> </h2>
  <p>There are no authors beginning with this letter</p>


The variable books uses the first key (picking all books within this 
category starting with this particular letter.
If there are any, the apply-templates selects the first book
in this category, which starts with this letter, using the Muenchian
technique.

Two further templates are provided.

the first processes the first record
  within a given category
    within a given letter
    first book by this author  (need to show the authors name)

The second template process other records by the same author,
but omits the author heading, showing all other books by the same author.
It is called from the first template, using

 <xsl:apply-templates 
      select="following-sibling::r[Ra/au = $au ]" mode="others">
    </xsl:apply-templates>

the variable au is selected as the current author, hence all other books
by the same author in this category are processed.


I learned a lot from this little exercise.
I provide it in case its of use to others.

Regards DaveP


************snip here************** 

- 

NOTICE: The information contained in this email and any attachments is 
confidential and may be legally privileged. If you are not the 
intended recipient you are hereby notified that you must not use, 
disclose, distribute, copy, print or rely on this email's content. If 
you are not the intended recipient, please notify the sender 
immediately and then delete the email and any attachments from your 
system.

RNIB has made strenuous efforts to ensure that emails and any 
attachments generated by its staff are free from viruses. However, it 
cannot accept any responsibility for any viruses which are 
transmitted. We therefore recommend you scan all attachments.

Please note that the statements and views expressed in this email 
and any attachments are those of the author and do not necessarily 
represent those of RNIB.

RNIB Registered Charity Number: 226227

Website: http://www.rnib.org.uk 

14th June 2002 is RNIB Look Loud Day - visit http://www.lookloud.org.uk to
find out all about it.


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]