HTMLRSF

Introduction

Each of the Rigi source parsers generate as output a flat file consisting of a stream of either 3-tuples or 4-tuples mixed with 3-tuples. These tuples are said to be in Rigi Standard Format (RSF). The tuples declare objects parsed from the program (nodes) and describe the relationships between them (directional arcs). For example, an RSF file typically declares program functions and their call relationships.

The 4-tuple RSF variant carries as its fourth component the location in the source where either the node or arc is declared. This is in addition to such attributes in the RSF file as file and lineno which are often generated for nodes. The fourth component makes it possible to locate the node or arc reference in the source code, a feature that is necessary for marking up the source for web browsing.

Consider, for example, the following source code named example.c that is written for the Microsoft® Windows95® platform:

1  /* example.c */
2  #include "msvckw.h"
3  #include <stdio.h>
4
5  void Hello( char *name ) {
6          fprintf( stdout, "Hello %s\n", name );
7  }
8
9  void main() {
10         char *userName;
11         fputs( "Please enter your name: ", stdout );
12         fgets( userName, stdin );
13         Hello( userName );
14 }
When this is parsed using the command:
rigiparse -4 example.c > rsf
the following RSF is generated:
type    _iobuf  Data      \\msvc20\\include\\stdio.h,120,15
type    Hello   Function  example.c,5,26
call    Hello   fprintf   example.c,6,38
type    main    Function  example.c,9,13
call    main    fputs     example.c,11,30
call    main    fgets     example.c,12,37
call    main    Hello     example.c,13,25
The the location information given in the fourth component of a tuple can be used to place an HTML anchor or bookmark. The command:
htmlrsf -a call -b type -p < rsf
results in the following HTML file named example.c_0001.html:
<html><head><body bgcolor=#ffffff><pre>/* example.c */
#include "msvckw.h"
#include <stdio.h>

<a name=5></a>void Hello( char *name ) {
    fprintf( stdout, "Hello %s\n", name );
}

<a name=9></a>void main() {
    char *userName;
    fputs( "Please enter your name: ", stdout );
    fgets( userName, stdin );
    <a href=example.c_0001.html#5>Hello</a>( userName );
}
</pre></body></html>
and a new, 3-tuple version of the RSF file that contains a new set of node attributes, nodeurl, that can be used by rigiedit, the Rigi graph editor:
type     "Hello"  "Function"
nodeurl  "Hello"  "example.c_0001.html#5"
type     "_iobuf" "Data"
nodeurl  "_iobuf" "stdio.h_0001.html#120"
type     "main"   "Function"
nodeurl  "main"   "example.c_0001.html#9"
call     "Hello"  "fprintf"
call     "main"   "Hello"
call     "main"   "fgets"
call     "main"   "fputs"
The HTML file has anchors and bookmarks for all of the valid Functions and calls found in the RSF file. The RSF data indicate that there should be a bookmark for _iobuf on line 120 of stdio.h. In fact, htmlrsf does create a file named stdio.h_0120.html that contains such a bookmark. However, because _iobuf is a data type rather than a Function, there is no call arc that references it and, thus, there is no hyperlink to it.

Invoking htmlrsf

The htmlrsf program is run as a filter. It accepts a stream of 4-tuple RSF data from stdin and writes a stream of 3-tuple RSF to stdout. Errors are reported to stderr. The input stream describes the source in an application subsystem. The program uses the input data to locate the source files, to copy them to the current working directory, and to mark them up with HTML tags.

A typical command might be:

rigiparse -4 pgm.c | sortrsf -4 | htmlrsf -pxa call -b Function > rsf

The following command line arguments are supported:

 -a
specifies those RSF keywords that should trigger the insertion of an anchor. If there are two or more keywords, the -a argument should be followed by a comma delimited list, e.g., -a call,fetch,store. In the example above, call is used in the C-language domain to indicate that one Function calls another. By specifying -a call, we have indicated that htmlrsf should establish an anchor at each location associated with a call tuple. An anchor will not be created unless it has a valid bookmark; htmlrsf will not permit broken links to be created. The RSF keywords are determined by the parser and the domain model for which it is coded.
 -b
specifies those RSF keywords that should trigger the insertion of a bookmark. If there are two or more keywords, the -b argument should be followed by a comma delimited list, e.g., -b procdef,filedef. In the example above, type is used in the C-language domain to specify the type of an object, e.g., Function (C-language procedure) or Data (data structure). By specifying -b type, we have indicated that htmlrsf should establish an anchor at each location associated with a type tuple. A bookmark will be created regardless of whether or not an anchor references it.
 -h
displays a banner and a brief description of the command line options.
  -l
causes the source file(s) to be segmented. Every occurence of a bookmark triggers a new segment. If -l is followed by a number, maxLines, the maximum number of lines in a segment is limited to maxLines. When the maximum number of lines is reached, a new segment is created. The segments of each source file are connected together by Up and Down hyperlinks.
 -p
writes the HTML file using the preformatted HTML tag ("<pre>"). In this mode, the "<" characters are converted to "&lt;" to prevent them from being interpreted as HTML markups.

If -p is not specified, htmlrsf converts all spaces to "&nbsp;" and all "<" characters to "&lt;". Furthermore, each line is terminated by "<br>". In a browser, the non-preformated HTML text looks similar to the equivalent preformatted text.

 -t
writes each HTML file using the template file that is provided as an accompanying argument: -t templateFile. The template file may contain a header and a footer specified as follows:
header [=] { text }

footer [=] { text }

The header text (which may be multi-line) is written at the extreme top of each HTML file. The footer text is written at the extreme bottom of each HTML file. The left bracket, "{", may be embedded in the text if it is preceded by a backslash ("\").

A default template is provided by htmlrsf if either the -t argument is not specified or if either the header or footer is not given by the template file.

 -x
expands tab characters to spaces. If -x numSpaces is specified, the tab settings are set at intervals of numSpaces, otherwise the interval size defaults to four. If -x is not specified, there is no substitution of tab characters.
  -l
causes RSF attributes designated as loc to be concatenated. The loc attribute identifies the file and line number where a specific node, e.g., a given procedure, is defined in the source. Sometimes when the RSF is derived from more than one compilation unit, the parser may generate calculate discrepant locations for a given globally visible node. In these cases, the different loc attributes can be concatenated and shown as a single composite loc attribute. Note that loc tuples will always be 3-tuples.
 
Example:
   procdef      Myproc          file1.c,10
   procdef      Myproc          file1.c,17
becomes
   procdef      Myproc          file1.c,10;file1.c,17
 -m
generates a multiarc if two or more arcs of different arc types have a common source node and a common destination node. This command line argument operates on 3-tuples only.
 
Example:
   fetch        Myproc          myvar
   store        Myproc          myvar
becomes
   multiarc     Myproc          myvar
 -4
causes 4-tuples to be preserved. If there are 4-tuples in the RSF file but this command line argument is unspecified, the fourth component of the tuple is removed. If htmlrsf is run with a -4 command line argument, 4-tuples are left unchanged.

htmlrsf writes the HTML text to files in the current working directory. The file names are generated by appending "_nnnn.html" to the name of the original source file and by replacing all directory separators (forward/backward slashes and colons) by underline characters (i.e. /usr/include/stdio.h becomes _usr_include_stdio.h). The "nnnn" is the line number in the original source file that corresponds to the first line in the HTML text file. Thus, "example.c_0009.html" indicates that the first line of the HTML file corresponds to line nine of the source file "example.c".

The source files are located by following the path given in the fourth component of the RSF tuples. If a source file is not found, htmlrsf terminates after displaying an error message.

All tuples in the RSF input file (stdin) are written to the RSF output file (stdout). Comments in the RSF file (those lines in the file that begin with "#") are written to stderr but not to stdout.

An example

Consider the C-language program described in the introduction. Suppose that the program is parsed using the following command.

rigiparse example.c | sortrsf -4 | htmlrsf -lxa call -b type > rsf

Although execution of sortrsf is not strictly necessary, this filter program reduces the size of the input RSF file by removing duplicate tuples (which in this example do not exist).

Because the -l argument is specified, htmlrsf segments the example.c file into three HTML files: example.c_0001.html, example.c_0005.html, and example.c_0009.html. stdio.h is split into stdio.h_0001.html and stdio.h_0120.h as well.

Each tab is converted to spaces because -x is specified. Each space is represented as "&nbsp; in the marked up source because -p argument is not specified. The three HTML text files follow:

example.c_0001.html

  <html><head><body bgcolor=#ffffff><br>
  /*&nbsp;example.c&nbsp;*/<br>
  #include "msvckw.h"<br>
  #include&nbsp;&lt;stdio.h><br>
  <br>

  <a href=example.c_0005.html>Down</a></body></html><br>

example.c_0005.html

  <html><head><body bgcolor=#ffffff><a href=example.c_0001.html>Up</a>
  <br>
  void&nbsp;Hello(&nbsp;char&nbsp;*name&nbsp;)&nbsp;{<br>
  &nbsp;&nbsp;&nbsp;&nbsp;fprintf(&nbsp;stdout,&nbsp;"Hello&nbsp;%s\n",&nbsp;name&nbsp;);<br>
  }<br>
  <br>

  <a href=example.c_0009.html>Down</a></body></html><br>

example.c_0009.html

  <html><head><body bgcolor=#ffffff><a href=example.c_0005.html>Up</a>
  <br>
  void&nbsp;main()&nbsp;{<br>
  &nbsp;&nbsp;&nbsp;&nbsp;char&nbsp;*userName;<br>
  &nbsp;&nbsp;&nbsp;&nbsp;fputs(&nbsp;"Please&nbsp;enter&nbsp;your&nbsp;name:&nbsp;",&nbsp;stdout&nbsp;);<br>
  &nbsp;&nbsp;&nbsp;&nbsp;fgets(&nbsp;userName,&nbsp;stdin&nbsp;);<br>
  &nbsp;&nbsp;&nbsp;&nbsp;<a href=example.c_0005.html>Hello</a>(&nbsp;userName&nbsp;);<br>
  }<br>
  </body></html><br>

Each of the three files has a default header, <html><head><body bgcolor=#ffffff>, and a default footer, </body></html><br>. These can be replaced by user-specified headers and footers by specifying a template file. The three files also have Up and Down anchors to link the segments together.

Locating tags

A given component in the RSF tuple is used in conjunction with the file location to position an anchor around its target identifier. If the identifier is not found in the text, no anchor is inserted. Unlike anchors, bookmarks can be positioned without finding their target. However, htmlrsf verifies each anchor by searching for a corresponding bookmark identifier. For example, with:

type    Hello   Function  example.c,5,26
call    main    Hello     example.c,13,25
the "call" tuple specifies the anchor and the "type" tuple specifies the bookmark. Assuming that "Hello" can be found on line 13 of example.c, the anchor determined by the "call" tuple will be inserted because there is a bookmark similarly named "Hello".

It is possible to specify more than one type of tuple for anchors and bookmarks. There is, however, no means to group types of anchors with types of bookmarks. Any given anchor is resolved using a table of all possible bookmarks.

Anchors can be identified by a sequence of subnames separated by any punctuation characters except "$", "_", or "^". Whenever a compound name is detected, only the first subname in the sequence is used to match the bookmarks. To determine the position of the anchor in the text, the line of text is searched for each subname in the sequence starting at the left-most. The search ends when a match is found or the sequence is exhausted.

It is possible to create inaccurate hyperlinks; the anchor search does not mediate between competing RSF tuples. Furthermore, because the search is conducted on only a portion of a name sequence, it is possible to tag the wrong identifier. As a safeguard, htmlrsf can not generate nested anchors or bookmarks.

RSF Semantics

The semantics of tuples in RSF are determined by the domain in which their keywords are defined and for which the source parser was written. Consider the following arc:

call    main    Hello     example.c,13,25
The first tuple component is the keyword itself. The second component is the caller and the third component is the function that is being called. The fourth component is the location in the source file where the call occurred. The location consists of the name of the source file, example.c, the line number, 13, and the column number in the preprocessed source file (which is ignored). In this example a call arc behaves like an anchor in the HTML model; the third component is the target identifier for the anchor and the fourth component indicates the placement of the anchor.

An anchor is not inserted in the HTML text unless it has a corresponding bookmark. For the RSF data to be used in a manner consistent with the HTML model, there must be a tuple with the same identifer as the call arc that can be designated a bookmark. In this example, we must use type tuples to describe the bookmark locations.

type    Hello   Function  example.c,5,26
In this case, the second component of the tuple is the target identifier for the bookmark and the fourth component indicates its placement in the source.

type is an RSF built-in keyword. Tuples that are type tuples always have the same semantics with the second component naming the object, which is shown as a node in rigiedit. This presents three problems: 1) unlike other tuples, the third component is required to discriminate between different types of objects, 2) the typed object is identified by the second component, not the third, and 3) a type tuple is frequently treated by tools as a statment of existence and classification, not of object definition.

The rigiparse C-language parser generates a type 4-tuple with the location of the object definition. Other parsers may generate a 4-tuple that explicitly defines an object rather than overloading the type tuple:

filedef  example.c  Hello  example.c,5,26
The bookmark identifer in one of these tuples can be found in the third component unlike with the type tuple. Thus, the type tuples are treated as being semantically different.

In summary, with the exception of type tuples, htmlrsf requires that the parser generate tuples for anchors and bookmarks such that the third component is the target identifier and the fourth component is the location of the respective anchor or bookmark. Furthermore, each anchor should have a corresponding bookmark.

Exceptions

The following are runtime warnings and errors that may be reported by htmlrsf to stderr.

Invalid argument <badArg>
One or more command line arguments have been incorrectly entered. <badArg> is the first in error argument that has been detected.
Insufficient memory
htmlrsf is unable to allocate sufficient memory for its tables. The RSF file is too large to process.
One or more anchor types must be specified
No RSF identifiers can be parsed from the command line following a -a command line argument. If identifers were specified, then the command line syntax is in error.
One or more bookmark types must be specified
No RSF identifiers can be parsed from the command line following a -b command line argument. If identifiers were specified, then the command line syntax is in error.
-l must be followed by a non-negative number of lines
The number following the -l command line argument must be non-negative.
-t must be followed by a template file name
If -t is specified it must be followed by the valid name of a template file. If a name was specified, then the command line syntax is in error.
-x expands tabs to a maximum of 10 spaces
The number following the -x command line argument must lie in the range [1..10].
Write error on <outputFile>
An error occurred while htmlrsf was writing to the designated file.
Read error on <inputFile>
An error occurred while htmlrsf was reading from the designated file.
Open error on <file>
An error occurred while htmlrsf was opening the designated file.
Unexpected end of file on <inputFile>
An RSF tuple addresses a position beyond the end of file for the designated input file.
Command line argument "-a" must be specified>
Command line arguments -a and -b must both be specified for htmlrsf to mark up the source files with anchors and bookmarks.
Command line argument "-b" must be specified>
Command line arguments -a and -b must both be specified for htmlrsf to mark up the source files with anchors and bookmarks.
Template file (<templateFile>) not found
The template file specified by the -t command line argument is missing.
Error opening template file <templateFile>
An error occurred while htmlrsf was opening the designated file for input.
Error reading template file <templateFile>
A error occurred which htmlrsf was reading from the designated template file.
RSF tuple has fewer than 3 components at line <lineNum>
An RSF tuple is only a 1-tuple or a 2-tuple. Blank lines and comments (those lines starting with a '#' character) are ignored. The error occurs at <lineNum> in the input RSF file.
RSF tuple has greater than 4 components at line <lineNum>
An RSF tuple has greater than four components. This is a warning message; the superfluous components are removed from the tuple and htmlrsf continues to execute. The warning occurs at <lineNum> in the input RSF file.


30 June 1998   Maintainer: Johannes Martin jmartin@csr.uvic.ca