RX regular expressions tool

Introduction

The RX tool uses the concept of regular expression for searching patterns in text and eventually replacing them by other string patterns. This is comparable to regular expressions objects in PERL, JavaScript, VBScript, C++, etc. The RX tool is used interactively by executing a small set of commands described below and summarized in the RX.Tool. Alternatively, the module RXA.Mod, imported by RX, can be used for string manipulation by custom developed programs. In that way, it is imaginable to give Oberon a PERL twist, a well-known and powerful string manipulation language. One could then envisage to write software for scanning server logs for example (e.g. mail logs). See Watson.ShowDef RXA ~. The tool was authored by Andreas Margelisch in 1990.

The search and replace mechanism

The approach used by the RX tool for searching a pattern in plain ASCII text, and eventually replacing it by another, is very similar to the approach used in Oberon text viewers. A text viewer menu bar features three Buttons [Search], [Rep] and [RepAll]. Behind the scene, when one of these Buttons is activated, a corresponding Oberon command is executed, namely and in that order, TextDocs.Search, TextDocs.Replace, and TextDocs.ReplaceAll. With that knowledge and with some experience in using these simple string manipulation facilities, understanding the RX tool facilities should be straightforward. The search algorithm of RX operates on a text stretch up to and including the next carriage return. This is a source of confusion and irritation to uninitiated users.

There remains only to understand what is behind a regular expression which defines a set of strings and to gain some experience in writing a syntactically correct expression to perform a given job. Here is suggested reading for an introduction to Regular expressions:

Mathematical Structures for Computer Science, Ed. 2 by Judith L. Gersting.
Regular Expressions are covered in Section 9.3, p. 399, Machines As Recognizers.

On the Web search for: "Learning regular expressions".

Using the RX tool with text gadgets

The description which follows assumes that the RX tool is put to work on plain ASCII text. Oberon text with embedded gadgets can also be processed but some manipulations may produce unexpected results, as the software is not entirely fit to process gadgets. With this warning in mind, it is still quite possible to process such texts successfully. Keep in mind that a text stretch following a TextStyle gadget may span several lines on the screen, making up for an entire paragraph ending with a CR. From the point of view of the user observing the screen, a pattern may well overlap two displayed lines, but the text structure still complies with the base rule. Only, the CR is located further down at the end of the paragraph.

A pattern for using the RX commands

The user may adopt either of two patterns in applying RX.

  1. Simple searching through a text of any length:
    RX.SetSearch
    RX.Search {RX.Search}

  2. Searching and replacing:
    RX.SetSearch
    RX.SetReplace
    RX.Search {RX.Search | RX.Replace} [RX.ReplaceAll]

After a round of set commands, search and replace operations may be launched in plenty with different texts, in which the caret must be set first.

Setting the Search and the Replace patterns

RX.SetSearch ["\c"] RegExpr

Sets the search pattern which will apply to subsequent Search commands. The text stretch which matches the pattern defined by the regular expression must be entirely contained in a text stretch ending at the next carriage return. Also, the regular expression ends at the next carriage return; that is, it must be entirely contained in a line and it is only possible to include some comment at the beginning of a line ahead of a command and not after it. The syntax of a regular expression, described in EBNF, is:
   RegExpr    = term { "|" term }. 
   term       = extdfactor { extdfactor }. 
   extdfactor = factor [ subexprid ]. 
   factor     = "(" RegExpr ")" | "[" RegExpr "]" | "{" RegExpr "}" 
                | ["~"] ( """ literal """ | shorthand ) 
                | """ literal { literal } """. 
   subexprid  = "X" digit. 
   shorthand  = "A" | "a" | "b" | "c" | "d" | "h" | "i" | "l" | "o" | "t" | "w".
Each shorthand token represents a predefined character class:
   A  : "A" - "Z" 
   a  : "a" - "z" 
   b  : "0" - "1" 
   c  : carriage return 
   d  : "0" - "9" 
   h  : "0" - "9" or "A" - "F" 
   i  : union of classes l and d 
   l  : union of classes A and a 
   o  : "0" - "7" 
   t  : tab 
   w  : tab, carriage return or blank
One can see from the definition of subexpreid, that an expression may contain a maximum of ten sub-expressions. If the option \c is used, the matching of letters in literals becomes case insensitive.

Remark about the special characters

Each special character must be specified when expected in a pattern. To find the parameter list in this line:      System.CopyFiles SYS:Example.txt => SYS:Example.Text ~ (* expanded parameter list *)

specify ("SYS" { i | w | "." | ":" | "=" | ">" } "~")

The crudest way to select an entire line is by specifying ({w | ~w}) or anything equivalent.

Remark about the meaning of "c"

According to the shorthand list, special characters are not included in any class. Remembering that each searched text stretch ends at the next CR, ~c stands for any character and (~c {~c}) stands for all the characters to the next CR, i.e. a non-empty line. A non-empty line is equally specified by (~c {~c} c). A plain c may only appear at the end of an expression, as shown in the example. A double CR can not be located with (c c).

RX.SetReplace {""" literal {literal} """ | subexprid | "t" | "c"}

Sets the replace pattern as a sequence of strings, sub-expressions, tabulator or carriage return characters.

Examples:

a) RX.SetSearch d[d]"."d[d]"."dddd
     This regular expression defines a date pattern. Hence, a subsequent RX.Search command will eventually find      a date, such as 5.9.2000, after the position of the caret.

b) RX.SetSearch ( { i | "."} ) X0 ".Mod"
     The search engine is directed to find names suffixed ".Mod". The name prefixes are to be collected in the sub-expression X0.

Generate pairs of filenames with extensions "Mod" and "Obj":   RX.SetReplace X0 ".Mod " X0 ".Obj"

c) Direction to find and collect filenames suffixed ".Txt":   RX.SetSearch ( i {[i] "."} ) X0 ".Txt"

Generate a set of parameters for System.CopyFiles:   RX.SetReplace X0 ".Txt" => " X0 ".Text"

d) Direction to find and collect non-empty lines:   RX.SetSearch (~c {~c}) X0

Indent non-empty lines by one tab:   RX.SetReplace t X0

e) Direction to find and collect all lines containing "next":   RX.SetSearch (~i {~i}) X0 "next" (~i {~i}) X1

Replace the occurrences of "next" by "prev":   RX.SetReplace X0 "prev " X1

Search and Replace functions

As said earlier, the commands provided in the RX module are similar to those found in a text viewer menu bar. Yet, the RX commands are far more powerful because the pattern is defined by a regular expression rather than by a constant string.

RX.Search

Searches the pattern defined by the last RX.SetSearch command in the text where the caret is set. Searching begins at the position of the caret and proceeds to the end of the text. If the search pattern is encountered, the pattern appears as selected and the caret is positioned right at the end of it. Else, the caret disappears. The pattern must be entirely contained in a text stretch which does not contain a carriage return, except at the end. That means that the search pattern may not overlap two lines of text. Repeated execution of this command will select occurrences of the search pattern to the end of text. There the search ends and does not "wrap around".

RX.Replace

Replaces the last search pattern found (located at the caret) by the replace pattern. Subsequently, the next occurrence of the search pattern is located in the text. Executing the command repeatedly searches and replaces all pattern occurrences to the end of the text. If the pattern located should not be replaced, execute Search instead to proceed to the next occurrence. To accelerate the process use ReplaceAll.

RX.ReplaceAll

replaces all the occurrences of the search pattern by the replace pattern in a single action to the end of the text.

RX.Grep (filename | "*") ["\" { "c" | "i" }] RegExpr

Searches the file specified by filename or the marked text for the pattern specified by the regular expression. A text viewer "RX.Grep" is opened displaying all the lines of the searched text containing the pattern. The default font (usually Oberon10.Scn.Fnt) is used in the viewer; that is, the fonts, the colors and the offsets used in the serched text are not rendered. The search is case insensitive by default but if the option \c is specified, the search is case sensitive. If the option \i is specified, the lines that do not match are listed. Grep is the abbreviation of "Get regular expression pattern". A single file can be investigated with that command, contrary to Unix which allows searching through several files by enumerating several filenames, using wildcards and even regular expressions.

Examples:
     RX.Grep Puzzle15.Mod "PROCEDURE"
     RX.Grep * \c "PROCEDURE" (* to verify that PROCEDURE is well-spelled *)

Debugging an expression in construction

On encountering difficulties to find the desired patterns in text, follow this guideline:

[Top]

24 Nov 2000 - Copyright © 2000 ETH Zürich. All rights reserved.
E-Mail: oberon@inf.ethz.ch
Homepage: http://www.ethoberon.ethz.ch/