afnix-txt(3)
txt - standard text processing module
Description
txt
NAME
txt - standard text processing module
STANDARD TEXT PROCESSING MODULE
The Standard Text Processing module is an original implementation of an object collection dedicated to text processing. Although text scaning is the current operation perfomed in the field of text processing, the module provides also specialized object to store and index text data. Text sorting and transliteration is also part of this module.
Scanning
concepts
Text scanning is the ability to extract lexical elements or
lexemes from a stream. A scanner or lexical analyzer is the
principal object used to perform this task. A scanner is
created by adding special object that acts as a pattern
matcher. When a pattern is matched, a special object called
a lexeme is returned.
Pattern
object
A Pattern object is a special object that acts as model for
the string to match. There are several ways to build a
pattern. The simplest way to build it is with a regular
expression. Another type of pattern is a balanced pattern.
In its first form, a pattern object can be created with a
regular expression object.
# create a
pattern object
const pat (afnix:txt:Pattern "$d+")
In this example, the pattern object is built to detect integer objects.
pat:check
"123" # true
pat:match "123" # 123
The check method return true if the input string matches the pattern. The match method returns the string that matches the pattern. Since the pattern object can also operates with stream object, the match method is appropriate to match a particular string. The pattern object is, as usual, available with the appropriate predicate.
afnix:txt:pattern-p pat # true
Another form of pattern object is the balanced pattern. A balanced pattern is determined by a starting string and an ending string. There are two types of balanced pattern. One is a single balanced pattern and the other one is the recursive balanced pattern. The single balanced pattern is appropriate for those lexical element that are defined by a character. For example, the classical C-string is a single balanced pattern with the double quote character.
# create a
balanced pattern
const pat (afnix:txt:Pattern "ELEMENT"
"<" ">")
pat:check "<xml>" # true
pat:match "<xml>" # xml
In the case of the C-string, the pattern might be more appropriately defined with an additional escape character. Such character is used by the pattern matcher to grab characters that might be part of the pattern definition.
# create a
balanced pattern
const pat (afnix:txt:Pattern "STRING"
"’" ’\’)
pat:check "’hello’" # true
pat:match "’hello’" #
"hello"
In this form, a balanced pattern with an escape character is created. The same string is used for both the starting and ending string. Another constructor that takes two strings can be used if the starting and ending strings are different. The last pattern form is the balanced recursive form. In this form, a starting and ending string are used to delimit the pattern. However, in this mode, a recursive use of the starting and ending strings is allowed. In order to have an exact match, the number of starting string must equal the number of ending string. For example, the C-comment pattern can be viewed as recursive balanced pattern.
# create a
c-comment pattern
const pat (afnix:txt:Pattern "STRING"
"/*" "*/" )
Lexeme
object
The Lexeme object is the object built by a scanner that
contains the matched string. A lexeme is therefore a tagged
string. Additionally, a lexeme can carry additional
information like a source name and index.
# create an
empty lexeme
const lexm (afnix:txt:Lexeme)
afnix:txt:lexeme-p lexm # true
The default lexeme is created with any value. A value can be set with the set-value method and retrieved with the get-value methods.
lexm:set-value
"hello"
lexm:get-value # hello
Similar are the set-tag and get-tag methods which operate with an integer. The source name and index are defined as well with the same methods.
# check for the
source
lexm:set-source "world"
lexm:get-source # world
# check for the source index
lexm:set-index 2000
lexm:get-index # 2000
Text
scanning
Text scanning is the ability to extract lexical elements or
lexemes from an input stream. Generally, the lexemes are the
results of a matching operation which is defined by a
pattern object. As a result, the definition of a scanner
object is the object itself plus one or several pattern
object.
Scanner
construction
By default, a scanner is created without pattern objects.
The length method returns the number of pattern objects. As
usual, a predicate is associated with the scanner
object.
# the default
scanner
const scan (afnix:txt:Scanner)
afnix:txt:scanner-p scan # true
# the length method
scan:length # 0
The scanner construction proceeds by adding pattern objects. Each pattern can be created independently, and later added to the scanner. For example, a scanner that reads real, integer and string can be defined as follow:
# create the
scanner pattern
const REAL (
afnix:txt:Pattern "REAL" [$d+.$d*])
const STRING (
afnix:txt:Pattern "STRING" """
’\’)
const INTEGER (
afnix:txt:Pattern "INTEGER"
[$d+|"0x"$x+])
# add the pattern to the scanner
scanner:add INTEGER REAL STRING
The order of pattern integration defines the priority at which a token is recognized. The symbol name for each pattern is optional since the functional programming permits the creation of patterns directly. This writing style makes the scanner definition easier to read.
Using the
scanner
Once constructed, the scanner can be used as is. A stream is
generally the best way to operate. If the scanner reaches
the end-of-stream or cannot recognize a lexeme, the nil
object is returned. With a loop, it is easy to get all
lexemes.
while (trans
valid (is:valid-p)) {
# try to get the lexeme
trans lexm (scanner:scan is)
# check for nil lexeme and print the value
if (not (nil-p lexm)) (println (lexm:get-value))
# update the valid flag
valid:= (and (is:valid-p) (not (nil-p lexm)))
}
In this loop, it is necessary first to check for the end of the stream. This is done with the help of the special loop construct that initialize the valid symbol. As soon as the the lexeme is built, it can be used. The lexeme holds the value as well as it tag.
Text
sorting
Sorting is one the primary function implemented inside the
text processing module. There are three sorting functions
available in the module.
Ascending and
descending order sorting
The sort-ascent function operates with a vector object and
sorts the elements in ascending order. Any kind of objects
can be sorted as long as they support a comparison method.
The elements are sorted in placed by using a quick sort
algorithm.
# create an
unsorted vector
const v-i (Vector 7 5 3 4 1 8 0 9 2 6)
# sort the vector in place
afnix:txt:sort-ascent v-i
# print the vector
for (e) (v) (println e)
The sort-descent function is similar to the sort-ascent function except that the object are sorted in descending order.
Lexical
sorting
The sort-lexical function operates with a vector object and
sorts the elements in ascending order using a lexicographic
ordering relation. Objects in the vector must be literal
objects or an exception is raised.
Transliteration
Transliteration is the process of changing characters my
mapping one to another one. The transliteration process
operates with a character source and produces a target
character with the help of a mapping table. The
transliteration process is not necessarily reversible as
often indicated in the literature.
Literate
object
The Literate object is a transliteration object that is
bound by default with the identity function mapping. As
usual, a predicate is associate with the object.
# create a
transliterate object
const tl (afnix:txt:Literate)
# check the object
afnix:txt:literate-p tl # true
The transliteration process can also operate with an escape character in order to map double character sequence into a single one, as usually found inside programming language.
# create a
transliterate object by escape
const tl (afnix:txt:Literate ’\’)
Transliteration
configuration
The set-map configures the transliteration mapping table
while the set-escape-map configure the escape mapping table.
The mapping is done by setting the source character and the
target character. For instance, if one want to map the
tabulation character to a white space, the mapping table is
set as follow:
tl:set-map ’’ ’ ’
The escape mapping table operates the same way. It should be noted that the mapping algorithm translate first the input character, eventually yielding to an escape character and then the escape mapping takes place. Note also that the set-escape method can be used to set the escape character.
tl:set-map ’’ ’ ’
Transliteration
process
The transliteration process is done either with a string or
an input stream. In the first case, the translate method
operates with a string and returns a translated string. On
the other hand, the read method returns a character when
operating with a stream.
# set the
mapping characters
tl:set-map ’w’
tl:set-map ’\’ ’o’
tl:set-map ’r’
tl:set-map ’’d’
# translate a string
tl:translate "helo" # word
STANDARD TEXT PROCESSING REFERENCE
Pattern
The Pattern class is a pattern matching class based either
on regular expression or balanced string. In the regex mode,
the pattern is defined with a regex and a matching is said
to occur when a regex match is achieved. In the balanced
string mode, the pattern is defined with a start pattern and
end pattern strings. The balanced mode can be a single or
recursive. Additionally, an escape character can be
associated with the class. A name and a tag is also bound to
the pattern object as a mean to ease the integration within
a scanner.
Predicate
pattern-p
Inheritance
Object
Constructors
Pattern
(none)
The Pattern constructor creates an empty pattern.
Pattern
(String|Regex)
The Pattern constructor creates a pattern object associated
with a regular expression. The argument can be either a
string or a regular expression object. If the argument is a
string, it is converted into a regular expression
object.
Pattern
(String String)
The Pattern constructor creates a balanced pattern. The
first argument is the start pattern string. The second
argument is the end balanced string.
Pattern
(String String Character)
The Pattern constructor creates a balanced pattern with an
escape character. The first argument is the start pattern
string. The second argument is the end balanced string. The
third character is the escape character.
Pattern
(String String Boolean)
The Pattern constructor creates a recursive balanced
pattern. The first argument is the start pattern string. The
second argument is the end balanced string.
Constants
REGEX
The REGEX constant indicates that the pattern is a regular
expression.
BALANCED
The BALANCED constant indicates that the pattern is a
balanced pattern.
RECURSIVE
The RECURSIVE constant indicates that the pattern is a
recursive balanced pattern.
Methods
check ->
Boolean (String)
The check method checks the pattern against the input
string. If the verification is successful, the method
returns true, false otherwise.
match ->
String (String|InputStream)
The match method attempts to match an input string or an
input stream. If the matching occurs, the matching string is
returned. If the input is a string, the end of string is
used as an end condition. If the input stream is used, the
end of stream is used as an end condition.
set-tag
-> none (Integer)
The set-tag method sets the pattern tag. The tag can be
further used inside a scanner.
get-tag
-> Integer (none)
The get-tag method returns the pattern tag.
set-name
-> none (String)
The set-name method sets the pattern name. The name is
symbol identifier for that pattern.
get-name
-> String (none)
The get-name method returns the pattern name.
set-regex
-> none (String|Regex)
The set-regex method sets the pattern regex either with a
string or with a regex object. If the method is successfully
completed, the pattern type is switched to the REGEX
type.
set-escape
-> none (Character)
The set-escape method sets the pattern escape character. The
escape character is used only in balanced mode.
get-escape
-> Character (none)
The get-escape method returns the escape character.
set-balanced
-> none (String| String String)
The set-balanced method sets the pattern balanced string.
With one argument, the same balanced string is used for
starting and ending. With two arguments, the first argument
is the starting string and the second is the ending
string.
Lexeme
The Lexeme class is a literal object that is designed to
hold a matching pattern. A lexeme consists in string (i.e.
the lexeme value), a tag and eventually a source name (i.e.
file name) and a source index (line number).
Predicate
lexeme-p
Inheritance
Literal
Constructors
Lexeme
(none)
The Lexeme constructor creates an empty lexeme.
Lexeme
(String)
The Lexeme constructor creates a lexeme by value. The string
argument is the lexeme value.
Methods
set-tag
-> none (Integer)
The set-tag method sets the lexeme tag. The tag can be
further used inside a scanner.
get-tag
-> Integer (none)
The get-tag method returns the lexeme tag.
set-value
-> none (String)
The set-value method sets the lexeme value. The lexeme value
is generally the result of a matching operation.
get-value
-> String (none)
The get-value method returns the lexeme value.
set-index
-> none (Integer)
The set-index method sets the lexeme source index. The
lexeme source index can be for instance the source line
number.
get-index
-> Integer (none)
The get-index method returns the lexeme source index.
set-source
-> none (String)
The set-source method sets the lexeme source name. The
lexeme source name can be for instance the source file
name.
get-source
-> String (none)
The get-source method returns the lexeme source name.
Scanner
The Scanner class is a text scanner or lexical analyzer that
operates on an input stream and permits to match one or
several patterns. The scanner is built by adding patterns to
the scanner object. With an input stream, the scanner object
attempts to build a buffer that match at least one pattern.
When such matching occurs, a lexeme is built. When building
a lexeme, the pattern tag is used to mark the lexeme.
Predicate
scanner-p
Inheritance
Object
Constructors
Scanner
(none)
The Scanner constructor creates an empty scanner.
Methods
add ->
none (Pattern*)
The add method adds 0 or more pattern objects to the
scanner. The priority of the pattern is determined by the
order in which the patterns are added.
length ->
Integer (none)
The length method returns the number of pattern objects in
this scanner.
get ->
Pattern (Integer)
The get method returns a pattern object by index.
check ->
Lexeme (String)
The check method checks that a string is matched by the
scanner and returns the associated lexeme.
scan ->
Lexeme (InputStream)
The scan method scans an input stream until a pattern is
matched. When a matching occurs, the associated lexeme is
returned.
Literate
The Literate class is transliteration mapping class.
Transliteration is the process of changing characters my
mapping one to another one. The transliteration process
operates with a character source and produces a target
character with the help of a mapping table. This
transliteration object can also operate with an escape
table. In the presence of an escape character, an escape
mapping table is used instead of the regular one.
Predicate
literate-p
Inheritance
Object
Constructors
Literate
(none)
The Literate constructor creates a default transliteration
object.
Literate
(Character)
The Literate constructor creates a default transliteration
object with an escape character. The argument is the escape
character.
Methods
read ->
Character (InputStream)
The read method reads a character from the input stream and
translate it with the help of the mapping table. A second
character might be consumed from the stream if the first
character is an escape character.
getu ->
Character (InputStream)
The getu method reads a Unicode character from the input
stream and translate it with the help of the mapping table.
A second character might be consumed from the stream if the
first character is an escape character.
reset ->
none (none)
The reset method resets all the mapping table and install a
default identity one.
set-map
-> none (Character Character)
The set-map method set the mapping table by using a source
and target character. The first character is the source
character. The second character is the target character.
get-map
-> Character (Character)
The get-map method returns the mapping character by
character. The source character is the argument.
translate
-> String (String)
The translate method translate a string by transliteration
and returns a new string.
set-escape
-> none (Character)
The set-escape method set the escape character.
get-escape
-> Character (none)
The get-escape method returns the escape character.
set-escape-map
-> none (Character Character)
The set-escape-map method set the escape mapping table by
using a source and target character. The first character is
the source character. The second character is the target
character.
get-escape-map
-> Character (Character)
The get-escape-map method returns the escape mapping
character by character. The source character is the
argument.
Functions
sort-ascent
-> none (Vector)
The sort-ascent function sorts in ascending order the vector
argument. The vector is sorted in place.
sort-descent
-> none (Vector)
The sort-descent function sorts in descending order the
vector argument. The vector is sorted in place.
sort-lexical
-> none (Vector)
The sort-lexical function sorts in lexicographic order the
vector argument. The vector is sorted in place.