Dl - D-language

[ English | Japanese ]

[visit D-home]

Introduction

What is Dl

Dl is a language to handle D-records. It is not general programming language, such as C, C++ or Java. Functionally, it is akin to awk language. Dl handles D-records, while awk handles lines of text files. For example, you can add a new field to all or a specific record of the input D-file, change field values, or delete fields.

Though Dl has almost full function of programming languages, it is not intended to process a huge program. Ded is an interpreter of Dl and not very fast as a compiler languages. If you need very complicated process, it is recommended to use other method, for example, perl or c programming language.

Typical usage of Dl is:

Ded IF txtlang == jpn OR txtlang == kor OR txtlang == chi THEN area = ea FI input-file.d

This command adds (or changes) the field "area" with value "ea" to the records of which "txtlang" field value is "jpn", "kor" or "chi".

Some of D-commands can be written by Dl. For example,

Dtie -t ":" a,b c

is same as

Ded FIELD c = FIELD a . CONST ":" . FIELD b ";" FIELD a = FIELD b = "{" "}"

Conceptual difference of these two approach is that a D-command represents a D-file basic operation, while Dl offers general purpose D-record handling method. Practical difference is the speed. D-commands are tuned for specific operations and hard-wired code for the operation. Ded is an interpreter of Dl and executes the operations step by step, thus is slower than D-commands. It is recommended to use specific D-command, when there is proper one provided.

Features of Dl

Dl program is written as a series of command arguments (generally). This is like sed command of unix, but -e option is not used and you can directly write the program as command arguments. Or, if you like, you can provide Dl program from a text file. Detail is described in the general syntax section.

Dl has highly simplified syntax. Unlike many languages, Dl has no statement. There are only expressions. Control structures like "if" or "while" are operators in Dl. Even the ";" is operator, which is similar to "," operator of the language C. An expression is "evaluated". This means that part of the Dl program is "executed".

In addition, Dl does not have subroutines or macros. These facts make it difficult to write large complicated program, which is not main objective of Dl.

Any field of a D-record is repeatable. Consequently, any constant or variable of Dl is an array. Any operator of Dl is applied to arrays, with special ways. For example, "+" or addition operator works differently depending on the numbers of elements in operand values. Perl language has array and scalar contexts to control the operation semantics. Dl has only array context for an operator.

Ded and Dselect

Two separate programs interprets Dl. Ded is the full processor of Dl, while Dselect has restriction of operators which affects to the output record.

In the case of Ded, given Dl expressions are evaluated (i.e. the given Dl program is executed) for each input D-record, and after the evaluation (i.e., execution of the Dl program), the current record is written to the standard output, if it has at least one field (i.e., if the current record is not deleted). You may write the current record explicitly by output, but even you use output, Ded will write the current record after the evaluation. After the current record output, Ded reads next input, and goes into new cycle of evaluation, until it encounters end of file.

In the case of Dselect, given Dl expression is evaluated for each input D-record, and if the result value is true (see boolean evaluation section below), the input record is written to the standard output. Assignment operation to a field or output operation is not allowed in the Dl expression given to Dselect command, so that the input record is not changed or duplicated in the output.

General Syntax

Words and quoting

Source program of Dl is taken from the command arguments or from source files given with the -f option. Dl program is made of words. Dl operators, constants, field names, variable names and other Dl reserved words are given as words. Even a parenthesis is a word in Dl. A word is made of arbitrary length string of any character. The way of recognizing a word is slightly different for command arguments and source file input.

When the program is given as command arguments, each command argument makes a Dlword. When the program is given from a text file, each word is separated by white spaces. End of a file is treated as a new line character. In addition to white spaces, only REVERSE SOLIDUS (\), QUOTATION MARK (") and APOSTROPHE (') have special meaning in the source file. Quoting mechanism with these characters explained below follows the UNIX born shell specification.

REVERSE SOLIDUS (\) at the end of line is regarded as line continuation mark unless it is placed in APOSTROPHE (') quoted string. The continuation mark and the new line after it are omitted from the word. REVERSE SOLIDUS at other positions is escape character unless it is placed in APOSTROPHE or QUOTATION quoted string. The escape character itself is omitted and the following character becomes a normal character which makes a part of the word. Escape character is usually used to include spaces, QUOTATION MARK, APOSTROPHE or REVERSE SOLIDUS as a part of the word.

QUOTATION MARK (") is another means to use special characters in a Dl source file. Once the Dl parser encounters a QUOTATION MARK, it is omitted from the word and the following characters are included in the word until the parser again encounters a QUOTATION MARK. (Ending QUOTATION MARK is not included in the word). There are two exceptions within the QUOTATION MARKs. REVERSE SOLIDUS followed by a QUOTATION MARK makes just one QUOTATION MARK. This is used to escape ending QUOTATION MARK. The other exception is REVERSE SOLIDUS followed by a new line character, which is an line continuation mark within the QUOTATION MARK, and both REVERSE SOLIDUS and the new line character are omitted from the current word. Other REVERSE SOLIDUS between QUOTATION MARKs is treated as normal character.

APOSTROPHE (') is also a quoting character. It is stronger than QUOTATION MARK. When the Dl parser encounters an APOSTROPHE, following characters are included in the current word until it encounters the closing APOSTROPHE. Unlike QUOTATION MARK, there is no exception. Even a REVERSE SOLIDUS or a new line character is treated as normal character within APOSTROPHEs. To use an APOSTROPHEs within a word, use REVERSE SOLIDUS escape, or QUOTATION MARK explained above.

Example of quoted words in Dl source files

one\ word one word
"one word" one word
'one word' one word
one" "word one word
o\ n" "e' '' '"w o ""r d" o n e  w o r d
\o\n\e\ \w\o\r\d one word
one\
word
oneword
\\\"\' \"'
"\"\\'\"" "\'"
"one\
word"
oneword
'"\"' "\"
'one
word'
one
word

Reserved words

Following words are reserved words in Dl.

! != !~ $& $' $. $n $? $` % && ( ) * ** + , - -- . .. / ; < <= <> = == =~ > >= @_ [ ] { || ABS AND ATAN AVG BY CAPS CAT CODESET CONST COS COUNT CURREC DIVIDEDBY DO DONE ELIF ELSE EPILOGUE EQ EXISTS EXIT EXP FI FIELD FIELDS FILENAME FNR FOR GE GREP GT IF IN INCL INT JOIN LE LENGTH LIKE LOCALE LOG LOG10 LT MATCH MATCHn MAX MIN MINUS MOD NE NOT NR NUM OR OUTPUT PLUS POSTMATCH POWER PREMATCH QX REC# S SG SIN SPLIT SPRINTF SQRT SSCANF STATIC STATUS STR SUBST SUBSTG SUBSTR SUM TAN THEN TIMES TOLOWER TOUPPER UNLIKE VAR WHILE

Following words are reserved only in limited situations:

} */

Note that all words in Dl are case sensitive (i.e., "IF" is reserved word but not "if" or "If").

Comment

Comment is started by a word /* and ended by a word */. Unlike C language's comment, these /* and */ must be separated by spaces; i.e., /*COMMENT*/ is not a comment (but a non reserved word), while /* COMMENT */ is treated as a comment.

Tokens

Tokens of the Dl are the field-name, constant, variable, static variable, special variable, operator, parenthesis or end-token. Each token must be given as a separate word or words led by a reserved keyword.

Grammar

Following is the simplified grammar of Dl.

program ::= expression
expression
::= primary
| unary-operator expression
| expression binary-operator expression
| expression {SUBST|S|SUBSTG|SG} expression BY expression
| expression '[' expression ']'
| IF expression THEN expression [ ELIF expression THEN expression ]... [ ELSE expression ] FI
| WHILE expression DO expression DONE
| FOR variable IN expression DO expression DONE
primary
::= constant
| field-name
| variable
| static-variable
| special-variable
| '(' expression ')'

Syntactical Components

Field Name
Constant
Variable and Static Variable
Special Variables
Parentheses
End Token

Field Name

Morphology

A word following the reserved word FIELD is a field name token. For example,

FIELD a

is a field-name token "a". Similarly,

FIELD FIELD

is a field-name token "FIELD". In this case the second word (FIELD) is not a reserved word but a field name, while the first word (FIELD) is a reserved word.

Omission of the reserved word FIELD

As a special case, a non-reserved word at the top of the program is regarded as a field-name token. A non-reserved word after following tokens is also a field-name token by itself.

(
; , && || !
IF THEN ELIF ELSE WHILE DO
ABS AND ATAN AVG CAPS CAT COS COUNT EXISTS EXP INT LENGTH LOG LOG10 MAX MIN NOT OR SIN SQRT SUM TAN TOLOWER TOUPPER

In the next example, the second word "a" is a field-name token, because it is after EXISTS.

EXISTS a

Evaluation of Field-name token

Field-name token, when evaluated, has the field values of the given field name in the current record. The current record may have two or more same name fields. In this case, the result of field evaluation becomes an array. When the current record has no such fields, the value is null value.

Numeric qualifier

The value of the field-name token is string as its default. But, for example in comparison operation, you may want to evaluate it as numeric. Numeric qualifier makes the field-name token evaluation in numeric value. It is COLON (:) and a letter "n" after the field-name. For example

FIELD seq

is evaluated as a string, thus, "10" is smaller than "9". But,

FIELD seq:n

is evaluated as numeric, and "10" becomes greater than "9".

Constant

Morphology

There are two ways to denote constants. One is to use a reserved word CONST. The word following the reserved word CONST is a constant token. To make array value, repeat CONST and the value after the first constant.

CONST a
CONST a CONST b

The other way to denote constant is to use BRACEs. The words following the LEFT BRACE ({) before a RIGHT BRACE (}) forms a constant token. In these BRACEs, reserved words of the Dl lose their effect, and become constants

{ a }
{ a b }

Note that a LEFT BRACE is just a constant in BRACEs. Thus the next example:

{ { } }

causes syntax error at the fourth word, because the constant token ends at the third word.

There is no way to include a string consist of a single RIGHT BRACE in this braced constants. Use

CONST }

for this purpose.

To make null value constant, use

{ }

There is no way to make null value constant with CONST type notation.

Omission of the reserved word CONST

As a special case, a non-reserved word after following tokens is a constant as it is.

!= !~ % * ** + - . .. /
< <= <> = == =~ > >= [
BY DIVEDBY EQ GE GREP GT IN INCL JOIN
LE LIKE LT MINUS MOD NE NUM PLUS POWER
QX S SG SPLIT STR SUBST SUBSTG SUBSTR TIMES UNLIKE

For example, in the next case

FIELD 1 == 1

the second word '1' is a field name, and the fourth word '1' is a constant.

Evaluation of constant

Constants are evaluated as strings as they are. In the case when a numeric value is required, for example after or before '+', Dl interpreter automatically converts it to numeric value. There is no way to evaluate a constant as numeric value explicitly. Operator NUM converts a string to numeric value.

Variable and Static Variable

Morphology

Variable token consists of a reserved word VAR and a following aribitrary word. Similarly, a static variable token consists of a reserved word STATIC and a following arbitrary word. Only after the FOR token, reserved word VAR may be omitted. There is no way to omit reserved word STATIC.

Lifetime

A variable or a static variable can hold a value. The difference is its lifetime. Lifetime of variable is just one cycle of Dl execution, i.e., when a D-record is read from the input file, all variables are wiped out before evaluation of the given program. Static variable has lifetime of a D-command execution, i.e., the value assigned to it is kept through the execution of a D-command execution.  In other words, static variable is a variable in the usual sense and variable is just for a local use such as loop index.  (FOR operator takes a variable as its index).

Scope of a variable or a static variable is always the whole program.

Evaluation

A variable or a static variable yields the value assigned to it. When a variable or a static variable not assigned a value is evaluated, it yields the null value (an array with no element).

Special Variables

Syntactically, special variables consist of a reserved keyword. Semantically, they are like predefined variables of perl, and some of them are just like statements of a programming languages.

When a special variable is evaluated, it yields a value related to the environment of program execution, or do some function for the program execution. In this sence "special variable" may be misleading name. But to make the syntax simple, they are classified into one category.

These variables are not user changeable, but for CURREC, which represents whole current record and is changeable by an assignment operation.

Individual special variables are described int the Operators and related special variables section.

Parentheses

LEFT PARENTHESIS (() and RIGHT PARENTHESIS ()) are used as in usual languages, to make an expression a unit, changing the order of operation.

Note that CURLEY BRACKETS ({ }) are for constant tokens and are not grouping the expressions. Note also, SQUARE BRACKETS ([ ]) are suffix operator.

End Token

Word '--' is used to indicate the end of program explicitly.  Usually, you may not use this, because the Dl parser inserts an end token automatically when it encounters a word which is not a valid token. You need explicit end token only when your first input file name is equal to one of reserved words of Dl. For example:

COUNT a LT 2 -- LT

In the above example, the third argument LT is a Dl reserved word for comparison operator "less than", while the sixth word LT is the input file name. In this case you need the end token, because unless it, the file name is interpreted as a reserved word, causing a syntax error. (Note that if the input file name was "lt" in small letter, you wouldn't need the end-token.)

Evaluation or Execution

As Dl grammer has only expressions, the program is "evaluated", in other words "executed". In this sence, it is like LISP.

Evaluation of elements (field names, variables, etc.) is described in the Syntactical Components sections. When an expression including operators is evaluated, operations described in the Operators and related special variables sections are executed.

Boolean Evaluation

The result of an operation of Dl may be an array of string or numeric values. When the result is evaluated as a boolean value, evaluation follows the next rule:

  1. When the value is simple (i.e. number of elements is 1),
    1. When the value is numeric,
      value 0 is FALSE, and non 0 value is TRUE.
    2. When the value is string,
      Null string is FALSE, and any other string is TRUE.
  2. When the value is null value (i.e. number of elements is 0),
    it is FALSE.
  3. When the value is not simple (i.e. number of elements is greater than 1),
    it is TRUE, regardless to the type or element values.

It might cause queer situations. For example a numeric value { 0 } is FALSE, while a string value { 0 } is TRUE (by rule 1). Never the less, it works practically in most situations.

This boolean evaluation rule is generally applied in Dl, whenever an operand requires boolean values. For example logical AND operation requires the left hand operand to be evaluated as boolean, and the above rule is applied.

Examples

In the following examples, FIELD, CONST or VAR is omitted as far as possible. This is to demonstrate how Dl parser works, and it is not recommendation. I would rather recommend not to rely on these default interpretations, which may often lead you to mistakes.

Test if a field "lang" is "jpn":

lang == jpn

Test if a field "yr" is smaller than 2003 as a numeric value:

yr:n LT 2003

Assume each input D-record has just one field "l", of which value is a line of a text file. In ths text file, a paragraph is separated by a blank line. Next example adds a field "P" which holds the paragraph number to each D-record.

IF NR == 1 THEN STATIC P = 1 FI ;
IF l =~ "^ *$" THEN STATIC P = STATIC P + 1 FI ;
FIELD P = STATIC P

Works as Dtie -t / y,m,d ymd (under the condition that fields "y", "m" and "d" have same number of elements).

ymd = FIELD y . CONST / . FIELD m . CONST / . FIELD d ;
y = FIELD m = FIELD d = { }

Similar to perl's grep. Eliminates null string value fields.

FOR i IN COUNT FIELDS - 1 .. 0 DO
  IF FIELDS [ VAR i ] LIKE "^[^:]*:$" THEN
    FIELDS [ VAR i ] = { }
  FI
DONE

Similar to the example above; another way to do with array suffix. Works as Dproj foo:

FOR i IN 0 .. COUNT FIELDS - 1 DO
  IF FIELDS [ VAR i ] =~ ^foo: THEN
    VAR f = VAR f , VAR i
  FI
DONE ;
FIELDS = FIELDS [ VAR f ]

See Also

Dintro, Ded, Dselect

AUTHOR

MIYAZAWA Akira


miyazawa@nii.ac.jp
2003-2007