Brief Introduction to D / by Akira MIYAZAWA

[Japanese/English]

Download

CONTENTS


What is D?

D is a tool for data processing. It is a series of command line interface commands that perform file operations such as selection, projection, sort, join, bundling and unbundling, on D-files. D also provides conversion tools to and from other data formats such as CSV, XML or HTML. Especially with DfromHtml, D can be used as a scraping tool.


D-file

D-file is a simple text file with some conventions. Here is an example of a D-file:

tr:Romeo and Juliet / by Mr. William Shakespear
pub:London : Printed for J. Tonson, 1734
phys:84 p. ; 16 cm

tr:The merchant of Venice / by Mr. William Shakespear
pub:London : Printed for J. Tonson, 1734
phys:72 p. ; 17 cm

As shown in the example above, a record consists of some lines ended by a null line. Each line is a field, and the field name and value in the field (line) is separated by a COLON (:).

This is the input record format of awk when RS="", or of perl when $/ = "". It is easy to create D-file with your favorite editor, or handle D-files with awk, perl or sed.

For the detail of D-file see the manual of Dintro.


Data processing with D

D-commands provide basic file operations. For example, we have a tab separate input-file as follows:

chapter1        conversion      selection       projection
chapter2        projection      bundling        unbundling

then, following D-commands:

   DfromLine "chapter:+7,word:*" input-file | \    (Result)
   Dunbundle word | \    (Result)
   Dsort word:f,chapter:n | \    (Result)
   Dbundle word:f | \    (Result)
   Dorder word,chapter | \    (Result)
   DtoLine -f "chapter:/\, /:Chapter%d"

produce next result:

bundling        Chapter2
conversion      Chapter1
projection      Chapter1, Chapter2
selection       Chapter1
unbundling      Chapter2

You will easily understand this result is the word index of the input file. To get more sophisticated print out, you may use DtoTex.


Some D-commands

To look into a D-file, use Dpr.

Dfd or Dfdp is to get report on D-files.

DfromCsv converts CSV (comma separated value) file into a D-file. DfromHtml extracts D-records from the specified parts of HTML file.

For the experts of UNIX textutils, cut corresponds to Dproj, join to Djoin, sed to Ded, paste to Dpaste, sort to Dsort, grep to Dgrep or Dselect, head to Dhead, tail to Dtail, wc to Drc, and uniq to Dbundle "^". But you may find Dfreq can be used instead of uniq in some cases.

Dselect provides powerful methods for selection. Ded provides general way to change D-records with Dl language.

See also D-command listing in Dintro.


Why D?

Easy to use:

Many of D-commands provide similar function those UNIX textutils like sort, cut or join provide. But, in textutils, you have to remember a field by its position. In D you can use field names:

Dsort name infile

is much easier to remember or easier to use than

sort +1 -2 infile

Furthermore, you have to keep documentation with your line format data. Otherwise you will forget which field is what in a couple of years. But, with D-files, you don't need documentation, if appropriate field names are provided. You can easily remember what the content is.

Repeating field and repeating group handling:

Repeating fields and repeating groups are commonly found in real world of data processing. Relational databases, spread sheets cannot handle these repeating fields in natural way. D is good at handling these structures.

Even if you are to use a relational database or spread sheet, D will be useful in the data preparation phase as well.

No definition:

To create D-files, you don't need cumbersome task of data definition phase that the database management system requests. You can create or modify D-files as you like, just following your data's native feature.

No limits:

D-files have no limits in data length, field name length or number of fields in a record. (but for 32bit limit).

Simple and open:

As D-file convention is so simple, virtually any system can handle it. There is no "hidden" file, no separate data like "registry" or "resources". A small number of environment variables are used to control the behavior. But the usage of such "hidden" parameters is limited to minimum. Using programs other than D-commands during D-file process will cause no problem at all.

But ...

D does not handle XML like structure

Arbitrary depth of structured elements is out of the D-file data model. Though D can handle two level stem-leaf structures, and it may be nested with user-defined delimiters, still, it is not enough to handle full structure.

You can convert XML files into D-file, flattening the structure at a certain level. See DfromXml and DtoXml.

D does not provide GUI

It is possible to create GUI (graphical user interface) for D-file operation and it will help beginners to understand what is D-file operation. But, thinking of its workload of development, and the fact that it does not add new functionality to D-file operation itself, D will remain as a tool for those who can handle command line commands, in foreseeable future.

D does not handle binary data

It handles only character string data. Though the author hatches idea of DfromBin and DtoBin more than ten years, it is not mature yet. At least, D will not handle binary data directly.


Update 2014-09-12