[Japanese/English]
DownloadD is a tool for data processing. It is a series of command line interface commands that perform file operations such as selection, projection, sort, join, bundling and unbundling, on D-files. D also provides conversion tools to and from other data formats such as CSV, XML or HTML. Especially with DfromHtml, D can be used as a scraping tool.
D-file is a simple text file with some conventions. Here is an example of a D-file:
tr:Romeo and Juliet / by Mr. William Shakespear
pub:London : Printed for J. Tonson, 1734
phys:84 p. ; 16 cm
tr:The merchant of Venice / by Mr. William Shakespear
pub:London : Printed for J. Tonson, 1734
phys:72 p. ; 17 cm
As shown in the example above, a record consists of some lines ended by a null line. Each line is a field, and the field name and value in the field (line) is separated by a COLON (:).
This is the input record format of awk when RS="", or of perl when $/ = "". It is easy to create D-file with your favorite editor, or handle D-files with awk, perl or sed.
For the detail of D-file see the manual of Dintro.
D-commands provide basic file operations. For example, we have a tab separate input-file as follows:
chapter1 conversion selection projection
chapter2 projection bundling unbundling
then, following D-commands:
produce next result:
bundling Chapter2
conversion Chapter1
projection Chapter1, Chapter2
selection Chapter1
unbundling Chapter2
You will easily understand this result is the word index of the input file. To get more sophisticated print out, you may use DtoTex.
To look into a D-file, use Dpr.
Dfd or Dfdp is to get report on D-files.
DfromCsv converts CSV (comma separated value) file into a D-file. DfromHtml extracts D-records from the specified parts of HTML file.
For the experts of UNIX textutils, cut corresponds to Dproj, join to Djoin, sed to Ded, paste to Dpaste, sort to Dsort, grep to Dgrep or Dselect, head to Dhead, tail to Dtail, wc to Drc, and uniq to Dbundle "^". But you may find Dfreq can be used instead of uniq in some cases.
Dselect provides powerful methods for selection. Ded provides general way to change D-records with Dl language.
See also D-command listing in Dintro.
Many of D-commands provide similar function those UNIX textutils like sort, cut or join provide. But, in textutils, you have to remember a field by its position. In D you can use field names:
Dsort name infile
is much easier to remember or easier to use than
sort +1 -2 infile
Furthermore, you have to keep documentation with your line format data. Otherwise you will forget which field is what in a couple of years. But, with D-files, you don't need documentation, if appropriate field names are provided. You can easily remember what the content is.
Repeating fields and repeating groups are commonly found in real world of data processing. Relational databases, spread sheets cannot handle these repeating fields in natural way. D is good at handling these structures.
Even if you are to use a relational database or spread sheet, D will be useful in the data preparation phase as well.
To create D-files, you don't need cumbersome task of data definition phase that the database management system requests. You can create or modify D-files as you like, just following your data's native feature.
D-files have no limits in data length, field name length or number of fields in a record. (but for 32bit limit).
As D-file convention is so simple, virtually any system can handle it. There is no "hidden" file, no separate data like "registry" or "resources". A small number of environment variables are used to control the behavior. But the usage of such "hidden" parameters is limited to minimum. Using programs other than D-commands during D-file process will cause no problem at all.
Arbitrary depth of structured elements is out of the D-file data model. Though D can handle two level stem-leaf structures, and it may be nested with user-defined delimiters, still, it is not enough to handle full structure.
You can convert XML files into D-file, flattening the structure at a certain level. See DfromXml and DtoXml.
It is possible to create GUI (graphical user interface) for D-file operation and it will help beginners to understand what is D-file operation. But, thinking of its workload of development, and the fact that it does not add new functionality to D-file operation itself, D will remain as a tool for those who can handle command line commands, in foreseeable future.
It handles only character string data. Though the author hatches idea of DfromBin and DtoBin more than ten years, it is not mature yet. At least, D will not handle binary data directly.