Tipi di tabelle CONNECT - File dei dati

Tabelle interne
Tabelle con file multipli
Il mapping dei file
Tabelle basate su file grandi
Compressed Table Tables
DOS and FIX Table Types
1. Specifying the Field Format
DBF Type
1. Reading Soft Deleted Lines of a DBF table
BIN Table Type
1. Type Conversion in BIN Tables
2. Numeric fields alignment
VEC Table Type (Vector)
CSV and FMT Table Types
XML Table Type
INI Table Type
1. Column layout
2. Row layout

La maggior parte delle tabelle elaborate da CONNECT sono semplici file DOS o UNIX, logicamente considerate tabelle grazie alla loro descrizione fornita durante la creazione della tabella. Questa descrizione viene dall'istruzione CREATE TABLE. A seconda dell'applicazione, queste tabelle possono essere già esistenti come file, e quindi utilizzate così come sono da CONNECT, oppure possono essere create fisicamente da CONNECT in seguito all'istruzione CREATE TABLE ... SELECT ... e/o INSERT.

Il file percorso/nome è specificato con l'opzione FILE_NAME. Se il percorso è relativo, sarà considerato relativo alla directory del database, cioè quella che contiene il file .FRM.

Tabelle interne

Un tipo particolare di tabelle CONNECT basate sui file sono le tabelle "interne". Queste sono create quando il nome del file viene specificato nell'istruzione CREATE TABLE. Il nome di default è nome_tabella.tip dove nome_tabella è il nome della tabella e tip è il tipo di tabella in caratteri minuscoli. Un file vuoto viene creato nella directory del database corrente al momento della creazione e viene popolato con i successivi inserimenti.

Queste tabelle si comportano come quelle create dagli altri Storage Engine e, diversamente da quanto avviene con le altre tabelle CONNECT, i file vengono cancellati quando la tabella viene eliminata. Naturalmente per poter essere usate non devono essere di sola lettura. Anche se la loro usabilità è limitata, possono essere utilizzate per scopo di testing o se l'utente non ha il privilegio FILE.

Tabelle con file multipli

Una tabella con file multipli è una tabelle che fisicamente è contenuta da diversi file dello stesso tipo, non uno solo. Questi file vengono elaborati in sequenza durante l'esecuzione delle query e il risultato è lo stesso che si otterrebbe se ci fosse un unico file. Ciò è utile per elaborare i file che vengono da diverse fonti (come i file di log dei crash) o creati in diversi periodi di tempo (come i report mensili della banca) e che devono essere considerati come un'unica tabella. Si noti che le operazioni su tali file sono limitate alle SELECT e alle UPDATE sequenziali; e che le tabelle multiple VEC non sono supportate da CONNECT. La lista dei file dipende dall'impostazione dell'opzione multiple nell'istruzione CREATE TABLE.

Le tabelle multiple si creano con MULTIPLE=n, che può assumere tre valori:

0	Una tabella non multipla (il default). Si può utilizzare in una ALTER TABLE.
1	La tabella è composta da file che si trovano nella stessa directory. L'opzione FILE_NAME è una schema come `'cash*.log'` che deve corrispondere a tutte le tabelle desiderate.
2	FILE_NAME è il nome di un file che contiene tutti i percorsi e i nomi dei file che compongono la tabella. Questo file può essere creato utilizzando una tabella DIR.

La colonna speciale FILEID, spiegata qui, permette di filtrare l'elenco dei file o eseguire qualche raggruppamento sui file che compongono una tabella multipla.

Il mapping dei file

Per le tabelle basate sui file di dimensioni ragionevoli, il tempo di elaborazione può essere ridotto sensibilmente sotto WindowsTM o alcuni sistemi UNIX e Linux utilizzando la tecnica del “file mapping”, che consiste nell'elaborare un file come se si trovasse interamente in memoria. Il mapping viene specificato al momento della creazione della tabella tramite l'opzione MAPPED=YES. Questa non si applica alle tabelle che non sono gestite dalle funzioni di I/O del sistema (XML e INI).

Tabelle basate su file grandi

Siccome tutti i file sono festiti dalle funzioni standard di input/output del sistema operativo, their le loro dimensioni sono limitate a 2GB, cioè le dimensioni massime consentite da tali funzioni. Per alcuni tipi di tabelle, CONNECT può gestire file che sono più grandi di 2GB o che potrebbero superare questo limite. Questi tipi sono FIX, BIN e VEC. Per dire a CONNECT di utilizzare le funzioni di input/output in grado di gestire file grandi, si specifica l'opzione huge=1 o huge=YES. Si noti però che CONNECT non può effettuare l'accesso casuale su tabelle che superano i 2G.

Compressed Table Tables

CONNECT can make and processed some tables whose data file is compressed. The only supported compression format is the gzlib format. Zip and zlib formats are not supported. The table types that can be compressed are DOS, FIX, BIN, CSV and FMT. This can save some disk space at the cost of a somewhat longer processing time.

Some restrictions apply to compressed tables:

Compressed tables are not indexable.

Update and partial delete are not supported.

Use the Boolean COMPRESS option to specify a compressed table.

DOS and FIX Table Types

Table of type DOS are based on text files in which each record represents a table row. Within a record, column fields are positioned at a fixed offset from the beginning of the record. Except sometimes for the last field, column fields are also of fixed length. If the last field has varying length, the type of the table is DOS. For instance, having the file dept.dat formatted like:

0318 KINGSTON       70012 SALES       Bank/Insurance
0021 ARMONK         87777 CHQ         Corporate headquarter
0319 HARRISON       40567 SALES       Federal Administration
2452 POUGHKEEPSIE   31416 DEVELOPMENT Research & development

You can define a table based on it with:

create table department (
  number char(4) not null,
  location char(15) not null flag=5,
  director char(5) not null flag=20,
  function char(12) not null flag=26,
  name char(22) not null flag=38)
engine=CONNECT table_type=DOS file_name='dept.dat';

Here the flag column option represents the offset of this column inside the records. If the offset of a column is not specified, it defaults to the end of the previous column and defaults to 0 for the first one. The lrecl parameter that represents the maximum size of a record is calculated by default as the end of the rightmost column and can be unspecified except when some trailing information exists after the rightmost column.

Note: A special case is files having an encoding such as UTF-8 (for instance specifying charset=UTF8) in which some characters may be represented with several bytes. Unlike the type size that MariaDB interprets as a number of characters, the lrecl value is the record size in bytes and the flag value represents the offset of the field in the record in bytes. If the flag and/or the lrecl value are not specified, they will be calculated by the number of character in the fields multiplied by a value that is the maximum size in bytes of a character for the corresponding charset. For UTF-8 this value is 3 that is far too much as they are very few characters requiring 3 bytes to be represented. When creating a new file, you are on the safe side by only doubling the maximum number of characters of a field to calculate the offset of the next field. Of course, for already existing files, the offset must be specified according to what it is in it.

Although the field representation is always text in the table file, you can freely choose the corresponding column type, characters, date, integer or floating point according to its contents.

Sometimes, as in the number column of the above department table, you have the choice of the type, numeric or characters. This will modify how the column is internally handled — in characters 0021 is different from 21 but not in numeric — as well as how it is displayed.

If the last field has fixed length, the table should be referred as having the type FIX. For instance, to create a table on the file boys.txt:

John      Boston      25/01/1986  02/06/2010
Henry     Boston      07/06/1987  01/04/2008
George    San Jose    10/08/1981  02/06/2010
Sam       Chicago     22/11/1979  10/10/2007
James     Dallas      13/05/1992  14/12/2009
Bill      Boston      11/09/1986  10/02/2008

You can for instance use the command:

create table boys (
  name char(12) not null,
  city char(12) not null,
  birth date not null date_format='DD/MM/YYYY',
  hired date not null date_format='DD/MM/YYYY' flag=36)
engine=CONNECT table_type=FIX file_name='boys.txt' lrecl=48;

Here some flag options were not specified because the fields have no intermediate space between them except for the last column. The offsets are calculated by default adding the field length to the offset of the preceding field. However, for formatted date columns, the offset in the file depends on the format and cannot be calculated by default. For fixed files, the lrecl option is the physical length of the record including the line ending character(s). It is calculated by adding to the end of the last field 2 bytes under Windows (CRLF) or 1 byte under UNIX. If the file is imported from another operating system, the ENDING option will have to be specified with the proper value.

For this table, the last offset and the record length must be specified anyway because the date columns have field length coming from their format that is not known by CONNECT. Do not forget to add the line ending length to the total length of the fields.

This table is displayed as:

name	city	birth	hired
John	Boston	1986-01-25	2010-06-02
Henry	Boston	1987-06-07	2008-04-01
George	San Jose	1981-08-10	2010-06-02
Sam	Chicago	1979-11-22	2007-10-10
James	Dallas	1992-05-13	2009-12-14
Bill	Boston	1986-09-11	2008-02-10

Whenever possible, the fixed format should be preferred to the varying one because it is much faster to deal with fixed tables than with variable tables. Sure enough, instead of being read or written record by record, FIX tables are processed by blocks of BLOCK_SIZE records, resulting in far less input/output operations to execute. The block size defaults to 100 if not specified in the Create Table statement.

Note 1: It is not mandatory to declare in the table all the fields existing in the source file. However, if some fields are ignored, the flag option of the following field and/or the lrecl option will have to be specified.

Note 2: Some files have an EOF marker (CTRL+Z 1A) that can prevent the table to be recognized as fixed because the file length is not a multiple of the fixed record size. To indicate this, use in the option list the create option EOF. For instance, if after creating the FIX table xtab on the file foo.dat that you know have fixed record size, you get, when you try to use it, a message such as:

File foo.dat is not fixed length, len=302587 lrecl=141

After checking that the LRECL default or specified specification is correct, you can indicate to ignore that extra EOF character by:

alter table xtab option_list='eof=1';

Of course, you can specify this option directly in the Create statement. All this applies to some other table types, in particular to BIN tables.

Note 3: The width of the fields is the length specified in the column declaration. For instance for a column declared as:

number int(3) not null,

The field width in the file is 3 characters. This is the value used to calculate the offset of the next field if it is not specified. If this length is not specified, it defaults to the MySQL default type length.

Specifying the Field Format

Some files have specific format for their numeric fields. For instance, the decimal point is absent and/or the field should be filled with leading zeros. To deal with such files, as well in reading as in writing, the format can be specified in the CREATE TABLE column definition. The syntax of the field format specification is:

Field_format='[Z][N][d]'

The optional parts of the format are:

Z	The field has leading zeros
N	No decimal point exist in the file
d	The number of decimals, defaults to the column precision

Let us see how it works in the following example. We define a table based on the file xfmt.txt having eight fields of 12 characters:

create table xfmt (
  col1 double(12,3) not null,
  col2 double(12,3) not null field_format='4',
  col3 double(12,2) not null field_format='N3',
  col4 double(12,3) not null field_format='Z',
  col5 double(12,3) not null field_format='Z3',
  col6 double(12,5) not null field_format='ZN5',
  col7 int(12) not null field_format='N3',
  col8 smallint(12) not null field_format='N3')
engine=CONNECT table_type=FIX file_name='xfmt.txt';

insert into xfmt values(4567.056,4567.056,4567.056,4567.056,-23456.8,
    3.14159,4567,4567);
select * from xfmt;

The first row is displayed as:

COL1	COL2	COL3	COL4	COL5	COL6	COL7	COL8
4567.056	4567.056	4567.06	4567.056	-23456.800	3.14159	4567	4567

The number of decimals displayed for all float columns is the column precision, the second argument of the column type option. Of course, integer columns have no decimals, although their formats specify some.

More interesting is the file layout. To see it let us define another table based on the same file but whose columns are all characters:

create table cfmt (
  col1 char(12) not null,
  col2 char(12) not null,
  col3 char(12) not null,
  col4 char(12) not null,
  col5 char(12) not null,
  col6 char(12) not null,
  col7 char(12) not null,
  col8 char(12) not null)
engine=CONNECT table_type=FIX file_name='xfmt.txt';
select * from cfmt;

The (transposed) display of the select command shows the file text layout for each field. Below a third column was added in this document to comment this result.

Column	Row 1	Comment (all numeric fields are written right justified)
COL1	`4567.056`	No format, the value was entered as is.
COL2	`4567.0560`	The format ‘4’ forces to write 4 decimals.
COL3	`4567060`	N3 → No decimal point. The last 3 digits are decimals. However, the second decimal was rounded because of the column precision.
COL4	`00004567.056`	Z → Leading zeros, 3 decimals (the column precision)
COL5	`-0023456.800`	Z3 → (Minus sign) leading zeros, 3 decimals.
COL6	`000000314159`	ZN5 → Leading zeros, no decimal point, 5 decimals.
COL7	`4567000`	N3 → No decimal point. The last 3 digits are decimals.
COL8	`4567000`	Same. Any decimals would be ignored.

Note: For columns internally using double precision floating-point numbers, MariaDB limits the decimal precision of any calculation to the column precision. The declared column precision should be at least the number of decimals of the format to avoid a loss of decimals as it happened for col3 of the above example.

DBF Type

A table of type DBF is physically a dBASE III or IV formatted file (used by many products like dBASE, Xbase, FoxPro etc.). This format is similar to the FIX type format with in addition a prefix giving the characteristics of the file, describing in particular all the fields (columns) of the table.

Because DBF files have a header that contains Meta data about the file, in particular the column description, it is possible to create a table based on an existing DBF file without giving the column description, for instance:

create table cust engine=CONNECT table_type=DBF file_name='cust.dbf';

To see what CONNECT has done, you can use the DESCRIBE or SHOW CREATE TABLE commands, and eventually modify some options with the ALTER TABLE command.

The case of deleted lines is handled in a specific way for DBF tables. Deleted lines are not removed from the file but are "soft deleted" meaning they are marked as deleted. In particular, the number of lines contained in the file header does not take care of soft deleted lines. This is why if you execute these two commands applied to a DBF table named tabdbf:

select count(*) from tabdbf;
select count(*) from tabdbf where 1;

They can give a different result, the (fast) first one giving the number of physical lines in the file and the second one giving the number of line that are not (soft) deleted.

The commands UPDATE, INSERT, and DELETE can be used with DBF tables. The DELETE command marks the deleted lines as suppressed but keeps them in the file. The INSERT command, if it is used to populate a newly created table, constructs the file header before inserting new lines.

Note: For DBF tables, column name length is limited to 11 characters and field length to 256 bytes.

Reading Soft Deleted Lines of a DBF table

It is possible to read these lines by changing the read mode of the table. This is specified by an option READMODE that can take the values:

0	Standard mode. This is the default option.
1	Read all lines including soft deleted ones.
2	Read only the soft deleted lines.

For example, to read all lines of the tabdbf table, you can do:

alter table tabdbf option_list='Readmode=1';

To come back to normal mode, specify READMODE=0.

BIN Table Type

A table of type BIN is physically a binary file in which each row is a logical record of fixed length^[1]. Within a record, column field are of fixed offset and length like for FIX tables. What is specific to BIN tables is that numerical values are internally encoded using native platform representation, so no conversion is needed to handle numerical values in expressions.

It is not required that the lines of a BIN file be separated by characters such as CR and/or LF but this is possible. In such event, the lrecl option must be specified accordingly.

Note: Unlike for the DOS and FIX types, the width of the fields is the length of their internal representation in the file. For instance for a column declared as:

number int(5) not null,

The field width in the file is 4 characters, the size of a binary integer. This is the value used to calculate the offset of the next field if it is not specified. Therefore, if the next field is placed 5 characters after this one, this declaration is not enough, the flag option will have to be used on the next field.

Type Conversion in BIN Tables

Here are the correspondences between the column type and field format provided by default:

Column type	File default format
Char(n)	Text of n characters.
Date	Integer (4 bytes)
Int(n)	Integer (4 bytes)
Smallint(n)	Short integer (2 bytes)
Bigint(n)	Large integer (8 bytes)
Double(n,d)	Double floating point (8 bytes)

However, the column type must not necessarily match the field format within the table file. In particular, this occurs for field formats that correspond to numeric types that are not handled by CONNECT. Indeed, BIN table files may internally contain tiny integers, short integers, or float numbers. Also, as in DOS or FIX tables, you may want to handles as numeric some character fields or vice versa.

This is why it is possible to specify the field format when it does not correspond to the column type default using the field_format column option in the CREATE TABLE statement. Here are the available field formats for BIN tables:

Field_format	Internal representation
C	Characters string (n bytes)
L or I	(Long) integer (4 bytes)
D	Double float (8 bytes)
S	Short integer (2 bytes)
B	Large (big) integer (8 bytes)
T	Tiny integer (1 byte)
F or R	Real or float (Floating point number on 4 bytes)
X	Use the default format field for the column type

All field formats are a one-character specification^[2]. 'X' is equivalent to not specifying the field format. For the 'C' character specification, n is the column width as specified with the column type. The number of bytes of the numeric fields corresponds to what it is on most platforms. However, it could vary for some ones.

Here is an example of a BIN table, the file record layout is supposed to be:

NNNNCCCCCCCCCCIIIISSFFFFSS

Where N represents numeric characters, C any characters, I integer bytes, S short integer bytes, and F float number bytes. The IIII field contains a date in numeric format.

The table could be created by:

create table testbal (
  fig int(4) not null field_format='C',
  name char(10) not null,
  birth date not null,
  id char(5) not null field_format='S',
  salary double(9,2) not null default 0.00 field_format='F',
  dept int(4) not null field_format='S')
engine=CONNECT table_type=BIN block_size=5 file_name='Testbal.dat';

The field offsets and the file record length being calculated according the column internal format, eventually modified by the field format, it is not necessary to specify them for a packed binary file without line ending. If line ending are desired, specify the lrecl option adding the ending width. The table can be filled by:

insert into testbal values
  (5500,'ARCHIBALD','1980-01-25','3789',4380.50,318),
  (123,'OLIVER','1953-08-10','23456',3400.68,2158),
  (3123,'FOO','2002-07-23','888',default,318);

Note that the types of the inserted values must match the column type, not the field format type.

The query:

select * from testbal;

Returns:

fig	name	birth	id	salary	dept
5500	ARCHIBALD	1980-01-25	3789	4380.50	318
123	OLIVER	1953-08-10	23456	3400.68	2158
3123	FOO	2002-07-23	888	0.00	318

Numeric fields alignment

In binary files, numeric fields and record length are by default aligned on 8-bytes boundaries to optimize performances with some processors. This can be modified in the OPTION_LIST with an "align" option (or "packed" meaning align=1).

VEC Table Type (Vector)

Tables of type VEC are binary files that in some cases can provide good performance on read-intensive query workloads. CONNECT organizes their data on disk as columns of values from the same attribute, as opposed to storing it as rows of tabular records. This organization means that when a query needs to access only a few columns of a particular table, only those columns need to be read from disk. Conversely, in a row-oriented table, all values in a table are typically read from disk, and wasting I/O bandwidth.

CONNECT provides two integral VEC formats, in which each column data is adjacent. The block semi-vector format, in which records are grouped by blocks while column data is only adjacent into each block, is not supported by this version.

Integral Vector Format

In this true vertical format, the VEC files are made of all the data of the first column, followed by all the data of the second column etc. All this can be in one physical file or each column data can be in a separate file. In the first case, the option MAX_ROWS=m, where m is the estimate of the maximum size (number of rows) of the table, must be specified to be able to insert some new records. This leaves an empty space after each column area in which new data can be inserted. In the second case, the "Split" option must be specified at table creation and each columns will be stored in a file named sequentially from the table file name followed by the rank of the column. Inserting new lines can freely augment such a table.

Semi-Vector Format (not supported in this version)

In this format the VCT files are organized in blocks each containing the data of n rows, n being the Elements value. Within each block, data is organized column wise, a vector of n values for the first column followed by a vector of n values for the second column etc. When inserting new line, new blocks can be created.

Differences between vector formats

These formats correspond to different needs. The integral vector format provides the best performance gain. It will be chosen when the speed of decisional queries must be optimized.

In the case of a unique file, inserting new data will be limited but there will be only one open and close to do. However, the size of the table cannot be calculated from the file size because of the eventual unused space in the file. It must be kept in a header containing the maximum number of rows and the current number of valid rows in the table. To achieve this, specify the option Header=n when creating the table. If n=1 the header will be placed at the beginning of the file, if n=2 it will be a separate file with the type ‘.blk’, and if n=3 the header will be place at the end of the file. This last value is provided because batch inserting is sometimes slower when the header is at the beginning of the file. If not specified, the header option will default to 2 for this table type.

On the other hand, the "Split" format with separate files have none of these issues, and is a much safer solution when the table must frequently inserted or shared among several users.

The semi-vector format (VCT) can be used when the table must be in only one file but made progressively by successive inserts without size limitation, with somewhat reduced performance. This format is mainly kept to be compatible with tables files made with older versions of PlugDB.

In the integral vertical formats, the option BLOCK_SIZE=n is used for block reading and writing; however, to have a file made of blocks of equal size, the internal value of the MAX_ROWS=m option is eventually increased to become a multiple of n.

Like for BIN tables, numeric values are stored using platform internal layout, the correspondence between column types and internal format being the same than the default ones given above for BIN. However, field formats are not available for VEC tables.

Header Option

This applies to not split VEC tables. Because the file size depends on the MAX_ROWS value, CONNECT cannot know how many valid records exist in the file. Depending on the value of the HEADER option, this information is stored in a header that can be placed at the beginning of the file, at the end of the file or in a separate file called fn.blk. The valid values for the HEADER option are:

0	Defaults to 2 for standard tables and to 3 for inward tables.
1	The header is at the beginning of the file.
2	The header is in a separate file.
3	The header is at the end of the file.

The value 2 can be used when dealing with files created by another application with no header. The value 3 makes sometimes inserting in the file faster than when the header is at the beginning of the file.

CSV and FMT Table Types

Many source data files are formatted with variable length fields and records. The simplest format, known as CSV (Comma Separated Variables), has column fields separated by a separator character. By default, the separator is a comma but can be specified by the SEP_CHAR option as any character, for instance a semi-colon.

If the CSV file first record is the list of column names, specifying the HEADER=1 option will skip the first record on reading. On writing, if the file is empty, the column names record is automatically written.

For instance, given the following people.csv file:

Name;birth;children
"Archibald";17/05/01;3
"Nabucho";12/08/03;2

You can create the corresponding table by:

create table people (
  name char(12) not null,
  birth date not null date_format='DD/MM/YY',
  children smallint(2) not null)
engine=CONNECT table_type=CSV file_name='people.csv'
header=1 sep_char=';' quoted=1;

For CSV tables, the flag column option is the rank of the column into the file starting from 1 for the leftmost column. This is to enable having column displayed in a different order than in the file and/or to define the table specifying only some columns of the CSV file. For instance:

create table people (
  name char(12) not null,
  children smallint(2) not null flag=3,
  birth date not null flag=2 date_format='DD/MM/YY')
engine=CONNECT table_type=CSV file_name='people.csv'
header=1 sep_char=';' quoted=1;

In this case the command:

select * from people;

will display the table as:

name	children	birth
Archibald	3	2001-05-17
Nabucho	2	2003-08-12

Many applications produce CSV files having some fields quoted, in particular because the field text contains the separator character. For such files, specify the 'QUOTED=n' option to indicate the level of quoting and/or the 'QCHAR=c' to specify what is this eventual quoting character, which is " by default. Quoting with single quotes must be specified as QCHAR=''''. On writing, fields will be quoted depending on the value of the quoting level, which is –1 by default meaning no quoting:

0	The fields between quotes are read and the quotes discarded. On writing, fields will be quoted only if they contain the separator character or begin with the quoting character. If they contain the quoting character, it will be doubled.
1	Only text fields will be written between quotes, except null fields. This includes also the column names of an eventual header.
2	All fields will be written between quotes, except null fields.
3	All fields will be written between quotes, including null fields.

Files written this way are successfully read by most applications including spreadsheets.

Note 1: If only the QCHAR option is specified, the QUOTED option will default to 1.

Note 2: For CSV tables whose separator is the tab character, specify sep_char='\t'.

Note 3: When creating a table on an existing CSV file, you can leave CONNECT analyze the file and make the column description. However, this is a not an elaborate analysis of the file and, for instance, DATE fields will not be recognized as such but will be regarded as string fields.

FMT type

FMT tables handle files of various formats that are an extension of the concept of CSV files. CONNECT supports these files providing all lines have the same format and that all fields present in all records are recognizable (optional fields must have recognizable delimiters). These files are made by specific application and CONNECT handle them in read only mode.

FMT tables must be created as CSV tables, specifying their type as FMT. In addition, each column description must be added its format specification.

Column Format Specification of FMT tables

The input format for each column is specified as a FIELD_FORMAT option. A simple example is:

IP Char(15) not null field_format=' %n%s%n',

In the above example, the format for this (1st) field is ' %n%s%n'. Note that the blank character at the beginning of this format is significant. No trailing blank should be specified in the column formats.

The syntax and meaning of the column input format is the one of the C scanf function.

However, CONNECT uses the input format in a specific way. Instead of using it to directly store the input value in the column buffer; it uses it to delimit the sub string of the input record that contains the corresponding column value. Retrieving this value is done later by the column functions as for standard CSV files.

This is why all column formats are made of five components:

An eventual description of what is met and ignored before the column value.
A marker of the beginning of the column value written as %n.
The format specification of the column value itself.
A marker of the end of the column value written as %n (or %m for optional fields).
An eventual description of what is met after the column value (not valid is %m was used).

For example, taking the file funny.txt:

12345,'BERTRAND',#200;5009.13
 56, 'POIROT-DELMOTTE' ,#4256 ;18009
345 ,'TRUCMUCHE' , #67; 19000.25

You can make a table fmtsample with 4 columns ID, NAME, DEPNO and SALARY, using the Create Table statement and column formats:

create table FMTSAMPLE (
  ID Integer(5) not null field_format=' %n%d%n',
  NAME Char(16) not null field_format=' , ''%n%[^'']%n''',
  DEPNO Integer(4) not null field_format=' , #%n%d%n',
  SALARY Double(12,2) not null field_format=' ; %n%f%n')
Engine=CONNECT table_type=FMT file_name='funny.txt';

Field 1 is an integer (%d) with eventual leading blanks.

Field 2 is separated from field 1 by optional blanks, a comma, and other optional blanks and is between single quotes. The leading quote is included in component 1 of the column format, followed by the %n marker. The column value is specified as %[^'] meaning to keep any characters read until a quote is met. The ending marker (%n) is followed by the 5th component of the column format, the single quote that follows the column value.

Field 3, also separated by a comma, is a number preceded by a pound sign.

Field 4, separated by a semicolon eventually surrounded by blanks, is a number with an optional decimal point (%f).

This table will be displayed as:

ID	NAME	DEPNO	SALARY
12345	BERTRAND	200	5009.13
56	POIROT-DELMOTTE	4256	18009.00
345	TRUCMUCHE	67	19000.25

Optional Fields

To be recognized, a field normally must be at least one character long. For instance, a numeric field must have at least one digit, or a character field cannot be void. However many existing files do not follow this format.

Let us suppose for instance that the preceding example file could be:

12345,'BERTRAND',#200;5009.13
 56, 'POIROT-DELMOTTE' ,# ;18009
345 ,'' , #67; 19000.25

This will display an error message such as “Bad format line x field y of FMTSAMPLE”. To avoid this and accept these records, the corresponding fields must be specified as "optional". In the above example, fields 2 and 3 can have null values (in lines 3 and 2 respectively). To specify them as optional, their format must be terminated by %m (instead of the second %n). A statement such as this can do the table creation:

create table FMTAMPLE (
  ID Integer(5) not null field_format=' %n%d%n',
  NAME Char(16) not null field_format=' , ''%n%[^'']%m',
  DEPNO Integer(4) field_format=''' , #%n%d%m',
  SALARY Double(12,2) field_format=' ; %n%f%n')
Engine=CONNECT table_type=FMT file_name='funny.txt';

Note that, because the statement must be terminated by %m with no additional characters, skipping the ending quote of field 2 was moved from the end of the second column format to the beginning of the third column format.

The table result is:

ID	NAME	DEPNO	SALARY
12345	BERTRAND	200	5,009.13
56	POIROT-DELMOTTE	NULL	18,009.00
345	NULL	67	19,000.25

Missing fields are replaced by null values if the column is nullable, blanks for character strings and 0 for numeric fields if it is not.

Note 1: Because the formats are specified between quotes, quotes belonging to the formats must be doubled to avoid a CREATE TABLE statement syntax error.

Note 2: Characters separating columns can be included as well in component 5 of the preceding column format or in component 1 of the succeeding column format but for blanks, which should be always included in component 1 of the succeeding column format because line trailing blanks can be sometimes lost. This is also mandatory for optional fields.

Note 3: Because the format is mainly used to find the sub-string corresponding to a column value, the field specification does not necessarily match the column type. For instance supposing a table contains two integer columns, NBONE and NBTWO, the two lines describing these columns could be:

NBONE integer(5) not null field_format=' %n%d%n',
NBTWO integer(5) field_format=' %n%s%n',

The first one specifies a required integer field (%d), the second line describes a field that can be an integer, but can be replaced by a "-" (or any other) character. Specifying the format specification for this column as a character field (%s) enables to recognize it with no error in all cases. Later on, this field will be converted to integer by the column read function, and a null 0 value will be generated for field specified in their format as non-numeric.

Bad record error processing

When no match if found for a column field the process aborts with a message such as:

Bad format line 3 field 4 of funny.txt

This can mean as well that one line of the input line is ill formed or that the column format for this field has been wrongly specified. When you know that your file contains records that are ill formatted and should be eliminated from normal processing, set the “maxerr” option of the CREATE TABLE statement, for instance:

Option_list='maxerr=100'

This will indicate that no error message be raised for the 100 first wrong lines. You can set Maxerr to a number greater than the number of wrong lines in your files to ignore them and get no errors.

Additionally, the “accept” option permit to keep those ill formatted lines with the bad field, and all succeeding fields of the record, nullified. If “accept” is specified without “maxerr”, all ill formatted lines will be accepted.

Note: This error processing also applies to CSV tables.

Fields containing a formatted Date

A special case is one of columns containing a formatted date. In this case, two formats must be specified:

The field recognition format used to delimit the date in the input record.
The date format used to interpret the date.
The field length option if the date representation is different than the standard type size.

For example, let us suppose we have a web log source file containing records such a:

165.91.215.31 - - [17/Jul/2001:00:01:13 -0400] - "GET /usnews/home.htm HTTP/1.1" 302

The create table statement shall be like this:

create table WEBSAMP (
  IP char(15) not null field_format='%n%s%n',
  DATE datetime not null field_format=' - - [%n%s%n -0400]'
  date_format='DD/MMM/YYYY:hh:mm:ss' field_length=20,
  FILE char(128) not null field_format=' - "GET %n%s%n',
  HTTP double(4,2) not null field_format=' HTTP/%n%f%n"',
  NBONE int(5) not null field_format=' %n%d%n')
Engine=CONNECT table_type=FMT lrecl=400
file_name='e:\\data\\token\\Websamp.dat';

Note 1: Here, field_length=20 was necessary because the default size for datetime columns is only 19. The lrecl=400 was also specified because the actual file contains more information in each records making the record size calculated by default too small.

Note 2: The file name could have been specified as 'e:/data/token/Websamp.dat'.

Note 3: FMT tables are currently read only.

XML Table Type

CONNECT supports tables represented by XML files. For these tables, the standard input/output functions of the operating system are not used but the parsing and processing of the file is delegated to a specialized library. Currently two such systems are supported: libxml2, a part of the GNOME framework, but that does not require GNOME and, on Windows, MS-DOM (DOMDOC) that is the Microsoft standard support of XML documents.

DOMDOC is the default for the Windows version of CONNECT and libxml2 is always used on other systems. On Windows the choice can be specified using the XMLSUP create table list option, for instance specifying option_list='xmlsup=libxml2'.

Creating XML tables

First of all, it must be understood that XML is a very general language used to encode data having any structure. In particular, the tag hierarchy in an XML file describes a tree structure of the data. For instance, consider the file:

<?xml version="1.0" encoding="ISO-8859-1"?>
<BIBLIO SUBJECT="XML">
   <BOOK ISBN="9782212090819" LANG="fr" SUBJECT="applications">
      <AUTHOR>
         <FIRSTNAME>Jean-Christophe</FIRSTNAME>
         <LASTNAME>Bernadac</LASTNAME>
      </AUTHOR>
      <AUTHOR>
         <FIRSTNAME>François</FIRSTNAME>
         <LASTNAME>Knab</LASTNAME>
      </AUTHOR>
      <TITLE>Construire une application XML</TITLE>
      <PUBLISHER>
         <NAME>Eyrolles</NAME>
         <PLACE>Paris</PLACE>
      </PUBLISHER>
      <DATEPUB>1999</DATEPUB>
   </BOOK>
   <BOOK ISBN="9782840825685" LANG="fr" SUBJECT="applications">
      <AUTHOR>
         <FIRSTNAME>William J.</FIRSTNAME>
         <LASTNAME>Pardi</LASTNAME>
      </AUTHOR>
      <TRANSLATOR PREFIX="adapté de l'anglais par">
         <FIRSTNAME>James</FIRSTNAME>
         <LASTNAME>Guerin</LASTNAME>
      </TRANSLATOR>
      <TITLE>XML en Action</TITLE>
      <PUBLISHER>
         <NAME>Microsoft Press</NAME>
         <PLACE>Paris</PLACE>
      </PUBLISHER>
      <DATEPUB>1999</DATEPUB>
   </BOOK>
</BIBLIO>

It represents data having the structure:

                               <BIBLIO>
                        __________|_________
                       |                    |
            <BOOK:ISBN,LANG,SUBJECT>        |
         ______________|_______________     |
        |        |         |           |    |
     <AUTHOR> <TITLE> <PUBLISHER> <DATEPUB> |
    ____|____            ___|____           |
   |    |    |          |        |          |
<FIRST> | <LAST>     <NAME>   <PLACE>       |
        |                                   |
     <AUTHOR>                   <BOOK:ISBN,LANG,SUBJECT>
    ____|____         ______________________|__________________
   |         |       |            |         |        |         |
<FIRST>   <LAST>  <AUTHOR> <TRANSLATOR> <TITLE> <PUBLISHER> <DATEPUB>
                _____|_        ___|___            ___|____
               |       |      |       |          |        |
            <FIRST> <LAST> <FIRST> <LAST>     <NAME>   <PLACE>

This structure seems at first view far from being tabular. However, modern database management systems, including MySQL, implement something close to the relational model and work on tables that are structurally not hierarchical but tabular with rows and columns.

Nevertheless, CONNECT can do it. Of course, it cannot guess what you want to extract from the XML structure, but gives you the possibility to specify it when you create the table^[3].

Let us take a first example. Suppose you want to make a table from the above document, displaying the node contains.

For this, you can define a table xsamptag as:

create table xsamptag (
  AUTHOR char(50),
  TITLE char(32),
  TRANSLATOR char(40),
  PUBLISHER char(40),
  DATEPUB int(4))
engine=CONNECT table_type=XML file_name='Xsample.xml';

It will be displayed as:

AUTHOR	TITLE	TRANSLATOR	PUBLISHER	DATEPUB
Jean-Christophe Bernadac	Construire une application XML		Eyrolles Paris	1999
William J. Pardi	XML en Action	James Guerin	Microsoft Press Paris	1999

Let us try to understand what happened. By default the columns names correspond to tag names. Because this file is rather simple, CONNECT was able to default the top tag of the table as the root node <BIBLIO> of the file, and the row tags as the <BOOK> children of the table tag. In a more complex file, this should have been specified, as we will see later. Note that we had not to worry about the sub- tags such as <FIRSTNAME> or <LASTNAME> because CONNECT automatically retrieves the entire text contained in a tag and its sub-tags^[4].

Only the first author of the first book appears. This is because only the first occurrence of a column tag has been retrieved so the result has a proper tabular structure. We will see later what we can do about that.

How can we retrieve the values specified by attributes? By using a Coltype table option to specify the default column type. The values ‘@’ means that column names match attribute names. Therefore, we can retrieve them by creating a table such as:

create table xsampattr (
  ISBN char(15),
  LANG char(2),
  SUBJECT char(32))
engine=CONNECT table_type=XML file_name='Xsample.xml'
option_list='Coltype=@';

This table gives the following display:

ISBN	LANG	SUBJECT
9782212090819	fr	applications
9782840825685	fr	applications

Now to define a table that will give us all the previous information, we must specify the column type for each column. Because in the next statement the column type defaults to Node, the field_format column parameter was used to indicate which columns are attributes:

create table xsamp (
  ISBN char(15) field_format='@',
  LANG char(2) field_format='@',
  SUBJECT char(32) field_format='@',
  AUTHOR char(50),
  TITLE char(32),
  TRANSLATOR char(40),
  PUBLISHER char(40),
  DATEPUB int(4))
engine=CONNECT table_type=XML file_name='Xsample.xml'
tabname='BIBLIO' option_list='rownode=BOOK';

Once done, we can enter the query:

select subject, lang, title, author from xsamp;

This will return the result:

SUBJECT	LANG	TITLE	AUTHOR
applications	fr	Construire une application XML	Jean-Christophe Bernadac
applications	fr	XML en Action	William J. Pardi

Note that we have been lucky. Because unlike SQL, XML is case sensitive and the column names have matched the node names only because column names were given in upper case. Note also that the order of the columns in the table could have been different from the order in which the nodes appear in the XML file.

Using Xpath’s with XML tables

The field_format options we used above can be specified to locate more precisely where and what is the information to retrieve using an Xpath-like syntax. For instance:

create table xsampall (
  isbn char(15) field_format='@ISBN',
  language char(2) field_format='@LANG',
  subject char(32) field_format='@SUBJECT',
  authorfn char(20) field_format='AUTHOR/FIRSTNAME',
  authorln char(20) field_format='AUTHOR/LASTNAME',
  title char(32) field_format='TITLE',
  translated char(32) field_format='TRANSLATOR/@PREFIX',
  tranfn char(20) field_format='TRANSLATOR/FIRSTNAME',
  tranln char(20) field_format='TRANSLATOR/LASTNAME',
  publisher char(20) field_format='PUBLISHER/NAME',
  location char(20) field_format='PUBLISHER/PLACE',
  year int(4) field_format='DATEPUB')
engine=CONNECT table_type=XML file_name='Xsample.xml'
tabname='BIBLIO' option_list='rownode=BOOK,skipnull=1';

This very flexible column parameter serves several purposes:

To specify the tag name, or the attribute name if different from the column name.
To specify the type (tag or attribute) by a prefix of '@' for attributes.
To specify the path for sub-tags using the '/' character.

This path is always relative to the current context (the column top node) and cannot be specified as an absolute path from the document root, therefore a leading '/' cannot be used. The path cannot be variable in node names or depth, therefore using '//' is not allowed.

The query:

select isbn, title, translated, tranfn, tranln, location from
    xsampall where translated <> '';

Replies:

ISBN	TITLE	TRANSLATED	TRANFN	TRANLN	LOCATION
9782840825685	XML en Action	adapté de l'anglais par	James	Guerin	Paris

Direct access on XML tables

Direct access is available on XML tables. This means that XML tables can be sorted and used in joins, even in the one-side of the join.

However, building a permanent index is not implemented yet. It is not sure that this can be useful. Indeed, the Windows DOM implementation when used to access these tables firstly parses the whole file and constructs a node tree in memory. This may be often the longest part of the process, so the use of index or block optimization would not be of great value. Note also that this restrict the XML files to be of a reasonable size. Anyway, when speed is important, this table type is not the best to use. Therefore, in these cases, it is probably better to convert the file to another type by inserting the XML table in another table of a more appropriate type concerning performance.

Write operations on XML tables

You can freely use the Update, Delete and Insert commands with XML table. However, you must understand that the format of the updated or inserted data follows the specifications of the table you created, not the ones of the original source file. For instance, let us suppose we insert a new book using the xsamp table (not the xsampall table) with the command:

insert into xsamp
  (isbn, lang, subject, author, title, publisher,datepub)
  values ('9782212090529','fr','général','Alain Michard',
         'XML, Langage et Applications','Eyrolles Paris',1998);

Then if we ask:

select subject, author, title, translator, publisher from xsamp;

Everything seems correct when we get the result:

SUBJECT	AUTHOR	TITLE	TRANSLATOR	PUBLISHER
applications	Jean-Christophe Bernadac	Construire une application XML		Eyrolles Paris
applications	William J. Pardi	XML en Action	James Guerin	Microsoft Press Paris
général	Alain Michard	XML, Langage et Applications		Eyrolles Paris

However if we enter the apparently equivalent query on the xsampall table, based on the same file:

select subject,
concat(authorfn, ' ', authorln) author , title,
concat(tranfn, ' ', tranln) translator,
concat(publisher, ' ', location) publisher from xsampall;

This will return an apparently wrong answer:

SUBJECT	AUTHOR	TITLE	TRANSLATOR	PUBLISHER
applications	Jean-Christophe Bernadac	Construire une application XML		Eyrolles Paris
applications	William J. Pardi	XML en Action	James Guerin	Microsoft Press Paris
général		XML, Langage et Applications

What happened here? Simply, because we used the xsamp table to do the Insert, what has been inserted within the XML file had the structure described for xsamp:

   <BOOK ISBN="9782212090529" LANG="fr" SUBJECT="général">
      <AUTHOR>Alain Michard</AUTHOR>
      <TITLE>XML, Langage et Applications</TITLE>
      <TRANSLATOR></TRANSLATOR>
      <PUBLISHER>Eyrolles Paris</PUBLISHER>
      <DATEPUB>1998</DATEPUB>
   </BOOK>

CONNECT cannot "invent" sub-tags that are not part of the xsamp table. Because these sub-tags do not exist, the xsampall table cannot retrieve the information that should be attached to them. If we want to be able to query the XML file by all the defined tables, the correct way to insert a new book to the file is to use the xsampall table, the only one that addresses all the components of the original document:

delete from xsamp where isbn = '9782212090529';

insert into xsampall (isbn, language, subject, authorfn, authorln,
      title, publisher, location, year)
   values('9782212090529','fr','général','Alain','Michard',
      'XML, Langage et Applications','Eyrolles','Paris',1998);

Now the added book, in the XML file, will have the required structure:

   <BOOK ISBN="9782212090529" LANG="fr" SUBJECT="général"
      <AUTHOR>
         <FIRSTNAME>Alain</FIRSTNAME>
         <LASTNAME>Michard</LASTNAME>
      </AUTHOR>
      <TITLE>XML, Langage et Applications</TITLE>
      <PUBLISHER>
         <NAME>Eyrolles</NAME>
         <PLACE>Paris</PLACE>
      </PUBLISHER>
      <DATEPUB>1998</DATEPUB>
   </BOOK>

Note: We used a column list in the Insert statements, along with specifying ‘skipnull=1’ when creating the table, to avoid generating a <TRANSLATOR> node with sub-nodes, all containing null values (this works on Windows only).

Multiple Nodes in the XML Document

Let us come back to the above example XML file. We have seen that the author node can be "multiple" meaning that there can be more than one author of a book. What can we do to get the complete information fitting the relational model? CONNECT provides you with two possibilities, but restricted to only one such multiple node per table.

The first and most challenging one is to return as many rows than there are authors, the other columns being repeated as if we had make a join between the author column and the rest of the table. To achieve this, simply specify the “multiple” node name and the “expand” option when creating the table. For instance, we can create the xsamp2 table like this:

create table xsamp2 (
  ISBN char(15) field_format='@',
  LANG char(2) field_format='@',
  SUBJECT char(32) field_format='@',
  AUTHOR char(40),
  TITLE char(32),
  TRANSLATOR char(32),
  PUBLISHER char(32),
  DATEPUB int(4))
engine=CONNECT table_type=XML file_name='Xsample.xml'
tabname='BIBLIO'
option_list='rownode=BOOK,Expand=1,Mulnode=AUTHOR,Limit=2';

In this statement, the Limit option is a ceiling estimate of the average number of author per book that enables CONNECT to calculate the maximum size of the table. If not specified you are likely to have your result truncated by the internally recalculated table size value. Then you can enter a query such as:

select isbn, subject, author, title from xsamp2;

This will retrieve and display the following result:

ISBN	SUBJECT	AUTHOR	TITLE
9782212090819	applications	Jean-Christophe Bernadac	Construire une application XML
9782212090819	applications	François Knab	Construire une application XML
9782840825685	applications	William J. Pardi	XML en Action
9782212090529	général	Alain Michard	XML, Langage et Applications

In this case, this is as if the table had four rows. However if we enter the query:

select isbn, subject, title, publisher from xsamp2;

This time the result will be:

ISBN	SUBJECT	TITLE	PUBLISHER
9782212090819	applications	Construire une application XML	Eyrolles Paris
9782840825685	applications	XML en Action	Microsoft Press Paris
9782212090529	général	XML, Langage et Applications	Eyrolles Paris

Because the author column does not appear in the query, the corresponding row was not expanded. This is somewhat strange because this would have been different if we had been working on a table of a different type. However, it is closer to the relational model for which there should not be two identical rows (tuples) in a table. Nevertheless, you should be aware of this somewhat erratic behavior. For instance:

select count(*) from xsamp2;                /* Replies 4 */
select count(author) from xsamp2;           /* Replies 4 */
select count(isbn) from xsamp2;             /* Replies 3 */
select isbn, subject, title, publisher from xsamp2 where author <> '';

This last query replies:

ISBN	SUBJECT	TITLE	PUBLISHER
9782212090819	applications	Construire une application XML	Eyrolles Paris
9782212090819	applications	Construire une application XML	Eyrolles Paris
9782840825685	applications	XML en Action	Microsoft Press Paris
9782212090529	général	XML, Langage et Applications	Eyrolles Paris

Even the author column does not appear in the result, the corresponding row was expanded because the multiple column was used in the where clause.

Intermediate Multiple Node

The "multiple" node can be an intermediate node. If we want to do the same expanding with the xsampall table, there will be nothing more to do. The xsampall2 table can be created with:

create table xsampall2 (
  isbn char(15) field_format='@ISBN',
  language char(2) field_format='@LANG',
  subject char(32) field_format='@SUBJECT',
  authorfn char(20) field_format='AUTHOR/FIRSTNAME',
  authorln char(20) field_format='AUTHOR/LASTNAME',
  title char(32) field_format='TITLE',
  translated char(32) field_format='TRANSLATOR/@PREFIX',
  tranfn char(20) field_format='TRANSLATOR/FIRSTNAME',
  tranln char(20) field_format='TRANSLATOR/LASTNAME',
  publisher char(20) field_format='PUBLISHER/NAME',
  location char(20) field_format='PUBLISHER/PLACE',
  year int(4) field_format='DATEPUB')
engine=CONNECT table_type=XML file_name='Xsample.xml'
tabname='BIBLIO'
option_list='rownode=BOOK,Expand=1,Mulnode=AUTHOR,Limit=2';

The only difference is that the "multiple" node is an intermediate node in the path. The resulting table can be seen with a query such as:

select subject, language lang, title, authorfn first, authorln
    last, year from xsampall2;

This displays:

SUBJECT	LANG	TITLE	FIRST	LAST	YEAR
applications	fr	Construire une application XML	Jean-Christophe	Bernadac	1999
applications	fr	Construire une application XML	François	Knab	1999
applications	fr	XML en Action	William J.	Pardi	1999
général	fr	XML, Langage et Applications	Alain	Michard	1998

These composite tables, half array half tree, reserve us some surprises when updating, deleting from or inserting into them. Insert just cannot generate this structure; if two rows are inserted with just a different author, two book nodes will be generated in the XML file. Delete always deletes one book node and all its children nodes even if specified against only one author. Update is more complicated:

update xsampall2 set authorfn = 'Simon' where authorln = 'Knab';
update xsampall2 set year = 2002 where authorln = 'Bernadac';
update xsampall2 set authorln = 'Mercier' where year = 2002;

After these three updates, the first one responding "Affected rows: 0" and the two others responding "Affected rows: 1", the last query answers:

SUBJECT	LANG	TITLE	FIRST	LAST	YEAR
applications	fr	Construire une application XML	Jean-Christophe	Mercier	2002
applications	fr	Construire une application XML	François	Knab	2002
applications	fr	XML en Action	William J.	Pardi	1999
général	fr	XML, Langage et Applications	Alain	Michard	1998

What must be understood here is that the Update modifies node values in the XML file, not cell values in the relational table. The first update did not worked as expected, unable to retrieve the first name node of the second author and changing it to a new value. The second update changed the year value of the book and this shows for the two expanded rows because there is only one DATEPUB node for that book. Because the third update applies to a row having a certain date value, this row was retrieved but not expanded because no author data appeared in the Where clause; consequently only the first author name was updated.

Making a List of Multiple Values

Another way to see multiple values is to ask CONNECT to make a comma separated list of the multiple node values. This time, it can only be done if the "multiple" node is not intermediate. For example, we can modify the xsamp2 table definition by:

alter table xsamp2 option_list='rownode=BOOK,Mulnode=AUTHOR,Limit=3';

This time 'Expand' is not specified, and Limit gives the maximum number of items in the list. Now if we enter the query:

select isbn, subject, author "AUTHOR(S)", title from xsamp2;

We will get the resulting display:

ISBN	SUBJECT	AUTHOR(S)	TITLE
9782212090819	applications	Jean-Christophe Bernadac, François Knab	Construire une application XML
9782840825685	applications	William J. Pardi	XML en Action
9782212090529	général	Alain Michard	XML, Langage et Applications

Note that updating the "multiple" column is not possible because CONNECT does not know which of the nodes to update.

This could not have been done with the xsampall2 table because the author node is intermediate in the path, and making two lists, one of first names and another one of last names would not make sense anyway.

What if a table contains several multiple nodes

This can be handled by creating several tables on the same file, each containing only one multiple node and constructing the desired result using joins.

Support of HTML Tables

Most tables included in HTML documents cannot be processed by CONNECT because the HTML language is often not compatible with the syntax of XML. In particular, XML requires all open tags to be matched by a closing tag while it is sometimes optional in HTML. This is often the case concerning column tags.

However, you can meet tables that respect the XML syntax but have some of the features of HTML tables. For instance:

<?xml version="1.0"?>
<Beers>
  <table>
    <th><td>Name</td><td>Origin</td><td>Description</td></th>
    <tr>
      <td><brandName>Huntsman</brandName></td>
      <td><origin>Bath, UK</origin></td>
      <td><details>Wonderful hop, light alcohol</details></td>
    </tr>
    <tr>
      <td><brandName>Tuborg</brandName></td>
      <td><origin>Danmark</origin></td>
      <td><details>In small bottles</details></td>
    </tr>
  </table>
</Beers>

Here the different column tags are included in <td></td> tags as for HTML tables. You cannot just add this tag in the Xpath of the columns, because the search is done on the first occurrence of each tag, and this would cause this search to fail for all columns except the first one. This case is handled by specifying the Colnode table option that gives the name of these column tags, for example:

create table beers (
  `Name` char(16) field_format='brandName',
  `Origin` char(16) field_format='origin',
  `Description` char(32) field_format='details')
engine=CONNECT table_type=XML file_name='beers.xml'
tabname='table' option_list='rownode=tr,colnode=td';

The table will be displayed as:

Name	Origin	Description
Huntsman	Bath, UK	Wonderful hop, light alcohol
Tuborg	Danmark	In small bottles

However, you can deal with tables even closer of the HTML model. For example the coffee.htm file:

<TABLE summary="This table charts the number of cups of coffe
                consumed by each senator, the type of coffee (decaf
                or regular), and whether taken with sugar.">
  <CAPTION>Cups of coffee consumed by each senator</CAPTION>
  <TR>
    <TH>Name</TH>
    <TH>Cups</TH>
    <TH>Type of Coffee</TH>
    <TH>Sugar?</TH>
  </TR>
  <TR>
    <TD>T. Sexton</TD>
    <TD>10</TD>
    <TD>Espresso</TD>
    <TD>No</TD>
  </TR>
  <TR>
    <TD>J. Dinnen</TD>
    <TD>5</TD>
    <TD>Decaf</TD>
    <TD>Yes</TD>
  </TR>
</TABLE>

Here column values are directly represented by the TD tag text. You cannot declare them as tags nor as attributes. In addition, they are not located using their name but by their position within the row. Here is how to declare such a table to CONNECT:

create table coffee (
  `Name` char(16),
  `Cups` int(8),
  `Type` char(16),
  `Sugar` char(4))
engine=connect table_type=XML file_name='coffee.htm'
tabname='TABLE' header=1 option_list='Coltype=HTML';

You specify the fact that column are located by position by setting the Coltype option to 'HTML'. Each column position (0 based) will be the value of the flag column parameter that is set by default in sequence. Now we are able to display the table:

Name	Cups	Type	Sugar
T. Sexton	10	Espresso	No
J. Dinnen	5	Decaf	Yes

Note 1: We specified 'header=n' in the create statement to indicate that the first n rows of the table are not data rows and should be skipped.

Note 2: In this last example, we did not specify the node names using the Rownode and Colnode options because when Coltype is set to 'HTML' they default to 'Rownode=TR' and 'Colnode=TD'.

Note 3: The Coltype option is a word only the first character of which is significant. Recognized values are:

T(ag) or N(ode)	Column names match a tag name (the default).
A(ttribute) or @	Column names match an attribute name.
H(tml) or C(ol) or P(os)	Column are retrieved by their position.

New file setting

Some create options are used only when creating a table on a new file, i. e. when inserting into a file that does not exist yet. When specified, the 'Header' option will create a header row with the name of the table columns. This is chiefly useful for HTML tables to be displayed on a web browser.

Some new list-options are used in this context:

Encoding	The encoding of the new document, defaulting to UTF-8.
Attribute	A list of 'attname=attvalue' separated by ';' to add to the table node.
HeadAttr	An attribute list to be added to the header row node.

Let us see for instance, the following create statement:

create table handlers (
  handler char(64),
  version char(20),
  author char(64),
  description char(255),
  maturity char(12))
engine=CONNECT table_type=XML file_name='handlers.htm'
tabname='TABLE' header=yes
option_list='coltype=HTML,encoding=ISO-8859-1,
attribute=border=1;cellpadding=5,headattr=bgcolor=yellow';

Supposing the table file does not exist yet, the first insert into that table, for instance by the following statement:

insert into handlers select plugin_name, plugin_version,
  plugin_author, plugin_description, plugin_maturity from
  information_schema.plugins where plugin_type = 'DAEMON';

Will generate the following file:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!-- Created by CONNECT Version 3.05.0005 August 17, 2012 -->
<TABLE border="1" cellpadding="5">
  <TR bgcolor="yellow">
    <TH>handler</TH>
    <TH>version</TH>
    <TH>author</TH>
    <TH>description</TH>
    <TH>maturity</TH>
  </TR>
  <TR>
    <TD>Maria</TD>
    <TD>1.5</TD>
    <TD>Monty Program Ab</TD>
    <TD>Compatibility aliases for the Aria engine</TD>
    <TD>Gamma</TD>
  </TR>
</TABLE>

This file can be used to display the table on a web browser (encoding should be ISO-8859-x)

handler	version	author	description	maturity
Maria	1.5	Monty Program Ab	Compatibility aliases for the Aria engine	Gamma

Note: The XML document encoding, is generally specified in the XML header node, and can be different from the DATA_CHARSET, which is always UTF-8 for XML tables. Therefore the table DATA_CHARSET character set should be unspecified, or specified as UTF8. The Encoding specification is useful only for new XML files and ignored for existing files having their encoding already specified in the header node.

INI Table Type

The INI type is the one of "configure" or "initializing" files often met on Windows machines. For instance, let us suppose you have a contact file contact.ini such as:

[BER]
name=Bertrand
forename=Olivier
address=21 rue Ferdinand Buisson
city=Issy-les-Mlx
zipcode=92130
tel=09.54.36.29.60
cell=06.70.06.04.16

[WEL]
name=Schmitt
forename=Bernard
hired=19/02/1985
address=64 tiergarten strasse
city=Berlin
zipcode=95013
tel=03.43.377.360

[UK1]
name=Smith
forename=Henry
hired=08/11/2003
address=143 Blum Rd.
city=London
zipcode=NW1 2BP

CONNECT let you view it as a table in two different ways.

Column layout

The first way is to regard it as a table having one line per section, the columns being the keys you want to display. In this case, the create statement could be:

create table contact (
  contact char(16) flag=1,
  name char(20),
  forename char(32),
  hired date date_format='DD/MM/YYYY',
  address char(64),
  city char(20),
  zipcode char(8),
  tel char(16))
engine=CONNECT table_type=INI file_name='contact.ini';

The column that will contain the section name can have any name but must specify flag=1. All other column must have the names of the keys we want to display (case insensitive). The type can be character or numeric depending on the key value type, and the length is the maximum expected length for the key value. Once done, the statement:

select contact, name, hired, city, tel from contact;

will display the file in tabular format.

contact	name	hired	city	tel
BER	Bertrand	1970-01-01	Issy-les-Mlx	09.54.36.29.60
WEL	Schmitt	1985-02-19	Berlin	03.43.377.360
UK1	Smith	2003-11-08	London	NULL

Only the keys defined in the create statements are visible; keys that do not exist in a section are displayed as null or pseudo null (blank for character, 1/1/70 for dates, and 0 for numeric):

All relational operations can be applied to this table. The table (and the file) can be updated, inserted and conditionally deleted. The only constraint is that when inserting values, the section name must be the first in the list of values.

Note: When inserting, if a section already exists, no new section will be created but the new values will be added or replace those of the existing section. Thus the following two commands are equivalent:

update contact set forename = 'Harry' where contact = 'UK1';
insert into contact (contact,forename) values('UK1','Harry');

Row layout

To be a good candidate for tabular representation, an INI file should have often the same keys in all sections. In practice, many files commonly found on computers, such as the win.ini file of the Windows directory or the my.ini file cannot be viewed that way because each section have different keys. In this case, a second way is to regard the file as a table having one row per section key and whose columns can be the section name, the key name and the key value.

For instance, let us define the table:

create table xcont (
  section char(16) flag=1,
  keyname char(16) flag=2,
  value char(32))
engine=CONNECT table_type=INI file_name='contact.ini'
option_list='Layout=Row';

In this statement, the "Layout" option sets the display format, Column by default or anything else not beginning by 'C' for row layout display. The names of the three columns can be freely chosen. The Flag option gives the meaning of the column. Specify flag=1 for the section name and flag=2 for the key name. Otherwise, the column will contain the key value.

Once done, the command:

select * from xcont;

Will display the following result:

section	keyname	value
BER	name	Bertrand
BER	forename	Olivier
BER	address	21 rue Ferdinand Buisson
BER	city	Issy-les-Mlx
BER	zipcode	92130
BER	tel	09.54.36.29.60
BER	cell	06.70.06.04.16
WEL	name	Schmitt
WEL	forename	Bernard
WEL	hired	19/02/1985
WEL	address	64 tiergarten strasse
WEL	city	Berlin
WEL	zipcode	95013
WEL	tel	03.43.377.360
UK1	name	Smith
UK1	forename	Henry
UK1	hired	08/11/2003
UK1	address	143 Blum Rd.
UK1	city	London
UK1	zipcode	NW1 2BP

Note: When processing an INI table, all section names are retrieved in a buffer of 2048 bytes. For a big file having many sections, this size can be increased using for example:

option_list='seclen=4000';

↑ Sometimes it can be a physical record if LF or CRLF have been written in the file.
↑ It can be specified with more than one character, but the first one only is significant.
↑ CONNECT does not claim to be able to deal with any XML document. Besides, those that can usefully be processed for data analysis are likely to have a structure that can easily be transformed into a table.
↑ With libxml2, sub tags text can be separated by 0 or several blanks depending on the structure and indentation of the data file.

Commenti

Sto caricando i commenti......

Content reproduced on this site is the property of its respective owners, and this content is not reviewed in advance by MariaDB. The views, information and opinions expressed by this content do not necessarily represent those of MariaDB or any other party.

↑ CONNECT Table Types ↑
Panoramica dei tipi di tabelle di CONNECT
Tipi di tabelle CONNECT - File dei dati