Tutorial

This is an introduction to using Metadb. We assume familiarity with databases and the basics of SQL.

Getting started

Metadb provides an analytic database which can be used for SQL analytics and data science. The data contained in the analytic database originally come from another place: a data source such as a transaction processing database. Metadb updates the analytic database continuously based on changes in a data source.

Users connect to a Metadb server over the network using a database client such as DBeaver or psql.

Current tables

All tables generated by Metadb have at least these four metadata columns, with names that begin with two underscores:

  • __id is a surrogate key that identifies a row in the table.

  • __start is the date and time when the row of data was generated.

  • __source identifies the data source. The value of __source is the same for all rows that have been read from the same data source.

  • __origin is an optional identifier used to group related data from one or more data sources. For example, Metadb can combine data from similar source tables into a single table in the analytic database, and stores an identifier in __origin to record where the data came from.

Consider a table that stores categories of patrons in a library:

SELECT __id, __start, __source, __origin, id, groupname, description FROM patrongroup;
__id __start __source __origin id groupname description

8

2022-04-18 19:27:18-00

library

west

15

undergrad

Undergraduate Student

4

2022-04-17 17:42:25-00

library

east

10

graduate

Graduate Student

5

2022-04-17 17:52:53-00

library

east

9

faculty

Faculty Member

6

2022-04-17 17:52:53-00

library

east

12

staff

Staff Member

Note the distinction between the __id and id columns. The __id column is a key defined by Metadb to identify each row uniquely. The id column is defined by the data source; it may or may not be a key, and its name, content, or significance may change if schema changes are made in the data source.

In Metadb, a table like patrongroup is called a current table because it reflects the current state of each row in the data source, as of the last time the row was updated.

History tables

A current table is usually accompanied by a history table, which stores both the current state and all previous states provided by the data source. The name of a history table is the current table’s name with two underscores appended; for example, the history table for patrongroup would be patrongroup__.

History tables have the same metadata columns as current tables, plus two additional columns:

  • __end is the date and time when the row of data became no longer current, or 9999-12-31 00:00:00+00 if the row is still current.

  • __current is a Boolean value that indicates whether the row is current.

For instance:

SELECT __start, __end, __current, id, groupname, description FROM patrongroup__;
__start __end __current id groupname description

2022-04-17 21:42:25-00

2022-04-18 19:27:18-00

FALSE

15

undergrad

Student

2022-04-18 19:27:18-00

9999-12-31 00:00:00+00

TRUE

15

undergrad

Undergraduate Student

2022-04-17 21:42:25-00

9999-12-31 00:00:00+00

TRUE

10

graduate

Graduate Student

2022-04-17 21:52:53-00

9999-12-31 00:00:00+00

TRUE

9

faculty

Faculty Member

2022-04-17 21:52:53-00

9999-12-31 00:00:00+00

TRUE

12

staff

Staff Member

Here the "undergrad" group with id = 15 has two rows because it was modified on 2022-04-18 19:27:18-00, changing description from 'Student' to 'Undergraduate Student'.

A current table contains the same data as the corresponding history table where __current = TRUE. So for example, these two statements are equivalent:

SELECT id, groupname, description FROM patrongroup;
SELECT id, groupname, description FROM patrongroup__ WHERE __current = TRUE;

Transformed tables

Metadb applies transformations to some kinds of data, which results in additional tables being created. One category of data that are transformed in this way is JSON objects. Suppose that our patrongroup table had contained the groupname and description data in JSON fields, for example:

SELECT __id, __start, id, jsondata FROM patrongroup;
__id __start id jsondata

8

2022-04-18 19:27:18-00

15

{
    "groupname": "undergrad",
    "description": "Undergraduate Student"
}

4

2022-04-17 17:42:25-00

10

{
    "groupname": "graduate",
    "description": "Graduate Student"
}

5

2022-04-17 17:52:53-00

9

{
    "groupname": "faculty",
    "description": "Faculty Member"
}

6

2022-04-17 17:52:53-00

12

{
    "groupname": "staff",
    "description": "Staff Member"
}

Metadb would attempt to generate a table called patrongroup__t with JSON fields extracted into columns.

SELECT __id, __start, id, groupname, description FROM patrongroup__t;
__id __start id groupname description

2

2022-04-18 19:27:18-00

15

undergrad

Undergraduate Student

5

2022-04-17 17:42:25-00

10

graduate

Graduate Student

4

2022-04-17 17:52:53-00

9

faculty

Faculty Member

6

2022-04-17 17:52:53-00

12

staff

Staff Member

A corresponding history table called patrongroup__t__ is also created.

Workspaces

Metadb creates a schema name space for each user to serve as an individual workspace. The schema has the same name as the user name, and it can be used as an area within the database for creating tables. This can be useful for saving the results of queries or importing external data sets. For example:

CREATE TABLE tina.groups AS SELECT * FROM patrongroup;

If you are logged in as the user tina, then the tina schema will generally be the default schema when no schema is specified. In that case the following query, leaving out the schema name, would have the same effect as above:

CREATE TABLE groups AS SELECT * FROM patrongroup;

Stream processor

Data type conversion

When reading data from a data source, Metadb will automatically adjust column data types in a number of cases:

  • A data type in the source has changed in a way that cannot be applied safely by Metadb without more information.

  • A data type in the source has changed to a type that is not supported by Metadb.

  • A data type in the source can be converted safely to a type that is more performant.

This table summarizes the type conversions that are performed:

Data type conversions To numeric To uuid To jsonb To varchar

From boolean

From smallint

From integer

From bigint

From real

From double precision

From numeric

From date

From time

From time with timezone

From timestamp

From timestamp with timezone

From uuid

From json

From jsonb

From varchar

Command reference

Functions

System information

Name Return type Description

version()

varchar

Metadb version

Statements

CREATE SERVER

Define a new external data source

CREATE SERVER server_name DATA SOURCE data_source_type
    OPTIONS ( option 'value' [, ... ] )
Description

CREATE SERVER defines connection settings for an external data source.

Parameters

server_name

A unique name for the data source to be created.

data_source_type

The type of data source. The only type currently supported is kafka.

OPTIONS ( option 'value' [, …​ ] )

Connection settings and other configuration options for the data source.

Options for data source type "kafka"

brokers

Kafka bootstrap servers (comma-separated list).

security

Security protocol: 'ssl' or 'plaintext'. The default is 'ssl'.

topics

Regular expressions matching Kafka topics to read (comma-separated list).

group

Kafka consumer group ID.

schemapassfilter

Regular expressions matching schema names to accept (comma-separated list).

schemaprefix

Prefix to add to schema names.

enable

Boolean indicating if the data source is enabled. The default is TRUE.

Examples

Create a server example as a kafka data source:

CREATE SERVER example DATA SOURCE kafka OPTIONS (
    brokers 'kafka:29092',
    topics '^metadb_example[.].*',
    group 'metadb_example',
    schemapassfilter 'example_.+',
    schemaprefix 'example_',
    enable TRUE
);