User guide

This is an overview of using Metadb. We assume familiarity with databases and the basics of SQL.

Getting started

Metadb extends PostgreSQL with features to support analytics such as streaming data sources, data model transforms, and historical data. The data contained in the Metadb database originally come from another place: a data source which could be, for example, a transaction-processing database or a sensor network. Metadb updates its database continuously based on state changes in external data sources.

Main tables

Tables generated by Metadb have at least these metadata columns, with names that begin with two underscores:

  • __id is a surrogate key that identifies a row in the table.

  • __start is the date and time when the row of data was generated.

  • __end is the date and time when the row of data became no longer current, or 9999-12-31 00:00:00+00 if the row is still current.

  • __current is a Boolean value that indicates whether the row is current.

  • __source identifies the data source. The value of __source is the same for all rows that have been read from the same data source.

  • __origin is an optional identifier used to group related data from one or more data sources. For example, Metadb can combine data from similar source tables into a single table in the analytic database, and stores an identifier in __origin to record where the data came from.

A main table in Metadb has two underscores at the end of its name (e.g., patrongroup__), and it contains both the current state and the history of all previous states provided by the data source. For instance, consider a main table that stores categories of patrons in a library:

SELECT __start, __end, __current, id, groupname, description FROM library.patrongroup__;
__start __end __current id groupname description

2022-04-17 21:42:25-00

2022-04-18 19:27:18-00

FALSE

15

undergrad

Student

2022-04-18 19:27:18-00

9999-12-31 00:00:00+00

TRUE

15

undergrad

Undergraduate Student

2022-04-17 21:42:25-00

9999-12-31 00:00:00+00

TRUE

10

graduate

Graduate Student

2022-04-17 21:52:53-00

9999-12-31 00:00:00+00

TRUE

9

faculty

Faculty Member

2022-04-17 21:52:53-00

9999-12-31 00:00:00+00

TRUE

12

staff

Staff Member

In this example, the "undergrad" group with id = 15 has two rows because it was modified on 2022-04-18 19:27:18-00, changing description from 'Student' to 'Undergraduate Student'.

Current tables

It is often desirable to limit a query to retrieving only current records. This can be done by filtering on __current = TRUE, for example:

SELECT id, groupname, description FROM library.patrongroup__ WHERE __current;

For convenience, since this filter is so often used, Metadb provides access to a current table which contains only current records. Every main table has a corresponding current table, which shares the same name but without the two underscores at the end. For instance, the following query is equivalent to the example above:

SELECT id, groupname, description FROM library.patrongroup;

A current table reflects the current state of each row in the data source, as of the last time the row was updated.

Note

In a few database client applications, current tables do not appear by default in the list of tables. This is because main tables and current tables have a parent-child relationship, and database user interfaces may show the current table as subsumed under the main table.

To take another example:

SELECT __id, __start, __source, __origin, id, groupname, description FROM library.patrongroup;
__id __start __source __origin id groupname description

8

2022-04-18 19:27:18-00

library

west

15

undergrad

Undergraduate Student

4

2022-04-17 17:42:25-00

library

east

10

graduate

Graduate Student

5

2022-04-17 17:52:53-00

library

east

9

faculty

Faculty Member

6

2022-04-17 17:52:53-00

library

east

12

staff

Staff Member

There is a distinction between the __id and id columns. The __id column is a key defined by Metadb to identify each row uniquely; it is present in all tables generated by Metadb. In contrast, the id column only happens to be present in this sample table and has been provided by the data source; it may or may not be a key, and its name, content, or significance may change if schema changes are made in the data source.

Transformed tables

Metadb applies transformations to some kinds of data, which results in additional tables being created. One category of data that are transformed in this way is JSON objects. Suppose that our patrongroup current table contains the groupname and description data in JSON fields, for example:

SELECT __id, __start, id, jsondata FROM library.patrongroup;
__id __start id jsondata

8

2022-04-18 19:27:18-00

15

{
    "groupname": "undergrad",
    "description": "Undergraduate Student"
}

4

2022-04-17 17:42:25-00

10

{
    "groupname": "graduate",
    "description": "Graduate Student"
}

5

2022-04-17 17:52:53-00

9

{
    "groupname": "faculty",
    "description": "Faculty Member"
}

6

2022-04-17 17:52:53-00

12

{
    "groupname": "staff",
    "description": "Staff Member"
}

The JSON fields can be selected using SQL:

SELECT __id,
       __start,
       id,
       jsondata->>'groupname' AS groupname,
       jsondata->>'description' AS description
    FROM library.patrongroup;

However, Metadb also generates a transformed table called patrongroup__t. Transformed tables based on JSON objects contain JSON fields extracted into columns, which allows for simpler queries.

SELECT __id, __start, id, groupname, description FROM library.patrongroup__t;
__id __start id groupname description

2

2022-04-18 19:27:18-00

15

undergrad

Undergraduate Student

5

2022-04-17 17:42:25-00

10

graduate

Graduate Student

4

2022-04-17 17:52:53-00

9

faculty

Faculty Member

6

2022-04-17 17:52:53-00

12

staff

Staff Member

In the current version of Metadb, only top-level, scalar JSON fields are extracted into transformed tables.

Main tables are also transformed in the same way. In this case the main transformed table would be called patrongroup__t__.

Note that JSON data are treated as "schemaless," and fields are inferred from their presence in the data rather than read from a JSON schema. As a result, a column is only created from a JSON field if the field is present in at least one JSON record.

Comparing table types

To summarize the types of tables that we have covered:

Table name Table type Description
patrongroup__

Main table

Current and historical records

patrongroup

Current table

Current records only

patrongroup__t__

Main transformed table

Transformed versions of the records in patrongroup__

patrongroup__t

Current transformed table

Transformed versions of the records in patrongroup

In addition to these table types, some partition tables have names that begin with zzz___. It is recommended that these tables not be used directly in queries, and access to them is generally disabled.

Workspaces

In general, users do not have privileges to modify or create tables in the database. However, Metadb creates a schema for each user where the user does have these privileges, and this serves as an individual workspace. The schema has the same name as the user name. This can be useful as an area for saving the results of queries or importing external data sets.

When using one’s own workspace, it is not necessary to specify the schema name; it will be used by default if no schema is specified.

For example, this table westgroup created by user flopsy will be created in schema flopsy:

CREATE TABLE westgroup AS SELECT * FROM library.patrongroup WHERE __origin = 'west';
SELECT * FROM westgroup LIMIT 20;

Unless configured differently, other users do not have privileges to access the table by default.

Metadb v1.0.6 To grant the user mopsy read-only access to the schema and table:

GRANT USAGE ON SCHEMA flopsy TO mopsy;

GRANT SELECT ON westgroup TO mopsy;

The mopsy user will have to include the schema name to access the table:

SELECT * FROM flopsy.westgroup LIMIT 20;

Database views

Database views are not supported and should not be created in a Metadb database.

Views can create dependencies that may block required schema changes during data updates. They also hide query complexity, which in analytical workloads can be significant.

Querying system information

Metadb version

Metadb v1.0.5 To show the current Metadb version:

SELECT mdbversion();

Data update times

Information about when certain tables were last updated is available in a table called metadb.table_update:

SELECT * FROM metadb.table_update ORDER BY schemaname, tablename;

System log

Metadb reports errors and other information in a logging table called metadb.log. For instance to view all log messages from January, 2023:

SELECT *
    FROM metadb.log
    WHERE log_time >= '2023-01-01' AND log_time < '2023-02-01'
    ORDER BY log_time;

Metadb v1.0.5 The mdblog() function is provided for convenience and returns ordered log messages, from the past 24 hours by default (or optionally from a specified interval):

SELECT * FROM mdblog();

Status of query process

When executing a long-running query, the ps() function can be used to check on the status of the query process from another session:

SELECT * FROM ps();

Reference

Stream processor

Data type conversion

When reading data from a data source, Metadb will automatically adjust column data types in a number of cases:

  • A data type in the source has changed in a way that cannot be applied safely by Metadb without more information.

  • A data type in the source has changed to a type that is not supported by Metadb.

  • A data type in the source can be converted safely to a type that is more performant.

This table summarizes the type adjustments that are performed:

Data type conversions To numeric To uuid To jsonb To varchar

From boolean

From smallint

From integer

From bigint

From real

From double precision

From numeric

From date

From time

From time with timezone

From timestamp

From timestamp with timezone

From uuid

From json

From jsonb

From varchar

Functions

System information

Name Return type Description

mdblog(interval) Metadb v1.0.5

table (
    log_time timestamptz(3),
    error_severity text,
    message text
)

Returns recent log messages during the specified interval, or during the past 24 hours if no interval is given

mdbversion() Metadb v1.0.5

text

Returns the Metadb version

ps()

table (
    dbname text,
    username text,
    state text,
    realtime text,
    query text
)

Returns the status and elapsed running time of current query processes

Examples

Show log messages from the past hour, and from the past 24 hours:

SELECT * FROM mdblog(interval '1 hour');
SELECT * FROM mdblog();

Metadb v1.0.5 Show the current Metadb version:

SELECT mdbversion();

System tables

metadb.log

Column name Column type Description
log_time
timestamptz(3)

Timestamp when the log entry was written

error_severity
text

Logging level: (from least to most severe) INFO, WARNING, ERROR, FATAL

message
text

The log message

metadb.table_update

Column name Column type Description
schemaname
text

Schema name of the updated table

tablename
text

Table name of the updated table

updated
timestamptz

Timestamp when the table was last updated

External SQL directives

Metadb v1.0.5 Metadb allows scheduling external SQL files to run on a regular basis.

Each SQL statement should be separated from others by an empty line, and any tables created should not specify a schema name.

Comment lines beginning with --metadb: are used for additional directives; each directive should be on a separate line.

It is suggested that each SQL file optionally begin with a --metadb:table directive, optionally followed by --metadb:require directives if needed, and then followed by an empty line, for example:

--metadb:table library_patrons
--metadb:require library.patrongroup.id uuid
--metadb:require library.patrongroup.name text

DROP TABLE IF EXISTS library_patrons;

CREATE TABLE library_patrons AS
SELECT . . . ;

--metadb:require

Metadb v1.0.5 The --metadb:require directive asserts a dependency on a specific table and column in the database. Metadb will ensure that the table and column exist before it continues executing the SQL file. The required table can be a main table or a current table; both will be created if either is specified and does not exist.

The directive takes the form:

--metadb:require <schema>.<table>.<column> <columntype>

For example:

--metadb:require library.patrongroup.id uuid
--metadb:require library.patrongroup.name text

There is generally no need to list all dependencies. The intent of this directive is to reduce SQL errors by listing tables and columns that are likely or known to be missing in some instances.

--metadb:table

Metadb v1.0.5 The --metadb:table directive declares that the SQL file updates a specific table. This allows Metadb to report on the status of the table. The directive takes the form:

--metadb:table <table>

The specified table should not contain a schema name.

For example:

--metadb:table user_group

Statements

Metadb extends SQL with statements for configuring and administering the server. These statements are only available when connecting to the Metadb server (not the database).

ALTER DATA SOURCE

Change the configuration of a data source

ALTER DATA SOURCE source_name
    OPTIONS ( [ ADD | SET | DROP ] option ['value'] [, ... ] )
Description

ALTER DATA SOURCE changes connection settings for a data source.

Note

ALTER DATA SOURCE currently requires restarting the server before it will take effect.

Parameters

source_name

The name of an existing data source.

OPTIONS ( [ ADD | SET | DROP ] option ['value'] [, …​ ] )

Connection settings and other configuration options for the data source.

Options

See CREATE DATA SOURCE

Examples

Change the consumer group:

ALTER DATA SOURCE sensor OPTIONS (SET consumergroup 'metadb_sensor_1');

AUTHORIZE

Enable access to tables generated from an external data source

AUTHORIZE SELECT
    ON ALL TABLES IN DATA SOURCE source_name
    TO role_specification
Description

The AUTHORIZE command grants access to tables. It differs from GRANT in that the authorization will also apply to tables created at a later time in the data source.

Note

AUTHORIZE currently requires restarting the server before it will take effect.

Parameters

source_name

The name of an existing data source.

role_specification

An existing role to be granted the authorization.

Examples
AUTHORIZE SELECT
    ON ALL TABLES IN DATA SOURCE sensor
    TO beatrice;

CREATE DATA ORIGIN

Define a new data origin

CREATE DATA ORIGIN origin_name
Description

CREATE DATA ORIGIN defines a new origin.

Note

CREATE DATA ORIGIN currently requires restarting the server before it will take effect.

Parameters

origin_name

A unique name for the data origin to be created.

Examples

Create a new origin test_origin:

CREATE DATA ORIGIN test_origin;

CREATE DATA SOURCE

Define a new external data source

CREATE DATA SOURCE source_name TYPE source_type
    OPTIONS ( option 'value' [, ... ] )
Description

CREATE DATA SOURCE defines connection settings for an external data source.

Parameters

source_name

A unique name for the data source to be created.

source_type

The type of data source. The only type currently supported is kafka.

OPTIONS ( option 'value' [, …​ ] )

Connection settings and other configuration options for the data source.

Options for data source type "kafka"

brokers

Kafka bootstrap servers (comma-separated list).

security

Security protocol: 'ssl' or 'plaintext'. The default is 'ssl'.

topics

Regular expressions matching Kafka topics to read (comma-separated list).

consumergroup

Kafka consumer group ID.

schemapassfilter

Regular expressions matching schema names to accept (comma-separated list).

schemastopfilter

Regular expressions matching schema names to ignore (comma-separated list), evaluated after schemapassfilter.

trimschemaprefix

Prefix to remove from schema names.

addschemaprefix

Prefix to add to schema names.

module

Name of pre-defined configuration.

Examples

Create sensor as a kafka data source:

CREATE DATA SOURCE sensor TYPE kafka OPTIONS (
    brokers 'kafka:29092',
    topics '^metadb_sensor_1.*',
    consumergroup 'metadb_sensor_1_1',
    addschemaprefix 'sensor_'
);

CREATE USER

CREATE USER is mostly equivalent to the PostgreSQL CREATE USER statement. However it also creates a schema having the same name as the user, if the schema does not exist, and grants the user privileges on the schema. The schema is intended as a workspace for the user.

DROP DATA SOURCE

Remove a data source configuration

DROP DATA SOURCE source_name
Description

DROP DATA SOURCE removes a data source configuration.

Parameters

source_name

The name of an existing data source.

Examples

Drop a data source sensor:

DROP DATA SOURCE sensor;

LIST

Show the value of a system variable

LIST name
Description

LIST shows the current setting of various system configurations and other variables.

Parameters

name

authorizations

Authorized users.

data_origins

Configured data origins.

data_sources

Configured data sources.

status

Current status of system components.

Examples
LIST status;

Server administration

System requirements

Hardware requirements

  • Metadb software:

    • CPU: 2 cores

    • Memory: 2 GB

    • Storage: 500 GB SSD

  • PostgreSQL database:

    • CPU: 4 cores

    • Memory: 32 GB

    • Storage: 2 TB SSD

Software requirements

  • Architecture: x86-64 (AMD64)

  • Operating system: Debian 11

  • PostgreSQL 14 or later

    • PostgreSQL 15 or later is recommended

    • AWS RDS PostgreSQL optionally may be used (with servers in the same zone/subnet); Aurora is not supported

  • Required to build from source:

    • Go 1.19 or later

    • goyacc (installation instructions below)

    • Ragel 6.10 or later

    • GCC C compiler 9.3.0 or later

PostgreSQL configuration

  • checkpoint_timeout: 3600

  • cpu_tuple_cost: 0.03

  • default_statistics_target: 1000

  • effective_io_concurrency: 200

  • idle_in_transaction_session_timeout: 3600000

  • idle_session_timeout: 604800000

  • maintenance_work_mem: 1000000

  • max_wal_size: 10240

  • shared_buffers: 1250000

  • statement_timeout: 3600000

  • work_mem: 350000

Building the software

First set the GOPATH environment variable to specify a path that can serve as the build workspace for Go, e.g.:

$ export GOPATH=$HOME/go

Then install goyacc:

go install golang.org/x/tools/cmd/goyacc@master

export PATH=$PATH:$GOPATH/bin

Then to build the server:

$ ./build.sh

The build.sh script creates a bin/ subdirectory and builds the metadb executable there:

$ ./bin/metadb help

It is suggested that a metadb user be created and the server run by that user, for example, in /home/metadb.

Server configuration

Metadb makes use of local storage in a data directory which is created using metadb with the init command. In this example we will create the data directory as data/:

metadb init -D data

This will also create a file metadb.conf in the data directory with the following structure:

[main]
host = host name of the PostgreSQL server
port = port number of the PostgreSQL server
database = database name, typically "metadb"
superuser = database superuser such as "postgres"
superuser_password = password of superuser
systemuser = database user that is the owner of the database
systemuser_password = password of systemuser
sslmode = should be set to "require" or stronger option

This file should be edited to add database connection parameters, for example:

[main]
host = a.b.c
port = 5432
database = metadb
superuser = postgres
superuser_password = zpreCaWS7S79dt82zgvD
systemuser = mdbadmin
systemuser_password = ZHivGie5juxGJZmTObHU
sslmode = require

Metadb will assume that the database, superuser, and systemuser defined here already exist; so they should be created before continuing.

Backups

It is essential to make regular backups of Metadb and to test the backups.

In general, persistent data are stored in the database, and therefore the database is the most important system that should be backed up often.

The data directory contains the metadb.conf configuration file and is also used for temporary storage. The metadb.conf file should be backed up, or alternatively it should be possible to reconstruct the file’s contents.

Upgrading from a previous version

To upgrade from any previous version of Metadb, stop the server (if running), and then run the upgrade process in case changes are required:

metadb upgrade -D data

The upgrade process may, in some instances, take a significant amount of time to run. The database generally remains available to users during this period.

If no changes are needed, the process outputs:

metadb: "data" is up to date

Running the server

To start the server:

nohup metadb start -D data -l metadb.log &

The --memlimit option can be used to set a soft memory limit (in GiB) if needed, for example:

nohup metadb start -D data -l metadb.log --memlimit 2 &

The server listens on port 8440 by default, but this can be set using the --port option. The --debug option enables verbose logging.

To stop the server:

metadb stop -D data

Note that stopping or restarting the server may delay scheduled data updates or cause them to restart.

The server can be set up to run with systemd via a file such as /etc/systemd/system/metadb.service, for example:

[Unit]
Description=Metadb
After=network.target remote-fs.target

[Service]
Type=simple
User=metadb
ExecStart=/bin/bash -ce "exec /home/metadb/bin/metadb start -D /home/metadb/data -l /home/metadb/metadb.log"
Restart=on-abort

[Install]
WantedBy=multi-user.target

Then:

systemctl enable metadb

systemctl start metadb

Connecting to the server

The PostgreSQL terminal client, psql, can be used to connect to the Metadb server:

psql -h localhost -p 8440

Metadb v1.0.5 Then for example to show the software version:

SELECT mdbversion();

Configuring a Kafka data source

A data source is defined using the CREATE DATA SOURCE statement, for example:

CREATE DATA SOURCE sensor TYPE kafka OPTIONS (
    brokers 'kafka:29092',
    topics '^metadb_sensor_1.*',
    consumergroup 'metadb_sensor_1_1',
    addschemaprefix 'sensor_',
    schemastopfilter 'admin'
);

Metadb currently supports reading Kafka messages in the format produced by the Debezium PostgreSQL connector for Kafka Connect. Configuration of Kafka, Kafka Connect, Debezium, and PostgreSQL logical decoding is beyond the scope of this documentation, but a few notes are included here.

Data flow in this direction:

  1. A source PostgreSQL database

  2. Kafka Connect/Debezium

  3. Kafka

  4. Metadb

  5. The analytics database

To allow capturing data changes in the source PostgreSQL database, logical decoding has to be enabled, in particular by setting wal_level = logical in postgresql.conf.

Next we create a connector configuration file for Kafka Connect:

{
    "name": "sensor-1-connector",
    "config": {
        "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
        "database.dbname": "sourcedb",
        "database.hostname": "example.host.name",
        "database.password": "eHrkGrZL8mMJOFgToqqL",
        "database.port": "5432",
        "database.server.name": "metadb_sensor_1",
        "database.user": "dbuser",
        "plugin.name": "pgoutput",
        "snapshot.mode": "exported",
        "tasks.max": "1",
        "truncate.handling.mode": "include",
        "publication.autocreate.mode", "filtered"
        "heartbeat.interval.ms": "30000",
        "heartbeat.action.query": "UPDATE admin.heartbeat set last_heartbeat_ts = now();"
    }
}

It is recommended to use the connector configuration settings heartbeat.interval.ms and heartbeat.action.query as above to avoid spikes in disk space consumption within the source database. (See the Debezium PostgreSQL connector documentation for more details.) The schemastopfilter option of the CREATE DATA SOURCE command is used to filter out the heartbeat table.

In the source database:

CREATE SCHEMA admin;

CREATE TABLE admin.heartbeat (last_heartbeat_ts timestamptz PRIMARY KEY);

INSERT INTO admin.heartbeat (last_heartbeat_ts) VALUES (now());

Then to create the connector:

curl -X POST -i -H "Accept: application/json" -H "Content-Type: application/json" \
     -d @connector.json https://kafka.connect.server/connectors

Note the 1 included in name and database.server.name in the connector configuration. This is suggested as a version number, which can be incremented if the data stream needs to be resynchronized with a new connector.

Metadb requires all streamed tables to have a primary key defined or a replica identity of FULL. Tables that do not meet this requirement should be filtered out in the Debezium PostgreSQL connector configuration by setting schema.exclude.list or table.exclude.list. Otherwise they will generate error messages in the Metadb log.

The replication slot disk usage must be monitored, because under certain error conditions it can grow too large and possibly fill up the disk. To show the disk usage (in the source database):

select slot_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(),
    restart_lsn)) as replicationSlotLag, active from pg_replication_slots;

It is also recommended to allocate plenty of extra disk space in the source database.

To drop the replication slot (in the source database) after deleting a connector:

SELECT pg_drop_replication_slot('debezium');

DROP PUBLICATION dbz_publication;

Resynchronizing a data source

If a Kafka data stream fails and cannot be resumed, it may be necessary to re-stream data to Metadb. For example, a source database may become unsynchronized with the analytic database, requiring a new snapshot of the source database to be streamed. Metadb can accept re-streamed data in order to resynchronize with the source, using the following procedure.

Note that during the resynchronization process, the analytics database continues to be available to users. However, streaming updates will be slower than usual, and there temporarily may be missing records (until they are re-streamed) or "extra" records (recently deleted in the source database).

  1. Update the topics and consumergroup configuration settings for the new data stream.

    ALTER DATA SOURCE sensor OPTIONS
        (SET topics '^metadb_sensor_2.*', SET consumergroup 'metadb_sensor_2_1');

    Do not restart the Metadb server but continue directly to Step 2.

  2. Stop the Metadb server and (before starting it again) "reset" the analytics database to mark all current data as old. This may take some time to run.

    metadb stop -D data
    
    metadb reset -D data --source sensor
  3. Start the Metadb server to begin streaming the data.

  4. Once the new data have finished (or nearly finished) re-streaming, stop the Metadb server, and "clean" the analytics database to remove any old data that have not been refreshed by the new data stream.

    metadb clean -D data --source sensor

    Note that the metadb server currently does not give an indication that it has finished re-streaming. (Having the server report that initial streaming or re-streaming has finished is a planned feature.)

    The precise timing of when "metadb clean" should be run is up to the admninistrator, but it must be run to remove redundant data and to complete the resynchronization process. The only disadvantage of running it too early is that data will appear to be missing until they are re-streamed.

  5. Start the server.

    Until a failed stream is re-streamed by following the process above, the analytic database may continue to be unsynchronized with the source.

Configuring database users

To create a new database user account:

CREATE USER weg PASSWORD 'LZn2DCajcNHpGR3ZXWHD';

In addition to creating the user, this also creates a schema with the same name as the user which can be used as a workspace.

We recommend user names of 3 to 8 characters in length.

To give the user access to all tables generated from a data source (including tables not yet created):

AUTHORIZE SELECT ON ALL TABLES IN DATA SOURCE sensor TO weg;
Note

AUTHORIZE currently requires restarting the server before it will take effect.

Administrative database changes

It is possible to make administrative-level changes directly in the underlying PostgreSQL database, such as providing additional tables for users. However, the following guidelines should be followed strictly to avoid disrupting the operation of Metadb:

  1. No changes should be made to any database objects created by Metadb. If it should become necessary to make changes to the database objects at the request of the Metadb maintainers, the server should be stopped first to prevent it from operating with an out-of-date cache. If changes are made to the database objects inadvertently, the server should be stopped as soon as possible and not started until the changes have been reversed.

  2. Changes generally should be made while logged in as a new database user (not a superuser) that is created specifically for that purpose.

  3. Any new database objects should be created in a new schema that will not coincide with a schema that may be created by Metadb. This can be ensured by always setting addschemaprefix in data source configurations and avoiding names with those prefixes when creating a new schema.

  4. Database views should not be created in the database.

External projects

FOLIO

Configuration

When creating a FOLIO data source, use the module 'folio' option, and set trimschemaprefix to remove the tenant from schema names and addschemaprefix to add a folio_ prefix to the schema names. For example:

CREATE DATA SOURCE folio TYPE kafka OPTIONS (
    module 'folio',
    trimschemaprefix 'tenantname_',
    addschemaprefix 'folio_',
    brokers 'kafka:29092',
    topics '^metadb_folio_1.*',
    consumergroup 'metadb_folio_1_1',
    schemastopfilter 'admin'
);

It is recommended to use a separate Kafka cluster, rather than the FOLIO Kafka instance, until one has experience with administration of Kafka.

In the Debezium PostgreSQL connector configuration, the following exclusions are suggested:

"schema.exclude.list": "public,.*_mod_login,.*_mod_pubsub,.*pubsub_config,supertenant_mod_.*,.*_mod_kb_ebsco_java,.*_mod_data_export_spring"
"table.exclude.list": ".*_mod_agreements.alternate_resource_name,.*_mod_service_interaction.dashboard_access,.*_mod_agreements.availability_constraint,.*_mod_agreements\\.package_description_url,.*_mod_agreements\\.content_type,.*_mod_agreements\\.entitlement_tag,.*_mod_agreements\\.erm_resource_tag,.*_mod_agreements\\.string_template,.*_mod_agreements\\.string_template_scopes,.*_mod_agreements\\.templated_url,.*_mod_oai_pmh\\.instances,.*_mod_remote_storage\\.original_locations,.*_mod_remote_storage\\.item_notes,.*app_setting,.*alternate_name,.*databasechangelog,.*databasechangeloglock,.*directory_entry_tag,.*license_document_attachment,.*license_supp_doc,.*license_tag,.*log_entry_additional_info,.*subscription_agreement_supp_doc,.*subscription_agreement_document_attachment,.*subscription_agreement_ext_lic_doc,.*subscription_agreement_tag,.*tenant_changelog,.*tenant_changelog_lock,.*marc_indexers.*,.*rmb_internal.*,.*rmb_job.*,.*_mod_agreements\\.match_key,.*system_changelog"

MARC transformation

Metadb v1.0.5 Metadb transforms MARC records from the tables marc_records_lb and records_lb in schema folio_source_record to a tabular form which is stored in a new table, folio_source_record.marc__t. This transformation is updated usually every few hours. The time of the most recent update can be retrieved via SQL:

SELECT updated
    FROM metadb.table_update
    WHERE schemaname = 'folio_source_record' AND tablename = 'marc__t';

Derived tables

FOLIO derived tables are automatically updated once per day, usually at about 00:00:00 UTC by default.

Documentation

FOLIO does not provide documentation for its internal data, which Metadb tables are based on, but it does have some data documentation for its "storage module" APIs which are roughly equivalent. This is located at https://dev.folio.org/reference/api/. The name of most storage modules ends in -storage, but some modules use a different convention; for instance, the storage module for users is mod-users. (All module names begin with mod-.)

ReShare

Before defining a ReShare data source, create a data origin for each consortial tenant. For example:

CREATE DATA ORIGIN tenant1;

CREATE DATA ORIGIN tenant2;

CREATE DATA ORIGIN tenant3;
Note

CREATE DATA ORIGIN currently requires restarting the server before it will take effect.

Then use the module 'reshare' option when creating the data source, and set addschemaprefix to add a reshare_ prefix to the schema names:

CREATE DATA SOURCE reshare TYPE kafka OPTIONS (
    module 'reshare',
    addschemaprefix 'reshare_',
    brokers 'kafka:29092',
    topics '^metadb_reshare_1.*',
    consumergroup 'metadb_reshare_1_1',
    schemastopfilter 'admin'
);

Note that the order of commands is important: The initial set of data origins should be created before the data source is created so that schema names of incoming data will be processed correctly. Later, whenever a new consortial tenant is to be added, it should be defined in Metadb using CREATE DATA ORIGIN (and the server restarted) before the tenant is added to ReShare.

In the Debezium PostgreSQL connector configuration, it is suggested that credentials (.+mod_login), the public schema, the Okapi supertenant (supertenant_mod_.+), and mod-pubsub data (pubsub_config,.+_mod_pubsub) be excluded using the schema.exclude.list setting.

Derived tables

ReShare derived tables are automatically updated once per day, usually at about 00:00:00 UTC by default.