1. User guide
This is an overview of using Metadb. We assume familiarity with databases and the basics of SQL.
1.1. Getting started
Metadb extends PostgreSQL with features to support analytics such as streaming data sources, data model transforms, and historical data. The data contained in the Metadb database originally come from another place: a data source which could be, for example, a transaction-processing database or a sensor network. Metadb updates its database continuously based on state changes in external data sources.
1.2. Main tables
Tables generated by Metadb have at least these metadata columns, with names that begin with two underscores:
-
__id
is a surrogate key that identifies a row in the table. -
__start
is the date and time when the row of data was generated. -
__end
is the date and time when the row of data became no longer current, or9999-12-31 00:00:00+00
if the row is still current. -
__current
is a Boolean value that indicates whether the row is current. -
__origin
is an optional identifier used to group related data from one or more data sources. For example, Metadb can combine data from similar source tables into a single table in the analytic database, and stores an identifier in__origin
to record where the data came from.
A main table in Metadb has two underscores at the end of its name (e.g.,
patrongroup__
), and it contains both the current state and the history of all
previous states provided by the data source. For instance, consider a main
table that stores categories of patrons in a library:
SELECT __start, __end, __current, id, groupname, description FROM library.patrongroup__;
__start |
__end |
__current |
id |
groupname |
description |
---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
In this example, the "undergrad" group with id
= 15 has two rows because it
was modified on 2022-04-18 19:27:18-00
, changing description
from
'Student'
to 'Undergraduate Student'
.
1.3. Current tables
It is often desirable to limit a query to retrieving only current records.
This can be done by filtering on __current
= TRUE
, for example:
SELECT id, groupname, description FROM library.patrongroup__ WHERE __current;
For convenience, since this filter is so often used, Metadb provides access to a current table which contains only current records. Every main table has a corresponding current table, which shares the same name but without the two underscores at the end. For instance, the following query is equivalent to the example above:
SELECT id, groupname, description FROM library.patrongroup;
A current table reflects the current state of each row in the data source, as of the last time the row was updated.
To take another example:
SELECT __id, __start, __origin, id, groupname, description FROM library.patrongroup;
__id |
__start |
__origin |
id |
groupname |
description |
---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
There is a distinction between the __id
and id
columns. The __id
column
is a key defined by Metadb to identify each row uniquely; it is present in all
tables generated by Metadb. In contrast, the id
column only happens to be
present in this sample table and has been provided by the data source; it may
or may not be a key, and its name, content, or significance may change if
schema changes are made in the data source.
1.4. Transformed tables
Metadb applies transformations to some kinds of data, which results in
additional tables being created. One category of data that are transformed in
this way is JSON objects. Suppose that our patrongroup
current table
contains the groupname
and description
data in JSON fields, for example:
SELECT __id, __start, id, jsondata FROM library.patrongroup;
__id |
__start |
id |
jsondata |
---|---|---|---|
|
|
|
{ "groupname": "undergrad", "description": "Undergraduate Student" } |
|
|
|
{ "groupname": "graduate", "description": "Graduate Student" } |
|
|
|
{ "groupname": "faculty", "description": "Faculty Member" } |
|
|
|
{ "groupname": "staff", "description": "Staff Member" } |
The JSON fields can be selected using SQL:
SELECT __id,
__start,
id,
jsonb_extract_path_text(jsondata, 'groupname') AS groupname,
jsonb_extract_path_text(jsondata, 'description') AS description
FROM library.patrongroup;
However, Metadb also generates a transformed table called patrongroup__t
.
Transformed tables based on JSON objects contain JSON fields extracted into
columns, which allows for simpler queries.
SELECT __id, __start, id, groupname, description FROM library.patrongroup__t;
__id |
__start |
id |
groupname |
description |
---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
In the current version of Metadb, only top-level, scalar JSON fields are extracted into transformed tables.
Main tables are also transformed in the same way. In this case the main
transformed table would be called patrongroup__t__
.
Note that JSON data are treated as "schemaless," and fields are inferred from their presence in the data rather than read from a JSON schema. As a result, a column is only created from a JSON field if the field is present in at least one JSON record.
1.5. Comparing table types
To summarize the types of tables that we have covered:
Table name | Table type | Description |
---|---|---|
patrongroup__ |
Main table |
Current and historical records |
patrongroup |
Current table |
Current records only |
patrongroup__t__ |
Main transformed table |
Transformed versions of the records in |
patrongroup__t |
Current transformed table |
Transformed versions of the records in |
In addition to these table types, some partition tables have names that begin
with zzz___
. It is recommended that these tables not be used directly in
queries, and access to them is generally disabled.
1.6. User workspaces
In general, users do not have privileges to modify or create tables in the
database. However, Metadb creates a schema for each user where the user does
have these privileges, and this serves as an individual workspace. The schema
has the same name as the user name. This can be useful as an area for saving
the results of queries or importing external data sets. For example, the user
celia
can create a table in the schema celia
:
CREATE TABLE celia.westgroup AS SELECT * FROM library.patrongroup WHERE __origin = 'west';
SELECT * FROM celia.westgroup LIMIT 20;
Other users typically do not have privileges to access the table by default.
To grant the user rosalind
read-only access to the schema and table:
GRANT USAGE ON SCHEMA celia TO rosalind; GRANT SELECT ON celia.westgroup TO rosalind;
The user rosalind
can then access the table.
1.7. Working with data types
This section offers a few brief recommendations for working with common data types.
Text and varchar
The text
data type is recommended for strings of characters.
The varchar(n)
type with length n can be used for older database clients
that require the length to be specified.
Monetary data
For monetary amounts, numeric(19, 4)
is usually a good choice. For exchange
rates, numeric(19, 14)
may be used.
Querying on a range of dates
There are various ways of querying within a range of dates, but the least
error-prone is to use inequalities of the form start <= t < end
, for
example:
SELECT * FROM folio_circulation.loan__t WHERE '2023-01-01' <= loan_date AND loan_date < '2024-01-01';
1.8. Creating reports
An effective way to create a report is to package it as a database function. A database function can define a query and associated parameters. Users can then call the function, specifying a value for each parameter.
For example, suppose that the following query counts the number of loans in a library for each circulated item within a range of dates.
SELECT item_id, count(*) AS loan_count FROM folio_circulation.loan__t WHERE '2023-01-01' <= loan_date AND loan_date < '2024-01-01' GROUP BY item_id;
We can create a function to generalize this query. Instead of including the
dates directly within the query, we will define them as parameters:
start_date
and end_date
.
CREATE FUNCTION lisa.count_loans( start_date date DEFAULT '2000-01-01', end_date date DEFAULT '2050-01-01') RETURNS TABLE( item_id uuid, loan_count integer) AS $$ SELECT item_id, count(*) AS loan_count FROM folio_circulation.loan__t WHERE start_date <= loan_date AND loan_date < end_date GROUP BY item_id $$ LANGUAGE SQL STABLE PARALLEL SAFE;
Now the function can be called with different arguments to generate reports:
SELECT * FROM lisa.count_loans(start_date => '2022-01-01', end_date => '2023-01-01'); SELECT * FROM lisa.count_loans(start_date => '2023-01-01');
The user lisa
who created this function can share it with other users in a
similar way as described in the "User workspaces" section above.
GRANT USAGE ON SCHEMA lisa TO celia, rosalind; GRANT EXECUTE ON FUNCTION lisa.count_loans TO celia, rosalind;
Note that the GRANT USAGE ON SCHEMA
command does not have to be run again if
it was done previously.
Defining shared functions in this way can be used together with a web-based database tool such as CloudBeaver to make reports available to a wider group of users.
1.9. Database views
Database views are not supported and should not be created in a Metadb database.
Views can create dependencies that may block required schema changes during data updates. They also hide query complexity, which in analytical workloads can be significant, behind what appears to be a simple table. It is recommended to use database functions instead, as described in the previous section.
1.10. Querying system information
Metadb version
To show the current Metadb version:
SELECT mdbversion();
Data update times
Information about when certain tables were last updated is available
in a table called metadb.table_update
:
SELECT * FROM metadb.table_update ORDER BY schemaname, tablename;
System log
Metadb reports errors and other information in a logging table called
metadb.log
. For instance to view all log messages from January, 2023:
SELECT * FROM metadb.log WHERE '2023-01-01' <= log_time AND log_time < '2023-02-01' ORDER BY log_time;
The mdblog()
function is provided for convenience and returns ordered log
messages, from the past 24 hours by default (or optionally from a specified
interval):
SELECT * FROM mdblog();
Status of query process
When executing a long-running query, the ps()
function can be used to check
on the status of the query process from another session:
SELECT * FROM ps();
2. Reference
2.1. Stream processor
2.1.1. Data type conversion
When reading data from a data source, Metadb will automatically adjust column data types in a number of cases:
-
A data type in the source has changed in a way that cannot be applied safely by Metadb without more information.
-
A data type in the source has changed to a type that is not supported by Metadb.
-
A data type in the source can be converted safely to a type that is more performant.
This table summarizes the type adjustments that are performed:
Data type conversions | To numeric | To uuid | To jsonb | To text |
---|---|---|---|---|
From boolean |
✅ |
|||
From smallint |
✅ |
✅ |
||
From integer |
✅ |
✅ |
||
From bigint |
✅ |
✅ |
||
From real |
✅ |
✅ |
||
From double precision |
✅ |
✅ |
||
From numeric |
✅ |
|||
From date |
✅ |
|||
From time |
✅ |
|||
From time with timezone |
✅ |
|||
From timestamp |
✅ |
|||
From timestamp with timezone |
✅ |
|||
From uuid |
✅ |
|||
From json |
✅ |
✅ |
||
From jsonb |
✅ |
|||
From text/varchar |
✅ |
2.2. Functions
2.2.1. System information
Name | Return type | Description |
---|---|---|
|
table ( log_time timestamptz(3), error_severity text, message text ) |
Returns recent log messages during the specified interval, or during the past 24 hours if no interval is given |
|
text |
Returns the Metadb version |
|
table ( dbname text, username text, state text, realtime text, query text ) |
Returns the status and elapsed running time of current query processes |
Examples
Show log messages from the past hour, and from the past 24 hours:
SELECT * FROM mdblog('1 hour');
SELECT * FROM mdblog();
Show the current Metadb version:
SELECT mdbversion();
2.3. System tables
2.3.1. metadb.base_table
Metadb 1.2
The table metadb.base_table
stores information about tables that are
extracted from data sources or are transformed from those tables.
Column name | Column type | Description |
---|---|---|
|
varchar(63) |
The base table schema name |
|
varchar(63) |
The base table name |
|
varchar(63) |
Name of the data source the table is extracted from |
|
boolean |
True if the table is transformed from another table |
|
varchar(63) |
Schema name of the parent table, if this is a transformed table |
|
varchar(63) |
Table name of the parent table, if this is a transformed table |
2.3.2. metadb.log
The table metadb.log
stores logging information for the system.
Column name | Column type | Description |
---|---|---|
log_time |
timestamptz(3) |
Timestamp when the log entry was written |
error_severity |
text |
Logging level: (from least to most severe) INFO, WARNING, ERROR, FATAL |
message |
text |
The log message |
2.3.3. metadb.table_update
Metadb 1.2
The table metadb.table_update
stores information about the updating of
certain tables.
Column name | Column type | Description |
---|---|---|
|
varchar(63) |
Schema name of the updated table |
|
varchar(63) |
Table name of the updated table |
|
timestamptz |
Timestamp when the table was last updated successfully |
|
real |
Wall-clock time in seconds of the last completed update process |
2.4. External SQL directives
Metadb allows scheduling external SQL files to run on a regular basis.
Each SQL statement should be separated from others by an empty line, and any tables created should not specify a schema name.
Comment lines beginning with --metadb:
are used for additional directives;
each directive should be on a separate line.
It is suggested that each SQL file optionally begin with a --metadb:table
directive, optionally followed by --metadb:require
directives if needed, and
then followed by an empty line, for example:
--metadb:table library_patrons --metadb:require library.patrongroup.id uuid --metadb:require library.patrongroup.name text DROP TABLE IF EXISTS library_patrons; CREATE TABLE library_patrons AS SELECT . . . ;
2.4.1. --metadb:require
The --metadb:require
directive asserts a dependency on a specific table and
column in the database. Metadb will ensure that the table and column exist
before it continues executing the SQL file. The required table can be a main
table or a current table; both will be created if either is specified and does
not exist.
The directive takes the form:
--metadb:require <schema>.<table>.<column> <columntype>
For example:
--metadb:require library.patrongroup.id uuid --metadb:require library.patrongroup.name text
There is generally no need to list all dependencies. The intent of this directive is to reduce SQL errors by listing tables and columns that are likely or known to be missing in some instances.
2.4.2. --metadb:table
The --metadb:table
directive declares that the SQL file updates a specific
table. This allows Metadb to report on the status of the table. The directive
takes the form:
--metadb:table <table>
The specified table should not contain a schema name.
For example:
--metadb:table user_group
2.5. Statements
Metadb extends SQL with statements for configuring and administering the server. These statements are only available when connecting to the Metadb server (not the database).
2.5.1. ALTER DATA SOURCE
Change the configuration of a data source
ALTER DATA SOURCE source_name
OPTIONS ( [ ADD | SET | DROP ] option ['value'] [, ... ] )
Description
ALTER DATA SOURCE changes connection settings for a data source.
Parameters
|
The name of an existing data source. |
|
Connection settings and other configuration options for the data source. |
Options
See CREATE DATA SOURCE
Examples
Change the consumer group:
ALTER DATA SOURCE sensor OPTIONS (SET consumergroup 'metadb_sensor_1');
2.5.2. AUTHORIZE
Enable access to tables generated from an external data source
AUTHORIZE SELECT
ON ALL TABLES IN DATA SOURCE source_name
TO role_specification
Description
The AUTHORIZE command grants access to tables. It differs from GRANT in that the authorization will also apply to tables created at a later time in the data source.
Parameters
|
The name of an existing data source. |
|
An existing role to be granted the authorization. |
Examples
AUTHORIZE SELECT ON ALL TABLES IN DATA SOURCE sensor TO beatrice;
CREATE DATA ORIGIN
Define a new data origin
CREATE DATA ORIGIN origin_name
Description
CREATE DATA ORIGIN defines a new origin.
Parameters
|
A unique name for the data origin to be created. |
Examples
Create a new origin test_origin
:
CREATE DATA ORIGIN test_origin;
2.5.3. CREATE DATA SOURCE
Define a new external data source
CREATE DATA SOURCE source_name
TYPE source_type
OPTIONS ( option 'value' [, ... ] )
Description
CREATE DATA SOURCE defines connection settings for an external data source.
The new data source starts out in synchronizing mode, which pauses periodic
transforms and running external SQL. When the initial snapshot has finished
streaming, the message "source snapshot complete (deadline exceeded)" will be
written to the log. To complete the synchronization, the Metadb server should
be stopped in order to run metadb endsync
, and after the "endsync" has
completed, the Metadb server can be started again.
Parameters
|
A unique name for the data source to be created. |
|
The type of data source. The only type currently supported is |
|
Connection settings and other configuration options for the data source. |
Options for data source type "kafka"
|
Kafka bootstrap servers (comma-separated list). |
|
Security protocol: |
|
Regular expressions matching Kafka topics to read (comma-separated list). |
|
Kafka consumer group ID. |
|
Regular expressions matching schema names to accept (comma-separated list). |
|
Regular expressions matching schema names to ignore (comma-separated list). |
|
Regular expressions matching table names to ignore (comma-separated list). |
|
Prefix to remove from schema names. |
|
Prefix to add to schema names. |
|
Name of pre-defined configuration. |
Examples
Create sensor
as a kafka
data source:
CREATE DATA SOURCE sensor TYPE kafka OPTIONS ( brokers 'kafka:29092', topics '^metadb_sensor_1\.', consumergroup 'metadb_sensor_1_1', addschemaprefix 'sensor_', tablestopfilter '^testing\.air_temp$,^testing\.air_temp_avg$' );
2.5.4. CREATE USER
Define a new database user
CREATE USER user_name
[ WITH ] option 'value' [, ... ]
Description
CREATE USER defines a new database user that will be managed by Metadb. It also creates a schema having the same name as the user, if the schema does not exist, and grants the user privileges on the schema. The schema is intended as a workspace for the user.
Parameters
|
The name of the new user. |
|
Configuration options for the new user. |
Options
|
Sets the user’s password. (Required) |
|
Stores a comment about the user, e.g. the user’s real name. The comment can be
viewed in psql using the |
Examples
Create a user wegg
:
CREATE USER wegg WITH PASSWORD 'LZn2DCajcNHpGR3ZXWHD', COMMENT 'Silas Wegg';
2.5.5. DROP DATA SOURCE
Remove a data source configuration
DROP DATA SOURCE source_name
Description
DROP DATA SOURCE removes a data source configuration.
Parameters
|
The name of an existing data source. |
Examples
Drop a data source sensor
:
DROP DATA SOURCE sensor;
2.5.6. LIST
Show the value of a system variable
LIST name
Description
LIST shows the current setting of various system configurations and other variables.
Parameters
|
||
|
Authorized users. |
|
|
Configured data origins. |
|
|
Configured data sources. |
|
|
Current status of system components. |
Examples
LIST status;
3. Server administration
3.1. System requirements
3.1.1. Hardware requirements
-
Metadb software:
-
Architecture: x86-64 (AMD64)
-
CPU: 2 cores
-
Memory: 1 GB
-
Local storage: 500 GB SSD
-
-
PostgreSQL database:
-
CPU: 4 cores
-
Memory: 32 GB
-
Storage: 2 TB SSD
-
3.1.2. Software requirements
-
Operating system: Debian 12 or later
-
PostgreSQL 15 or later
-
Required to build from source:
-
Go 1.20 or later
-
goyacc (installation instructions below)
-
Ragel 6.10 or later
-
GCC C compiler 9.3.0 or later
-
3.1.3. PostgreSQL configuration
-
checkpoint_timeout
:3600
-
cpu_tuple_cost
:0.03
-
default_statistics_target
:1000
-
effective_io_concurrency
:200
-
idle_in_transaction_session_timeout
:60000
-
idle_session_timeout
:86400000
-
maintenance_work_mem
:1000000
-
max_wal_size
:10240
-
shared_buffers
:1250000
-
statement_timeout
:3600000
-
work_mem
:350000
3.2. Building the software
First set the GOPATH
environment variable to specify a path that can serve as
the build workspace for Go, e.g.:
export GOPATH=$HOME/go
Then install goyacc:
go install golang.org/x/tools/cmd/goyacc@master
export PATH=$PATH:$GOPATH/bin
Then to build the server:
./build.sh
The build.sh
script creates a bin/
subdirectory and builds the metadb
executable there:
./bin/metadb help
It is suggested that a metadb
user be created and the server run by that
user, for example, in /home/metadb
.
3.3. Server configuration
Metadb makes use of local storage in a data directory which is created using
metadb
with the init
command. In this example we will create the data
directory as data/
:
metadb init -D data
This will also create a file metadb.conf
in the data directory with the
following structure:
[main]
host = host name of the PostgreSQL server
port = port number of the PostgreSQL server
database = database name
superuser = database superuser such as "postgres"
superuser_password = password of superuser
systemuser = database user that is the owner of the database
systemuser_password = password of systemuser
sslmode = should be set to "require" or stronger option
Metadb expects the database name to be metadb
or to begin with metadb_
;
otherwise it logs a warning message.
This file should be edited to add database connection parameters, for example:
[main]
host = a.b.c
port = 5432
database = metadb
superuser = postgres
superuser_password = zpreCaWS7S79dt82zgvD
systemuser = mdbadmin
systemuser_password = ZHivGie5juxGJZmTObHU
sslmode = require
Metadb will assume that the database, superuser, and systemuser defined here already exist; so they should be created before continuing.
3.4. Backups
It is essential to make regular backups of Metadb and to test the backups.
In general, persistent data are stored in the database, and therefore the database is the most important system that should be backed up often.
The data directory contains the metadb.conf
configuration file and is also
used for temporary storage. The metadb.conf
file should be backed up, or
alternatively it should be possible to reconstruct the file’s contents.
3.5. Upgrading from a previous version
To upgrade from any previous version of Metadb, stop the server (if running), and then run the upgrade process in case changes are required:
metadb upgrade -D data
The upgrade process may, in some instances, take a significant amount of time to run. The database generally remains available to users during this period.
If no changes are needed, the process outputs:
metadb: "data" is up to date
3.6. Running the server
To start the server:
nohup metadb start -D data -l metadb.log &
The --memlimit
option can be used to set a soft memory limit (in GiB) if
needed, for example:
nohup metadb start -D data -l metadb.log --memlimit 2 &
The server listens on port 8550 by default, but this can be set using the
--port
option. The --debug
option enables verbose logging.
To stop the server:
metadb stop -D data
Note that stopping or restarting the server may delay scheduled data updates or cause them to restart.
The server can be set up to run with systemd via a file such as
/etc/systemd/system/metadb.service
, for example:
[Unit]
Description=Metadb
After=network.target remote-fs.target
[Service]
Type=simple
User=metadb
ExecStart=/bin/bash -ce "exec /home/metadb/bin/metadb start -D /home/metadb/data -l /home/metadb/metadb.log"
Restart=on-abort
[Install]
WantedBy=multi-user.target
Then:
systemctl enable metadb systemctl start metadb
3.7. Connecting to the server
The PostgreSQL terminal-based client, psql
, is used to connect to a Metadb
server that runs on the same host and listens on a specified port:
psql -d metadb -h localhost -X -p <port>
For example:
psql -d metadb -h localhost -X -p 8550
See Reference > Statements for commands that can be issued via psql
.
3.8. Configuring a Kafka data source
3.8.1. Overview
Metadb currently supports reading Kafka messages in the format produced by the Debezium PostgreSQL connector for Kafka Connect. Configuration of Kafka, Kafka Connect, Debezium, and PostgreSQL logical decoding is beyond the scope of this documentation, but a few notes are included here.
Data flow in this direction:
-
A source PostgreSQL database
-
Kafka Connect/Debezium
-
Kafka
-
Metadb
-
The analytics database
3.8.2. Creating a connector
To allow capturing data changes in the source PostgreSQL database, logical
decoding has to be enabled, in particular by setting wal_level = logical
in
postgresql.conf
.
Next we create a connector configuration file for Kafka Connect:
{ "name": "sensor-1-connector", "config": { "connector.class": "io.debezium.connector.postgresql.PostgresConnector", "database.dbname": "sourcedb", "database.hostname": "example.host.name", "database.password": "eHrkGrZL8mMJOFgToqqL", "database.port": "5432", "database.server.name": "metadb_sensor_1", "database.user": "dbuser", "plugin.name": "pgoutput", "snapshot.mode": "exported", "tasks.max": "1", "truncate.handling.mode": "include", "publication.autocreate.mode", "filtered" "heartbeat.interval.ms": "30000", "heartbeat.action.query": "UPDATE admin.heartbeat set last_heartbeat = now();" } }
It is recommended to use the connector configuration settings
heartbeat.interval.ms
and heartbeat.action.query
as above to avoid spikes
in disk space consumption within the source database. (See the Debezium
PostgreSQL connector documentation for more details.) The schemastopfilter
option of the CREATE DATA SOURCE
command is used to filter out the heartbeat
table.
In the source database:
CREATE SCHEMA admin; CREATE TABLE admin.heartbeat (last_heartbeat timestamptz PRIMARY KEY); INSERT INTO admin.heartbeat (last_heartbeat) VALUES (now());
Then to create the connector:
curl -X POST -i -H "Accept: application/json" -H "Content-Type: application/json" \ -d @connector.json https://kafka.connect.server/connectors
Note the 1
included in name
and database.server.name
in the connector
configuration. This is suggested as a version number, which can be incremented
if the data stream needs to be resynchronized with a new connector.
Metadb requires all streamed tables to have a primary key defined. Tables that
do not meet this requirement should be filtered out in the Debezium PostgreSQL
connector configuration by setting schema.exclude.list
or
table.exclude.list
. Otherwise they will generate error messages in the
Metadb log.
3.8.3. Monitoring replication
The replication slot disk usage must be monitored, because under certain error conditions it can grow too large and possibly fill up the disk. To show the disk usage (in the source database):
select slot_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as replicationSlotLag, active from pg_replication_slots;
It is recommended to allocate plenty of extra disk space in the source database.
3.8.4. Creating the data source
In Metadb, a data source is defined using the CREATE DATA SOURCE
statement,
for example:
CREATE DATA SOURCE sensor TYPE kafka OPTIONS ( brokers 'kafka:29092', topics '^metadb_sensor_1\.', consumergroup 'metadb_sensor_1_1', addschemaprefix 'sensor_', schemastopfilter 'admin' );
3.8.5. Initial synchronization
Metadb 1.2
When a new data source is first configured using CREATE DATA SOURCE
, Metadb
automatically puts the database into synchronizing mode, just as if metadb
sync
had been run (see Server administration > Resynchronizing a data
source). This has the effect of pausing periodic transforms and external SQL.
When the initial snapshot has finished streaming, the message "source snapshot
complete (deadline exceeded)" will be written to the log. Then, to complete
this first synchronization, stop the Metadb server, and after that run metadb
endsync
:
metadb stop -D data
metadb endsync -D data --source sensor
Once "endsync" has finished running, start the Metadb server.
3.8.6. Deleting a connection
Sometimes a connection may have to be deleted and recreated (see Server administration > Resynchronizing a data source). After deleting a connection, the replication slot and publication in the source database should be dropped using:
SELECT pg_drop_replication_slot('debezium'); DROP PUBLICATION dbz_publication;
3.9. Resynchronizing a data source
If a Kafka data stream fails and cannot be resumed, it may be necessary to re-stream a complete snapshot of the data to Metadb. For example, a source database may become unsynchronized with the analytic database, requiring a new snapshot of the source database to be streamed. Metadb can accept re-streamed data in order to synchronize with the source, using the following procedure.
Note that during the synchronization process, the analytics database continues to be available to users. However, streaming updates will be slower than usual, and there temporarily may be missing records (until they are re-streamed) or "extra" records (recently deleted in the source database). Also, periodic transforms and external SQL are paused during synchronization.
-
Update the
topics
andconsumergroup
configuration settings for the new data stream.ALTER DATA SOURCE sensor OPTIONS (SET topics '^metadb_sensor_2\.', SET consumergroup 'metadb_sensor_2_1');
Do not restart the Metadb server but continue directly to Step 2.
-
Stop the Metadb server and (before starting it again) run
metadb sync
. This may take some time to run.metadb stop -D data metadb sync -D data --source sensor
-
Start the Metadb server to begin streaming the data.
-
Once the new data have finished (or nearly finished) re-streaming, stop the Metadb server, and run
metadb endsync
to remove any old data that have not been refreshed by the new data stream.metadb endsync -D data --source sensor
The timing of when "endsync" should be run is up to the admninistrator, but it must be run to complete the synchronization process. In most cases it will be more convenient for users if "endsync" is run too late (delaying removal of deleted records) rather than too early (removing records before they have been restreamed).
Metadb detects when snapshot data are no longer being received, and then writes "source snapshot complete (deadline exceeded)" to the log. This generally means it is a good time to run "endsync".
-
Start the server.
Until a failed stream is re-streamed by following the process above, the analytic database may continue to be unsynchronized with the source.
3.10. Creating database users
To create a new database user account:
CREATE USER wegg WITH PASSWORD 'LZn2DCajcNHpGR3ZXWHD', COMMENT 'Silas Wegg';
In addition to creating the user, this also creates a schema with the same name as the user. The schema is intended as a workspace for the user.
-
Each user account should be for an individual user and not shared by more than one person.
-
Prefer user names of 3 to 8 characters in length.
By default the user does not have access to data tables. To give the user access to all tables generated from a data source (including tables not yet created):
AUTHORIZE SELECT ON ALL TABLES IN DATA SOURCE sensor TO wegg;
3.11. Administrative database changes
It is possible to make administrative-level changes directly in the underlying PostgreSQL database, such as providing additional tables for users. However, the following guidelines should be followed strictly to avoid disrupting the operation of Metadb:
-
No changes should be made to any database objects created by Metadb. If it should become necessary to make changes to the database objects at the request of the Metadb maintainers, the server should be stopped first to prevent it from operating with an out-of-date cache. If changes are made to the database objects inadvertently, the server should be stopped as soon as possible and not started until the changes have been reversed.
-
Changes generally should be made while logged in as a new database user (not a superuser) that is created specifically for that purpose.
-
Any new database objects should be created in a new schema that will not coincide with a schema that may be created by Metadb. This can be ensured by always setting
addschemaprefix
in data source configurations and avoiding names with those prefixes when creating a new schema. -
Database views should not be created in the database.
4. External projects
4.1. FOLIO
4.1.1. MARC transform
Metadb transforms MARC records from the tables marc_records_lb
and
records_lb
in schema folio_source_record
to a tabular form which is stored
in a new table, folio_source_record.marc__t
. Only records considered to be
current are transformed, where current is defined as having state
=
'ACTUAL'
and an identifier present in 999 ff $i
.
This transform updates the table folio_source_record.marc__t
usually every
few hours or so. The time of the most recent update can be retrieved from the
table metadb.table_update
:
SELECT updated FROM metadb.table_update WHERE schemaname = 'folio_source_record' AND tablename = 'marc__t';
The MARC transform stores partition tables in the schema marctab
. Users can
ignore this schema, as all data are accessible via folio_source_record.marc__t
.
4.1.2. Derived tables
FOLIO "derived tables" are automatically updated once per day, usually at about 00:00:00 UTC by default.
Note that the derived tables are based on a periodic snapshot of data, and for this reason they are generally not up-to-date.
4.1.3. Data model
FOLIO does not provide documentation for its internal data model, which Metadb
tables are based on, but it does have some data documentation for its "storage
module" APIs which are roughly equivalent. This is located at
https://dev.folio.org/reference/api/
. The name of most storage modules ends
in -storage
, but some modules use a different convention; for instance, the
storage module for users is mod-users
. (All module names begin with mod-
.)
4.1.4. Migrating from LDP
This section contains notes related to migrating from LDP to Metadb.
4.1.4.1. Table names
Table names have changed and now are derived from FOLIO internal table names:
LDP table | Metadb table |
---|---|
acquisitions_memberships |
folio_orders.acquisitions_unit_membership |
acquisitions_units |
folio_orders.acquisitions_unit |
audit_circulation_logs |
folio_audit.circulation_logs |
circulation_cancellation_reasons |
folio_circulation.cancellation_reason |
circulation_check_ins |
folio_circulation.check_in |
circulation_loan_history |
folio_circulation.audit_loan |
circulation_loan_policies |
folio_circulation.loan_policy |
circulation_loans |
folio_circulation.loan |
circulation_patron_action_sessions |
circulation_patron_action_sessions |
circulation_patron_notice_policies |
folio_circulation.patron_notice_policy |
circulation_request_policies |
folio_circulation.request_policy |
circulation_request_preference |
folio_circulation.user_request_preference |
circulation_scheduled_notices |
folio_circulation.scheduled_notice |
circulation_staff_slips |
folio_circulation.staff_slips |
configuration_entries |
folio_configuration.config_data |
course_copyrightstatuses |
folio_courses.coursereserves_copyrightstates |
course_courselistings |
folio_courses.coursereserves_courselistings |
course_courses |
folio_courses.coursereserves_courses |
course_coursetypes |
folio_courses.coursereserves_coursetypes |
course_departments |
folio_courses.coursereserves_departments |
course_processingstatuses |
folio_courses.coursereserves_processingstates |
course_reserves |
folio_courses.coursereserves_reserves |
course_terms |
folio_courses.coursereserves_terms |
email_email |
folio_email.email_statistics |
feesfines_accounts |
folio_feesfines.accounts |
feesfines_comments |
folio_feesfines.comments |
feesfines_feefineactions |
folio_feesfines.feefineactions |
feesfines_feefines |
folio_feesfines.feefines |
feesfines_lost_item_fees_policies |
folio_feesfines.lost_item_fee_policy |
feesfines_overdue_fines_policies |
folio_feesfines.overdue_fine_policy |
feesfines_owners |
folio_feesfines.owners |
feesfines_payments |
folio_feesfines.payments |
feesfines_refunds |
folio_feesfines.refunds |
feesfines_waives |
folio_feesfines.waives |
finance_budgets |
folio_finance.budget |
finance_fiscal_years |
folio_finance.fiscal_year |
finance_fund_types |
folio_finance.fund_type |
finance_funds |
folio_finance.fund |
finance_group_fund_fiscal_years |
folio_finance.group_fund_fiscal_year |
finance_groups |
folio_finance.groups |
finance_ledgers |
folio_finance.ledger |
finance_transactions |
folio_finance.transaction |
inventory_alternative_title_types |
folio_inventory.alternative_title_type |
inventory_call_number_types |
folio_inventory.call_number_type |
inventory_campuses |
folio_inventory.loccampus |
inventory_classification_types |
folio_inventory.classification_type |
inventory_contributor_name_types |
folio_inventory.contributor_name_type |
inventory_contributor_types |
folio_inventory.contributor_type |
inventory_electronic_access_relationships |
folio_inventory.electronic_access_relationship |
inventory_holdings |
folio_inventory.holdings_record |
inventory_holdings_note_types |
folio_inventory.holdings_note_type |
inventory_holdings_sources |
folio_inventory.holdings_records_source |
inventory_holdings_types |
folio_inventory.holdings_type |
inventory_identifier_types |
folio_inventory.identifier_type |
inventory_ill_policies |
folio_inventory.ill_policy |
inventory_instance_formats |
folio_inventory.instance_format |
inventory_instance_note_types |
folio_inventory.instance_note_type |
inventory_instance_relationship_types |
folio_inventory.instance_relationship_type |
inventory_instance_relationships |
folio_inventory.instance_relationship |
inventory_instance_statuses |
folio_inventory.instance_status |
inventory_instance_types |
folio_inventory.instance_type |
inventory_instances |
folio_inventory.instance |
inventory_institutions |
folio_inventory.locinstitution |
inventory_item_damaged_statuses |
folio_inventory.item_damaged_status |
inventory_item_note_types |
folio_inventory.item_note_type |
inventory_items |
folio_inventory.item |
inventory_libraries |
folio_inventory.loclibrary |
inventory_loan_types |
folio_inventory.loan_type |
inventory_locations |
folio_inventory.location |
inventory_material_types |
folio_inventory.material_type |
inventory_modes_of_issuance |
folio_inventory.mode_of_issuance |
inventory_nature_of_content_terms |
folio_inventory.nature_of_content_term |
inventory_service_points |
folio_inventory.service_point |
inventory_service_points_users |
folio_inventory.service_point_user |
inventory_statistical_code_types |
folio_inventory.statistical_code_type |
inventory_statistical_codes |
folio_inventory.statistical_code |
invoice_invoices |
folio_invoice.invoices |
invoice_lines |
folio_invoice.invoice_lines |
invoice_voucher_lines |
folio_invoice.voucher_lines |
invoice_vouchers |
folio_invoice.vouchers |
organization_categories |
folio_organizations.categories |
organization_contacts |
folio_organizations.contacts |
organization_interfaces |
folio_organizations.interfaces |
organization_organizations |
folio_organizations.organizations |
po_alerts |
folio_orders.alert |
po_lines |
folio_orders.po_line |
po_order_templates |
folio_orders.order_templates |
po_pieces |
folio_orders.pieces |
po_purchase_orders |
folio_orders.purchase_order |
po_reporting_codes |
folio_orders.reporting_code |
srs_error |
folio_source_record.error_records_lb |
srs_marc |
folio_source_record.marc_records_lb |
srs_records |
folio_source_record.records_lb |
user_addresstypes |
folio_users.addresstype |
user_groups |
folio_users.groups |
user_proxiesfor |
folio_users.proxyfor |
user_users |
folio_users.users |
4.1.4.2. Column names
The data
column in LDP contains JSON objects. In Metadb this column appears
as jsonb
or in some cases content
, matching the FOLIO internal column
names.
4.1.4.3. Data types
In Metadb, UUIDs generally have the uuid
data type. If a UUID has the text
data type preserved from the source data, it should be cast using ::uuid
in
queries.
Columns with the json
data type in LDP have been changed to use the jsonb
data type in Metadb.
4.1.4.4. JSON queries
Querying JSON is very similar with Metadb as compared to LDP. For clarity we give a few examples below.
JSON source data
To select JSON data extracted from a FOLIO source, LDP supports:
SELECT data FROM user_groups;
In Metadb, this can be written as:
SELECT jsonb FROM folio_users.groups;
Or with easier to read formatting:
SELECT jsonb_pretty(jsonb) FROM folio_users.groups;
JSON fields: non-array data
For non-array JSON fields, extracting the data directly from JSON in LDP usually takes the form:
SELECT json_extract_path_text(data, 'group') FROM user_groups;
The form recommended for Metadb is:
SELECT jsonb_extract_path_text(jsonb, 'group') FROM folio_users.groups;
JSON fields: array data
To extract JSON arrays, the syntax for Metadb is similar to LDP. A lateral
join can be used with the function jsonb_array_elements()
to convert the
elements of a JSON array to a set of rows, one row per array element.
For example, if the array elements are simple text
strings:
CREATE TABLE instance_format_ids AS SELECT id AS instance_id, instance_format_ids.jsonb #>> '{}' AS instance_format_id, instance_format_ids.ordinality FROM folio_inventory.instance CROSS JOIN LATERAL jsonb_array_elements(jsonb_extract_path(jsonb, 'instanceFormatIds')) WITH ORDINALITY AS instance_format_ids (jsonb);
If the array elements are JSON objects:
CREATE TABLE holdings_notes AS SELECT id AS holdings_id, jsonb_extract_path_text(notes.jsonb, 'holdingsNoteTypeId')::uuid AS holdings_note_type_id, jsonb_extract_path_text(notes.jsonb, 'note') AS note, jsonb_extract_path_text(notes.jsonb, 'staffOnly')::boolean AS staff_only, notes.ordinality FROM folio_inventory.holdings_record CROSS JOIN LATERAL jsonb_array_elements(jsonb_extract_path(jsonb, 'notes')) WITH ORDINALITY AS notes (jsonb);
JSON fields as columns
LDP transforms simple, first-level JSON fields into columns, which can be queried as, for example:
SELECT id, "group", "desc" FROM user_groups;
The Metadb equivalent of this query is:
SELECT id, "group", "desc" FROM folio_users.groups__t;
Note that the double quotation marks are needed here only because group
and
desc
are reserved words in SQL. Alternatively, they can be removed if the
column names are prefixed with a table alias:
SELECT g.id, g.group, g.desc FROM folio_users.groups__t AS g;
Support for transforming subfields and arrays is planned in Metadb.
4.1.5. Configuring Metadb for FOLIO
When creating a FOLIO data source, use the module 'folio'
option, and set
trimschemaprefix
to remove the tenant from schema names and addschemaprefix
to add a folio_
prefix to the schema names. For example:
CREATE DATA SOURCE folio TYPE kafka OPTIONS ( module 'folio', trimschemaprefix 'tenantname_', addschemaprefix 'folio_', brokers 'kafka:29092', topics '^metadb_folio_1\.', consumergroup 'metadb_folio_1_1', schemastopfilter 'admin' );
It is recommended to use a separate Kafka cluster, rather than the FOLIO Kafka instance, until one has experience with administration of Kafka.
In the Debezium PostgreSQL connector configuration, the following exclusions are suggested:
"schema.exclude.list": "public,.*_mod_login,.*_mod_pubsub,.*pubsub_config,supertenant_mod_.*,.*_mod_kb_ebsco_java,.*_mod_data_export_spring"
"table.exclude.list": ".*__system,.*_mod_agreements.alternate_resource_name,.*_mod_service_interaction.dashboard_access,.*_mod_agreements.availability_constraint,.*_mod_agreements\\.package_description_url,.*_mod_agreements\\.content_type,.*_mod_agreements\\.entitlement_tag,.*_mod_agreements\\.erm_resource_tag,.*_mod_agreements\\.string_template,.*_mod_agreements\\.string_template_scopes,.*_mod_agreements\\.templated_url,.*_mod_oai_pmh\\.instances,.*_mod_remote_storage\\.original_locations,.*_mod_remote_storage\\.item_notes,.*app_setting,.*alternate_name,.*databasechangelog,.*databasechangeloglock,.*directory_entry_tag,.*license_document_attachment,.*license_supp_doc,.*license_tag,.*log_entry_additional_info,.*subscription_agreement_supp_doc,.*subscription_agreement_document_attachment,.*subscription_agreement_ext_lic_doc,.*subscription_agreement_tag,.*tenant_changelog,.*tenant_changelog_lock,.*marc_indexers.*,.*rmb_internal.*,.*rmb_job.*,.*_mod_agreements\\.match_key,.*system_changelog"
4.2. ReShare
4.2.1. Derived tables
ReShare "derived tables" are automatically updated once per day, usually at about 00:00:00 UTC by default.
Note that the derived tables are based on a periodic snapshot of data, and for this reason they are generally not up-to-date.
4.2.2. Configuring Metadb for ReShare
Before defining a ReShare data source, create a data origin for each consortial tenant. For example:
CREATE DATA ORIGIN tenant1; CREATE DATA ORIGIN tenant2; CREATE DATA ORIGIN tenant3;
Then use the module 'reshare'
option when creating the data source, and set
addschemaprefix
to add a reshare_
prefix to the schema names:
CREATE DATA SOURCE reshare TYPE kafka OPTIONS ( module 'reshare', addschemaprefix 'reshare_', brokers 'kafka:29092', topics '^metadb_reshare_1\.', consumergroup 'metadb_reshare_1_1', schemastopfilter 'admin' );
Note that the order of commands is important: The initial set of data origins
should be created before the data source is created so that schema names of
incoming data will be processed correctly. Later, whenever a new consortial
tenant is to be added, it should be defined in Metadb using CREATE DATA
ORIGIN
(and the server restarted) before the tenant is added to ReShare.
In the Debezium PostgreSQL connector configuration, it is suggested that
credentials (.+mod_login
), the public schema, the Okapi supertenant
(supertenant_mod_.+
), and mod-pubsub data (pubsub_config,.+_mod_pubsub
)
be excluded using the schema.exclude.list
setting.
4.3. MARC transform for LDP
Metadb 1.1
The MARC transform in Metadb can also be used with LDP (and LDLite). A
command-line tool called marct
is provided which is a drop-in replacement for
ldpmarc
.
The system requirements are a subset of those for Metadb:
-
Local storage: 500 GB
-
Database storage: 500 GB
-
Operating system: Linux
-
PostgreSQL 15 or later
-
Go 1.20 or later
To build marct
:
mkdir -p bin && go build -o bin ./cmd/marct
which creates a bin/
subdirectory and builds the marct
executable there:
./bin/marct -h
In LDP, MARC data are read from the tables public.srs_marc
and
public.srs_records
, and the transformed output is written to the table
public.srs_marctab
.
Typical usage is:
./bin/marct -D <datadir> -u <ldp_user>
where datadir
is a LDP data directory containing ldpconf.json
, and
ldp1_user
is a LDP user to be granted SELECT
privileges on the output
table.
For example:
./bin/marct -D data -u ldp
Note that marct
only grants privileges for a single user. If individual user
accounts are configured for LDP, a shell script can be used to grant privileges
to the users, for example:
users=/path/to/list/of/users.txt for u in $( cat $users ); do psql -c "GRANT SELECT ON public.srs_marctab TO $u ;" done
The first time marct
runs, it will perform a "full update" of all of the MARC
records. In subsequent runs, it will attempt to use "incremental update" to
update only records that have changed since the previous run, which can
dramatically reduce the running time if the number of changes is small.
However, if very many records have changed, it is possible that incremental
update may take longer than full update. If it appears that an incremental
update will never finish, it should be canceled, and a full update should be
run once before resuming incremental updates. This can be done by using the
-f
command-line option, which disables incremental update and requires
marct
to do a full update.