User guide
This is an overview of using Metadb. We assume familiarity with databases and the basics of SQL.
Getting started
Metadb extends PostgreSQL with features to support analytics such as streaming data sources, data model transforms, and historical data. The data contained in the Metadb database originally come from another place: a data source which could be, for example, a transaction-processing database or a sensor network. Metadb updates its database continuously based on state changes in external data sources.
Main tables
Tables generated by Metadb have at least these metadata columns, with names that begin with two underscores:
-
__id
is a surrogate key that identifies a row in the table. -
__start
is the date and time when the row of data was generated. -
__end
is the date and time when the row of data became no longer current, or9999-12-31 00:00:00+00
if the row is still current. -
__current
is a Boolean value that indicates whether the row is current. -
__source
identifies the data source. The value of__source
is the same for all rows that have been read from the same data source. -
__origin
is an optional identifier used to group related data from one or more data sources. For example, Metadb can combine data from similar source tables into a single table in the analytic database, and stores an identifier in__origin
to record where the data came from.
A main table in Metadb has two underscores at the end of its name (e.g.,
patrongroup__
), and it contains both the current state and the history of all
previous states provided by the data source. For instance, consider a main
table that stores categories of patrons in a library:
SELECT __start, __end, __current, id, groupname, description FROM library.patrongroup__;
__start |
__end |
__current |
id |
groupname |
description |
---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
In this example, the "undergrad" group with id
= 15 has two rows because it
was modified on 2022-04-18 19:27:18-00
, changing description
from
'Student'
to 'Undergraduate Student'
.
Current tables
It is often desirable to limit a query to retrieving only current records.
This can be done by filtering on __current
= TRUE
, for example:
SELECT id, groupname, description FROM library.patrongroup__ WHERE __current;
For convenience, since this filter is so often used, Metadb provides access to a current table which contains only current records. Every main table has a corresponding current table, which shares the same name but without the two underscores at the end. For instance, the following query is equivalent to the example above:
SELECT id, groupname, description FROM library.patrongroup;
A current table reflects the current state of each row in the data source, as of the last time the row was updated.
To take another example:
SELECT __id, __start, __source, __origin, id, groupname, description FROM library.patrongroup;
__id |
__start |
__source |
__origin |
id |
groupname |
description |
---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
There is a distinction between the __id
and id
columns. The __id
column
is a key defined by Metadb to identify each row uniquely; it is present in all
tables generated by Metadb. In contrast, the id
column only happens to be
present in this sample table and has been provided by the data source; it may
or may not be a key, and its name, content, or significance may change if
schema changes are made in the data source.
Transformed tables
Metadb applies transformations to some kinds of data, which results in
additional tables being created. One category of data that are transformed in
this way is JSON objects. Suppose that our patrongroup
current table
contains the groupname
and description
data in JSON fields, for example:
SELECT __id, __start, id, jsondata FROM library.patrongroup;
__id |
__start |
id |
jsondata |
---|---|---|---|
|
|
|
{ "groupname": "undergrad", "description": "Undergraduate Student" } |
|
|
|
{ "groupname": "graduate", "description": "Graduate Student" } |
|
|
|
{ "groupname": "faculty", "description": "Faculty Member" } |
|
|
|
{ "groupname": "staff", "description": "Staff Member" } |
The JSON fields can be selected using SQL:
SELECT __id,
__start,
id,
jsondata->>'groupname' AS groupname,
jsondata->>'description' AS description
FROM library.patrongroup;
However, Metadb also generates a transformed table called patrongroup__t
.
Transformed tables based on JSON objects contain JSON fields extracted into
columns, which allows for simpler queries.
SELECT __id, __start, id, groupname, description FROM library.patrongroup__t;
__id |
__start |
id |
groupname |
description |
---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
In the current version of Metadb, only top-level, scalar JSON fields are extracted into transformed tables.
Main tables are also transformed in the same way. In this case the main
transformed table would be called patrongroup__t__
.
Note that JSON data are treated as "schemaless," and fields are inferred from their presence in the data rather than read from a JSON schema. As a result, a column is only created from a JSON field if the field is present in at least one JSON record.
Comparing table types
To summarize the types of tables that we have covered:
Table name | Table type | Description |
---|---|---|
patrongroup__ |
Main table |
Current and historical records |
patrongroup |
Current table |
Current records only |
patrongroup__t__ |
Main transformed table |
Transformed versions of the records in |
patrongroup__t |
Current transformed table |
Transformed versions of the records in |
In addition to these table types, some partition tables have names that begin
with zzz___
. It is recommended that these tables not be used directly in
queries, and access to them is generally disabled.
Workspaces
In general, users do not have privileges to modify or create tables in the database. However, Metadb creates a schema for each user where the user does have these privileges, and this serves as an individual workspace. The schema has the same name as the user name. This can be useful as an area for saving the results of queries or importing external data sets.
When using one’s own workspace, it is not necessary to specify the schema name; it will be used by default if no schema is specified.
For example, this table westgroup
created by user flopsy
will be created in
schema flopsy
:
CREATE TABLE westgroup AS SELECT * FROM library.patrongroup WHERE __origin = 'west';
SELECT * FROM westgroup LIMIT 20;
Unless configured differently, other users do not have privileges to access the table by default.
Metadb v1.0.6
To grant the user mopsy
read-only access to the schema and table:
GRANT USAGE ON SCHEMA flopsy TO mopsy; GRANT SELECT ON westgroup TO mopsy;
The mopsy
user will have to include the schema name to access the table:
SELECT * FROM flopsy.westgroup LIMIT 20;
Database views
Database views are not supported and should not be created in a Metadb database.
Views can create dependencies that may block required schema changes during data updates. They also hide query complexity, which in analytical workloads can be significant.
Querying system information
Metadb version
Metadb v1.0.5 To show the current Metadb version:
SELECT mdbversion();
Data update times
Information about when certain tables were last updated is available
in a table called metadb.table_update
:
SELECT * FROM metadb.table_update ORDER BY schemaname, tablename;
System log
Metadb reports errors and other information in a logging table called
metadb.log
. For instance to view all log messages from January, 2023:
SELECT * FROM metadb.log WHERE log_time >= '2023-01-01' AND log_time < '2023-02-01' ORDER BY log_time;
Metadb v1.0.5
The mdblog()
function is provided for convenience and returns ordered log
messages, from the past 24 hours by default (or optionally from a specified
interval):
SELECT * FROM mdblog();
Status of query process
When executing a long-running query, the ps()
function can be used to check
on the status of the query process from another session:
SELECT * FROM ps();
Reference
Stream processor
Data type conversion
When reading data from a data source, Metadb will automatically adjust column data types in a number of cases:
-
A data type in the source has changed in a way that cannot be applied safely by Metadb without more information.
-
A data type in the source has changed to a type that is not supported by Metadb.
-
A data type in the source can be converted safely to a type that is more performant.
This table summarizes the type adjustments that are performed:
Data type conversions | To numeric | To uuid | To jsonb | To varchar |
---|---|---|---|---|
From boolean |
✅ |
|||
From smallint |
✅ |
✅ |
||
From integer |
✅ |
✅ |
||
From bigint |
✅ |
✅ |
||
From real |
✅ |
✅ |
||
From double precision |
✅ |
✅ |
||
From numeric |
✅ |
|||
From date |
✅ |
|||
From time |
✅ |
|||
From time with timezone |
✅ |
|||
From timestamp |
✅ |
|||
From timestamp with timezone |
✅ |
|||
From uuid |
✅ |
|||
From json |
✅ |
✅ |
||
From jsonb |
✅ |
|||
From varchar |
✅ |
Functions
System information
Name | Return type | Description |
---|---|---|
|
table ( log_time timestamptz(3), error_severity text, message text ) |
Returns recent log messages during the specified interval, or during the past 24 hours if no interval is given |
|
text |
Returns the Metadb version |
|
table ( dbname text, username text, state text, realtime text, query text ) |
Returns the status and elapsed running time of current query processes |
Examples
Show log messages from the past hour, and from the past 24 hours:
SELECT * FROM mdblog(interval '1 hour');
SELECT * FROM mdblog();
Metadb v1.0.5 Show the current Metadb version:
SELECT mdbversion();
System tables
metadb.log
Column name | Column type | Description |
---|---|---|
log_time |
timestamptz(3) |
Timestamp when the log entry was written |
error_severity |
text |
Logging level: (from least to most severe) INFO, WARNING, ERROR, FATAL |
message |
text |
The log message |
metadb.table_update
Column name | Column type | Description |
---|---|---|
schemaname |
text |
Schema name of the updated table |
tablename |
text |
Table name of the updated table |
updated |
timestamptz |
Timestamp when the table was last updated |
External SQL directives
Metadb v1.0.5 Metadb allows scheduling external SQL files to run on a regular basis.
Each SQL statement should be separated from others by an empty line, and any tables created should not specify a schema name.
Comment lines beginning with --metadb:
are used for additional directives;
each directive should be on a separate line.
It is suggested that each SQL file optionally begin with a --metadb:table
directive, optionally followed by --metadb:require
directives if needed, and
then followed by an empty line, for example:
--metadb:table library_patrons --metadb:require library.patrongroup.id uuid --metadb:require library.patrongroup.name text DROP TABLE IF EXISTS library_patrons; CREATE TABLE library_patrons AS SELECT . . . ;
--metadb:require
Metadb v1.0.5
The --metadb:require
directive asserts a dependency on a specific table and
column in the database. Metadb will ensure that the table and column exist
before it continues executing the SQL file. The required table can be a main
table or a current table; both will be created if either is specified and does
not exist.
The directive takes the form:
--metadb:require <schema>.<table>.<column> <columntype>
For example:
--metadb:require library.patrongroup.id uuid --metadb:require library.patrongroup.name text
There is generally no need to list all dependencies. The intent of this directive is to reduce SQL errors by listing tables and columns that are likely or known to be missing in some instances.
--metadb:table
Metadb v1.0.5
The --metadb:table
directive declares that the SQL file updates a specific
table. This allows Metadb to report on the status of the table. The directive
takes the form:
--metadb:table <table>
The specified table should not contain a schema name.
For example:
--metadb:table user_group
Statements
Metadb extends SQL with statements for configuring and administering the server. These statements are only available when connecting to the Metadb server (not the database).
ALTER DATA SOURCE
Change the configuration of a data source
ALTER DATA SOURCE source_name
OPTIONS ( [ ADD | SET | DROP ] option ['value'] [, ... ] )
Description
ALTER DATA SOURCE changes connection settings for a data source.
Parameters
|
The name of an existing data source. |
|
Connection settings and other configuration options for the data source. |
Options
See CREATE DATA SOURCE
Examples
Change the consumer group:
ALTER DATA SOURCE sensor OPTIONS (SET consumergroup 'metadb_sensor_1');
AUTHORIZE
Enable access to tables generated from an external data source
AUTHORIZE SELECT
ON ALL TABLES IN DATA SOURCE source_name
TO role_specification
Description
The AUTHORIZE command grants access to tables. It differs from GRANT in that the authorization will also apply to tables created at a later time in the data source.
Parameters
|
The name of an existing data source. |
|
An existing role to be granted the authorization. |
Examples
AUTHORIZE SELECT ON ALL TABLES IN DATA SOURCE sensor TO beatrice;
CREATE DATA ORIGIN
Define a new data origin
CREATE DATA ORIGIN origin_name
Description
CREATE DATA ORIGIN defines a new origin.
Parameters
|
A unique name for the data origin to be created. |
Examples
Create a new origin test_origin
:
CREATE DATA ORIGIN test_origin;
CREATE DATA SOURCE
Define a new external data source
CREATE DATA SOURCE source_name
TYPE source_type
OPTIONS ( option 'value' [, ... ] )
Description
CREATE DATA SOURCE defines connection settings for an external data source.
Parameters
|
A unique name for the data source to be created. |
|
The type of data source. The only type currently supported is |
|
Connection settings and other configuration options for the data source. |
Options for data source type "kafka"
|
Kafka bootstrap servers (comma-separated list). |
|
Security protocol: |
|
Regular expressions matching Kafka topics to read (comma-separated list). |
|
Kafka consumer group ID. |
|
Regular expressions matching schema names to accept (comma-separated list). |
|
Regular expressions matching schema names to ignore (comma-separated list),
evaluated after |
|
Prefix to remove from schema names. |
|
Prefix to add to schema names. |
|
Name of pre-defined configuration. |
Examples
Create sensor
as a kafka
data source:
CREATE DATA SOURCE sensor TYPE kafka OPTIONS ( brokers 'kafka:29092', topics '^metadb_sensor_1.*', consumergroup 'metadb_sensor_1_1', addschemaprefix 'sensor_' );
CREATE USER
CREATE USER is mostly equivalent to the PostgreSQL CREATE USER statement. However it also creates a schema having the same name as the user, if the schema does not exist, and grants the user privileges on the schema. The schema is intended as a workspace for the user.
DROP DATA SOURCE
Remove a data source configuration
DROP DATA SOURCE source_name
Description
DROP DATA SOURCE removes a data source configuration.
Parameters
|
The name of an existing data source. |
Examples
Drop a data source sensor
:
DROP DATA SOURCE sensor;
LIST
Show the value of a system variable
LIST name
Description
LIST shows the current setting of various system configurations and other variables.
Parameters
|
||
|
Authorized users. |
|
|
Configured data origins. |
|
|
Configured data sources. |
|
|
Current status of system components. |
Examples
LIST status;
Server administration
System requirements
Hardware requirements
-
Metadb software:
-
CPU: 2 cores
-
Memory: 2 GB
-
Storage: 500 GB SSD
-
-
PostgreSQL database:
-
CPU: 4 cores
-
Memory: 32 GB
-
Storage: 2 TB SSD
-
Software requirements
-
Architecture: x86-64 (AMD64)
-
Operating system: Debian 11
-
PostgreSQL 14 or later
-
PostgreSQL 15 or later is recommended
-
AWS RDS PostgreSQL optionally may be used (with servers in the same zone/subnet); Aurora is not supported
-
-
Required to build from source:
-
Go 1.19 or later
-
goyacc (installation instructions below)
-
Ragel 6.10 or later
-
GCC C compiler 9.3.0 or later
-
PostgreSQL configuration
-
checkpoint_timeout
:3600
-
cpu_tuple_cost
:0.03
-
default_statistics_target
:1000
-
effective_io_concurrency
:200
-
idle_in_transaction_session_timeout
:3600000
-
idle_session_timeout
:604800000
-
maintenance_work_mem
:1000000
-
max_wal_size
:10240
-
shared_buffers
:1250000
-
statement_timeout
:3600000
-
work_mem
:350000
Building the software
First set the GOPATH
environment variable to specify a path that can serve as
the build workspace for Go, e.g.:
$ export GOPATH=$HOME/go
Then install goyacc:
go install golang.org/x/tools/cmd/goyacc@master
export PATH=$PATH:$GOPATH/bin
Then to build the server:
$ ./build.sh
The build.sh
script creates a bin/
subdirectory and builds the metadb
executable there:
$ ./bin/metadb help
It is suggested that a metadb
user be created and the server run by that
user, for example, in /home/metadb
.
Server configuration
Metadb makes use of local storage in a data directory which is created using
metadb
with the init
command. In this example we will create the data
directory as data/
:
metadb init -D data
This will also create a file metadb.conf
in the data directory with the
following structure:
[main]
host = host name of the PostgreSQL server
port = port number of the PostgreSQL server
database = database name, typically "metadb"
superuser = database superuser such as "postgres"
superuser_password = password of superuser
systemuser = database user that is the owner of the database
systemuser_password = password of systemuser
sslmode = should be set to "require" or stronger option
This file should be edited to add database connection parameters, for example:
[main]
host = a.b.c
port = 5432
database = metadb
superuser = postgres
superuser_password = zpreCaWS7S79dt82zgvD
systemuser = mdbadmin
systemuser_password = ZHivGie5juxGJZmTObHU
sslmode = require
Metadb will assume that the database, superuser, and systemuser defined here already exist; so they should be created before continuing.
Backups
It is essential to make regular backups of Metadb and to test the backups.
In general, persistent data are stored in the database, and therefore the database is the most important system that should be backed up often.
The data directory contains the metadb.conf
configuration file and is also
used for temporary storage. The metadb.conf
file should be backed up, or
alternatively it should be possible to reconstruct the file’s contents.
Upgrading from a previous version
To upgrade from any previous version of Metadb, stop the server (if running), and then run the upgrade process in case changes are required:
metadb upgrade -D data
The upgrade process may, in some instances, take a significant amount of time to run. The database generally remains available to users during this period.
If no changes are needed, the process outputs:
metadb: "data" is up to date
Running the server
To start the server:
nohup metadb start -D data -l metadb.log &
The --memlimit
option can be used to set a soft memory limit (in GiB) if
needed, for example:
nohup metadb start -D data -l metadb.log --memlimit 2 &
The server listens on port 8440 by default, but this can be set using the
--port
option. The --debug
option enables verbose logging.
To stop the server:
metadb stop -D data
Note that stopping or restarting the server may delay scheduled data updates or cause them to restart.
The server can be set up to run with systemd via a file such as
/etc/systemd/system/metadb.service
, for example:
[Unit]
Description=Metadb
After=network.target remote-fs.target
[Service]
Type=simple
User=metadb
ExecStart=/bin/bash -ce "exec /home/metadb/bin/metadb start -D /home/metadb/data -l /home/metadb/metadb.log"
Restart=on-abort
[Install]
WantedBy=multi-user.target
Then:
systemctl enable metadb systemctl start metadb
Connecting to the server
The PostgreSQL terminal client, psql
, can be used to connect to the Metadb
server:
psql -h localhost -p 8440
Metadb v1.0.5 Then for example to show the software version:
SELECT mdbversion();
Configuring a Kafka data source
A data source is defined using the CREATE DATA SOURCE
statement, for example:
CREATE DATA SOURCE sensor TYPE kafka OPTIONS ( brokers 'kafka:29092', topics '^metadb_sensor_1.*', consumergroup 'metadb_sensor_1_1', addschemaprefix 'sensor_', schemastopfilter 'admin' );
Metadb currently supports reading Kafka messages in the format produced by the Debezium PostgreSQL connector for Kafka Connect. Configuration of Kafka, Kafka Connect, Debezium, and PostgreSQL logical decoding is beyond the scope of this documentation, but a few notes are included here.
Data flow in this direction:
-
A source PostgreSQL database
-
Kafka Connect/Debezium
-
Kafka
-
Metadb
-
The analytics database
To allow capturing data changes in the source PostgreSQL database, logical
decoding has to be enabled, in particular by setting wal_level = logical
in
postgresql.conf
.
Next we create a connector configuration file for Kafka Connect:
{ "name": "sensor-1-connector", "config": { "connector.class": "io.debezium.connector.postgresql.PostgresConnector", "database.dbname": "sourcedb", "database.hostname": "example.host.name", "database.password": "eHrkGrZL8mMJOFgToqqL", "database.port": "5432", "database.server.name": "metadb_sensor_1", "database.user": "dbuser", "plugin.name": "pgoutput", "snapshot.mode": "exported", "tasks.max": "1", "truncate.handling.mode": "include", "publication.autocreate.mode", "filtered" "heartbeat.interval.ms": "30000", "heartbeat.action.query": "UPDATE admin.heartbeat set last_heartbeat_ts = now();" } }
It is recommended to use the connector configuration settings
heartbeat.interval.ms
and heartbeat.action.query
as above to avoid spikes
in disk space consumption within the source database. (See the Debezium
PostgreSQL connector documentation for more details.) The schemastopfilter
option of the CREATE DATA SOURCE
command is used to filter out the heartbeat
table.
In the source database:
CREATE SCHEMA admin; CREATE TABLE admin.heartbeat (last_heartbeat_ts timestamptz PRIMARY KEY); INSERT INTO admin.heartbeat (last_heartbeat_ts) VALUES (now());
Then to create the connector:
curl -X POST -i -H "Accept: application/json" -H "Content-Type: application/json" \ -d @connector.json https://kafka.connect.server/connectors
Note the 1
included in name
and database.server.name
in the connector
configuration. This is suggested as a version number, which can be incremented
if the data stream needs to be resynchronized with a new connector.
Metadb requires all streamed tables to have a primary key defined or a replica
identity of FULL
. Tables that do not meet this requirement should be
filtered out in the Debezium PostgreSQL connector configuration by setting
schema.exclude.list
or table.exclude.list
. Otherwise they will generate
error messages in the Metadb log.
The replication slot disk usage must be monitored, because under certain error conditions it can grow too large and possibly fill up the disk. To show the disk usage (in the source database):
select slot_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as replicationSlotLag, active from pg_replication_slots;
It is also recommended to allocate plenty of extra disk space in the source database.
To drop the replication slot (in the source database) after deleting a connector:
SELECT pg_drop_replication_slot('debezium'); DROP PUBLICATION dbz_publication;
Resynchronizing a data source
If a Kafka data stream fails and cannot be resumed, it may be necessary to re-stream data to Metadb. For example, a source database may become unsynchronized with the analytic database, requiring a new snapshot of the source database to be streamed. Metadb can accept re-streamed data in order to resynchronize with the source, using the following procedure.
Note that during the resynchronization process, the analytics database continues to be available to users. However, streaming updates will be slower than usual, and there temporarily may be missing records (until they are re-streamed) or "extra" records (recently deleted in the source database).
-
Update the
topics
andconsumergroup
configuration settings for the new data stream.ALTER DATA SOURCE sensor OPTIONS (SET topics '^metadb_sensor_2.*', SET consumergroup 'metadb_sensor_2_1');
Do not restart the Metadb server but continue directly to Step 2.
-
Stop the Metadb server and (before starting it again) "reset" the analytics database to mark all current data as old. This may take some time to run.
metadb stop -D data metadb reset -D data --source sensor
-
Start the Metadb server to begin streaming the data.
-
Once the new data have finished (or nearly finished) re-streaming, stop the Metadb server, and "clean" the analytics database to remove any old data that have not been refreshed by the new data stream.
metadb clean -D data --source sensor
Note that the metadb server currently does not give an indication that it has finished re-streaming. (Having the server report that initial streaming or re-streaming has finished is a planned feature.)
The precise timing of when "metadb clean" should be run is up to the admninistrator, but it must be run to remove redundant data and to complete the resynchronization process. The only disadvantage of running it too early is that data will appear to be missing until they are re-streamed.
-
Start the server.
Until a failed stream is re-streamed by following the process above, the analytic database may continue to be unsynchronized with the source.
Configuring database users
To create a new database user account:
CREATE USER weg PASSWORD 'LZn2DCajcNHpGR3ZXWHD';
In addition to creating the user, this also creates a schema with the same name as the user which can be used as a workspace.
We recommend user names of 3 to 8 characters in length.
To give the user access to all tables generated from a data source (including tables not yet created):
AUTHORIZE SELECT ON ALL TABLES IN DATA SOURCE sensor TO weg;
Administrative database changes
It is possible to make administrative-level changes directly in the underlying PostgreSQL database, such as providing additional tables for users. However, the following guidelines should be followed strictly to avoid disrupting the operation of Metadb:
-
No changes should be made to any database objects created by Metadb. If it should become necessary to make changes to the database objects at the request of the Metadb maintainers, the server should be stopped first to prevent it from operating with an out-of-date cache. If changes are made to the database objects inadvertently, the server should be stopped as soon as possible and not started until the changes have been reversed.
-
Changes generally should be made while logged in as a new database user (not a superuser) that is created specifically for that purpose.
-
Any new database objects should be created in a new schema that will not coincide with a schema that may be created by Metadb. This can be ensured by always setting
addschemaprefix
in data source configurations and avoiding names with those prefixes when creating a new schema. -
Database views should not be created in the database.
External projects
FOLIO
Configuration
When creating a FOLIO data source, use the module 'folio'
option, and set
trimschemaprefix
to remove the tenant from schema names and addschemaprefix
to add a folio_
prefix to the schema names. For example:
CREATE DATA SOURCE folio TYPE kafka OPTIONS ( module 'folio', trimschemaprefix 'tenantname_', addschemaprefix 'folio_', brokers 'kafka:29092', topics '^metadb_folio_1.*', consumergroup 'metadb_folio_1_1', schemastopfilter 'admin' );
It is recommended to use a separate Kafka cluster, rather than the FOLIO Kafka instance, until one has experience with administration of Kafka.
In the Debezium PostgreSQL connector configuration, the following exclusions are suggested:
"schema.exclude.list": "public,.*_mod_login,.*_mod_pubsub,.*pubsub_config,supertenant_mod_.*,.*_mod_kb_ebsco_java,.*_mod_data_export_spring"
"table.exclude.list": ".*_mod_agreements.alternate_resource_name,.*_mod_service_interaction.dashboard_access,.*_mod_agreements.availability_constraint,.*_mod_agreements\\.package_description_url,.*_mod_agreements\\.content_type,.*_mod_agreements\\.entitlement_tag,.*_mod_agreements\\.erm_resource_tag,.*_mod_agreements\\.string_template,.*_mod_agreements\\.string_template_scopes,.*_mod_agreements\\.templated_url,.*_mod_oai_pmh\\.instances,.*_mod_remote_storage\\.original_locations,.*_mod_remote_storage\\.item_notes,.*app_setting,.*alternate_name,.*databasechangelog,.*databasechangeloglock,.*directory_entry_tag,.*license_document_attachment,.*license_supp_doc,.*license_tag,.*log_entry_additional_info,.*subscription_agreement_supp_doc,.*subscription_agreement_document_attachment,.*subscription_agreement_ext_lic_doc,.*subscription_agreement_tag,.*tenant_changelog,.*tenant_changelog_lock,.*marc_indexers.*,.*rmb_internal.*,.*rmb_job.*,.*_mod_agreements\\.match_key,.*system_changelog"
MARC transformation
Metadb v1.0.5
Metadb transforms MARC records from the tables marc_records_lb
and
records_lb
in schema folio_source_record
to a tabular form which is stored
in a new table, folio_source_record.marc__t
. This transformation is updated
usually every few hours. The time of the most recent update can be retrieved
via SQL:
SELECT updated FROM metadb.table_update WHERE schemaname = 'folio_source_record' AND tablename = 'marc__t';
Derived tables
FOLIO derived tables are automatically updated once per day, usually at about 00:00:00 UTC by default.
Documentation
FOLIO does not provide documentation for its internal data, which Metadb tables
are based on, but it does have some data documentation for its "storage module"
APIs which are roughly equivalent. This is located at
https://dev.folio.org/reference/api/
. The name of most storage modules ends
in -storage
, but some modules use a different convention; for instance, the
storage module for users is mod-users
. (All module names begin with mod-
.)
ReShare
Before defining a ReShare data source, create a data origin for each consortial tenant. For example:
CREATE DATA ORIGIN tenant1; CREATE DATA ORIGIN tenant2; CREATE DATA ORIGIN tenant3;
Then use the module 'reshare'
option when creating the data source, and set
addschemaprefix
to add a reshare_
prefix to the schema names:
CREATE DATA SOURCE reshare TYPE kafka OPTIONS ( module 'reshare', addschemaprefix 'reshare_', brokers 'kafka:29092', topics '^metadb_reshare_1.*', consumergroup 'metadb_reshare_1_1', schemastopfilter 'admin' );
Note that the order of commands is important: The initial set of data origins
should be created before the data source is created so that schema names of
incoming data will be processed correctly. Later, whenever a new consortial
tenant is to be added, it should be defined in Metadb using CREATE DATA
ORIGIN
(and the server restarted) before the tenant is added to ReShare.
In the Debezium PostgreSQL connector configuration, it is suggested that
credentials (.+mod_login
), the public schema, the Okapi supertenant
(supertenant_mod_.+
), and mod-pubsub data (pubsub_config,.+_mod_pubsub
) be
excluded using the schema.exclude.list
setting.
Derived tables
ReShare derived tables are automatically updated once per day, usually at about 00:00:00 UTC by default.