1. User guide
This is an overview of using Metadb. We assume familiarity with databases and the basics of SQL.
1.1. Getting started
Metadb extends PostgreSQL with features to support analytics such as streaming data sources, data model transforms, and historical data. The data contained in the Metadb database originally come from another place: a data source which could be, for example, a transaction-processing database or a sensor network. Metadb updates its database continuously based on state changes in an external data source. Also part of the design of Metadb is to support multiple data sources, although this is not yet fully implemented.
1.2. Main tables
Tables generated by Metadb have at least these metadata columns, with names that begin with two underscores:
-
__id
is a surrogate key that identifies a row in the table. -
__start
is the date and time when the row of data was generated. -
__end
is the date and time when the row of data became no longer current, or9999-12-31 00:00:00+00
if the row is still current. -
__current
is a Boolean value that indicates whether the row is current. -
__origin
is an optional identifier used to group related data from one or more data sources. For example, Metadb can combine data from similar source tables into a single table in the analytic database, and in that case it stores an identifier in__origin
to record where the data came from.
A main table in Metadb has two underscores at the end of its name
(e.g., patrongroup__
), and it contains both the current state and
the history of all previous states provided by the data source. For
instance, consider a main table that stores categories of patrons in a
library:
SELECT __start, __end, __current, id, groupname, description FROM library.patrongroup__;
__start |
__end |
__current |
id |
groupname |
description |
---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
In this example, the "undergrad" group with id
= 15 has two rows
because it was modified on 2022-04-18 19:27:18-00
, changing
description
from 'Student'
to 'Undergraduate Student'
.
1.3. Current tables
It is often desirable to limit a query to retrieving only current
records. This can be done by filtering on __current
= TRUE
, for
example:
SELECT id, groupname, description FROM library.patrongroup__ WHERE __current;
For convenience, since this filter is so often used, Metadb provides access to a current table which contains only current records. Every main table has a corresponding current table, which shares the same name but without the two underscores at the end. For instance, the following query is equivalent to the example above:
SELECT id, groupname, description FROM library.patrongroup;
A current table reflects the current state of each row in the data source, as of the last time the row was updated.
To take another example:
SELECT __id, __start, __origin, id, groupname, description FROM library.patrongroup;
__id |
__start |
__origin |
id |
groupname |
description |
---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
There is a distinction between the __id
and id
columns. The
__id
column is a key defined by Metadb to identify each row
uniquely; it is present in all tables generated by Metadb. In
contrast, the id
column only happens to be present in this sample
table and has been provided by the data source; it may or may not be a
key, and its name, content, or significance may change if schema
changes are made in the data source.
Current table names are sometimes referred to as "base tables" because they match the original table names in the data source.
1.4. Transformed tables
Metadb can apply transformations to data, which results in additional
tables being created. One category of data that can be transformed in
this way is JSON objects. Suppose that our patrongroup
current
table contains the groupname
and description
data in JSON fields,
for example:
SELECT __id, __start, id, jsondata FROM library.patrongroup;
__id |
__start |
id |
jsondata |
---|---|---|---|
|
|
|
{ "groupname": "undergrad", "description": "Undergraduate Student" } |
|
|
|
{ "groupname": "graduate", "description": "Graduate Student" } |
|
|
|
{ "groupname": "faculty", "description": "Faculty Member" } |
|
|
|
{ "groupname": "staff", "description": "Staff Member" } |
The JSON fields can be selected using SQL:
SELECT __id,
__start,
id,
jsonb_extract_path_text(jsondata, 'groupname') AS groupname,
jsonb_extract_path_text(jsondata, 'description') AS description
FROM library.patrongroup;
Metadb can be configured to generate a transformed table, which in
our example will be called patrongroup__t
. The JSON transformation
extracts JSON fields into columns, which allows for simpler queries.
To define the transformation, the Metadb system administrator can run this command:
CREATE DATA MAPPING FOR json FROM TABLE library.patrongroup__ COLUMN jsondata PATH '$' TO 't';
The new transformation will take effect when the JSON data are next updated. Then the query can be written as:
SELECT __id, __start, id, groupname, description FROM library.patrongroup__t;
__id |
__start |
id |
groupname |
description |
---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The corresponding main table is also transformed. In this case the
main transformed table would be called patrongroup__t__
.
Note that JSON data are treated as "schemaless," and fields are inferred from their presence in the data rather than read from a JSON schema. As a result, a column is only created from a JSON field if the field is present in at least one JSON record. A new column may be added dynamically at any time when a record containing a new JSON field is streamed.
1.5. Comparing table types
To summarize the types of tables that we have covered:
Table name | Table type | Description |
---|---|---|
patrongroup__ |
Main table |
Current and historical records |
patrongroup |
Current table (base table) |
Current records only |
patrongroup__{X}__ |
Main transformed table |
Transformed versions of the records in |
patrongroup__{X} |
Current transformed table |
Transformed versions of the records in |
where {X}
is some identifier configured by the Metadb system
administrator. In the previous example, it was simply t
. There can
be more than one transformed table, with different identifiers.
In addition to these table types, some partition tables have names
that begin with zzz___
. It is recommended that these tables not be
used directly in queries, and access to them is generally disabled.
1.6. User workspaces
In general, users do not have privileges to modify or create tables in
the database. However, Metadb creates a schema for each user where
the user does have these privileges, and this serves as an individual
workspace. The schema has the same name as the user name. This can
be useful as an area for saving the results of queries or importing
external data sets. For example, the user celia
can create a table
in the schema celia
:
CREATE TABLE celia.westgroup AS SELECT * FROM library.patrongroup WHERE __origin = 'west';
SELECT * FROM celia.westgroup LIMIT 20;
Other users typically do not have privileges to access the table by default.
To grant the user rosalind
read-only access to the schema and table:
GRANT USAGE ON SCHEMA celia TO rosalind; GRANT SELECT ON celia.westgroup TO rosalind;
The user rosalind
can then access the table.
1.7. Creating reports
An effective way to create a report is to package it as a database function. A database function can define a query and associated parameters. Users can then call the function, specifying a value for each parameter.
For example, suppose that the following query counts the number of loans in a library for each circulated item within a range of dates.
SELECT item_id, count(*) AS loan_count FROM folio_circulation.loan__t WHERE '2023-01-01' <= loan_date AND loan_date < '2024-01-01' GROUP BY item_id;
We can create a function to generalize this query. Instead of
including the dates directly within the query, we will define them as
parameters: start_date
and end_date
.
CREATE FUNCTION lisa.count_loans( start_date date DEFAULT '2000-01-01', end_date date DEFAULT '2050-01-01') RETURNS TABLE( item_id uuid, loan_count integer) AS $$ SELECT item_id, count(*) AS loan_count FROM folio_circulation.loan__t WHERE start_date <= loan_date AND loan_date < end_date GROUP BY item_id $$ LANGUAGE SQL STABLE PARALLEL SAFE;
Now the function can be called with different arguments to generate reports:
SELECT * FROM lisa.count_loans(start_date => '2022-01-01', end_date => '2023-01-01'); SELECT * FROM lisa.count_loans(start_date => '2023-01-01');
The user lisa
who created this function can share it with other
users in a similar way as described in the "User workspaces" section
above.
GRANT USAGE ON SCHEMA lisa TO celia, rosalind; GRANT EXECUTE ON FUNCTION lisa.count_loans TO celia, rosalind;
Note that the GRANT USAGE ON SCHEMA
command does not have to be run
again if it was done previously.
Defining shared functions in this way can be used together with a web-based database tool such as CloudBeaver to make reports available to a wider group of users.
1.8. Database views
Database views are not supported and should not be created in a Metadb database.
Views can create dependencies that may block required schema changes during data updates. They also hide query complexity, which in analytical workloads can be significant, behind what appears to be a simple table. It is recommended to use database functions instead, as described in the previous section.
1.9. Querying system information
Metadb version
To show the current Metadb version:
SELECT mdbversion();
Data update times
Information about when certain tables were last updated is available
in a table called metadb.table_update
:
SELECT * FROM metadb.table_update ORDER BY schema_name, table_name;
System log
Metadb reports errors and other information in a logging table called
metadb.log
. For instance to view all log messages from January,
2023:
SELECT * FROM metadb.log WHERE '2023-01-01' <= log_time AND log_time < '2023-02-01' ORDER BY log_time;
The mdblog()
function is provided for convenience and returns
ordered log messages, from the past 24 hours by default (or optionally
from a specified interval):
SELECT * FROM mdblog();
Status of query process
When executing a long-running query, the ps()
function can be used
to check on the status of the query process from another session:
SELECT * FROM ps();
1.10. Working with data types
This section offers a few brief recommendations for working with common data types.
Text and varchar
The text
data type is recommended for strings of characters.
The varchar(n)
type with length n can be used for older database
clients that require the length to be specified.
Monetary data
For monetary amounts, numeric(19, 4)
is commonly used to avoid
round-off errors. For exchange rates, numeric(19, 14)
may be used.
Querying on a range of dates
There are various ways of querying within a range of dates, but the
most reliable method is to use inequalities of the form start <=
t < end
, for example:
SELECT * FROM folio_circulation.loan__t WHERE '2023-01-01' <= loan_date AND loan_date < '2024-01-01';
2. Reference
2.1. Data type conversion
When reading data from a data source, Metadb will automatically adjust column data types in a number of cases:
-
A data type in the source has changed in a way that cannot be applied safely by Metadb without more information.
-
A data type in the source has changed to a type that is not supported by Metadb.
-
A data type in the source can be converted safely to a type that is more performant.
Each type conversion is applied in one of two ways: either by the stream processor converting a record-at-a-time, or later by "inferring" types from previously updated data.
2.1.1. Record-at-a-time
This table summarizes conversions that are performed a record-at-a-time by the stream processor:
Data type conversions | To numeric | To uuid | To jsonb | To text |
---|---|---|---|---|
From boolean |
✅ |
|||
From smallint |
✅ |
✅ |
||
From integer |
✅ |
✅ |
||
From bigint |
✅ |
✅ |
||
From real |
✅ |
✅ |
||
From double precision |
✅ |
✅ |
||
From numeric |
✅ |
|||
From date |
✅ |
|||
From time |
✅ |
|||
From time with timezone |
✅ |
|||
From timestamp |
✅ |
|||
From timestamp with timezone |
✅ |
|||
From uuid |
✅ |
|||
From json |
✅ |
✅ |
||
From jsonb |
✅ |
2.1.2. Inferred from data
At present the only inferred type is uuid
:
Data type conversions | To numeric | To uuid | To jsonb | To text |
---|---|---|---|---|
From text/varchar |
✅ |
Types also can be set manually via the ALTER TABLE
command.
2.2. Functions
2.2.1. System information
Name | Return type | Description |
---|---|---|
|
table ( log_time timestamptz(3), error_severity text, message text ) |
Returns recent log messages during the specified interval, or during the past 24 hours if no interval is given |
|
text |
Returns the Metadb version |
|
table ( dbname text, username text, state text, realtime text, query text ) |
Returns the status and elapsed running time of current query processes |
Examples
Show log messages from the past hour, and from the past 24 hours:
SELECT * FROM mdblog('1 hour');
SELECT * FROM mdblog();
Show the current Metadb version:
SELECT mdbversion();
2.3. System tables
2.3.1. metadb.base_table
The table metadb.base_table
stores information about tables that are
extracted from data sources or are transformed from those tables.
Column name | Column type | Description |
---|---|---|
|
varchar(63) |
The base table schema name |
|
varchar(63) |
The base table name |
|
varchar(63) |
Name of the data source the table is extracted from |
|
boolean |
True if the table is transformed from another table |
|
varchar(63) |
Schema name of the parent table, if this is a transformed table |
|
varchar(63) |
Table name of the parent table, if this is a transformed table |
2.3.2. metadb.log
The table metadb.log
stores logging information for the system.
Column name | Column type | Description |
---|---|---|
log_time |
timestamptz(3) |
Timestamp when the log entry was written |
error_severity |
text |
Logging level: (from least to most severe) INFO, WARNING, ERROR, FATAL |
message |
text |
The log message |
2.3.3. metadb.table_update
The table metadb.table_update
stores information about the updating
of certain tables.
Column name | Column type | Description |
---|---|---|
|
varchar(63) |
Schema name of the updated table |
|
varchar(63) |
Table name of the updated table |
|
timestamptz |
Timestamp when the table was last updated successfully |
|
real |
Wall-clock time in seconds of the last completed update process |
2.4. External SQL directives
Metadb allows scheduling external SQL files to run on a regular basis. At present this feature is enabled automatically when the "folio" or "reshare" module has been specified in the data source. Details such as the location of the SQL files and when they are run are currently hardcoded but will be configurable in the future.
Each SQL statement should be separated from others by an empty line, and any tables created should not specify a schema name.
Comment lines beginning with --metadb:
are used for special
directives; each directive should be on a separate line.
It is suggested that each SQL file begin with a --metadb:table
directive, followed by an empty line, for example:
--metadb:table library_patrons DROP TABLE IF EXISTS library_patrons; CREATE TABLE library_patrons AS SELECT . . . ;
2.4.1. --metadb:table
The --metadb:table
directive declares that the SQL file updates a
specific table. This allows Metadb to report on the status of the
update in the metadb.table_update
system table. The directive takes
the form:
--metadb:table <table>
The specified table should not contain a schema name.
For example:
--metadb:table user_group
2.5. Statements
Metadb extends SQL with statements for configuring and administering the server. These statements are only available when connecting to the Metadb server (not the PostgreSQL server for the underlying database).
2.5.1. ALTER DATA SOURCE
Change the configuration of a data source
ALTER DATA SOURCE source_name
OPTIONS ( [ ADD | SET | DROP ] option ['value'] [, ... ] )
Description
ALTER DATA SOURCE changes connection settings for a data source.
Parameters
|
The name of an existing data source. |
|
Connection settings and other configuration options for the data source. |
Options
See CREATE DATA SOURCE
Examples
Change the consumer group:
ALTER DATA SOURCE sensor OPTIONS (SET consumergroup 'metadb_sensor_1');
2.5.2. ALTER TABLE
Change a table definition
ALTER TABLE table_name
ALTER COLUMN column_name
TYPE data_type
Description
ALTER TABLE changes the definition of a table that is extracted from a data source.
Parameters
|
Schema-qualified name of a main table. |
|
Name of a column to alter. |
|
The new data type of the column. The only type currently supported
is |
Examples
Change a column type to uuid
:
ALTER TABLE library.patron__ ALTER COLUMN patrongroup_id TYPE uuid;
2.5.3. AUTHORIZE
Enable access to tables generated from an external data source
AUTHORIZE SELECT
ON ALL TABLES IN DATA SOURCE source_name
TO role_specification
Description
The AUTHORIZE command grants access to tables. It differs from GRANT in that the authorization will also apply to tables created at a later time in the data source.
Parameters
|
The name of an existing data source. |
|
An existing role to be granted the authorization. |
Examples
AUTHORIZE SELECT ON ALL TABLES IN DATA SOURCE sensor TO beatrice;
2.5.4. CREATE DATA MAPPING
Metadb 1.4 Define a new mapping for data transformation
CREATE DATA MAPPING FOR mapping_type
FROM TABLE table_name
COLUMN column_name
PATH 'object_path'
TO 'target_identifier'
Description
CREATE DATA MAPPING defines mapping rules for data transformation. The output of a mapping is written to a new column or table based on the specified target identifier.
In JSON mapping, the specified path identifies a JSON object or array
to transform. For example, the path '$.a.b'
is used to refer to an
object or array named b
contained within an object or array named
a
. The path '$'
means the outermost enclosing object. Note that
an object or array will not be transformed unless all of its parents
are also transformed; for example, a mapping from path '$.a.b'
will
be applied only if mappings are also defined for both the paths
'$.a'
and '$'
within the same table and column.
Parameters
|
The type of data mapping. The only type currently supported is
|
|
The table to transform. |
|
The column to transform. |
|
Path to a JSON object or array. |
|
A short, lowercase identifier to be used in naming the transformed data. It must be unique for the transformed column; in other words, no two paths can be mapped to the same target identifier. |
Examples
Create JSON mappings to transform an object at $.metadata
and an
array at $.tags.tagList
:
CREATE DATA MAPPING FOR json FROM TABLE library.inventory__ COLUMN jsondata PATH '$' TO 't'; CREATE DATA MAPPING FOR json FROM TABLE library.inventory__ COLUMN jsondata PATH '$.metadata' TO 'metadata'; CREATE DATA MAPPING FOR json FROM TABLE library.inventory__ COLUMN jsondata PATH '$.tags' TO 'tags'; CREATE DATA MAPPING FOR json FROM TABLE library.inventory__ COLUMN jsondata PATH '$.tags.tagList' TO 'taglist';
2.5.5. CREATE DATA ORIGIN
Define a new data origin
CREATE DATA ORIGIN origin_name
Description
CREATE DATA ORIGIN defines a new origin. An origin name can be stored
by Metadb in the __origin
column to tag individual records. The use
and meaning of origins are typically defined by the application or by
a Metadb module, but in general origins allow grouping data
independently of data sources. This can be useful because data
sources may be dictated by how the data are collected,
e.g. geographically in a sensor network.
Parameters
|
A unique name for the data origin to be created. |
Examples
Create a new origin test_origin
:
CREATE DATA ORIGIN test_origin;
2.5.6. CREATE DATA SOURCE
Define a new external data source
CREATE DATA SOURCE source_name
TYPE source_type
OPTIONS ( option 'value' [, ... ] )
Description
CREATE DATA SOURCE defines connection settings for an external data source.
The new data source starts out in synchronizing mode, which pauses
periodic transforms and running external SQL. After no new snapshot
records have been received for a significant period of time, which
suggests that the initial snapshot has likely finished streaming, the
message "source snapshot complete (deadline exceeded)" will be written
to the log. To complete the synchronization, the Metadb server should
be stopped in order to run metadb endsync
, and after the "endsync"
has completed, the Metadb server can be started again.
Parameters
|
A unique name for the data source to be created. |
|
The type of data source. The only type currently supported is
|
|
Connection settings and other configuration options for the data source. |
Options for data source type "kafka"
|
Kafka bootstrap servers (comma-separated list). |
|
Security protocol: |
|
Regular expressions matching Kafka topics to read (comma-separated list). |
|
Kafka consumer group ID. |
|
Regular expressions matching schema names to accept (comma-separated list). |
|
Regular expressions matching schema names to ignore (comma-separated list). |
|
Regular expressions matching table names to ignore (comma-separated list). |
|
Prefix to remove from schema names. |
|
Prefix to add to schema names. |
|
Name of pre-defined configuration. |
Examples
Create sensor
as a kafka
data source:
CREATE DATA SOURCE sensor TYPE kafka OPTIONS ( brokers 'kafka:29092', topics '^metadb_sensor_1\.', consumergroup 'metadb_sensor_1_1', addschemaprefix 'sensor_', tablestopfilter '^testing\.air_temp$,^testing\.air_temp_avg$' );
2.5.7. CREATE USER
Define a new database user
CREATE USER user_name
WITH option 'value' [, ... ]
Description
CREATE USER defines a new database user that will be managed by Metadb. It also creates a schema having the same name as the user, if the schema does not exist, and grants the user privileges on the schema. The schema is intended as a workspace for the user.
Parameters
|
The name of the new user. |
|
Configuration options for the new user. |
Options
|
Sets the user’s password. (Required) |
|
Stores a comment about the user, e.g. the user’s real name. The
comment can be viewed in psql using the |
Examples
Create a user wegg
:
CREATE USER wegg WITH PASSWORD 'LZn2DCajcNHpGR3ZXWHD', COMMENT 'Silas Wegg';
2.5.8. DROP DATA SOURCE
Remove a data source configuration
DROP DATA SOURCE source_name
Description
DROP DATA SOURCE removes a data source configuration.
Parameters
|
The name of an existing data source. |
Examples
Drop a data source sensor
:
DROP DATA SOURCE sensor;
2.5.9. LIST
Show the value of a system variable
LIST name
Description
LIST shows the current setting of various system configurations and other variables.
Parameters
|
||
|
Authorized users. |
|
|
Configured data mappings. |
|
|
Configured data origins. |
|
|
Configured data sources. |
|
|
Current status of system components. |
Examples
LIST status;
3. Server administration
3.1. System requirements
These are minimum system requirements. They are based on the systems that Metadb has been tested with recently or is believed to work with.
3.1.1. Hardware requirements
-
Metadb:
-
Architecture: x86-64 (AMD64)
-
CPU: 2 cores
-
Memory: 1 GB
-
Local storage: 500 GB SSD
-
-
PostgreSQL:
-
CPU: 4 cores
-
Memory: 32 GB
-
Storage: 2 TB SSD
-
3.1.2. Software requirements
-
Operating system: Debian Linux
-
Database system: PostgreSQL 15 or later
-
To build from source:
-
Go 1.23
-
goyacc (see installation instructions below)
-
Ragel 6
-
GCC C compiler 9 or later
-
3.1.3. PostgreSQL configuration
-
autovacuum_analyze_scale_factor
:0.01
-
autovacuum_max_workers
:1
-
autovacuum_vacuum_cost_delay
:0
-
autovacuum_vacuum_insert_scale_factor
:0.01
-
autovacuum_vacuum_scale_factor
:0.01
-
checkpoint_timeout
:3600
-
cpu_tuple_cost
:0.03
-
default_statistics_target
:1000
-
effective_io_concurrency
:1
-
idle_in_transaction_session_timeout
:60000
-
idle_session_timeout
:86400000
-
maintenance_work_mem
:1000000
-
max_wal_size
:10240
-
shared_buffers
:1250000
-
statement_timeout
:3600000
-
work_mem
:350000
3.2. Building the software
It is suggested that a metadb
user be created and the server run by
that user, for example, in /home/metadb
.
To build Metadb, first set the GOPATH
environment variable to
specify a path that can serve as the build workspace for Go, e.g.:
export GOPATH=$HOME/go
Then install goyacc:
go install golang.org/x/tools/cmd/goyacc@master
export PATH=$PATH:$GOPATH/bin
Then to build the server:
./build
The build
script creates a bin/
subdirectory and builds the
metadb
executable there:
./bin/metadb help
In general running metadb
takes the form:
metadb <command> <arguments>
Some typical commands are:
-
init
initializes a new Metadb instance -
start
starts the server -
stop
shuts down the server -
upgrade
upgrades a Metadb instance to the current version -
sync
begins synchronization with a data source -
endsync
ends synchronization and cleans up stale data -
version
prints the Metadb version
For more infomation about a specific command:
metadb help <command>
3.3. Server configuration
Metadb makes use of local storage in a data directory which is
created using metadb
with the init
command. In this example we
will create the data directory as data/
:
metadb init -D data
This will also create a file metadb.conf
in the data directory with
the following structure:
[main]
host = host name of the PostgreSQL server
port = port number of the PostgreSQL server
database = database name
superuser = database superuser such as "postgres"
superuser_password = password of superuser
systemuser = database user that is the owner of the database
systemuser_password = password of systemuser
sslmode = should be set to "require" or stronger option
Metadb expects the database name to be metadb
or to begin with
metadb_
; otherwise it logs a warning message.
This file should be edited to add database connection parameters, for example:
[main]
host = a.b.c
port = 5432
database = metadb
superuser = postgres
superuser_password = zpreCaWS7S79dt82zgvD
systemuser = mdbadmin
systemuser_password = ZHivGie5juxGJZmTObHU
sslmode = require
Metadb will assume that the database, superuser, and systemuser defined here already exist; so they should be created before continuing.
3.4. Backups
It is essential to make regular backups of Metadb and to test the backups.
In general persistent data are stored in the database, and so the database should be backed up often.
The data directory contains the metadb.conf
configuration file and
is also used for temporary storage. The metadb.conf
file should be
backed up.
3.5. Upgrading from a previous version
To upgrade from any previous version of Metadb, stop the server (if running), and then run the upgrade process in case changes are required:
metadb upgrade -D data
The upgrade process may, in some instances, take a significant amount of time to run. The database generally remains available to users during this period.
If no changes are needed, the process outputs:
metadb: "data" is up to date
3.6. Running the server
To start the server:
nohup metadb start -D data -l metadb.log &
The --memlimit
option can be used to set a soft memory limit (in
GiB) if needed, for example:
nohup metadb start -D data -l metadb.log --memlimit 2 &
The server listens on port 8550 by default, but this can be set using
the --port
option. The --debug
option enables verbose logging.
To stop the server:
metadb stop -D data
Note that stopping or restarting the server may delay scheduled data updates or cause them to restart.
The server can be set up to run with systemd via a file such as
/etc/systemd/system/metadb.service
, for example:
[Unit]
Description=Metadb
After=network.target remote-fs.target
[Service]
Type=simple
User=metadb
ExecStart=/bin/bash -ce "exec /home/metadb/bin/metadb start -D /home/metadb/data -l /home/metadb/metadb.log"
Restart=on-abort
[Install]
WantedBy=multi-user.target
Then:
systemctl enable metadb systemctl start metadb
3.7. Connecting to the server
The PostgreSQL terminal-based client, psql
, is used to connect to a
Metadb server that runs on the same host and listens on a specified
port:
psql -X -h localhost -d metadb -p <port>
For example:
psql -X -h localhost -d metadb -p 8550
See Reference > Statements for commands that can be issued via
psql
.
Note that the Metadb server is not a database system, but only
implements part of the PostgreSQL communication protocol sufficient to
allow psql
to be used as a client.
3.8. Configuring a Kafka data source
3.8.1. Overview
Metadb currently supports reading Kafka messages in the format produced by the Debezium PostgreSQL connector for Kafka Connect. Configuration of Kafka, Kafka Connect, Debezium, and PostgreSQL logical decoding is beyond the scope of this documentation, but a few notes are included here.
Data flow in this direction:
-
A source PostgreSQL database
-
Kafka Connect/Debezium
-
Kafka
-
Metadb
-
The analytics database
3.8.2. Creating a connector
To allow capturing data changes in the source PostgreSQL database,
logical decoding has to be enabled, in particular by setting
wal_level = logical
in postgresql.conf
for the source database.
Note that timeout settings in the source database such as
idle_in_transaction_session_timeout
can cause the connector to fail,
if a timeout occurs while the connector is taking an initial snapshot
of the database.
Next we create a connector configuration file for Kafka Connect:
{ "name": "sensor-1-connector", "config": { "connector.class": "io.debezium.connector.postgresql.PostgresConnector", "database.dbname": "sourcedb", "database.hostname": "example.host.name", "database.password": "eHrkGrZL8mMJOFgToqqL", "database.port": "5432", "database.server.name": "metadb_sensor_1", "database.user": "dbuser", "plugin.name": "pgoutput", "snapshot.mode": "exported", "tasks.max": "1", "truncate.handling.mode": "include", "publication.autocreate.mode", "filtered" "heartbeat.interval.ms": "30000", "heartbeat.action.query": "UPDATE admin.heartbeat set last_heartbeat = now();" } }
It is recommended to use the connector configuration settings
heartbeat.interval.ms
and heartbeat.action.query
as above to avoid
spikes in disk space consumption within the source database. (See the
Debezium PostgreSQL connector documentation for more details.) The
schemastopfilter
option of the CREATE DATA SOURCE
command is used
to filter out the heartbeat table.
In the source database:
CREATE SCHEMA admin; CREATE TABLE admin.heartbeat (last_heartbeat timestamptz PRIMARY KEY); INSERT INTO admin.heartbeat (last_heartbeat) VALUES (now());
Then to create the connector:
curl -X POST -i -H "Accept: application/json" -H "Content-Type: application/json" \ -d @connector.json https://kafka.connect.server/connectors
Note the 1
included in name
and database.server.name
in the
connector configuration. This is suggested as a version number, which
can be incremented if the data stream needs to be resynchronized with
a new connector.
Metadb requires all streamed tables to have a primary key defined.
Tables that do not meet this requirement should be filtered out in the
Debezium PostgreSQL connector configuration by setting
schema.exclude.list
or table.exclude.list
. Otherwise they will
generate error messages in the Metadb log.
3.8.3. Monitoring replication
The replication slot disk usage must be monitored, because under certain error conditions it can grow too large and possibly fill up the disk. To show the disk usage (in the source database):
select slot_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as replicationSlotLag, active from pg_replication_slots;
It is recommended to allocate plenty of extra disk space in the source database.
3.8.4. Creating the data source
In Metadb, a data source is defined using the CREATE DATA SOURCE
statement, for example:
CREATE DATA SOURCE sensor TYPE kafka OPTIONS ( brokers 'kafka:29092', topics '^metadb_sensor_1\.', consumergroup 'metadb_sensor_1_1', addschemaprefix 'sensor_', schemastopfilter 'admin' );
3.8.5. Initial synchronization
When a new data source is first configured using CREATE DATA SOURCE
,
Metadb automatically puts the database into synchronizing mode, just
as if metadb sync
had been run (see Server administration >
Resynchronizing a data source). This has the effect of pausing
periodic transforms and external SQL. When the initial snapshot has
finished streaming, the message "source snapshot complete (deadline
exceeded)" will be written to the log. Then, to complete this first
synchronization, stop the Metadb server, and after that run metadb
endsync
:
metadb stop -D data
metadb endsync -D data --source sensor
Once "endsync" has finished running, start the Metadb server.
3.8.6. Deleting a connection
Sometimes a connection may have to be deleted and recreated (see Server administration > Resynchronizing a data source). After deleting a connection, the replication slot and publication in the source database should be dropped using:
SELECT pg_drop_replication_slot('debezium'); DROP PUBLICATION dbz_publication;
3.9. Resynchronizing a data source
If a Kafka data stream fails and cannot be resumed, it may be necessary to re-stream a complete snapshot of the data to Metadb. For example, a source database may become unsynchronized with the analytic database, requiring a new snapshot of the source database to be streamed. Metadb can accept re-streamed data in order to synchronize with the source, using the following procedure.
Note that during the synchronization process, the analytics database continues to be available to users. However, streaming updates will be slower than usual, and there temporarily may be missing records (until they are re-streamed) or "extra" records (recently deleted in the source database). Also, periodic transforms and external SQL are paused during synchronization.
-
Update the
topics
andconsumergroup
configuration settings for the new data stream.ALTER DATA SOURCE sensor OPTIONS (SET topics '^metadb_sensor_2\.', SET consumergroup 'metadb_sensor_2_1');
Do not restart the Metadb server but continue directly to Step 2.
-
Stop the Metadb server and (before starting it again) run
metadb sync
. This may take some time to run.metadb stop -D data metadb sync -D data --source sensor
-
Start the Metadb server to begin streaming the data.
-
Once the new data have finished (or nearly finished) re-streaming, stop the Metadb server, and run
metadb endsync
to remove any old data that have not been refreshed by the new data stream.metadb endsync -D data --source sensor
The timing of when "endsync" should be run is up to the admninistrator, but it must be run to complete the synchronization process. In most cases it will be more convenient for users if "endsync" is run too late (delaying removal of deleted records) rather than too early (removing records before they have been restreamed).
Metadb detects when snapshot data are no longer being received, and then writes "source snapshot complete (deadline exceeded)" to the log. This generally means it is a good time to run "endsync".
Metadb 1.4 The snapshot status is also available via the
LIST status
command. -
Start the server.
Until a failed stream is re-streamed by following the process above, the analytic database may continue to be unsynchronized with the source.
3.10. Creating database users
To create a new database user account:
CREATE USER wegg WITH PASSWORD 'LZn2DCajcNHpGR3ZXWHD', COMMENT 'Silas Wegg';
In addition to creating the user, this also creates a schema with the same name as the user. The schema is intended as a workspace for the user.
-
Each user account should be for an individual user and not shared by more than one person.
-
Prefer user names of 3 to 8 characters in length.
By default the user does not have access to data tables. To give the user access to all tables generated from a data source (including tables not yet created):
AUTHORIZE SELECT ON ALL TABLES IN DATA SOURCE sensor TO wegg;
3.11. Administrative database changes
It is possible to make administrative-level changes directly in the underlying PostgreSQL database, such as providing additional tables for users. However, the following guidelines should be followed strictly to avoid disrupting the operation of Metadb:
-
No changes should be made to any database objects created by Metadb. If it should become necessary to make changes to the database objects at the request of the Metadb maintainers, the server should be stopped first to prevent it from operating with an out-of-date cache. If changes are made to the database objects inadvertently, the server should be stopped as soon as possible and not started until the changes have been reversed.
-
Changes generally should be made while logged in as a new database user (not a superuser) that is created specifically for that purpose.
-
Any new database objects should be created in a new schema that will not coincide with a schema that may be created by Metadb. This can be ensured by always setting
addschemaprefix
in data source configurations and avoiding names with those prefixes when creating a new schema. -
Database views should not be created in the database.
4. External projects
Two open-source communities, FOLIO and ReShare, use Metadb to support analytics on their data. Metadb has specific support for these two types of data sources; it is referred to conceptually as the "folio" module and the "reshare" module. In the future it is intended that this support will be moved into a plug-in or other modular system, where it could be maintained and documented independently of the Metadb project.
The following sections document Metadb features that are specific to FOLIO and ReShare.
4.1. FOLIO
When configured for FOLIO, a Metadb instance typically manages data for a single FOLIO tenant. If multiple FOLIO tenants are to be supported, then a separate Metadb instance and database should be set up for each tenant.
4.1.1. MARC transform
Metadb transforms MARC records from the tables marc_records_lb
and
records_lb
in schema folio_source_record
to a tabular form which
is stored in a new table, folio_source_record.marc__t
.
Only MARC records considered to be "current" are transformed, where
current here is defined as having state
= 'ACTUAL'
and a valid
identifier is present in 999 ff $i
(field 999
, both indicators
f
, and subfield $i
).
This transform updates the table folio_source_record.marc__t
usually
every few hours or so. The time of the most recent update can be
retrieved from the table metadb.table_update
:
SELECT last_update FROM metadb.table_update WHERE schema_name = 'folio_source_record' AND table_name = 'marc__t';
The MARC transform stores partition tables in the schema marctab
.
Users can ignore this schema, as all data are accessible via
folio_source_record.marc__t
.
4.1.2. Derived tables
FOLIO "derived tables" are helper tables that provide commonly used table joins, etc. They are automatically updated once per day, usually at about 3:00 UTC by default.
Note that the derived tables are based on a periodic snapshot of data, and for this reason they are generally not up-to-date.
4.1.3. Data model
FOLIO modules do not in general provide documentation for their
internal data models, which Metadb tables are based on, but they do
have some documentation for
"storage module" WSAPIs which are roughly equivalent. The names of
most storage modules end in -storage
, but some modules use a
different convention; for instance, the storage module for users is
mod-users
. (All module names begin with mod-
.)
4.1.4. Migrating from LDP
This section contains notes related to migrating from LDP to Metadb.
4.1.4.1. Table names
Table names have changed and now are derived from FOLIO internal table names. Note that unlike LDP, Metadb tables are spread across multiple schemas, reflecting the naming structure of the source FOLIO database in which tables are grouped by FOLIO module.
LDP table | Metadb table |
---|---|
acquisition_method |
folio_orders.acquisition_method |
acquisitions_memberships |
folio_orders.acquisitions_unit_membership |
acquisitions_units |
folio_orders.acquisitions_unit |
audit_circulation_logs |
folio_audit.circulation_logs |
circulation_cancellation_reasons |
folio_circulation.cancellation_reason |
circulation_check_ins |
folio_circulation.check_in |
circulation_fixed_due_date_schedules |
folio_circulation.fixed_due_date_schedule |
circulation_loan_history |
folio_circulation.audit_loan |
circulation_loan_policies |
folio_circulation.loan_policy |
circulation_loans |
folio_circulation.loan |
circulation_patron_action_sessions |
folio_circulation.patron_action_session |
circulation_patron_notice_policies |
folio_circulation.patron_notice_policy |
circulation_request_policies |
folio_circulation.request_policy |
circulation_request_preference |
folio_circulation.user_request_preference |
circulation_requests |
folio_circulation.request |
circulation_scheduled_notices |
folio_circulation.scheduled_notice |
circulation_staff_slips |
folio_circulation.staff_slips |
configuration_entries |
folio_configuration.config_data |
course_copyrightstatuses |
folio_courses.coursereserves_copyrightstates |
course_courselistings |
folio_courses.coursereserves_courselistings |
course_courses |
folio_courses.coursereserves_courses |
course_coursetypes |
folio_courses.coursereserves_coursetypes |
course_departments |
folio_courses.coursereserves_departments |
course_processingstatuses |
folio_courses.coursereserves_processingstates |
course_reserves |
folio_courses.coursereserves_reserves |
course_roles |
folio_courses.coursereserves_roles |
course_terms |
folio_courses.coursereserves_terms |
email_email |
folio_email.email_statistics |
feesfines_accounts |
folio_feesfines.accounts |
feesfines_comments |
folio_feesfines.comments |
feesfines_feefineactions |
folio_feesfines.feefineactions |
feesfines_feefines |
folio_feesfines.feefines |
feesfines_lost_item_fees_policies |
folio_feesfines.lost_item_fee_policy |
feesfines_manualblocks |
folio_feesfines.manualblocks |
feesfines_overdue_fines_policies |
folio_feesfines.overdue_fine_policy |
feesfines_owners |
folio_feesfines.owners |
feesfines_payments |
folio_feesfines.payments |
feesfines_refunds |
folio_feesfines.refunds |
feesfines_transfer_criterias |
folio_feesfines.transfer_criteria |
feesfines_transfers |
folio_feesfines.transfers |
feesfines_waives |
folio_feesfines.waives |
finance_budgets |
folio_finance.budget |
finance_expense_classes |
folio_finance.expense_class |
finance_fiscal_years |
folio_finance.fiscal_year |
finance_fund_types |
folio_finance.fund_type |
finance_funds |
folio_finance.fund |
finance_group_fund_fiscal_years |
folio_finance.group_fund_fiscal_year |
finance_groups |
folio_finance.groups |
finance_ledgers |
folio_finance.ledger |
finance_transactions |
folio_finance.transaction |
inventory_alternative_title_types |
folio_inventory.alternative_title_type |
inventory_bound_with_part |
folio_inventory.bound_with_part |
inventory_call_number_types |
folio_inventory.call_number_type |
inventory_campuses |
folio_inventory.loccampus |
inventory_classification_types |
folio_inventory.classification_type |
inventory_contributor_name_types |
folio_inventory.contributor_name_type |
inventory_contributor_types |
folio_inventory.contributor_type |
inventory_electronic_access_relationships |
folio_inventory.electronic_access_relationship |
inventory_holdings |
folio_inventory.holdings_record |
inventory_holdings_note_types |
folio_inventory.holdings_note_type |
inventory_holdings_sources |
folio_inventory.holdings_records_source |
inventory_holdings_types |
folio_inventory.holdings_type |
inventory_identifier_types |
folio_inventory.identifier_type |
inventory_ill_policies |
folio_inventory.ill_policy |
inventory_instance_formats |
folio_inventory.instance_format |
inventory_instance_note_types |
folio_inventory.instance_note_type |
inventory_instance_relationship_types |
folio_inventory.instance_relationship_type |
inventory_instance_relationships |
folio_inventory.instance_relationship |
inventory_instance_statuses |
folio_inventory.instance_status |
inventory_instance_types |
folio_inventory.instance_type |
inventory_instances |
folio_inventory.instance |
inventory_institutions |
folio_inventory.locinstitution |
inventory_item_damaged_statuses |
folio_inventory.item_damaged_status |
inventory_item_note_types |
folio_inventory.item_note_type |
inventory_items |
folio_inventory.item |
inventory_libraries |
folio_inventory.loclibrary |
inventory_loan_types |
folio_inventory.loan_type |
inventory_locations |
folio_inventory.location |
inventory_material_types |
folio_inventory.material_type |
inventory_modes_of_issuance |
folio_inventory.mode_of_issuance |
inventory_nature_of_content_terms |
folio_inventory.nature_of_content_term |
inventory_service_points |
folio_inventory.service_point |
inventory_service_points_users |
folio_inventory.service_point_user |
inventory_statistical_code_types |
folio_inventory.statistical_code_type |
inventory_statistical_codes |
folio_inventory.statistical_code |
invoice_invoices |
folio_invoice.invoices |
invoice_lines |
folio_invoice.invoice_lines |
invoice_voucher_lines |
folio_invoice.voucher_lines |
invoice_vouchers |
folio_invoice.vouchers |
notes |
folio_notes.note |
organization_addresses |
folio_organizations.addresses |
organization_categories |
folio_organizations.categories |
organization_contacts |
folio_organizations.contacts |
organization_emails |
folio_organizations.emails |
organization_interfaces |
folio_organizations.interfaces |
organization_organizations |
folio_organizations.organizations |
organization_phone_numbers |
folio_organizations.phone_numbers |
organization_urls |
folio_organizations.urls |
patron_blocks_user_summary |
folio_patron_blocks.user_summary |
perm_permissions |
folio_permissions.permissions |
perm_users |
folio_permissions.permissions_users |
po_alerts |
folio_orders.alert |
po_lines |
folio_orders.po_line |
po_order_invoice_relns |
folio_orders.order_invoice_relationship |
po_order_templates |
folio_orders.order_templates |
po_pieces |
folio_orders.pieces |
po_purchase_orders |
folio_orders.purchase_order |
po_receiving_history |
(Not supported in Metadb) |
po_reporting_codes |
folio_orders.reporting_code |
srs_error |
folio_source_record.error_records_lb |
srs_marc |
folio_source_record.marc_records_lb |
srs_marctab |
folio_source_record.marc__t |
srs_records |
folio_source_record.records_lb |
template_engine_template |
folio_template_engine.template |
user_addresstypes |
folio_users.addresstype |
user_departments |
folio_users.departments |
user_groups |
folio_users.groups |
user_proxiesfor |
folio_users.proxyfor |
user_users |
folio_users.users |
4.1.4.2. Column names
The data
column in LDP contains JSON objects. In Metadb this column
appears as jsonb
or in some cases content
, matching the FOLIO
internal column names.
4.1.4.3. Data types
In Metadb, UUIDs generally have the uuid
data type. If a UUID has
the text
data type preserved from the source data, it should be cast
using ::uuid
in queries.
Columns with the json
data type in LDP have been changed to use the
jsonb
data type in Metadb.
4.1.4.4. JSON queries
Querying JSON is very similar with Metadb as compared to LDP. For clarity we give a few examples below.
JSON source data
To select JSON data extracted from a FOLIO source, LDP supports:
SELECT data FROM user_groups;
In Metadb, this can be written as:
SELECT jsonb FROM folio_users.groups;
Or with easier to read formatting:
SELECT jsonb_pretty(jsonb) FROM folio_users.groups;
JSON fields: non-array data
For non-array JSON fields, extracting the data directly from JSON in LDP usually takes the form:
SELECT json_extract_path_text(data, 'group') FROM user_groups;
The form recommended for Metadb is:
SELECT jsonb_extract_path_text(jsonb, 'group') FROM folio_users.groups;
JSON fields: array data
To extract JSON arrays, the syntax for Metadb is similar to LDP. A
lateral join can be used with the function jsonb_array_elements()
to
convert the elements of a JSON array to a set of rows, one row per
array element.
For example, if the array elements are simple text
strings:
CREATE TABLE instance_format_ids AS SELECT id AS instance_id, instance_format_ids.jsonb #>> '{}' AS instance_format_id, instance_format_ids.ordinality FROM folio_inventory.instance CROSS JOIN LATERAL jsonb_array_elements(jsonb_extract_path(jsonb, 'instanceFormatIds')) WITH ORDINALITY AS instance_format_ids (jsonb);
If the array elements are JSON objects:
CREATE TABLE holdings_notes AS SELECT id AS holdings_id, jsonb_extract_path_text(notes.jsonb, 'holdingsNoteTypeId')::uuid AS holdings_note_type_id, jsonb_extract_path_text(notes.jsonb, 'note') AS note, jsonb_extract_path_text(notes.jsonb, 'staffOnly')::boolean AS staff_only, notes.ordinality FROM folio_inventory.holdings_record CROSS JOIN LATERAL jsonb_array_elements(jsonb_extract_path(jsonb, 'notes')) WITH ORDINALITY AS notes (jsonb);
JSON fields as columns
LDP transforms simple, first-level JSON fields into columns, which can be queried as, for example:
SELECT id, "group", "desc" FROM user_groups;
The Metadb equivalent of this query is:
SELECT id, "group", "desc" FROM folio_users.groups__t;
Note that the double quotation marks are needed here only because
group
and desc
are reserved words in SQL. Alternatively, they can
be removed if the column names are prefixed with a table alias:
SELECT g.id, g.group, g.desc FROM folio_users.groups__t AS g;
Support for transforming subfields and arrays is planned in Metadb.
4.1.4.5. Migrating historical data from LDP
Metadb 1.3 Metadb can import legacy historical data from LDP. The Metadb server must be stopped while this process runs. As an example:
metadb migrate -D data --ldpconf ldpconf.json --source folio
The file ldpconf.json
is used to connect to the LDP database. The
output looks something like:
Begin migration process? y metadb: migrating: folio_audit.circulation_logs__: reading history.audit_circulation_logs where (updated < 2023-06-28 10:31:35.0556 +0000 UTC) metadb: migrating: folio_audit.circulation_logs__: 3544356 records written metadb: migrating: folio_circulation.audit_loan__: reading history.circulation_loan_history where (updated < 2023-06-28 03:34:57.32423 +0000 UTC) metadb: migrating: folio_circulation.audit_loan__: 2201724 records written metadb: migrating: folio_circulation.cancellation_reason__: reading history.circulation_cancellation_reasons where (updated < 2023-06-28 03:34:59.911506 +0000 UTC) metadb: migrating: folio_circulation.cancellation_reason__: 22 records written metadb: migrating: folio_circulation.check_in__: reading history.circulation_check_ins where (updated < 2023-06-28 11:31:38.628637 +0000 UTC) metadb: migrating: folio_circulation.check_in__: 1095442 records written metadb: migrating: folio_circulation.fixed_due_date_schedule__: reading history.circulation_fixed_due_date_schedules where (updated < 2023-07-04 10:31:46.899899 +0000 UTC) metadb: migrating: folio_circulation.fixed_due_date_schedule__: 34 records written metadb: migrating: folio_circulation.loan__: reading history.circulation_loans where (updated < 2023-06-28 03:34:57.932582 +0000 UTC) metadb: migrating: folio_circulation.loan__: 1600346 records written # (etc.)
Note that only records that LDP updated before a specific time stamp will be imported. This is because for each LDP table and corresponding Metadb table there may be a range of times in which both LDP and Metadb contain historical data. In such cases, the Metadb data are preferred, and the import stops at the point after which the two data sets would otherwise overlap.
Also note that JSON data contained in the imported records are not transformed into columns.
Records imported using this process have their __origin
column set
to the value ldp
, which distinguishes them from other FOLIO data in
Metadb.
4.1.5. Configuring Metadb for FOLIO
When creating a FOLIO data source, use the module 'folio'
option,
and set trimschemaprefix
to remove the tenant from schema names and
addschemaprefix
to add a folio_
prefix to the schema names. For
example:
CREATE DATA SOURCE folio TYPE kafka OPTIONS ( module 'folio', trimschemaprefix 'tenantname_', addschemaprefix 'folio_', brokers 'kafka:29092', topics '^metadb_folio_1\.', consumergroup 'metadb_folio_1_1', schemastopfilter 'admin' );
Specifying module 'folio'
has multiple effects including how tenants
are handled, where to find the derived tables, and that MARC
transformation is to be performed. We trim the tenant-name prefix
from schema names because the Metadb database handles only a single
tenant. We add the folio_
prefix as a namespace to allow for other
(non-FOLIO) library data to be imported into the database as well, in
order to support cross-domain analytics.
Note that the CREATE DATA SOURCE
statement does not control which
tenant’s data are to be streamed. The tenant is typically selected in
the configuration of the Debezium connector.
It is recommended to use a separate Kafka cluster, rather than the FOLIO Kafka instance, until one has experience with administration of Kafka.
In the Debezium PostgreSQL connector configuration, the following exclusions are suggested:
"schema.exclude.list": "public,.*_mod_login,.*_mod_pubsub,.*pubsub_config,supertenant_mod_.*,.*_mod_kb_ebsco_java,.*_mod_data_export_spring"
"table.exclude.list": ".*__system,.*_mod_agreements.alternate_resource_name,.*_mod_service_interaction.dashboard_access,.*_mod_agreements.availability_constraint,.*_mod_agreements\\.package_description_url,.*_mod_agreements\\.content_type,.*_mod_agreements\\.entitlement_tag,.*_mod_agreements\\.erm_resource_tag,.*_mod_agreements\\.string_template,.*_mod_agreements\\.string_template_scopes,.*_mod_agreements\\.templated_url,.*_mod_oai_pmh\\.instances,.*_mod_remote_storage\\.original_locations,.*_mod_remote_storage\\.item_notes,.*app_setting,.*alternate_name,.*databasechangelog,.*databasechangeloglock,.*directory_entry_tag,.*license_document_attachment,.*license_supp_doc,.*license_tag,.*log_entry_additional_info,.*subscription_agreement_supp_doc,.*subscription_agreement_document_attachment,.*subscription_agreement_ext_lic_doc,.*subscription_agreement_tag,.*tenant_changelog,.*tenant_changelog_lock,.*marc_indexers.*,.*rmb_internal.*,.*rmb_job.*,.*_mod_agreements\\.match_key,.*system_changelog"
Tables can be excluded for various reasons. Most of the tables
filtered above are excluded because they are not of interest for
analytics (e.g. pubsub state), but data from some modules,
e.g. mod_login
, are omitted for security reasons. It is up to
individual libraries to tailor this exclusion list to their
requirements.
4.2. ReShare
When configured for ReShare, a Metadb instance typically manages data for a consortium using a single data source. ReShare "tenants" are not separated from each other in the usual sense; they each represent a consortium member and together form a single consortial tenant. For this reason, all of the ReShare tenants are streamed to a single Metadb instance, unlike FOLIO where a Metadb instance only handles a single FOLIO tenant.
The data from consortium members are combined into aggregated tables,
with the __origin
column set to the name of the ReShare tenant
(i.e. consortium member). The CREATE DATA ORIGIN
command defines
the set of known origins, and schema names in the source database are
assumed to begin with a prefix that identifies the origin (because
this is how FOLIO and ReShare store tenant data). For example
west_mod_rs.table1
has the prefix west
which is taken to be the
origin if previously defined by CREATE DATA ORIGIN
.
4.2.1. Derived tables
ReShare "derived tables" are automatically updated once per day, usually at about 3:00 UTC by default.
Note that the derived tables are based on a periodic snapshot of data, and for this reason they are generally not up-to-date.
4.2.2. Configuring Metadb for ReShare
Before defining a ReShare data source, create a data origin for each ReShare tenant (i.e. each member of the consortium). For example:
CREATE DATA ORIGIN tenant1; CREATE DATA ORIGIN tenant2; CREATE DATA ORIGIN tenant3;
Then use the module 'reshare'
option when creating the data source,
and set addschemaprefix
to add a reshare_
prefix to the schema
names:
CREATE DATA SOURCE reshare TYPE kafka OPTIONS ( module 'reshare', addschemaprefix 'reshare_', brokers 'kafka:29092', topics '^metadb_reshare_1\.', consumergroup 'metadb_reshare_1_1', schemastopfilter 'admin' );
Note that unlike with FOLIO, here we do not use trimschemaprefix
,
because the "reshare" module uses the tenant name in the prefix to
choose a configured data origin, as described above.
Note that the order of commands is important: The initial set of data
origins should be created before the data source is created so that
schema names of incoming data will be processed correctly. Later,
whenever a new consortial tenant is to be added, it should be defined
in Metadb using CREATE DATA ORIGIN
(and the server restarted) before
the tenant is added to ReShare.
In the Debezium PostgreSQL connector configuration, it is suggested
that credentials (.mod_login`), the public schema, the Okapi
supertenant (`supertenant_mod_.
), and mod-pubsub data
(pubsub_config,.+_mod_pubsub
) be excluded using the
schema.exclude.list
setting.
4.3. MARC transform for LDP/LDLite
The MARC transform (currently included in Metadb) can also be used
with LDP and LDLite. A command-line tool called marct
is provided
which is a drop-in replacement for ldpmarc
.
The system requirements are a subset of those for Metadb:
-
Local storage: 500 GB
-
Database storage: 500 GB
-
Operating system: Linux
-
PostgreSQL 15 or later
-
Go 1.20 or later
To build marct
:
mkdir -p bin && go build -o bin ./cmd/marct
which creates a bin/
subdirectory and builds the marct
executable
there:
./bin/marct -h
In LDP, MARC data are read from the tables public.srs_marc
and
public.srs_records
, and the transformed output is written to the
table public.srs_marctab
.
Typical usage is:
./bin/marct -D <datadir> -u <ldp_user>
where datadir
is a LDP data directory containing ldpconf.json
, and
ldp1_user
is a LDP user to be granted SELECT
privileges on the
output table.
For example:
./bin/marct -D data -u ldp
Note that marct
only grants privileges for a single user. If
individual user accounts are configured for LDP, a shell script can be
used to grant privileges to the users, for example:
users=/path/to/list/of/users.txt for u in $( cat $users ); do psql -c "GRANT SELECT ON public.srs_marctab TO $u ;" done
The first time marct
runs, it will perform a "full update" of all of
the MARC records. In subsequent runs, it will attempt to use
"incremental update" to update only records that have changed since
the previous run, which can dramatically reduce the running time if
the number of changes is small.
However, if very many records have changed, it is possible that
incremental update may take longer than full update. If it appears
that an incremental update will never finish, it should be canceled,
and a full update should be run once before resuming incremental
updates. This can be done by using the -f
command-line option,
which disables incremental update and requires marct
to do a full
update.