LRL files
Comdb2 configuration files have an .lrl extension for historic reasons. No configuration file is required. If no file is specified, the database will pick reasonable defaults. On startup, the database will look for the following files, and read them, in order:
$COMDB2_ROOT/etc/cdb2/config/comdb2.lrl
$COMDB2_ROOT/etc/cdb2/config/comdb2_local.lrl
$COMDB2_ROOT/etc/cdb2/config.d/*.lrl
dbname.lrl
(for your database’s name)
COMDB2_ROOT
is set at compile time (defaults to /opt/bb/
). Setting a $COMDB2_ROOT
environment variable overrides
the default value.
The intent behind multiple config files is to make deployment of settings easier. comdb2.lrl
is meant to contain settings
that are identical across many databases in your application. comdb2_local.lrl
is for settings that are local to this
machine. Files under config.d
are useful for organizing settings in multiple files (the ‘include’ directive in .lrl files
can also accomplish this).
The last file read is the database specific lrl file. This contains settings specific to a given database. Comdb2 will look for this file in one of these locations, in this order:
- Current directory
- In database directory if it’s specified with
--dir /path/to/db
on command line - In the default database location, if the database directory isn’t specified
- In the location specified with ‘–lrl /path/to/dbname.lrl` on command line
Lines beginning with ‘#’ are treated as comments and skipped. Other lines should consist of a directive and its value.
Database tunables.
Cache size
Cache size can be set with the the ‘cache’ tunable. It takes a cache size and a unit (kb, mb, gb). For example
cache 64 mb
This gets you 64 megabytes of cache. The database will print its cache size when starting up:
Cache:64MB Segments:1 Segment-size:64MB
The default cache size is 64MB. You can set policies on a machine by capping the min and max settings for cachesize (putting these in global comdb2.lrl or comdb2_local.lrl is a good idea. For example to set the minimum cache size at 8MB, and restrict the maximum at 1GB (say, for a shared machine), one can have:
cachekbmin 8192
cachekbmax 1048576
Maximum and minimum values are given in KB.
Thread Pools
There are various threadpools in Comdb2 responsible for specific work. They all take a common set of options.
Name | Function | Default size | Default max queue depth |
---|---|---|---|
appsockpool | Pool for application connections | Unlimited | 0 |
sqlenginepool | Pool for sql runner threads | 48 | 500 |
iopool | Cache flusher threads | 4 | 8000 |
The following options are available for configuring thread pool parameters:
Parameters | Function |
mint | Minimum number of threads to keep around. Threads above this value will exit after linger seconds. Raise this if the thread pool reports lots of thread creates. |
maxt | Maximum number of threads to keep around. Lower this you don’t get gains from additional concurrency for the specific subsystem. |
linger | How long threads above mint will hang around (in seconds) before exiting. |
maxagems | Max age of an item on a queue in ms. Items older than given will be dropped. |
exit_on_error | If set (argument is on ), database will exit if it can’t spin up new threads for this threadpool. |
dump_on_full | If set, argument is on ) will dump the current state of the threadpool when the queue is full |
maxq | Maximum queue depth. If maxt threads are active and none are available, items are enqueued. If the queue reaches this depth, requests to enqueue further are dropped. |
maxqover | Maximum queue override depth. Queued items below this limit won’t generate warnings. |
Examples:
To limit the number of concurrent SQL threads for this node to 12:
sqlenginepool maxt 12
To dump the SQL thread pool every time the queue is full:
sqlenginepool maxq dump_on_error on
Networks
There are a few references to various “networks” in the descriptions below. All machines specified by the
cluster nodes
tunable (or added dynamically) form connections. There is 2 dedicated threads for each machine
for reading and writing messages. There’s 3 “networks” formed for each database (so 3 sets of connections).
Each network has its own set of tunables. Messages are queued for each destination, and are flushed when
necessary. All nodes periodically send heartbeats. Lack of a heartbeat is treated like a network drop.
Network tunables for controlling heartbeat parameters, queue sizes, etc. are listed below. The following
sections describe each network’s uses.
Option | Default | Description |
netbufsz | 1048576 (1 MB) | Size of the network buffer (per node) for the replication network |
netbufsz_signal | 65536 | Size of the network buffer (per node) for the signal network |
heartbeat_check_time | 10 (seconds) | Consider an error if no heartbeat for this many seconds |
net_max_queue | 25000 | Maximum number of items to keep on replication network queue before dropping (per replicant) |
nax_max_mem | 0 (not set) | Maximum size (in MB) of items keep on replication network queue before dropping (per replicant) |
net_throttle_percent | 50 | Throttles write requests from replicants if the replication network queue is net_throttle_percent full |
net_max_queue_signal | 100 | Maximum number of items to keep on the signal network queue before dropping (per replicant) |
net_poll | 100 ms | Allow a connection to linger for this many ms before identifying itself. Connections that take longer are shut down. |
osql_net_poll | 100 ms | Like net_sql , but for the offload network (used by write transactions on replicants to send work to the master) |
net_portmux_register_interval | 600 ms | Check on this interval if our port is correctly registered with pmux for the replication net |
signal_net_portmux_register_interval | 600 ms | Like osql_net_poll for the signal network |
osql_net_portmux_register_interval | 600 ms | like net_portmux_register_interval |
osql_max_queue | 25000 | Like net_max_queue for offload net |
osql_bkoff_netsend | 100 ms | On a full offload net queue, attempt to wait this long before attempting to resend |
osql_bkoff_netsend_lmt | 300000 | Wait a total of this many ms attempting to send on the offload net |
toblock_net_throttle | not set | If set, will throttle writes on a full network queue |
no_toblock_net_throttle | Disables no_toblock_net_throttle | |
enque_flush_interval | 1000 | Try to flush network queue after this many writes for the replication net |
enque_flush_interval_signal | 1000 | Try to flush network queue after this many writes for the signal net |
enque_reorder_lookahead | 20 | When messages are sent out of order, peek at this many messages on the queue in attempt to reorder |
osql_heartbeat_send_time | 5 (sec) | Like heartbeat_send_time for the offload network |
osql_heartbeat_alert_time | 10 (sec) | Like heartbeat_check_time for the offload network |
net_explicit_flush_trace | not set | Produce a stack dump for long network flushes |
no_net_explicit_flush_trace | Turns off stack dumps for long network flushes | |
udp | set | Transaction acks are sent back to master via UDP. Since UDP is potentially lossy, replicants will inject the current LSN ack into their TCP channel to the master every 500 ms. On a lossy network, if you see lots of 500ms transactions, you may want to disable UDP. Such cases aren’t typical. |
noudp | Disables udp . |
Replication
The replication network handles replication-specific traffic between database nodes. This includes election messages, log messages (ie: normal replication), requests for re-transmits, etc. Steps 3 and 4 in Transaction Life cycle go over the replication network.
Offload
The offload net is used by replicants to send work to the master to apply. Replicants run SQL and assemble a set of record modifications to be applied. Steps 2 and 5 in Transaction Life cycle go over the offload net.
Signal
The signal net is reserved for future use.
Sync commands
These tunables control how often the database flushes its log to disk, and what nodes are required to acknowledge a log entry before control is returned to the user. All these tunables start with the ‘sync’ command, and the following word determine which option is being set. The default mode is to NOT flush logs on every transaction, and to wait for all nodes to acknowledge a transaction before returning success.
Option | Default | Description |
full | not set | Logs are flushed on every commit |
normal | set | Logs are NOT flushed on every commit (ie: you can lose transactions if all the nodes in the cluster crash). This is the default. This option also implies fully synchronous replication - ie: wait for all nodes to ack before returning success. |
rep_always_wait | not set | Wait for all nodes, including those that aren’t connected (not recommended) |
none | not set | Asynchronous replication, don’t wait for any replication acknowledgments before returning success. |
room | not set | Wait for acknowledgments only from nodes in the same availability zone/data center (see room affinity). |
log-sync-time | 10 (seconds) | Sets how often we flush the logs to disk. |
log-delete-now | set | Make all log files eligible for deletion when not needed |
log-delete-before | not set | Make all log files older than current time are eligible for deletion when not needed |
log-delete-age-set | Takes epoch time in seconds, log files older than the given time are eligible for deletion when not needed | |
log-delete | on | on enables log deletion, off disables it |
Allow/Disallow commands
These commands control machine class permissions. For every incoming connection, Comdb2 will try to read its class
from comdb2db once. allow
and disallow
can be used to set up policies for which classes of machines are allowed
to access databases. For example, you may have a blanket rule that production databases can only be accessed by
production machines, etc. Commands have this syntax:
allow|disallow writes|cluster|broadcast [from] class|machine
Class should be one of ‘dev’, ‘alpha’, ‘beta’, ‘prod’. Any other value is assumed to be a machine name.
writes
allows/disallows transactions that change the state of the database from the given source. cluster
allows/disallows forming clusters with machines from the given source. broadcast
allows/disallows consumer
connections from the given source.
Switches
Switches are simple on/off settings that generally control debugging or experimental features. They can
be enabled with the on
option, and disabled with off
. Unless debugging/developing Comdb2, toggling these
is generally not advised.
Option | Default | Description |
bad_lrl_fatal | off | Unrecognized lrl options are fatal errors |
t2t | off | New tag->tag conversion code |
fix_cstr | on | Fix validation of cstrings |
warn_cstr | on | Warn on validation of cstrings |
scpushlogs | on | Push to next log after a schema changes |
pfltverbose | on | Verbose errors in prefaulting code |
plannedsc | on | Use planned schema change by default |
pflt_readahead | on | Enable prefaulting of readahead operations |
pflt_toblock_lcl | on | Prefault toblock operations locally |
pflt_toblock_rep | on | Prefault toblock operations on replicants |
dflt_livesc | on | Use live schema change by default |
dflt_plansc | on | Use planned schema change by default |
consumer_rtcpu | on | Don’t send update broadcasts to machines that are marked offline |
clnt_sql_stats | on | Trace back fds to client machines |
sqlite3openserial | on | Serialize calls to sqlite3_open to prevent excess CPU |
thread_stats | on | Berkeley DB will keep stats on what its threads are doing |
lock_timing | on | Berkeley DB will keep stats on time spent waiting for locks |
qdump_atexit | on | Dump queue stats at exit |
memp_timing | off | Berkeley DB will keep stats on time spent in __memp_fget |
memp_pg_timing | on | Berkeley DB will keep stats on time spent in __memp_pg |
shalloc_timing | on | Berkeley DB will keep stats on time |
reset_queue_cursor_mode | on | Reset queue consumeer read cursor after each consume |
key_updates | on | Update non-dupe keys instead of delete/add |
emptystrnum | on | Empty strings don’t convert to numbers |
schemachange_perms | on | Check if schema change allowed from source machines |
verifylsn | on | Verify if LSN written before writing page |
allow_broken_datetimes | on | Allow broken datetimes |
verify_directio | off | Run expensive checks on directio calls |
parallel_sync | on | Run checkpoint/memptrickle code with parallel writes |
verify_dbreg | off | Periodically check if dbreg entries are correct |
verifycheckpoints | off | Highly paranoid checkpoint validity checks |
support_datetime_in_triggers | on | Enable support for datetime/interval types in triggers |
prefix_foreign_keys | on | Allow foreign key to be a prefix of your key |
superset_foreign_keys | on | Allow foreign key to be a superset of your key |
repverifyrecs | off | Verify every berkeley log record received |
enable_osql_logging | off | Log every osql packet received in a special file, per iq |
enable_osql_longreq_logging | off | Log untruncated osql strings |
check_sparse_files | off | When allocating a page, check that we aren’t creating a sparse file |
core_on_sparse_file | off | Generate a core if we catch berkeley creating a sparse file |
cache_sockpool_tran | on | Cache sent rows for an sql transaction, and survive a master swing |
sqlclient_use_random_readnode | off | Sql client will use random sql allocation by default |
check_sqlite_numeric_types | on | Report if our numeric conversion disagrees with SQLite’s |
use_fastseed_for_comdb2_seqno | off | Use fastseed instead of context for comdb2_seqno unique values |
disable_stable_for_ipu | on | For inplace update tables, disable stable find-next cursors |
debug_mpalloc_size | off | Alarm on suspicious allocation requests; |
disable_exit_on_thread_error | off | don’t exit on thread errors |
catch_response_on_retry | on | print trace when we try to send replies on a retry |
requeue_on_tran_dispatch | on | Requeue transactional statement if not enough threads |
check_wrong_db | on | Return error if connecting to wrong database |
dbglog_use_sockpool | on | Use sockpool for connections opened for dbglog |
debug_temp_tables | off | Debug temp tables |
check_sql_source | off | Check sql source |
flush_check_active_peer | on | Check if still have active connection when trying to flush |
private_blkseq | on | Keep a private blkseq |
use_blkseq | on | Enable blkseq |
track_queue_time | on | Track time sql requests spend on queue |
update_startlsn_printstep | off | Print steps walked in update_startlsn code |
locks_check_waiters | off | Light a flag if a lockid has waiters |
rowlocks_commit_on_waiters | off | Don’t commit a physical transaction unless there are lock waiters |
broadcast_check_rmtpol | on | Check rmtpol before sending triggers |
track_curtran_locks | off | Print curtran lockinfo |
print_deadlock_cycles | off | Print all deadlock cycles |
replicate_rowlocks | on | Replicate rowlocks |
gather_rowlocks_on_replicant | on | Replicant will gather rowlocks |
force_old_cursors | off | Replicant will use old cursors |
disable_rowlocks_logging | off | Don’t add logical logging for rowlocks |
disable_rowlocks | off | Follow rowlocks codepath but don’t lock |
random_rowlocks | off | Grab random, guaranteed non-conflicting rowlocks |
already_aborted_trace | off | Print trace when dd_abort skips an ‘already-aborted’ locker |
disable_update_shadows | off | stub out update shadows code |
verbose_toblock_backouts | off | print verbose toblock backout trace |
sql_release_locks_on_si_lockwait | on | Release sql locks from si if the rep thread is waiting |
sql_release_locks_on_emit_row_lockwait | off | Release sql locks when we are about to emit a row |
sql_release_locks_on_slow_reader | on | Release sql locks if a tcp write to the client blocks |
no_timeouts_on_release_locks | on | Disable client-timeouts if we’re releasing locks |
sql_release_locks_in_update_shadows | on | Release sql locks in update_shadows on lockwait |
release_locks_trace | off | Print trace if we release locks |
verbose_waiter_flag | off | Print trace setting the waiter flag in lock code |
dump_locks_on_repwait | off | Dump locks on repwaits |
dump_page_on_byteswap_error | off | fsnap a malformed page from byteswap |
dump_after_byteswap | off | dump page after byteswap |
rl_retry_on_deadlock | on | retry micro commit on deadlock |
disable_blob_check | off | return immediately in check_blob_buffers |
disable_new_si_overhead | off | return immediately in several new snapisol functions |
verify_all_pools | off | verify objects are returned to the correct pools |
print_blockp_stats | off | print thread-count in block processor |
allow_parallel_rep_on_pagesplit | on | allow parallel rep on pgsplit |
allow_parallel_rep_on_prefix | on | allow parallel rep on bam_prefix |
verbose_net | off | Net prints lots of messages |
only_match_on_commit | off | Only rep_verify_match on commit records |
check_page_in_recovery | off | verify that a page has or hasn’t gotten corrupt |
comptxn_inherit_locks | on | Compensating transactions inherit pagelocks |
rep_printlock | off | Print locks in rep commit |
accept_on_child_nets | off | listen on separate port for osql/signal nets |
disable_etc_services_lookup | off | When on, disables using /etc/services first to resolve ports |
rowlocks_deadlock_trace | off | Prints deadlock trace in phys.c |
sqllogger
commands
There’s a detailed SQL logging system in place. It allows fine-grained data collection on queries. This
can be controlled with the sqllogger
commands. Logs themselves are binary files that are dumped into
the database directory with the name $DBNAME.sqllog. When rolled, they before $DBNAME.sqllog.1,
$DBNAME.sqllog.2, etc.
sqllogger on | Enables sql logging | |
sqllogger rollat | not set | Automatically roll logs when they get to this size (in bytes) |
sqllogger keep | 2 | Keep this many copies of rolled logs |
sqllogger every | 1 | Log every Nth statement (1 means every), occasionally good for sampling queries |
sqllogger async | 1 | Log asynchronously - don’t write to file in the same thread as SQL query |
sqllogger sync | 1 | Log synchronously - write to file in the same thread as SQL query |
sqllogger asyncsize | 4194304 | For async logging, the max size of unflushed query data |
Incoherent nodes overview
One problem with synchronous replication is that commit time is dominated by the slowest node. What’s worse, a node that stops responding altogether can hold up replication forever. To avoid these problems, the master may choose to mark nodes incoherent. Incoherent nodes continue to operate as before with 2 exceptions:
- They will no longer service any requests - any new queries/transactions that come in are rejected and will go to another node.
- The master will no longer wait for replication acknowledgments from incoherent nodes.
There’s 2 ways for a node to be marked incoherent. The first catches outright broken nodes. If any nodes have acknowledged a transaction, and any other node takes longer than a configured threshold to respond, it’s considered broken and marked incoherent. The second option captures nodes that are statistically slower than other nodes. A node where the running average of acknowledgment times over the last minute twice as bad as the second bad node (withing some minimum threshold) is considered slow, and will become incoherent.
Incoherent nodes are self-healing. If a node marked incoherent catches up to the master, the master will mark it coherent again, will resume waiting for replication, and the node will be free to service requests. If a node was marked slow (second method, listed above), it will become coherent once its running average is back within the configured threshold.
The tunables in this section control the various tunables used to decide when nodes are to be considered incoherent
DB layer tunables:
Option | Default | Description |
incoherent_msg_freq | 3600 (seconds) | Drop a warning file if a node is incoherent (see incoherent nodes overview) for more information. |
incoherent_alarm_time | 120 (seconds) | Warn about incoherent nodes this often. |
max_incoherent_nodes | 1 | Number of incoherent nodes before an alarm triggers. |
net_lmt_upd_incoherent_nodes | 70 (%) | When this percent of a network queue is full, throttle messages to incoherent nodes. |
BDB layer tunables (see bdbattr tunables)
Option | Default | Description |
REPTIMEOUT_LAG | 50 (PERCENT) | Used in replication. Once a node has received our update, we will wait REPTIMEOUT_LAG% of the time that took for all other nodes. |
REPTIMEOUT_MINMS | 10000 (MSECS) | Even if the first node comes back quickly, wait at least this many ms to replicate to other nodes. |
REPTIMEOUT_MAXMS | 5 * 60 * 1000 (MSECS) | We should wait this long for one node to acknowledge replication. If after this time we have failed to replicate anywhere then the entire cluster is incoherent! |
REP_DEBUG_DELAY | 0 (MSECS) | For debugging - set an artificial replication delay |
TOOMANYSKIPPED | 2 (QUANTITY) | Call for election again and delay commits if more than this many nodes are incoherent |
SKIPDELAYBASE | 100 (MSECS) | Delay commits by at least this much if forced to delay by incoherent nodes |
REPMETHODMAXSLEEP | 300 (SECS) | Delay commits by at most this much if forced to delay by incoherent nodes |
sql_tranlevel_default
options
These options allow you to set the default SQL transaction level across the database. Any setting here can be overwritten by running ‘SET TRANSACTION …’ after connecting to the database. See Isolation Levels for available transaction levels.
sql_tranlevel_default default | Set to default isolation level |
sql_tranlevel_default recom | Set to READ COMMITTED isolation level |
sql_tranlevel_default snapshot | Set to SNAPSHOT isolation level |
sql_tranlevel_default serial | Set to SERIALIZABLE isolation level |
Query limit commands
Queries running in the database can be limited in cost. The database can optionally warn when a query reaches a certain cost, or prevent it from running further.
querylimit maxcost N | Set the maximum cost the database will allow for a query to N |
querylimit warn maxcost N | Warn if a query has a cost > N |
querylimit maxcost off | Turn off returning errors for queries over a cost limit |
querylimit warn maxcost off | Turn off warnings for queries over a cost limit |
Decimal rounding options
The rounding policy for rounding for columns of decimal types can be controlled with the decimal_rounding
option.
Available options are:
decimal_rounding DEC_ROUND_NONE | Don’t round |
decimal_rounding DEC_ROUND_CEILING | Round towards +infinity |
decimal_rounding DEC_ROUND_UP | Round away from 0 |
decimal_rounding DEC_ROUND_HALF_UP | 0.5 rounds up |
decimal_rounding DEC_ROUND_HALF_EVEN | 0.5 rounds to nearest even (default) |
decimal_rounding DEC_ROUND_HALF_DOWN | 0.5 rounds down |
decimal_rounding DEC_ROUND_DOWN | Round towards 0 (truncate) |
decimal_rounding DEC_ROUND_FLOOR | Round towards -infinity |
decimal_rounding DEC_ROUND_05UP | Round for reround |
berkattr
tunables
There’s a few low-level tunables that control options in BerkeleyDB subsystems present in Comdb2.
All options below are preceded with “berkattr” in the configuration file.
option | Default | Description |
iomap_enabled | 1 | Map file that tells comdb2ar to pause while we fsync |
flush_scan_dbs_first | 0 | Don’t hold bufpool mutex while opening files for flush |
skip_sync_if_direct | 1 | Don’t fsync files if directio enabled |
warn_on_replicant_log_write | 1 | Warn if replicant is writing to logs |
abort_on_replicant_log_write | 0 | Abort if replicant is writing to logs |
debug_deadlock_replicant_percent | 0 | Percent of replicant events getting deadlocks |
always_run_recovery | 1 | Replicant always runs recovery after rep_verify |
debug_addrem_dbregs | 0 | Generate debug records for addrems |
start_recovery_at_dbregs | 1 | Start recovery at dbregs |
apprec_track_lsn_ranges | 1 | During recovery track lsn ranges |
dbreg_errors_fatal | 0 | dbreg errors fatal |
warn_nondbreg_records | 0 | warn on non-dbreg records before checkpoint |
num_write_retries | 8 | number of times to retry writes on ENOSPC |
debug_enospc_chance | 0 | DEBUG %% random ENOSPC on writes |
check_zero_lsn_writes | 1 | Warn on writing pages with zero LSNs |
abort_zero_lsn_writes | 0 | Abort on writing pages with zero headers |
abort_zero_lsn_memp_put | 0 | Abort on memp_fput pages with zero headers |
preallocate_on_writes | 0 | Pre-allocate on writes |
preallocate_max | 256 * MEGABYTE | Pre-allocation size |
lsnerr_pgdump | 1 | Dump page on LSN errors |
lsnerr_pgdump_all | 0 | Dump page on LSN errors on all nodes |
max_backout_seconds | 0 | Refuse to roll back replicant past this many seconds |
max_backout_logs | 3 | Refuse to roll back replicant past this many log files |
log_applied_lsns | 0 | Log applied LSNs to log |
check_applied_lsns | 0 | Check transaction that its LSNs have been applied |
check_applied_lsns_fatal | 0 | Abort if check_applied_lsns fails |
check_applied_lsns_debug | 0 | Lots of verbose trace for debugging applied LSNs |
sgio_enabled | 0 | Do scatter gather I/O |
sgio_max | 10 * MEGABYTE | Max scatter gather I/O to do at one time |
btpf_enabled | 0 | Enables index pages read ahead |
btpf_wndw_min | 100 | Minimum number of pages read ahead |
btpf_wndw_max | 1000 | Maximum number of pages read ahead |
btpf_wndw_inc | 1 | Increment factor for the number of pages read ahead |
btpf_pg_gap | 0 | Min. number of records to the page limit before read ahead |
btpf_cu_gap | 5 | How close a cursor should be (pages) to the prefaulted limit before prefaulting again |
btpf_min_th | 1 | Preload pages only if the tree has height less than this parameter |
recovery_verify | 0 | After recovery, run a full pass to make sure everything is applied |
recovery_verify_fatal | 0 | Abort if recovery_verify is set, and fails. |
check_pwrites | 0 | Read page after direct pwrite, check that it matches |
check_pwrites_debug | 0 | Read page after direct pwrite, check that it matches |
cache_lc | 0 | Collect logs into LSN_COLLECTIONs as they come in |
cache_lc_max | 16 | Keep this many transactions around in LC cache |
cache_lc_debug | 0 | Lots of verbose messages out of LC cache system |
cache_lc_trace_evictions | 0 | Print a message at the point of eviction |
cache_lc_trace_misses | 0 | Print a message on cache miss |
cache_lc_check | 0 | Check LC cache system on every transaction |
cache_lc_memlimit | 2097152 | Limit total memory used by LC cache (0 = unlimited). |
cache_lc_memlimit_tran | 1048576 | Limit per transaction memory used by LC cache |
consolidate_dbreg_ranges | 1 | Combine adjacent dbreg ranges for same file |
max_latch | 200000 | Size of latch array |
max_latch_lockerid | 10000 | Size of latch lockerid array |
lockerid_node_step | 128 | Stepup for preallocated lids |
ilock_step | 2048 | Stepup for preallocated ilock-latches |
db_lock_lsn_step | 1024 | Stepup for preallocated db_lock_lsns |
blocking_latches | 0 | Block on latch rather than deadlock |
latch_max_wait | 5000 | Block at most this many microseconds before returning deadlock |
latch_poll_us | 1000 | Poll latch this many microseconds before retrying |
latch_max_poll | 5 | Poll latch this many times before returning deadlock |
latch_timed_mutex | 1 | Use a timed mutex |
log_cursor_cache | 0 | Cache log cursors |
recovery_processor_poll_interval_us | 1000 | Recovery processor wakes this often to check workers |
lsnerr_logflush | 1 | Flush log on lsn error |
tracked_locklist_init | 10 | Initial allocation count for tracked locks |
bdbattr
tunables
These tunables control the ‘BDB’ layer of the db. Its main responsibility is setting tunables that control lower-level functions like replication, various timeouts, etc. These are set with the ‘setattr’ tunable. The next word determines which option to set, and the following word determines its value.
Option | Default (type) | Description |
REPTIMEOUT | 20 (SECS) | Replication timeout. |
CHECKPOINTTIME | 60 (SECS) | Write a checkpoint at this interval. |
CHECKPOINTTIMEPOLL | 100 (MSECS) | Poll a random amount of time < this before writing a checkpoint (attempting to prevent multiple databases from checkpointing at the same exact time) |
NUMBERKDBCACHES | 0 (QUANTITY) | Split the cache into this many segments. |
FULLRECOVERY | 0 (BOOLEAN) | Instead of recovering from the last checkpoint, run recover from the start of the available logs. |
DTASTRIPE | 8 (QUANTITY) | Partition each table’s data into this many stripes. Note that this is ONLY settable at database creation time. |
COMMITDELAY | 0 (MSECS) | Add a delay after every commit. This is occasionally useful to throttle the transaction rate. |
REPSLEEP | 0 (MSECS) | Add a delay on replicants before completing processing a log record. |
MINREPTIMEOUT | 10000 (MSECS) | Wait at leas this long for a replication event before marking a node incoherent (see incoherent nodes overview) |
BLOBSTRIPE | 0 (BOOLEAN) | Settable at ONLY at database creation time. Create stripes for every variable length field (eg: blobs) |
SQLITE_SORTER_TEMPDIR_REQFREE | 6 (PERCENT) | Refuse to create a sorter for queries if less than this percent of disk space is available (and return an error to the application) |
MEMPTRICKLEPERCENT | 99 (PERCENT) | Try to keep at least this percentage of the buffer pool clean. Write pages periodically until that’s achieved. |
MEMPTRICKLEMSECS | 1000 (MSECS) | Pause for this many ms between runs of the cache flusher. |
LITTLE_ENDIAN_BTREES | 1 (BOOLEAN) | Enabling this sets byte ordering for pages to little endian |
COMMITDELAYMAX | 8 (QUANTITY) | Caps the max transaction delay time. COMMITDELAY won’t set above this value, unless set manually. |
LOWDISKTHRESHOLD | 95 (PERCENT) | Sets the low headroom threshold (percent of filesystem full) above which Comdb2 will start removing logs against set policy. |
SQLBULKSZ | 2097152 (BYTES) | For index/data scans, the database will retrieve data in bulk instead of singlestepping a cursor. This set the buffer size for the bulk retrieval. |
ZLIBLEVEL | 6 (QUANTITY) | If zlib compression is enabled, this determines the compression level. |
AUTODEADLOCKDETECT | 1 (BOOLEAN) | When enabled, deadlock detection will run on every lock conflict. When disabled, it’ll run periodically (every DEADLOCKDETECTMS ms) |
DEADLOCKDETECTMS | 100 (MSECS) | When automatic deadlock detection is disabled, run the deadlock detector this often. |
LOGSEGMENTS | 1 (QUANTITY) | Changing this can create multiple logfile segments. Multiple segments can allow the log to be written while other segments are being flushed. |
REPLIMIT | 256 * 1024 (BYTES) | Replication messages will be limited to this size |
REP_LONGREQ | 1 (SECS) | Warn if replication events are taking this long to process. |
TEMPTABLE_MEM_THRESHOLD | 512 (QUANTITY) | If in-memory temp tables contain more than this many entries, spill them to disk. |
TEMPTABLE_CACHESZ | 262144 (BYTES) | Cache size for temporary tables. Temp tables do not share the database’s main buffer pool. |
BULK_SQL_MODE | 1 (BOOLEAN) | Enable reading data in bulk when performing a scan (alternative is single-stepping a cursor) |
ROWLOCKS_PAGELOCK_OPTIMIZATION | 1 (BOOLEAN) | Upgrade rowlocks to pagelocks if possible on cursor traversals. |
LOGREGIONSZ | 1024*1024 (QUANTITY) | Size of the log region - this is used by BerkeleyDB to store information about open files and other things. <!– *> |
ELECTTIMEBASE | 50 (MSECS) | Master election timeout base value |
BULK_SQL_THRESHOLD | 2 (QUANTITY) | Use bulk retrieval of data on scan after this many next operations |
SQL_QUERY_IGNORE_NEWER_UPDATES | 0 (BOOLEAN) | In transaction modes below SNAPSHOT, skip records updated after the current transaction started. |
CHECK_LOCKER_LOCKS | 0 (BOOLEAN) | Sanity check locks at end of transaction |
DEADLOCK_MOST_WRITES | 0 (BOOLEAN) | If AUTODEADLOCKDETECT is off, prefer transaction with most write as deadlock victim |
DEADLOCK_WRITERS_WITH_LEAST_WRITES | 1 (BOOLEAN) | If AUTODEADLOCKDETECT is off, prefer transaction with least write as deadlock victim |
DEADLOCK_YOUNGEST_EVER | 0 (BOOLEAN) | If AUTODEADLOCKDETECT is off, prefer youngest transaction as deadlock victim |
DEADLOCK_LEAST_WRITES_EVER | 1 (BOOLEAN) | If AUTODEADLOCKDETECT is off, prefer transaction with least write as deadlock victim |
DISABLE_WRITER_PENALTY_DEADLOCK | 0 (BOOLEAN) | If set, won’t shrink max #writers on deadlock |
CHECKPOINTRAND | 30 (SECS) | Stagger scheduled checkpoints by this random amount within this many seconds (to prevent multiple databases from checkpointing at the same time) |
DEF_MIN_KEEP_LOGS | 5 (QUANTITY) | Keep at least this many logs around, even if they are eligible for deletion (unless disk space is very low) |
MIN_KEEP_LOGS_AGE | 0 (SECS) | Keep logs that are at least this old (0 - do not keep logs based on time) |
LOG_DEBUG_CTRACE_THRESHOLD | 20 (QUANTITY) | Limit trace about log file deletion to this many events |
DISABLE_UPDATE_STRIPE_CHANGE | 1 (BOOLEAN) | Enable to move records between stripes on an update. |
PAGE_ORDER_TABLESCAN | 1 (BOOLEAN) | Scan tables in order of pages, not in order of rowids (faster for non-sparse tables) |
TABLESCAN_CACHE_UTILIZATION | 20 (PERCENT) | Attempt to keep no more than this percentage of the buffer pool of table scans. |
INDEX_PRIORITY_BOOST | 1 (BOOLEAN) | Treat index pages as higher priority in the buffer pool. |
REP_PROCESSORS | 4 (QUANTITY) | Try to apply this many transactions in parallel in the replication stream |
REP_WORKERS | 16 (QUANTITY) | Size of worker pool for applying page changes on behalf of transactions (only has effect when REP_PROCESSORS is set |
REP_LSN_CHAINING | 0 (BOOLEAN) | If set, will force transactions on replicant to always release locks in LSN order. |
REP_MEMSIZE | 524288 (QUANTITY) | Maximum size for a local copy of log records for transaction processors on replicants. Larger transactions will read from the log directly. |
ELECT_DISABLE_NETSPLIT_PATCH | 0 (BOOLEAN) | When false - on a net split, the side with the master keeps it instead of downgrading. New masters still can’t be elected without a quorum. |
OSYNC | 0 (BOOLEAN) | Enable O_SYNC on writes. Reads will still use filesystem cache. |
ALLOW_OFFLINE_UPGRADES | 0 (BOOLEAN) | Allow machines marked offline to become master. |
MAX_VLOG_LSNS | 10000000 (QUANTITY) | Apply up to this many replication record trying to maintain a snapshot transaction. |
PAGE_EXTENT_SIZE | 0 (QUANTITY) | If set, allocate pages in blocks of this many (extents) |
DELAYED_OLDFILE_CLEANUP | 1 (BOOLEAN) | If set, don’t delete unused data/index files in the critical path of schema change - schedule them for deletion later. |
DISABLE_PAGEORDER_RECSZ_CHK | 0 (BOOLEAN) | If set, allow page-order table scans even for larger record sizes where they don’t necessarily lead to improvement. |
RECOVERY_PAGES | 0 (QUANTITY) | Disabled if set to 0. Othersize, number of pages to write in addition to writing datapages. This works around corner recovery cases on questionable filesystems. |
REP_DB_PAGESIZE | 0 (QUANTITY) | Page size for BerkeleyDB’s replication cache db. |
PAGEDEADLOCK_RETRIES | 500 (QUANTITY) | On a page deadlock, retry the page operation up to this many times. |
PAGEDEADLOCK_MAXPOLL | 5 (QUANTITY) | If retrying on deadlock (see PAGEDEADLOCK_RETRIES ), poll up to this many ms on each retry. |
ENABLE_TEMPTABLE_CLEAN_EXIT | 0 (BOOLEAN) | On exit, clean up temp tables (they are deleted on next startup regardless). |
MAX_SQL_IDLE_TIME | 3600 (QUANTITY) | Warn when an SQL connection remains idle for this long. |
SEQNUM_WAIT_INTERVAL | 500 (QUANTITY) | Wake up to check the state of the world this often while waiting for replication acks. |
SOSQL_MAX_COMMIT_WAIT_SEC | 600 (QUANTITY) | Wait for the master to commit a transaction for up to this long |
SOSQL_POKE_TIMEOUT_SEC | 12 (QUANTITY) | On replicants, when checking on master for transaction status, retry the check after this many seconds. |
SOSQL_POKE_FREQ_SEC | 5 (QUANTITY) | On replicants, check this often for transaction status. |
SQL_QUEUEING_DISABLE_TRACE | 0 (BOOLEAN) | Disable trace when SQL requests are starting to queue. |
SQL_QUEUEING_CRITICAL_TRACE | 100 (QUANTITY) | Produce trace when SQL request queue is this deep. |
NOMASTER_ALERT_SECONDS | 60 (QUANTITY) | Replicants will alarm there’s no master for this many seconds. |
PRINT_FLUSH_LOG_MSG | 0 (BOOLEAN) | Produce trace when flushing log files. |
MASTER_REJECT_REQUESTS | 1 (BOOLEAN) | Master will reject SQL requests - they’ll be routed to a replicant. The master can serve SQL requests, but it’s better to avoid it for better workload balancing. |
NEW_MASTER_DUMMY_ADD_DELAY | 5 (SECS) | Force a transaction after this delay, after becoming master. |
GENID_COMP_THRESHOLD | 60 (QUANTITY) | Try to compress rowids if the record data is smaller than this size. |
MASTER_REJECT_SQL_IGNORE_SANC | 0 (BOOLEAN) | If MASTER_REJECT_REQUESTS is set, reject if no other connected nodes are available. |
KEEP_REFERENCED_FILES | 1 (BOOLEAN) | Don’t remove any files that may still be referenced by the logs. |
DISABLE_PGORDER_MIN_NEXTS | 1000 (QUANTITY) | Don’t disable page order table scans for tables less than this many pages. |
DISABLE_PGORDER_THRESHOLD | 60 (PERCENT) | Disable page order table scans if skipping this percentage of pages on a scan |
HOSTILE_TAKEOVER_RETRIES | 0 (QUANTITY) | Attempt to take over mastership if the master machine is marked offline, and the current machine is online. |
Auto analyze options
These options control when the database kicks off analyze automatically. Auto-analyze
will be run by the database when the database does some fixed count of inserts/deletes. Various counters
can be queried with the stat autonalyze
command.
Option | Default (type) | Description |
AUTOANALYZE | 0 (BOOLEAN) | Set to enable auto-analyze |
AA_COUNT_UPD | 0 (BOOLEAN) | Also consider updates towards the count of operations |
MIN_AA_OPS | 100000 (QUANTITY) | Start analyze after this many operations |
CHK_AA_TIME | 180 (SECS) | Check whether we should start analyze this often |
MIN_AA_TIME | 7200 (SECS) | Don’t re-run auto-analyze if already ran within this many seconds |
AA_LLMETA_SAVE_FREQ | 1 (QUANTITY) | Persist change counters per table on every N’th iteration (called every CHK_AA_TIME seconds) |
AA_MIN_PERCENT | 20 (QUANTITY) | Percent change above which we kick off analyze |
AA_MIN_PERCENT_JITTER | 300 (QUANTITY) | Additional jitter factor for determining percent change. |
SQL planner tunables
Option | Default (type) | Description |
PLANNER_EFFORT | 1 (QUANTITY) | Planner effort (try harder) levels, default is 1 |
SHOW_COST_IN_LONGREQ | 1 (BOOLEAN) | Show query cost in the database long requests log (see logs.html#long-requests-log) |
PLANNER_SHOW_SCANSTATS | 0 (BOOLEAN) | After each query, display statistics about index/data paths taken. |
PLANNER_WARN_ON_DISCREPANCY | 0 (BOOLEAN) | After each query warn if the estimate and actual cost are significantly different |
Schema change tunables
Option | Default (type) | Description |
SC_RESTART_SEC | 60 (QUANTITY) | Delay restarting schema change for this many seconds after startup/new master election |
INDEXREBUILD_SAVE_EVERY_N | 10 (QUANTITY) | Save schema change state to every n-th row for index only rebuilds |
SC_DECREASE_THRDS_ON_DEADLOCK | 1 (BOOLEAN) | Decrease number of schema change threads on deadlock – way to have schema change backoff |
SC_CHECK_LOCKWAITS_SEC | 1 (QUANTITY) | Frequency of checking lockwaits during schemachange in seconds |
SC_USE_NUM_THREADS | 0 (QUANTITY) | Start up to this many threads for parallel rebuilding durin schema change. 0 means use one per dtastripe . Setting is capped at dtastripe . |
SC_NO_REBUILD_THR_SLEEP | 10 (QUANTITY) | Sleep this many microsec when conversion threads count is at max |
SC_FORCE_DELAY | 0 (BOOLEAN) | Force schemachange to delay after every record inserted – to have sc backoff |
UDP tunables
UDP_DROP_DELTA_THRESHOLD | 10 (QUANTITY) | Warn if delta of dropped packets exceeds this threshold |
UDP_DROP_WARN_PERCENT | 10 (PERCENT) | Warn only if percentage of dropped packets exceeds this |
UDP_DROP_WARN_TIME | 300 (SECS) | No more than one warning per UDP_DROP_WARN_TIME seconds |
UDP_AVERAGE_OVER_EPOCHS | 4 (QUANTITY) | Average over these many TCP epochs |
RAND_UDP_FAILS | 0 (QUANTITY) | For testing, rate of drop of UDP packets |
Rowlock tunables
MAX_ROWLOCKS_REPOSITION | 10 | Release a physical cursor an re-establish |
PHYSICAL_COMMIT_INTERVAL | 512 | Force a physical commit after this many physical operations |
ROWLOCKS_MICRO_COMMIT | 1 | Commit on every btree operation |
PHYSICAL_ACK_INTERVAL | 0 | For logical transactions, have the slave send an ‘ack’ after this many physical operations |
ACK_ON_REPLAG_THRESHOLD | 0 | Drop the undo interval to 1 after this many deadlocks |
Misc tunables
NET_INORDER_LOGPUTS | 1 | Attempt to order messages to ensure they go out in LSN order |
RCACHE_COUNT | 257 | Number of entries in root page cache |
RCACHE_PGSZ | 4096 | Size of pages in root page cache |
DISABLE_CACHING_STMT_WITH_FDB | 1 | Don’t cache query plans for statements with foreign table references |
Log configuration
DEBUG_LOG_DELETION | 0 | Enable to see when/why database log files are being deleted |
LOG_DELETE_LOW_HEADROOM_BREAKTIME | 10 | Try to delete logs this many times if the filesystem is getting full before giving up |
MIN_KEEP_LOGS_AGE_HWM | 0 | … |
LOGFILESIZE | 41943040 (BYTES) | Attempt to keep each log file around this size. |
LOGFILEDELTA | 4096 (BYTES) | Treat lsn differences less than this as “close enough” for the purpose of determining if a new node is in sync |
LOGMEMSIZE | 10485760 (BYTES) | Use this much memory for a log file in-memory buffer. |
LOGDELETELOWFILENUM | -1 (QUANTITY) | Set to set the lowest deleteable log file number. |
Replay detection tunables
Comdb2 databases allow the client APIs to replay a transaction if its outcome is uncertain (ie: client issues commit, but the database drops a connection, so uncertain whether it committed). This system is internally called “blkseq” (block sequence). Tunables to control it are below
PRIVATE_BLKSEQ_CACHESZ | 4194304 | Cache size of the blkseq table |
PRIVATE_BLKSEQ_MAXAGE | 20 | Maximum time in seconds to let “old” transactions live |
PRIVATE_BLKSEQ_STRIPES | 1 | Number of stripes for the blkseq table |
PRIVATE_BLKSEQ_ENABLED | 1 | Sets whether dupe detection is enabled |
PRIVATE_BLKSEQ_CLOSE_WARN_TIME | 100 | Warn when it takes longer than this many MS to roll a blkseq table |
DISABLE_SERVER_SOCKPOOL | 1 | Don’t get connections to other databases from sockpool. |
TIMEOUT_SERVER_SOCKPOOL | 10 | Timeout for getting a connection to another database from sockpool. |
Coherency options
Comdb2 will mark nodes that are non-responsive “incoherent”. Incoherent nodes will not respond to new queries and replication will not wait for them acknowledge events. Nodes stay incoherent until the are caught up to the rest of the cluster. The master node will send replicants a coherency lease to allow them to stay coherent. It’ll withhold the lease if a node is either not responding or is significantly slower than the other nodes.
COMMITDELAYBEHINDTHRESH | 1048576 (BYTES) | Call for election again and ask the master to delay commits if we’re further than this far behind on startup. |
GOOSE_REPLICATION_FOR_INCOHERENT_NODES | 0 (BOOLEAN) | Call for election for nodes affected by COMMITDELAYBEHINDTHRESH |
TRACK_REPLICATION_TIMES | 1 (BOOLEAN) | Track how long each replicant takes to ack all transactions. |
WARN_SLOW_REPLICANTS | 1 (BOOLEAN) | Warn if any replicant’s average response times over the last 10 seconds are significantly worse than the second worst replicant’s. |
MAKE_SLOW_REPLICANTS_INCOHERENT | 1 (BOOLEAN) | Make slow replicants incoherent. |
SLOWREP_INCOHERENT_FACTOR | 2 (QUANTITY) | Make replicants that are this many times worse than the second worst replicant incoherent. This is the threshold for WARN_SLOW_REPLICANTS and MAKE_SLOW_REPLICANTS_INCOHERENT . |
SLOWREP_INCOHERENT_MINTIME | 2 (MSECS) | Ignore replicantion events faster than this. |
SLOWREP_INACTIVE_TIMEOUT | 5000 (SECS) | If a “slow” replicant hasn’t responded in this long, mark him incoherent. |
TRACK_REPLICATION_TIMES_MAX_LSNS | 50 (QUANTITY) | Track replication times for up to this many transactions |
REMOVE_COMMITDELAY_ON_COHERENT_CLUSTER | 1 | Stop delaying commits when the all the nodes in the cluster are coherent. |
COHERENCY_LEASE | 200 | A coherency lease grants a replicant the right to be coherent for this many MS. |
ADDITIONAL_DEFERMS | 0 | Wait-fudge to ensure that a replicant has gone incoherent |
COHERENCY_LEASE_UDP | 1 | Use udp to issue leases |
LEASE_RENEW_INTERVAL | 50 | How often we renew leases |
DOWNGRADE_PENALTY | 10000 | Prevent upgrades for at least this many ms after a downgrade |
CATCHUP_WINDOW | 1000000 | Start waiting in waitforseqnum if replicant is within this many bytes of master |
CATCHUP_ON_COMMIT | 1 | Replicant to INCOHERENT_WAIT rather than INCOHERENT on commit if within CATCHUP_WINDOW |
ADD_RECORD_INTERVAL | 1 | Add a record every |
Misc. tunables
RLLIST_STEP | 10 | Reallocate rowlock lists in steps of this size. |
GENID48_WARN_THRESHOLD | 500000000 | Print a warning when there are only a few genids remaining */ |
DISABLE_SELECTVONLY_TRAN_NOP | 0 | Disable verifying rows selected via SELECTV if there’s no other actions done by the same transaction |
SC_VIA_DDL_ONLY | 0 | If DDL_ONLY is set, we don’t do checks needed for comdb2sc |
ASOF_THREAD_POLL_INTERVAL_MS | 500 | For how long should the BEGIN TRANSACTION AS OF thread sleep after draining its work queue |
ASOF_THREAD_DRAIN_LIMIT | 0 | How many entries at maximum should the BEGIN TRANSACTION AS OF thread drain per run |
REP_VERIFY_MAX_TIME | 300 | Maximum amount of time we allow a replicant to roll back its logs in an attempt to sync up to the master. |
REP_VERIFY_MIN_PROGRESS | 10485760 | Abort replicant if it doesn’t make this much progress while rolling back logs to sync up to master. |
REP_VERIFY_LIMIT_ENABLED | 1 | Enable aborting replicant if it doesn’t make sufficient progress while rolling back logs to sync up to master. |
TIMEPART_CHECK_SHARD_EXISTENCE | 0 | Check at startup/time partition creation that all shard files exist. |
Init time options
These options apply only at database creation time. Use -create -lrl /path/to/lrlfile.lrl
with the
pre-populated options to create the database
Option | Default | Description |
init_with_genid48 | Off | Use incrementing numbers for rowids instead of timestamps |
init_with_rowlocks | Off | Use rowlocks instead of pagelocks for short-term transaction locks |
init_with_rowlocks_master_only | Off | Alternate rowlocks scheme that only requires short-term rowlocks on the master |
init_with_ondisk_header | On | Add an extra header to records for versioning and compression information. This can be enabled on tables individually with ALTER TABLE . This controls the default for new tables. Prerequisite for other init_with_* options below. |
init_with_inplace_update | On | Do record inplace inplace instead of a delete/add. This is far more efficient. |
init_with_instant_schema_change | On | When possible (eg: when just adding fields) schema change will not rebuild the underlying tables. |
init_with_compr | On | Turns out compression on table data. |
init_with_compr_blobs | On | Turns out compression on blob/vutf8 columns. |
init_with_bthash | 0 (off) | Turn on a root page lookaside cache for tables |
dtastripe | 1 | Stripe each table across this many files (several physical files per table) |
blobstripe | 1 | Also stripe blob files (several physical files per blob field) |
table | Multiple table options can be added to an lrl file to add tables at database init time. Arguments are table name and path to .csc2 file. |
Runtime options
These options are toggle-able at runtime.
Option | Default | Description |
nullfkey | Constraints are enforced for all key values | Do not enforce foreign key constraints for null keys. |
fullrecovery | Off, recovery runs from previous checkpoint | Attempt to run database recovery from the beginning of available logs |
dir | $COMDB2_ROOT/var/cdb2/$DBNAME |
Database directory |
default_sql_mspace_kbsz | 1024 | Default size of memory regions owned by SQL threads, in KB |
directio | On | Bypass filesystem cache for page I/O |
osync | Off | Enables O_SYNC on data files (reads still go through FS cache) if directio isn’t set |
nonames | Off | Use database name for some environment files (older setting, should remain off) |
checksums | On | Checksum data pages. Turning this off is highly discouraged. |
commitdelaymax | 8 | Introduce a delay after each transaction before returning control to the application. Occasionally useful to allow replicants to catch up on startup with a very busy system. |
lock_conflict_trace | Off | Dump count of lock conflicts every second |
no_lock_conflict_trace | On | Turns off lock_conflict_trace |
blocksql_grace | 10 sec | Let block transactions run this long if db is exiting before being killed (and returning an error). |
gbl_exit_on_pthread_create_fail | 0 | If set, database will exit if thread pools aren’t able to create threads. |
enable_sql_stmt_caching | not set | Enable caching of query plans. If followed by “all” will cache all queries, including those without parameters. |
max_sqlcache_per_thread | 10 | Max number of plans to cache per sql thread (statement cache is per-thread, but see hints below) |
max_sqlcache_hints | 100 | Max number of “hinted” query plans to keep (global) - see cdb2_use_hints() |
max_lua_instructions | 10000 | Max lua opcodes to execute before we assume the stored procedure is looping and kill it |
iothreads | 0 | Number of threads to use for I/O prefaulting |
ioqueue | 0 | Max depth of the I/O prefaulting queue |
prefaulthelperthreads | 0 | Max number of prefault helper threads. |
osqlprefaultthreads | 0 | If set, send prefaulting hints to nodes. |
enable_prefault_udp | not set | Send lossy prefault requests to replicants |
disable_prefault_udp | Disable enable_prefault_udp |
|
sqlsortermem | 314572800 | maximum amount of memory to give the sqlite sorter |
cache | 64 mb | Database cache size, see cache size |
cachekb | see cache size | |
cachekbmin | see cache size | |
cachekbmax | see cache size | |
cluster nodes | List of nodes that comprise the cluster for this database. See setting up clusters | |
appsockslimit | 500 | Start warning on this many connections to the database |
maxappsockslimit | 1400 | Start dropping new connections on this many connections to the database |
maxsockcached | 500 | After this many connections, start requesting that further connections are no longer pooled. |
maxlockers | 256 | Initial size of the lockers table (there’s no current maximum) |
maxtxn | 128 | Maximum concurrent transactions. |
largepages | 0 | Enables large pages. |
maxosqltransfer | 50000 | Maximum number of records modifications allowed per transaction |
heartbeat_send_time | 5 (seconds) | Send heartbeats this often. |
sc_del_unused_files_threshold | ||
tablepenaltyincpercent | See BDB_ATTR_DISABLE_WRITER_PENALTY_DEADLOCK | |
maxwt | 8 | Maximum number of threads processing write requests |
maxq | 192 | Maximum queue depth for write requests |
nice | not set | If set, will call nice() with this value to set the database nice level |
sync | See sync command | |
allow | See permissioning commands | |
disallow | See permissioning commands | |
clrpol | See permissioning commands | |
setclass | See permissioning commands | |
sqlflush | not set | Force flushing the current record stream to client every specified number of records |
sbuftimeout | not set | Set a timeout on client connections, connections drop if they |
throttlesqloverlog | 5 (sec) | On a full queue of SQL requests, dump the current thread pool this often |
allow_lua_print | 0 | Enable to allow stored procedures to print trace on DB’s stdout |
allow_user_schema | 0 | Enable to allow per-user schemas |
resource | not set | Registers a file with the databases. Can be referred to from stored procedures. |
repchecksum | 0 | Enable to do additional check-summing of replication stream (log records in replication stream already have checksums) |
use_parallel_schema_change | 1 | Scan stripes for a table in parallel during schema change. |
use_planned_schema_change | 1 | Only change entities that need to change on a schema change. Disable to always rebuild all data files and indices for the changing table. |
enable_bulk_import | 0 | Enable API to quickly bring in tables from another database |
enable_bulk_import_different_tables | 0 | Enable API to bring in tables from another databases that are not present in the current database |
queuepoll | 0 | Occasionally wake up and poll consumer queues even when no events require it |
replicate_local | 0 | When enabled, record all database events to a comdb2_oplog table. This can be used to set clusters/instances that are fed data from a database cluster. Alternate ways of doing this are planned, so enabling this option should not be needed in the near future. |
enable_tagged_api | 0 | |
enable_snapshot_isolation | 0 | Enable to allow SNAPSHOT level transactions to run against the database |
enable_new_snapshot | 0 | Experimental Enable new SNAPSHOT implementation |
enable_new_snapshot_asof | 0 | Experimental Enable new BEGIN TRANSACTION AS OF implementation |
enable_new_snapshot_logging | 0 | Experimental Enable alternate logging scheme |
disable_new_snapshot | Disables alternate snapshot implementation | |
enable_serial_isolation | 0 | Enable to allow SERIALIZABLE level transactions to run against the database |
update_shadows_interval | 0 | Set to higher than 0 to update snaphots on every Nth operation (default is for every operation) |
enable_lowpri_snapisol | 0 | Give lower priority to locks acquired when updating snapshot state |
disable_lowpri_snapisol | ||
sqlrdtimeout | 10000 (ms) | Set timeout for reading from an SQL connection. |
sqlwrtimeout | 10000 (ms) | Set timeout for writing to an SQL connection. |
log_delete_now | 1 | Set log deletion policy to delete logs as soon as possible. |
log_delete_after_backup | 0 | Set log deletion policy to disable log deletion (can be set by backups, thought the default backups provided by copycomdb2 use a different mechanism) |
log_delete_before_startup | 0 | Set log deletion policy to disable logs older than database startup time. |
on/off | Enable/disable various switches - see switches | |
setattr | Change bdb tunables - see bdb tunables | |
reqldiffstat | 60 (sec) | Set how often the database will dump various usage statistics (each entry will include changes in the last interval) |
reqltruncate | 1 | Disable to always log full SQL queries in request logs (they are truncated by default to save space) |
appsockpool | See thread pools | |
sqlenginepool | See thread pools | |
round_robin_stripes | 0 | Alternate to which table stripe new records are written. The default is to keep stripe affinity by writer. |
no_round_robin_stripes | ||
chkpoint_alarm_time | 60 (sec) | Warn if checkpoints are taking more than this many seconds. |
report_deadlock_verbose | 0 | If set, dump the current thread’s stack for every deadlock. |
disable_pageorder_recsz_check | 0 | If set, allow page order table scans even for pages with overflows. |
enable_pageorder_recsz_check | Disables enable_pageorder_recsz_check | |
disable_overflow_page_trace | 1 | If set, warn when a page order table scan encounters an overflow page. |
enable_overflow_page_trace | If set, don’t warn when a page order table scan encounters an overflow page. | |
enable_selectv_range_check | not set | Experimental If set, SELECTV will send ranges for verification, not every touched record. |
rep_process_txn_trace | not set | If set, report processing time on replicant for all transactions |
no_rep_process_txn_trace | Unsets rep_process_txn_trace | |
ack_trace | not set | Every second, produce trace for ack messages |
no_ack_trace | Turns off ack trace | |
sql_tranlevel_default | Sets the default SQL transaction level for the database, see (SQL transaction levels)[#sql-transaction-levels) | |
sql_time_threshold | 5000 (ms) | Sets the threshold time in ms after which queries are reported as running a long time. |
nowatch | not set | Disable watchdog. Watchdog aborts the database if basic things like creating threads, allocating memory, etc. doesn’t work. |
page_latches | not set | Experimental If set, in rowlocks mode, will acquire fast latches on pages instead of full locks. |
disable_page_latches | Turns off page latches | |
replicant_latches | not set | Experimental Also acquire latches on replicants |
disable_replicant_latches | Turns off page latches on replicants | |
notimeout | not set | Turns off SQL timeouts |
setsqlattr | See (SQL tunables)[#sql-tunables] | |
master_swing_osql_verbose | not set | Produce verbose trace for SQL handlers detecting a master change |
debugthreads | off | If set to ‘on’ enables trace on thread events. |
dumpthreadonexit | off | If set to ‘on’ dump resources held by a thread on exit |
num_record_converts | 100 | During schema changes, pack this many records into a transaction. |
maxcolumns | 255 | Raise the maximum permitted number of columns per table. There’s a hard limit of 1024. |
enable_partial_indexes | not set | If set, allows partial index definitions in table schema. See partial indices |
disable_partial_indexes | Disables partial indices | |
querylimit | See query limit commands | |
maxretries | 500 | Maximum number of times a transactions will be retried on a deadlock |
deadlock_rep_retry_max | not set | If set, will reset the deadlock mode after this many deadlocks on the replicant while applying the log stream. |
enable_sparse_lockerid_map | set | If set, allocates a sparse map of lockers for deadlock resolution |
disable_sparse_lockerid_map | Disables enable_sparse_lockerid_map | |
enable_inplace_blobs | set | Don’t update the rowid of a blob entry on an update |
disable_inplace_blobs | Disables enable_inplace_blobs (needs enable_inplace_blob_optimization, and enable_osql_blob_optimization also enabled - which they are by default) | |
enable_inplace_blob_optimization | Enables inplace blob updates (blobs are updated in place in their b-tree when possible, not deleted/added) | |
disable_inplace_blob_optimization | Disables enable_inplace_blob_optimization | |
enable_osql_blob_optimization | set | Replicant tracks which columns are modified in a transaction to allow blob updates to be omitted if possible |
disable_osql_blob_optimization | Disables disable_osql_blob_optimization | |
pagedeadlock_maxpoll | 5 (ms) | Randomly poll for this many ms and retry a deadlocked component of a rowlocks transaction |
pagedeadlock_retries | 500 | Retry a deadlocked component of a rowlock transaction this many times before reporting deadlock for the transaction. |
enable_cache_internal_nodes | set | Btree internal nodes have a higher cache priority. |
disable_cache_internal_nodes | Disable enable_cache_internal_nodes | |
analyze_tbl_threads | 5 | Number of threads to go through generated samples when generating index statistics |
analyze_comp_threads | 10 | Number of thread to use when generating samples for computing index statistics |
analyze_comp_threshold | 104857600 | Index file size above which we’ll do sampling, rather than scan the entire index. |
print_syntax_err | not set | Trace all SQL with syntax errors. |
survive_n_master_swings | 600 | Have a node retry applying a transaction against a new master this many times before giving up. |
master_retry_poll_ms | 100 | Have a node wait this long after a master swing before retrying a transaction |
osql_verify_retry_max | 499 | Retry a transaction on a verify error this many times - see optimistic concurrency control |
osql_verify_ext_chk | 1 | For block transaction mode only - after this many verify errors, see if transaction is non-commitable - see default isolation level |
pageordertablescan | set | Table scans read the table in page order, not row order. |
tablescan_cache_utilization | 20 | Percent of cache to allow to be used for table scans. |
early | set | When set, replicants will ack a transaction as soon as they acquire locks - not that replication must succeed at that point, and reads on that node will either see the records or block. |
noearly | Disables early . With noearly replicant will wait to commit a transaction locally and release locks before acking. The only advantage is that subsequent reads won’t block on this transaction’s lock - the replication semantics don’t change. |
|
reallearly | not set | Ack as soon as a commit record is seen by the replicant (before it’s applied). This effectively makes replication asynchronous, so reads may not see the effects of a committed transaction yet. |
noreallearly | Disables reallyearly |
|
ctrace_rollat | 0 | Roll database debug trace file ($COMDB2_ROOT/var/log/cdb2/$dbname.trc.c ) at specified size. Set to 0 to never roll. |
ctrace_nlogs | 7 | When rolling trace files, keep this many. The older files will have incrementing number suffixes (.1, .2, etc.) |
ctrace_dbdir | not set | If set, debug trace files will go to the data directory instead of `$COMDB2_ROOT/var/log/cdb2/) |
disable_sql_dlmalloc | not set | If set, will use default system malloc for SQL state machines. By default, each thread running SQL gets a dedicated memory pool. |
decimal_rounding | DEC_ROUND_HALF_EVEN | See decimal rounding options |
mempget_timeout | 60 (seconds) | |
berkattr | See BerkeleyDB attributes | |
keycompr | Enable index compression (applies to newly allocated index pages, rebuild table to force for all pages, see REBUILD | |
nokeycompr | Disable index compression (applies to newly allocated index pages, just like keycompr ) |
|
crypto | See Authentication and Encryption | |
crc32c | set | Use crc32c (alternate faster implementation of CRC32, different checksums) for page checksums |
nocrc32c | Disables crc32c , fall back to CRC32 |
|
rcache | set | Keep a lookaside cache of root pages for b-trees |
norcache | Disables rcache |
|
sqllogger | See request logging | |
location | Sets up default file locations - see file locations | |
include | Include file given as argument. Named file will be processed before continuing processing the current file. | |
temptable_limit | 8192 | Set the maximum number of temporary tables the database can create |
disable_temptable_pool | Disables the pool of temp tables set by temptable_limit , temp tables are created as needed. |
|
enable_upgrade_ahead | not set | Occasionally update read records to the newest schema version (saves some processing when reading them later) |
disable_upgrade_ahead | Disables enable_upgrade_ahead |
|
do | At the end of processing config files, execute the rest of this line as an operational command, see operational Commands | |
memstat_autoreport_freq | 180 (sec) | Dump memory usage to trace files at this frequency |
blob_mem_mb | not set | Blob allocator - sets the max memory limit to allow for blob values (in MB). |
blobmem_sz_thresh_kb | not set | Sets the threshold (in kb) above which blobs are allocated by the blob allocator. |
logmsg | Controls the database logging level - accepts logging commands. |