How to resolve common issues
Below is a list of common issues with running validator and data nodes, and their solutions.
If you come across an issue you don't see addressed here - share it in the validator Discord channel, or on the feedback board.
Some of the issues below will be addressed in future software versions, while others may be a result of mis-configurations.
Problem: Data node fails to start because it times out when fetching history for segment
You may see the following information in your error log.
INFO datanode.start.persistentPre start/node_pre.go:121 Auto Initialising Datanode From Network History
INFO datanode.start.persistentPre networkhistory/initialise.go:61 got most recent history segment {"segment": "from_height:136001 to_height:137000 history_segment_id:\"QmWtU5Xks8CewedyYtibLZ32tgSifoNozyQTLibYNmTmcV\" previous_history_segment_id:\"Qmc6uqPnLdMUTgLuvm5FmSGe3WarveJ9UvB1GvbEf6XsvT\"", "peer": "143.198.66.233:3007"}
INFO datanode.start.persistentPre start/node_pre.go:125 fetching history using as the first segment:{QmWtU5Xks8CewedyYtibLZ32tgSifoNozyQTLibYNmTmcV} and minimum blocks to fetch 1
INFO datanode.start.persistentPre networkhistory/initialise.go:81 fetching history for segment id:QmWtU5Xks8CewedyYtibLZ32tgSifoNozyQTLibYNmTmcV
failed to initialize datanode from network history: failed to fetch history blocks:failed to fetch history:could not write out the fetched history segment: context deadline exceeded
Solution: Restart data node
If the above issue happens, try to start the data node one more time.
Problem: Data node fails to start up with the following panic
You may see the following information in your panic log.
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x44 pc=0x8791c4]
goroutine 144 [running]:
code.vegaprotocol.io/vega/cmd/vega/node.(*Command).Run.func1.1()
/home/runner/work/vega/vega/cmd/vega/node/node.go:137 +0x59
panic({0x2d7fa00, 0x56540b0})
/opt/hostedtoolcache/go/1.19.5/x64/src/runtime/panic.go:884 +0x212
github.com/cosmos/iavl.(*MutableTree).AvailableVersions(0x0)
/home/runner/go/pkg/mod/github.com/cosmos/iavl@v0.19.4/mutable_tree.go:93 +0x64
code.vegaprotocol.io/vega/core/snapshot.(*Engine).CheckLoaded(0xc001339860)
/home/runner/work/vega/vega/core/snapshot/engine.go:363 +0x59
code.vegaprotocol.io/vega/core/processor.(*App).Info(0xc001fe2840, {{0x76?, 0x63f14b18?}, 0xc003df6f40?, 0x7fffffffffffffff?})
/home/runner/work/vega/vega/core/processor/abci.go:534 +0x1e2
code.vegaprotocol.io/vega/core/blockchain/abci.(*App).Info(0xe?, {{0x330f60b?, 0x1f?}, 0xc003dda7e8?, 0x40dd7f?})
/home/runner/work/vega/vega/core/blockchain/abci/abci.go:27 +0x6b
code.vegaprotocol.io/vega/cmd/vega/node.(*appW).Info(0xc002dc0420?, {{0x330f60b?, 0xc00228dba0?}, 0x565f340?, 0xc003dda878?})
/home/runner/work/vega/vega/cmd/vega/node/app_wrapper.go:33 +0x59
github.com/tendermint/tendermint/abci/client.(*localClient).InfoSync(0xc003dc8600, {{0x330f60b?, 0xc003dda930?}, 0x4e3625?, 0x1?})
/home/runner/go/pkg/mod/github.com/informalsystems/tendermint@v0.34.25/abci/client/local_client.go:224 +0x107
github.com/tendermint/tendermint/proxy.(*appConnQuery).InfoSync(0x4e6e86?, {{0x330f60b?, 0xc003dda9f0?}, 0x4e6dc4?, 0xc0000b20c0?})
/home/runner/go/pkg/mod/github.com/informalsystems/tendermint@v0.34.25/proxy/app_conn.go:155 +0x29
github.com/tendermint/tendermint/consensus.(*Handshaker).Handshake(0xc004c4eda8, {0x3b862b0, 0xc0022f0b60})
/home/runner/go/pkg/mod/github.com/informalsystems/tendermint@v0.34.25/consensus/replay.go:244 +0x76
github.com/tendermint/tendermint/node.doHandshake({_, _}, {{{0xb, 0x1}, {0xc002373320, 0x7}}, {0xc00038f3c0, 0x1a}, 0x1, 0x6cf42, ...}, ...)
/home/runner/go/pkg/mod/github.com/informalsystems/tendermint@v0.34.25/node/node.go:329 +0x1b8
github.com/tendermint/tendermint/node.NewNode(0xc001fe1540, {0x3b56970, 0xc0003f23c0}, 0xc001ffef70, {0x3b3bd40, 0xc001fd56b0}, 0x0?, 0x0?, 0xc001ffefb0, {0x3b5c5b0, ...}, ...)
/home/runner/go/pkg/mod/github.com/informalsystems/tendermint@v0.34.25/node/node.go:777 +0x597
code.vegaprotocol.io/vega/core/blockchain/abci.NewTmNode({{0x0}, 0x1, 0x1, 0x0, 0x0, {0xc001e7c1f0, 0xa}, {{0x0}, {0xc001e1ae88, 0x15}}, ...}, ...)
/home/runner/work/vega/vega/core/blockchain/abci/tm_node.go:78 +0x6a5
code.vegaprotocol.io/vega/cmd/vega/node.(*Command).startABCI(0xc0019c3c00, {0x3b80100, 0xc001fcda40}, {0x7ffef0370e93, 0x1a}, {0x0?, 0x0?}, {0x0?, 0x0?})
/home/runner/work/vega/vega/cmd/vega/node/node.go:412 +0x1a7
code.vegaprotocol.io/vega/cmd/vega/node.(*Command).startBlockchain(0xc0019c3c00, {0x7ffef0370e93?, 0x0?}, {0x0, 0x0}, {0x0, 0x0})
/home/runner/work/vega/vega/cmd/vega/node/node.go:299 +0x39a
code.vegaprotocol.io/vega/cmd/vega/node.(*Command).Run.func1()
/home/runner/work/vega/vega/cmd/vega/node/node.go:140 +0x7e
created by code.vegaprotocol.io/vega/cmd/vega/node.(*Command).Run
/home/runner/work/vega/vega/cmd/vega/node/node.go:131 +0x6af
Solution: Check permissions
The above problem is caused by wrong file permissions for the snapshot
database.
To verify: List all files in <vega-home>/state/node/snapshots/snapshot.db/
and check that all files are owned by the user that runs the data node service.
To fix: Change the ownership of all files in that directory to the user that runs data node service, and start the service again.
Possible cause: You may have run the following command when connected as a different user, e.g. root
:
# list all the local snapshots
sudo ./vega tools snapshot
Problem: panic: state.AppHash does not match AppHash after replay. Got XXXXXXXXXX, expected YYYYYYYYYY.
After a protocol upgrade you may get the above error, for example:
core node/node.go:225 Vega shutdown complete {"version": "v0.72.10", "version-hash":
"26afd41a2fe4cb20f3fffeae0d4cfe523fc35614"}
Aug 23 14:01:50 vega-testnet visor[3605726]: panic: state.AppHash does not match AppHash after replay. Got
Aug 23 14:01:50 vega-testnet visor[3605726]: FA62EFCF5AAA70B26FB2D746B5BD7FC3709CBA2132A7597D796DD9DC16B65EE0, expected D8CDD4DC0B1133EFF36985B76A9D5FE4DF904933FC85F6B306A00E1357CEF94F.
Did you reset CometBFT without resetting your application's data?
Solution: Start from a snapshot before the protocol upgrade.
- List all of the local snapshots on your node:
vega tools snapshot --home <vega_home>
- Get the snapshot height for snapshot before the protocol upgrade
- If you are using Visor, make sure that directories for a new release exist in the
<vegavisor-home>
- Rollback a single tendermint block:
vega tm rollback --home <tendermint_home>
- Start the previous binary version with additional flag
--snapshot.load-from-block-height=<selected-snapshot-height>
, e.g.:./vega start --snapshot.load-from-block-height=7885400 --home <vega-home> --tendermint-home <tendermint-home
. If you are using Visor, you need to add the above flag to therun-config.toml
(e.g:<vegavisor-home>/v0.71.6/run-config.toml
). - Wait for upgrade. New node may start automatically if you are using Visor, but it is not guaranteed.
- Remove the previously added flag (
--snapshot.load-from-block-height
). - If your node did not restart automatically:
a. For Visor: Stop Visor, link correct version to
<vegavisor-home>/current
, and restart your node, and start Visor b. For non-Visor setup: Stop the node, start the node with new binary
Problem: Data Node is at block 0, while Core is processing blocks ok
When you request /statistics
from your Data Node REST API, you will get:
blockHeight
andvegaTime
from response body - this is the information aboutcore
and it looks okX-Block-Height
andX-Block-Timestamp
response headers - this is the information aboutdata node
, and you getX-Block-Height: 0
- you might also get non-zero value, that is way behind
core
block height, and it is not increasing (if it increases, then it is a different issue)
- you might also get non-zero value, that is way behind
It means your core
process is working ok, but it is not sending any data to the data node
process.
Solution: Fix config and restart from remote network history
- Check your data node's broker IP and port
config.toml
[Broker]
# ...
[Broker.SocketConfig]
IP = "0.0.0.0" # 0.0.0.0 will serve on all network interfaces
Port = 3005 # make sure it matches core config
- Verify that core has the broker enabled, and pointed to the correct IP and port
config.toml
[Broker]
# ...
[Broker.Socket]
# ...
Address = "127.0.0.1" # your Data Node endpoint
Port = 3005 # your Data Node listening port
Enabled = true # send data to the Data Node
- If you didn't have to change your config, that means you have a different issue
3a. If you changed your config (either fixed port, or set
Entabled
totrue
), then please continue - Stop your
core
anddata node
(or justvegavisor
) - Configure
data node
to start from remote Network History. For this, modify data nodeconfig.toml
AutoInitialiseFromNetworkHistory = true
# ...
[SQLStore]
WipeOnStartup = true # IMPORTANT: remember to change this back to false once node is started
# ...
[NetworkHistory]
# ...
Enabled = true
WipeOnStartup = true # IMPORTANT: remember to change this back to false once node is started
# ...
[NetworkHistory.Store]
# ...
BootstrapPeers = [you need to put at least one trusted peer here]
# ...
[NetworkHistory.Initialise]
# ...
TimeOut = "1h" # Make sure the timeout is not too small
- Make sure
core
timeout is not too small. For that modify coreconfig.toml
[Broker]
# ...
[Broker.Socket]
DialTimeout = "1h" # Make sure the timeout is not too small
- If you are using
vega visor
make sure the timeout is not too small. For that modify visorconfig.toml
# Try every 2 seconds, 1800 retries is 1h
maxNumberOfFirstConnectionRetries = 1800
- Start
vegavisor
orcore + date node
(order doesn't matter)
Problem: failed to copy data into hyper-table ...: ERROR: extra data after last expected column (SQLSTATE 22P04)
2023-10-13T17:29:22.607+0800 INFO datanode.networkHistory.service snapshot/service_load_snapshot.go:300 copying vega-mainnet-0011-22-18710202-18710501.zip into database
failed to initialize datanode from network history: failed to load history into the datanode: failed to load snapshot data:failed to load segments for database version 22: failed to load history segments: failed to load history segment {Network History Segment for Chain ID:vega-mainnet-0011 Height From:18710202 Height To:18710501}: failed to copy data into the database vega-mainnet-0011-22-18710202-18710501.zip : failed to copy data into table market_data: failed to copy history table data into database: failed to copy data into hyper-table market_data: ERROR: extra data after last expected column (SQLSTATE 22P04)
The error usually occurs when you are loading a specific history segment (e.g: the last one) into an incorrect version of Vega, for example an old version of the binary.
Vega adds/modifies/removes SQL schemas during the migration between versions. You must load data from specific segment into the corresponding version of the database that the segment was created for.
Solution: Fix the vega version
Check the vega version with the command vega version
and compare with the version the segment has been created for. To do it, you can use the /api/v2/snapshots
(e.g: https://api0.vega.community/api/v2/snapshots) endpoint to check what version, was deployed on the specific block you are interested in.
Problem: My validator node is running with Visor, but it fails with auto-install.
First, you should check the logs from Visor. An example log is below:
Oct 24 11:33:09 moonrock visor[260678]: Error: failed to prepare next upgrade folder: failed to auto install folder "/home/vega/vega-visor/v0.73.0-rc.1" for release "v0.73.0-rc.1": missing required auto install vega asset definition in Visor config
Solution: Check the Visor config
The most frequent issue when your auto-install procedure fails is the wrong Visor configuration.
Visor works the following way during auto-install:
- Checks if auto-install is enabled
- Copies the folder and the
run-config.toml
from the previous version - the version you are doing the protocol upgrade from - Downloads the binary from the given repository - binary name and repository are taken from Visor config
- Starts the downloaded binary
The correct config for Visor should look similar to the following:
# Try every 2 seconds, 172800 retries is 96h
maxNumberOfFirstConnectionRetries = 172800
maxNumberOfRestarts = 3
restartsDelaySeconds = 5
stopDelaySeconds = 0
stopSignalTimeoutSeconds = 15
[upgradeFolders]
"vX.X.X" = "vX.X.X"
[autoInstall]
enabled = true
repositoryOwner = "vegaprotocol"
repository = "vega"
[autoInstall.asset]
name = "vega-linux-amd64.zip"
binaryName = "vega"
Problem: Data node ERROR: out of shared memory (SQLSTATE 53200)
The error means a data node sent a query to the database that locks more items in single transaction than your PostgreSQL config allows.
You can usually see a similar error in the data node logs:
vega data node stopped with error: failed to flush subscriber:flushing ledger: failed to copy "ledger" entries into database: ERROR: out of shared memory (SQLSTATE 53200)
Solution: Update the max_locks_per_transaction parameter in the PostgreSQL config
- Update your PostgreSQL config file, usually the config is located at
cat /etc/postgresql/14/main/postgresql.conf
. Change themax_locks_per_transaction
to a bigger number, such as 256 or more. - Restart your PostgreSQL server
- Start your data node.
Problem: CONSENSUS FAILURE at block 26439343
Your node may fail with the following error: panic: cannot unregister order with potential sell + size change < 0
at block 26439343
.
If the block is different, this is a different issue.
There was a mainnet bug that crashed the network.
Solution: Apply patches to the mainnet network
See the following procedure: Apply patches for the consensus failure at block 26439343.
Problem: CONSENSUS FAILURE at block 34669654
Your node may fail with the following error: runtime error: invalid memory address or nil pointer dereference
at block 34669654
.
If the block is different, this is a different issue.
There was a mainnet bug that crashed the network.
Solution: Apply patches to the mainnet network
See the following procedure: Apply patches for the consensus failure at block 26439343.
Problem: You encounter deadlocking issues with the Postgres database causing it to crash.
We have found during testing that for some scenarios, such as snapshot restores, Postgres can encounter deadlocks when refreshing the continuous aggregates. This leads to Postgres panics and causes the data node to crash.
Solution: Ensure you have the noatime
flag set for your mounted drives.
Ubuntu defaults do not set the noatime
flag for mounted drives, this can have a significant performance impact for the database and causes the deadlocking issues. While not a panacea for all deadlocking issues, it is a good first step to take.