Summary
After upgrading Dremio, the upgrade process may take longer than expected, and logs might appear to be stuck in the same state, giving the impression that the process is hung.
Reported Issue
While the Dremio process might appear to be running, the logs may not be updating, which can create the impression that the process is hung.
Overview
This article addresses the scenario where Dremio takes longer to start due to Lucene reindexing.
Relevant Versions Tools and Integrations
All versions of Dremio
Steps to Resolve
The user may see Dremio "stuck" on a startup step typically with the following lines in the log
2023-05-24 10:15:27,164 [main] INFO c.d.datastore.LocalKVStoreProvider - LocalKVStoreProvider is up
2023-05-24 10:15:27,293 [main] INFO c.d.datastore.LocalKVStoreProvider - Stopping LocalKVStoreProvider
The Dremio process will be running and consuming CPU but users might think it is "hung" and mistakenly restart thereby prolonging the reindexing.
A really useful way to determine of this is the problem at hand is to use the SJK tool available here https://github.com/aragozin/jvm-tools/tree/master
The ttop
subcommand of the SJK provides the user the ability to correlate JVM threads to top
output to allow a user to determine in real time which JVM threads are busy. For the above scenario here is an example of the ttop
output
$ sudo -u dremio java -jar ./sjk-plus-0.21.jar ttop -o CPU -n 50 -p 6959
...
[000076] user=99.00% sys= 0.05% alloc= 449kb/s - Lucene Merge Thread #0
...
Her we can see that the ttop
command output shows the Lucene thread using CPU time. If this thread consistently appears in every iteration then Dremio is spending time reindexing.
Common Challenges
Dremio uses Lucene to index data in its catalog. Under certain conditions Dremio may need to reindex its Lucene data when it starts up. Often this is after an unexpected shutdown (i.e. hardware reboot or forced kill of the process). If Dremio detects that the reindex is required this is performed on startup.
While this is designed behaviour it is not obvious from the logs and can take time depending on the size of the catalog model.
If the user needs Dremio to start faster and not perform reindexing, then the following JVM flag can be added into the dremio-env
-Ddremio.catalog.disable_reindex_on_crash=true
Note that if the indexing is skipped, then this may have implications on searching the catalog in the UI (i.e. inaccurate results).
Additional Resources
The indexing will need time to complete but doesn't always have to occur on startup. However allowing it to complete on startup will mean that searches (i.e. looking for a dataset in the UI) will be accurate.
There is an internal JIRA (DX-66277) open to cover adding more log entries to show that indexing is being synced on shutdown and startup.
SJK tool - https://github.com/aragozin/jvm-tools/tree/master