Slow startup of Dremio caused by reindexing catalog data – Dremio Support

Summary

After upgrading Dremio, the upgrade process may take longer than expected, and logs might appear to be stuck in the same state, giving the impression that the process is hung.

Reported Issue

While the Dremio process might appear to be running, the logs may not be updating, which can create the impression that the process is hung.

Overview

This article addresses the scenario where Dremio takes longer to start due to Lucene reindexing.

Relevant Versions Tools and Integrations

All versions of Dremio

Steps to Resolve

The user may see Dremio "stuck" on a startup step typically with the following lines in the log

2023-05-24 10:15:27,164 [main] INFO c.d.datastore.LocalKVStoreProvider - LocalKVStoreProvider is up
2023-05-24 10:15:27,293 [main] INFO c.d.datastore.LocalKVStoreProvider - Stopping LocalKVStoreProvider

The Dremio process will be running and consuming CPU but users might think it is "hung" and mistakenly restart thereby prolonging the reindexing.

A really useful way to determine of this is the problem at hand is to use the SJK tool available here https://github.com/aragozin/jvm-tools/tree/master

The ttop subcommand of the SJK provides the user the ability to correlate JVM threads to top output to allow a user to determine in real time which JVM threads are busy. For the above scenario here is an example of the ttop output

$ sudo -u dremio java -jar ./sjk-plus-0.21.jar ttop -o CPU -n 50 -p 6959
...
[000076] user=99.00% sys= 0.05% alloc=  449kb/s - Lucene Merge Thread #0
...

Her we can see that the ttop command output shows the Lucene thread using CPU time. If this thread consistently appears in every iteration then Dremio is spending time reindexing.

Common Challenges

Dremio uses Lucene to index data in its catalog. Under certain conditions Dremio may need to reindex its Lucene data when it starts up. Often this is after an unexpected shutdown (i.e. hardware reboot or forced kill of the process). If Dremio detects that the reindex is required this is performed on startup.

While this is designed behaviour it is not obvious from the logs and can take time depending on the size of the catalog model.

If the user needs Dremio to start faster and not perform reindexing, then the following JVM flag can be added into the dremio-env

-Ddremio.catalog.disable_reindex_on_crash=true

Note that if the indexing is skipped, then this may have implications on searching the catalog in the UI (i.e. inaccurate results).

Additional Resources

The indexing will need time to complete but doesn't always have to occur on startup. However allowing it to complete on startup will mean that searches (i.e. looking for a dataset in the UI) will be accurate.

There is an internal JIRA (DX-66277) open to cover adding more log entries to show that indexing is being synced on shutdown and startup.

SJK tool - https://github.com/aragozin/jvm-tools/tree/master