Indexing on low cardinality columns is degrading the performance
Hi,
I am experiencing performance issues with the following query in MariaDB 10.7.3:
query: SELECT GROUP_CONCAT(DISTINCT test_table1.fg_item) FROM test_table1 JOIN test_table2 ON (test_table1.component_item = test_table2.assy_item) OR (test_table1.assembly_item = test_table2.assy_item) WHERE test_table2.cmpnt_item = 'CMP_ITEM1' AND test_table1.fg_active_part = 'yes';
The fg_active_part column in test_table1 has low cardinality with values either 'yes' or 'no'. According to MariaDB documentation, indexing on low cardinality columns will be shunned(Reference https://mariadb.com/kb/en/building-the-best-index-for-a-given-select/#flags-and-low-cardinality).
When the fg_active_part column is indexed, the query performance degrades significantly and does take a lot of time. Conversely, when the fg_active_part column is not indexed, query performance improves and the execution is much faster.
Can someone explain why indexing the fg_active_part column leads to performance degradation rather than improving or maintaining the same performance? Are there any best practices or alternative indexing strategies to optimize this query?
Any help or insights into resolving this issue would be greatly appreciated. Thank you!
Thanks, Harinath
Answer Answered by Daniel Black in this comment.
First note that the 10.7 release series is at the end of life.
The 11.0 release series and newer, like the 11.4 LTS release series of MariaDB have improved optimizer functions capable of handling the OR criteria of your JOIN in a much better way estimating the cost better and likely to generate a better query plan.
Yes low cardinality should usually be shunned, except in the case where the dataset of the table can be retreived from just the index.
Using ANALYZE FORMAT=JSON {query} can help see where either r_ amount, the real, differers from the estimate without the r_ prefix.
Optimizer trace allows a trace of how the query plan was made.
Strongly recommend updating to the 11.4 version in a test environment and measuring how well or otherwise the query performs there. Include ANALYZE or optimizer trace if still getting suboptimal queries.