'Why retrieving hbase region mapping from zookeeper is much faster on spark than from local jdbc

i want to understand why querying hbase from spark-job (cluster mode) is much faster than from local jdbc.

I know that data locality improve performance but my point is on region mapping retrieved from zookeeper.

I have huge cluster with huge tables and Apache Phoenix on top, when I perform a query from dbeaver (simple query, only a where with a couple of value for the pk), it take almost 10 minutes for the first query because it's retrieving region mapping from zookeeper, next queries will take millis.

When I run the query into spark job it takes almost 1/2 minutes to do the whole batch of query (region mapping and query with millions of keys), i can expect that is faster but not 10 times because for what I know each spark job create a NEW jvm on each executor so it can't have already cached region mapping.

Can someone explain why this is happening or what I'm misunderstanding, thanks



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source