'Optimality difference in beam search and Q learning in mountain road traversal problem

I am trying to design a custom problem statement. I call it mountain road traversal problem. It goes as follows:

There is a mountain. N roads start from it. Each road has certain rewards (number of restaurants available on that road). Every road splits further into more roads and above reward structure applies to each branching road too. Roads split finitely till a certain depth (the value of depth can be large, like 10).

Example diagram:

mountain road traversal

The first 3 roads starting from mountain are A1, A2, A3. A1 splits further into A11, A12 and A13 whereas A2 splits further into A21 and A22 whereas A3 splits further into A31, A32, A33 and A34. Each road has respective reward.

Problem: I want to explore this tree optimally to maximize rewards. For example, in above image the most reward maximizing path is {A2, A22}. I am also inclined to get the top N reward maximizing paths. For example, top 3 reward maximizing paths would be ({A2, A22}, {A2, A21}, {A1, A13}). I want to get these paths with minimum exploration

What I have tried:

  1. Brute force: Highly inefficient. It is going to waste time in exploring roads like A3 and its branches that do not yield anything
  2. Greedy: It is going to consider road A1 first (and its subsequent branches) but we know road A2 is better. So this is also inefficient.

Question: What is the optimality difference between beam search and Q-learning algorithm in this scenario?

My understanding:

Beam Search: I can set beam width according to top N paths that I want. It is true that I may not get actual optimal N paths in my final solution but this is the best move beyond naive greedy (that's what I believe, correct me if wrong).

Q-learning: Q learning algo will try to explore all states and build a Q-table, but it is also exploring (wasting time in roads A3 and subsequent branches). Is it that once reducing epsilon (exploration factor) gradually will make the algorithm use learnt experiences more and exploit on seen paths?

More ever, will any of these 2 algos (or a better suggestion) apply to the same problem with different set of parameters (different number of roads and rewards) meaning it would become 1-shot for the unseen set of parameters because we are carrying over our learnt experience?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source