'is there a method to detect common form in html code?

I have a lot of html pages that are formatted differently but the content that interests me is the same , for example :

Page_1.html :

<div class = "block_person">
     <div class="persons"><span>Jules Rodrigez</span></div>
     <div class="contents"><h1>Jules Rodrigez is a programmer specialized in machine learning</h1></div>
</div>
<div class = "block_person">
     <div class="persons"><span>James Alfonso</span></div>
     <div class="contents"><h1>James  is a singer</h1></div>
</div>

page_2.html :

<div class="many_speakers" >
   <div class="speakers"><h1>Jules Rodrigez</h1></div>
   <div class="summary"><span>Jules Rodrigez is a programmer specialized in data science</span></div>
</div>
<div class="many_speakers" >
   <div class="speakers"><h1>Peka Yaya</h1></div>
   <div class="summary"><span>Peka is a professor</span></div>
</div>
<div class="many_speakers" >
   <div class="speakers"><h1>Cristiano dimaria</h1></div>
   <div class="summary"><span>Cristiano is a football player</span></div>
</div>

from a page html (page_1 or page_2), i want to get a list of objects like :

from page_1.html

[{"person":"Jules Rodrigez","content":"Jules Rodrigez is a programmer specialized in machine learning"},{"person":"James Alfonso","content":"James  is a singer"}]

the problem is that each page is formatted with an structure : how can we detect in an html page that a block is repeated several times and therefore it contains the requested information : for example in the page_1.html the bloc which is repeated several times is :

<div class = "block_person">
     <div class="persons"><span>Jules Rodrigez</span></div>
     <div class="contents"><h1>Jules Rodrigez is a programmer specialized in machine learning</h1></div>
</div>


Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source