'is there a method to detect common form in html code?
I have a lot of html pages that are formatted differently but the content that interests me is the same , for example :
Page_1.html :
<div class = "block_person">
<div class="persons"><span>Jules Rodrigez</span></div>
<div class="contents"><h1>Jules Rodrigez is a programmer specialized in machine learning</h1></div>
</div>
<div class = "block_person">
<div class="persons"><span>James Alfonso</span></div>
<div class="contents"><h1>James is a singer</h1></div>
</div>
page_2.html :
<div class="many_speakers" >
<div class="speakers"><h1>Jules Rodrigez</h1></div>
<div class="summary"><span>Jules Rodrigez is a programmer specialized in data science</span></div>
</div>
<div class="many_speakers" >
<div class="speakers"><h1>Peka Yaya</h1></div>
<div class="summary"><span>Peka is a professor</span></div>
</div>
<div class="many_speakers" >
<div class="speakers"><h1>Cristiano dimaria</h1></div>
<div class="summary"><span>Cristiano is a football player</span></div>
</div>
from a page html (page_1 or page_2), i want to get a list of objects like :
from page_1.html
[{"person":"Jules Rodrigez","content":"Jules Rodrigez is a programmer specialized in machine learning"},{"person":"James Alfonso","content":"James is a singer"}]
the problem is that each page is formatted with an structure : how can we detect in an html page that a block is repeated several times and therefore it contains the requested information : for example in the page_1.html the bloc which is repeated several times is :
<div class = "block_person">
<div class="persons"><span>Jules Rodrigez</span></div>
<div class="contents"><h1>Jules Rodrigez is a programmer specialized in machine learning</h1></div>
</div>
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|
