'preg replace to remove a div from a string

I'm trying to remove a HTML element from a string,

I have the following preg_replace ;

    $body = preg_replace('#<div class="code-block code-block-12" style="margin: 8px 0; clear: both;">(.*?)</div>#', '', $body);

But the preg_replace doesn't seem to work;

Here is the full code;

    $html = new DOMDocument();
     @$html->loadHtmlFile($url);
     $xpath = new DOMXPath( $html );
     $nodelist = $xpath->query( '//*[@class="coincodex-content"]' );
     $body = '';
    foreach ($nodelist as $n){
        $body .= $html->saveHtml($n)."\n";
    } 
    
    $body = preg_replace('#<div class="code-block code-block-12" style="margin: 8px 0; clear: both;">(.*?)</div>#', '', $body);
    

The current output is this;

<div class="coincodex-content">
hello this is content
<div class="code-block code-block-12" style="margin: 8px 0; clear: both;">
<div><center><span style="font-size:11px; color: gray;"TEST</span></center>
<b>TEST</b><br><br></div></div>
<div class="rp4wp-related-posts rp4wp-related-post">
    </ul></div><!-- AI CONTENT END 1 -->
<div class="entry-tags" style="margin-bottom:15px; font-weight: bold; text-align:center;">Tags: <a href="#" rel="tag">test</a> <a href="#" rel="tag">#tag</a></div>
</div>

And my desired output is ;

<div class="coincodex-content">
hello this is content
</div>

I really appreciate any help I'm sure there is an easier way to achieve this I'm just not entirely sure why my current method is not working thankyou.



Solution 1:[1]

This is cheating a bit. The main problem with trying to use regex to parse HTML is the nesting tags, which will drive you to madness. If you truly only need to keep the first <div> and the content that occurs before the second <div>, the below will work.

preg_match('#<div class="coincodex-content">(.*)<div.*$#Us', $body, $matches);
$body = '<div class="coincodex-content">' . $matches[1] . '</div>';

... since we're just extracting the content we need, and inserting it into the content format that's static.

Foul

Solution 2:[2]

Regular expressions are unsuitable for modifying DOM elements. Your experiment shows that. The result is wrong and also invalid HTML.

You can better use DOM methods to solve the problem as noted in the comment. DOM has a method DOMNode::removeChild which you can use to remove elements. To show how removeChild can be used I chose simpler HTML.

$html = <<<HTML
<div>
<div class="coincodex-content">
hello this is content
  <div class="delete_this" style="margin: 8px 0; clear: both;">
    <div>
       <center><span style="font-size:11px; color: gray;">TEST</span></center>
       <b>TEST</b><br><br>
     </div>
  </div>
  <div class="preserved">
    Test2
  </div>
</div>
</div>
HTML;

I collect the fragments into an array.

$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
$nodelist = $xpath->query( '//*[@class="coincodex-content"]' );

$fragment = [];
foreach($nodelist as $contentNode){
  $removeNodelist = $xpath->query('//div[@class="delete_this"]',$contentNode); 
  $item = $removeNodelist->item(0);  //only first
  $item->parentNode->removeChild($item); 
  $fragment[] = $doc->saveHTML($contentNode); 
}

The result in fragment[0] :

<div class="coincodex-content">
hello this is content
  
  <div class="preserved">
    Test2
  </div>
</div>

Try it yourself at 3v4l.org.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 FoulFoot
Solution 2