'How to get first occurence of src with HTML Agility Pack

due to invalid formatting of xmls I have, I'm using HTML Agility Pack. I am parsing for example this feed: https://www.rioseo.com/feed/

I have an array of these elements (so the "src" are always unique):

<content:encoded><![CDATA[<h2><a href="https://resources.rioseo.com/c/gbp-guide-for-hospit?x=0hTW-s"><img class="alignnone size-full wp-image-23086" src="https://www.rioseo.com/wp-content/uploads/2022/04/Rio_eBook_GBP-Guide-for-Hospitality-Brands_April2022_Hero.jpg" alt="" width="1200" height="409" srcset="https://www.rioseo.com/wp-content/uploads/2022/04/Rio_eBook_GBP-Guide-for-Hospitality-Brands_April2022_Hero-200x68.jpg 200w, https://www.rioseo.com/wp-content/uploads/2022/04/Rio_eBook_GBP-Guide-for-Hospitality-Brands_April2022_Hero-300x102.jpg 300w, https://www.rioseo.com/wp-content/uploads/2022/04/Rio_eBook_GBP-Guide-for-Hospitality-Brands_April2022_Hero-400x136.jpg 400w,

I want to get only the first url of image from the src attribute, so my expected output should be (an array of urls):

{'https://www.rioseo.com/wp-content/uploads/2022/04/Rio_eBook_GBP-Guide-for-Hospitality-Brands_April2022_Hero.jpg',
https://another.url.extracted.from.the.array.of.'content_encoded'}

I can output whole img element from 'content-encoded' node with:

var images = doc.DocumentNode.SelectNodes(".//*[name()='content:encoded']/img").ToArray();
foreach (var item in images)
     {
          Console.WriteLine("image: " + item.OuterHtml);
     }

Other methods than OuterHtml gives me blank output.

I can also output every img from this string with:

var items = doc.DocumentNode.SelectNodes("//img[@src]").ToArray();
foreach (var image in items)
     {
          Console.WriteLine("img: " + image.Attributes["src"].Value);
     }

I know I have to extract first occurence of "https" from img element. I've tried many xpaths, but I can't get it. Probably my xpath itself is wrong, but I don't know how to fix it.

Any help will be very appreciated:), thanks!



Solution 1:[1]

I think I got it, with RegEx I just do:

var items = doc.DocumentNode.SelectNodes(".//item").ToArray();
foreach (var item in items)
         {
              string matchString = Regex.Match(item.OuterHtml, "<img.+?src=[\"'](.+?)      [\"'].*?>", RegexOptions.IgnoreCase).Groups[1].Value;
              Console.WriteLine("img: " + matchString);
         }

Solution 2:[2]

Your content:encoded sample is incomplete but I think this can be a solution:

var images = doc.DocumentNode.SelectNodes(".//*[name()='content:encoded']//img")
    .Select(item => item.GetAttributeValue("src", null))
    .Where(item => item != null)
    .ToList();
foreach (var url in images)
{
    Console.WriteLine("image: " + url);
}

The XPATH is like yours, but with two // in img because of CDATA. Then I select src attribute (or null if not exists) and filter null items (images without src, that I suppose you haven't but is a sanity check).

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Mi Yahn
Solution 2 Victor