'How to get first occurence of src with HTML Agility Pack
due to invalid formatting of xmls I have, I'm using HTML Agility Pack. I am parsing for example this feed: https://www.rioseo.com/feed/
I have an array of these elements (so the "src" are always unique):
<content:encoded><![CDATA[<h2><a href="https://resources.rioseo.com/c/gbp-guide-for-hospit?x=0hTW-s"><img class="alignnone size-full wp-image-23086" src="https://www.rioseo.com/wp-content/uploads/2022/04/Rio_eBook_GBP-Guide-for-Hospitality-Brands_April2022_Hero.jpg" alt="" width="1200" height="409" srcset="https://www.rioseo.com/wp-content/uploads/2022/04/Rio_eBook_GBP-Guide-for-Hospitality-Brands_April2022_Hero-200x68.jpg 200w, https://www.rioseo.com/wp-content/uploads/2022/04/Rio_eBook_GBP-Guide-for-Hospitality-Brands_April2022_Hero-300x102.jpg 300w, https://www.rioseo.com/wp-content/uploads/2022/04/Rio_eBook_GBP-Guide-for-Hospitality-Brands_April2022_Hero-400x136.jpg 400w,
I want to get only the first url of image from the src attribute, so my expected output should be (an array of urls):
{'https://www.rioseo.com/wp-content/uploads/2022/04/Rio_eBook_GBP-Guide-for-Hospitality-Brands_April2022_Hero.jpg',
https://another.url.extracted.from.the.array.of.'content_encoded'}
I can output whole img element from 'content-encoded' node with:
var images = doc.DocumentNode.SelectNodes(".//*[name()='content:encoded']/img").ToArray();
foreach (var item in images)
{
Console.WriteLine("image: " + item.OuterHtml);
}
Other methods than OuterHtml gives me blank output.
I can also output every img from this string with:
var items = doc.DocumentNode.SelectNodes("//img[@src]").ToArray();
foreach (var image in items)
{
Console.WriteLine("img: " + image.Attributes["src"].Value);
}
I know I have to extract first occurence of "https" from img element. I've tried many xpaths, but I can't get it. Probably my xpath itself is wrong, but I don't know how to fix it.
Any help will be very appreciated:), thanks!
Solution 1:[1]
I think I got it, with RegEx I just do:
var items = doc.DocumentNode.SelectNodes(".//item").ToArray();
foreach (var item in items)
{
string matchString = Regex.Match(item.OuterHtml, "<img.+?src=[\"'](.+?) [\"'].*?>", RegexOptions.IgnoreCase).Groups[1].Value;
Console.WriteLine("img: " + matchString);
}
Solution 2:[2]
Your content:encoded sample is incomplete but I think this can be a solution:
var images = doc.DocumentNode.SelectNodes(".//*[name()='content:encoded']//img")
.Select(item => item.GetAttributeValue("src", null))
.Where(item => item != null)
.ToList();
foreach (var url in images)
{
Console.WriteLine("image: " + url);
}
The XPATH is like yours, but with two // in img because of CDATA. Then I select src attribute (or null if not exists) and filter null items (images without src, that I suppose you haven't but is a sanity check).
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Mi Yahn |
| Solution 2 | Victor |
