'Using RegEx to extract data from an anchor tag
I have the following anchor tag in an html document that I want to extract the link and the text from:
<a href="https://www.catholicgallery.org/bible-drb/acts-9/">Acts 9:</a> 1-20
I have tried using two different methods.
calling TestRegEx with
IEnumerable <Tuple<string, string, string>> tuple = TestRegEx(reading.readinghRef);
where TestRegEx is:
protected IEnumerable<Tuple<string, string, string>> TestRegEx (string html)
{
Regex r = new Regex(@"<a.*?href=(""|')(?<href>.*?)(""|').*?>(?<value>.*?)</a>\s(?<verses>.*?)");
foreach (Match match in r.Matches(html))
yield return new Tuple<string, string, string>(
match.Groups["href"].Value, match.Groups["value"].Value, match.Groups["verses"].Value);
}
I have also tried:
Regex regex = new Regex(@"<a\shref=""(?<url>.*?)"">(?<text>.*?):</a>\s(?<verses>.*?)");
Match match = regex.Match(reading.readinghRef);
string text = match.Groups["text"].Value;
string[] textParts = text.Split(' ');
string verses = match.Groups["verses"].Value;
string book = "";
for (int i = 0; i < textParts.Length - 1; i++)
{
if (book.Length > 0)
book += " ";
book += textParts[i];
}
string chapter = textParts[textParts.Length - 1];
They both succeed in getting the book and the url, but fail to get the verses. Item 2 in the tuple is not yet parsed to book and chapter. That is not the problem. The problem is not getting the verses at the end of the html string.
Solution 1:[1]
The only problem with your first regex is the non-greedy
(?<verses>.*?)
Replace with the greedy version, and you'll get the verses.
(?<verses>.*)
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | ejkeep |