'C# Split string by tag and create an iterable data structure

I have this string:

<item>
  <node1>Name</node1>
    <childNode1>Nickname</childNode1>
  <node2>Surname</node2>
</item>
<item>
  <node1>AnotherName</node1>
  <node2>AnotherSurname</node2>
</item>

I want to split this string by "item", and create a data structure from text extracted from all nodes, for example: {"Name","Nickname", "Surname"}

{"Name", "Surname"}



Solution 1:[1]

If you install HtmlAgilityPack you can solve your task easily:

public class Item
{
    public List<string> Properties { get; set; }

    public static List<Item> LoadItems(string text)
    {
        var items = new List<Item>();
        var doc = new HtmlDocument();
        doc.LoadHtml(text);

        var docItems = doc.DocumentNode.SelectNodes("//item");
        foreach (var docItem in docItems)
        {
            var list = docItem.ChildNodes
                .Where(n => n.NodeType != HtmlNodeType.Text)
                .Select(n => n.InnerText)
                .ToList();
            if (list.Count > 0)
            {
                items.Add(new Item { Properties = list });
            }
        }

        return items;
    }
}

This class has a list of properties ("Name,Nickname,Surname" for your first item) and a LoadItems that parse your text. Simply select all "item" nodes and iterate the returned list selecting the InnerText (the content of each node).

You can test your sample:

const string text = @"<item>
  <node1>Name</node1>
    <childNode1>Nickname</childNode1>
  <node2>Surname</node2>
</item>
<item>
  <node1>AnotherName</node1>
  <node2>AnotherSurname</node2>
</item>";

var allItems = Item.LoadItems(text);

Solution 2:[2]

While adding a root element, in order to create an XML might be better ?

Another way might be to use this regexp: https://regex101.com/r/eKMjeu/1

The code from the generator (on that side) gives:

string pattern = @"<(\/?node[0-9]|\/?childNode[0-9])>*>|\n";
        string substitution = ",";
        string input = @"<item>
  <node1>Name</node1>
    <childNode1>Nickname</childNode1>
  <node2>Surname</node2>
</item>
<item>
  <node1>AnotherName</node1>
  <node2>AnotherSurname</node2>
</item>";
        RegexOptions options = RegexOptions.Multiline;
        
        Regex regex = new Regex(pattern, options);
        string result = regex.Replace(input, substitution);

The results in result are:

<item>
,  ,Name,
,    ,Nickname,
,  ,Surname,
,</item>
,<item>
,  ,AnotherName,
,  ,AnotherSurname,
,</item>

Which might make life a little bit easier

You could add:

result = result.Replace('\r',' ');
result = result.Replace(@"</item>",Environment.NewLine.ToString());
result = "," + result.Replace(@"<item>","");

Which leave you with:

, ,  ,Name, ,    ,Nickname, ,  ,Surname, , 
 , ,  ,AnotherName, ,  ,AnotherSurname, ,

All in all, pretty messy...

The other solution using XML, seems much nicer:

using System.Text.RegularExpressions;
using System.Xml.Linq;
using System.Xml.XPath;

string str = @"<item>
  <node1>Name</node1>
    <childNode1>Nickname</childNode1>
  <node2>Surname</node2>
</item>
<item>
  <node1>AnotherName</node1>
  <node2>AnotherSurname</node2>
</item>";

str = "<root>" + str + "</root>";

XDocument xml = XDocument.Parse(str);

foreach(XElement e in xml.Descendants("node1")) {
    XElement node2 = e.XPathSelectElement("../node2");
    System.Console.WriteLine("{\"" + e.Value + "\",\"" + node2.Value + "\"}");
}

output:

{"Name","Surname"}
{"AnotherName","AnotherSurname"}

Of course this second solutions (currently) lacks error checking, and should have started with xml.Descendants("item"), but ... ?

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source
Solution 1 Victor
Solution 2