'How do I optimize string searching for c++ in a for loop?

Is there anyway I can optimize this function to take less times for strings with half a million characters?

for (int i = start; i <= end; i++) {
        doc.body.push_back(html[i]);
        

        if (html[i] == '<') {
            if (html[i + 1] != '/') {
                Tag newTag;
                int start1 = html[i];
                int end1;

                for (int y = i; y <= end; y++) {
                    if (html[y] == '>') {
                        end1 = y;
                        break;
                    }
                }
                
                for (int y = start1 + 1; y <= end1; y++) {
                    if (html[y] == ' ' || html[y] == '>') {
                        break;
                    }
                    newTag.tagType.push_back(html[y]);
                }
                for (int y = start1; y <= end1; y++) {
                    newTag.openTag.push_back(html[y]);
                }
                if (newTag.tagType.find("area") != std::string::npos || newTag.tagType.find("base") != std::string::npos || newTag.tagType.find("br") != std::string::npos || newTag.tagType.find("col") != std::string::npos || newTag.tagType.find("command") != std::string::npos || newTag.tagType.find("embed") != std::string::npos || newTag.tagType.find("hr") != std::string::npos || newTag.tagType.find("img") != std::string::npos || newTag.tagType.find("input") != std::string::npos || newTag.tagType.find("keygen") != std::string::npos || newTag.tagType.find("link") != std::string::npos || newTag.tagType.find("meta") != std::string::npos || newTag.tagType.find("param") != std::string::npos || newTag.tagType.find("source") != std::string::npos || newTag.tagType.find("track") != std::string::npos || newTag.tagType.find("wbr") != std::string::npos) {
                    newTag.closeTag = "Null";
                }

                if (newTag.openTag.find("class=") != std::string::npos) {
                    int start1 = newTag.openTag.find("\"", newTag.openTag.find("class=")) + 1;
                    int end1 = newTag.openTag.find("\"", start1);

                    for (int y = start1; y < end1; y++) {
                        if (html[y] != ' ') {
                            newTag.tagClass.back().push_back(html[y]);
                        }
                        else {
                            newTag.tagClass.push_back("");
                        }
                    }
                }

                if (newTag.openTag.find("id=") != std::string::npos) {
                    int start1 = newTag.openTag.find("\"", newTag.openTag.find("id=")) + 1;
                    int end1 = newTag.openTag.find("\"", start1);

                    for (int y = start1; y < end1; y++) {
                        if (html[y] != ' ') {
                            newTag.tagID.back().push_back(html[y]);
                        }
                        else {
                            newTag.tagID.push_back("");
                        }
                    }

                }


                page.tags.push_back(newTag);
            }
            else {


                int end1;
                for (int y = i; y <= stringSize; y++) {
                    if (html[y] == '>') {
                        end1 = y;
                        break;
                    }
                }


                //gets everything within the closing tag
                std::string storeClose;
                for (int y = i; y <= end1; y++) {
                    storeClose.push_back(html[y]);
                }
                for (int y = page.tags.size() - 1; y >= 0; y--) {
                    if (storeClose.find(page.tags[y].tagType) != std::string::npos) {
                        page.tags[y].closeTag = storeClose;
                    }
                }

            }
        }
    }

I timed how long it took with the chrono library and it took 16 minutes for a string with the length of 300000 characters! This is supposed to be parsing a html document downloaded from the web and its mostly functional. For shorter pages its almost instant but as soon as I reach the higher number the time it takes its exponentially higher!

c++optimization html-parsing

Solution 1:^[1]

I answered already yesterday the same question.

But first. Please compile your source code with all warnings enabled. There are tons of compiler warnings that you must resolve.

Additionally: In my opinion you have a hard bug in line int start1 = html[i];. I guess this should be int start1 = i;

Then, because you did not create a minimum reproducable example, I stubbed your code to test it.

Then I copied the HTML source from this file from stackoverflow and copied it in a local file 4 times to get a file size of 1.9MByte.

Then I let it run on my machine in below 4 seconds (including reading the 2MByte) from disk.

I used the profiler to check, where the time is burned. Several rounds.

The timed is mainly burned in string.find, string assignment and vector assignments.

The reason is not that the find-function is slow, but your excessive and unnecessary use of it.

Here you must work.

Look at the lines:

if (newTag.tagType.find("area") != std::string::npos || newTag.tagType.find("base") != std::string::npos || newTag.tagType.find("br") != std::string::npos || newTag.tagType.find("col") != std::string::npos || newTag.tagType.find("command") != std::string::npos || newTag.tagType.find("embed") != std::string::npos || newTag.tagType.find("hr") != std::string::npos || newTag.tagType.find("img") != std::string::npos || newTag.tagType.find("input") != std::string::npos || newTag.tagType.find("keygen") != std::string::npos || newTag.tagType.find("link") != std::string::npos || newTag.tagType.find("meta") != std::string::npos || newTag.tagType.find("param") != std::string::npos || newTag.tagType.find("source") != std::string::npos || newTag.tagType.find("track") != std::string::npos || newTag.tagType.find("wbr") != std::string::npos) {
                    newTag.closeTag = "Null";
                }

That is of course very inefficient.

Anyway. Study you code and refactor it. The design is important.

Please see your stubbed code:

#include <iostream>
#include <fstream>
#include <string>
#include <vector>

struct Doc {
    std::string body{};
};
struct Tag {
    std::vector<std::string> tagID{};
    std::string tagType{};
    std::string openTag{};
    std::string closeTag{};
    std::vector<std::string> tagClass{};
};
struct Page {
    std::vector<Tag> tags;
};

int main() {

    Doc doc{};
    Page page{};

    std::vector<char> html{};
    html.reserve(200000000);

    if (std::ifstream ifs("r:\\example.htm"); ifs) {
        
        std::copy(std::istreambuf_iterator<char>(ifs), {}, std::back_inserter(html));

        int start = 0;
        int end = static_cast<int>(html.size());

        for (int i = start; i <= end; i++) {
            doc.body.push_back(html[i]);


            if (html[i] == '<') {
                if (html[i + 1] != '/') {
                    Tag newTag;
                    int start1 = i;//html[i];
                    int end1;

                    for (int y = i; y <= end; y++) {
                        if (html[y] == '>') {
                            end1 = y;
                            break;
                        }
                    }

                    for (int y = start1 + 1; y <= end1; y++) {
                        if (html[y] == ' ' || html[y] == '>') {
                            break;
                        }
                        newTag.tagType.push_back(html[y]);
                    }
                    for (int y = start1; y <= end1; y++) {
                        newTag.openTag.push_back(html[y]);
                    }
                    if (newTag.tagType.find("area") != std::string::npos || newTag.tagType.find("base") != std::string::npos || newTag.tagType.find("br") != std::string::npos || newTag.tagType.find("col") != std::string::npos || newTag.tagType.find("command") != std::string::npos || newTag.tagType.find("embed") != std::string::npos || newTag.tagType.find("hr") != std::string::npos || newTag.tagType.find("img") != std::string::npos || newTag.tagType.find("input") != std::string::npos || newTag.tagType.find("keygen") != std::string::npos || newTag.tagType.find("link") != std::string::npos || newTag.tagType.find("meta") != std::string::npos || newTag.tagType.find("param") != std::string::npos || newTag.tagType.find("source") != std::string::npos || newTag.tagType.find("track") != std::string::npos || newTag.tagType.find("wbr") != std::string::npos) {
                        newTag.closeTag = "Null";
                    }
                    if (newTag.openTag.find("class=") != std::string::npos) {
                        int start1 = newTag.openTag.find("\"", newTag.openTag.find("class=")) + 1;
                        int end1 = newTag.openTag.find("\"", start1);

                        for (int y = start1; y < end1; y++) {
                            if (newTag.tagClass.empty()) newTag.tagClass.push_back({});
                            if (html[y] != ' ') {
                                if (newTag.tagClass.empty()) newTag.tagClass.push_back({});   // ***********************
                                newTag.tagClass.back().push_back(html[y]);
                            }
                            else {
                                newTag.tagClass.push_back("");
                            }
                        }
                    }
                    if (newTag.openTag.find("id=") != std::string::npos) {
                        int start1 = newTag.openTag.find("\"", newTag.openTag.find("id=")) + 1;
                        int end1 = newTag.openTag.find("\"", start1);
                        if (newTag.tagID.empty()) newTag.tagID.push_back({});
                        for (int y = start1; y < end1; y++) {
                            if (html[y] != ' ') {
                                newTag.tagID.back().push_back(html[y]);
                            }
                            else {
                                newTag.tagID.push_back("");
                            }
                        }

                    }
                    page.tags.push_back(newTag);
                }
                else {


                    int end1;
                    for (int y = i; y <= html.size(); y++) {
                        if (html[y] == '>') {
                            end1 = y;
                            break;
                        }
                    }


                    //gets everything within the closing tag
                    std::string storeClose;
                    for (int y = i; y <= end1; y++) {
                        storeClose.push_back(html[y]);
                    }
                    for (int y = page.tags.size() - 1; y >= 0; y--) {
                        if (storeClose.find(page.tags[y].tagType) != std::string::npos) {
                            page.tags[y].closeTag = storeClose;
                        }
                    }
                }
            }
        }
    }
}

And one output of the profile runs:

The test has been performed on a 12 years old Windows 7 machine.

Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution	Source
Solution 1	Armin Montigny

'How do I optimize string searching for c++ in a for loop?

Solution 1:[1]

Sources

Related Questions

Solution 1:^[1]