-
Notifications
You must be signed in to change notification settings - Fork 860
esi: replace _findOpeningTag with memchr #13173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -27,6 +27,7 @@ | |
| #include <ts/ts.h> | ||
|
|
||
| #include <cctype> | ||
| #include <cstring> | ||
|
|
||
| using std::string; | ||
| using namespace EsiLib; | ||
|
|
@@ -182,68 +183,52 @@ EsiParser::_compareData(const string &data, size_t pos, const char *str, int str | |
| return PARTIAL_MATCH; | ||
| } | ||
|
|
||
| /** This implementation is optimized but not completely correct. If | ||
| * the opening tag were to have a repeating opening sequence ('<e<esi' | ||
| * or something like that), this will break. However that is not the | ||
| * case for the two opening tags we are looking for */ | ||
| /** Uses memchr to skip non-'<' bytes, then memcmp to verify each candidate | ||
| * anchor. Delegates scanning to the platform's optimized memchr | ||
| * implementation. Does not have the KMP-failure limitation of the original | ||
| * state-machine. */ | ||
| EsiParser::MATCH_TYPE | ||
| EsiParser::_findOpeningTag(const string &data, size_t start_pos, size_t &opening_tag_pos, bool &is_html_comment_node) const | ||
| { | ||
| size_t i_data = start_pos; | ||
| int i_esi = 0, i_html_comment = 0; | ||
|
|
||
| while (i_data < data.size()) { | ||
| if (data[i_data] == ESI_TAG_PREFIX[i_esi]) { | ||
| if (++i_esi == ESI_TAG_PREFIX_LEN) { | ||
| is_html_comment_node = false; | ||
| opening_tag_pos = i_data - i_esi + 1; | ||
| const char *const buf = data.data(); | ||
| const size_t total = data.size(); | ||
| const size_t esi_len = ESI_TAG_PREFIX_LEN; | ||
| const size_t hlen = HTML_COMMENT_NODE_INFO.tag_suffix_len; | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you really need this hlen local? The compiler will optimize it away, but I honestly would just like to see HTML_COMMENT_NODE_INFO.tag_suffix_len where you need this, specially since you memcmp from HTML_COMMENT_NODE_INFO.tag_suffix already (no shadow variable for the string). Same with the esi_len I think, neither are blockers.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. corrected in latest commit |
||
| size_t i = start_pos; | ||
|
|
||
| while (i < total) { | ||
| const char *p = static_cast<const char *>(memchr(buf + i, '<', total - i)); | ||
| if (!p) { | ||
| return NO_MATCH; | ||
| } | ||
| const size_t pos = static_cast<size_t>(p - buf); | ||
| const size_t avail = total - pos; | ||
|
|
||
| if (avail >= esi_len && memcmp(p, ESI_TAG_PREFIX, esi_len) == 0) { | ||
| is_html_comment_node = false; | ||
| opening_tag_pos = pos; | ||
| return COMPLETE_MATCH; | ||
| } | ||
| // hlen+1 bytes needed: hlen for the tag, 1 for the required trailing whitespace | ||
| if (avail > hlen && memcmp(p, HTML_COMMENT_NODE_INFO.tag_suffix, hlen) == 0) { | ||
| const char ch = p[hlen]; | ||
| if (ch == ' ' || ch == '\t' || ch == '\r' || ch == '\n') { | ||
| is_html_comment_node = true; | ||
| opening_tag_pos = pos; | ||
| return COMPLETE_MATCH; | ||
| } | ||
| } else { | ||
| if (i_esi) { | ||
| i_esi = 0; | ||
| --i_data; // we do this to reexamine the current char as target string might start from here | ||
| if (i_html_comment) { | ||
| --i_html_comment; // in case other target string has started matching, adjust it's index | ||
| } | ||
| } | ||
| } | ||
| // doing the exact same thing for the other target string | ||
| if (i_html_comment < HTML_COMMENT_NODE_INFO.tag_suffix_len && | ||
| data[i_data] == HTML_COMMENT_NODE_INFO.tag_suffix[i_html_comment]) { | ||
| if (++i_html_comment == HTML_COMMENT_NODE_INFO.tag_suffix_len && i_data + 1 < data.size()) { | ||
| char ch = data[i_data + 1]; //<!--esi must follow by a space char | ||
| if (ch == ' ' || ch == '\t' || ch == '\r' || ch == '\n') { | ||
| is_html_comment_node = true; | ||
| opening_tag_pos = i_data - i_html_comment + 1; | ||
| return COMPLETE_MATCH; | ||
| } | ||
| } | ||
| } else { | ||
| if (i_html_comment) { | ||
| i_html_comment = 0; | ||
| --i_data; // same comments from above applies | ||
| if (i_esi) { | ||
| --i_esi; | ||
| } | ||
| } | ||
| if (avail < esi_len && memcmp(p, ESI_TAG_PREFIX, avail) == 0) { | ||
| is_html_comment_node = false; | ||
| opening_tag_pos = pos; | ||
| return PARTIAL_MATCH; | ||
| } | ||
| ++i_data; | ||
| } | ||
| // partial matches; with the nature of our current opening tags, the | ||
| // only way we can have a partial match for both target strings is | ||
| // if the last char of the input string is '<' and that is not | ||
| // enough information to differentiate the tags; Anyway, the parser | ||
| // takes no action for a partial match | ||
| if (i_esi) { | ||
| is_html_comment_node = false; | ||
| opening_tag_pos = i_data - i_esi; | ||
| return PARTIAL_MATCH; | ||
| } | ||
| if (i_html_comment) { | ||
| is_html_comment_node = true; | ||
| opening_tag_pos = i_data - i_html_comment; | ||
| return PARTIAL_MATCH; | ||
| if (avail <= hlen && memcmp(p, HTML_COMMENT_NODE_INFO.tag_suffix, avail) == 0) { | ||
| is_html_comment_node = true; | ||
| opening_tag_pos = pos; | ||
| return PARTIAL_MATCH; | ||
| } | ||
| i = pos + 1; | ||
| } | ||
| return NO_MATCH; | ||
| } | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thought here, and I asked Claude to make an example: Did you consider pulling in and using libswoc TextView instead? It excels at things like "string parsing". Not a show stopper, and could be a future consideration.
I asked Claude to write a suggested use with TextView (not tested ...)
swoc::TextView view{data.data() + start_pos, data.size() - start_pos}; const swoc::TextView esi_prefix{ESI_TAG_PREFIX, ESI_TAG_PREFIX_LEN}; const swoc::TextView html_prefix{HTML_COMMENT_NODE_INFO.tag_suffix, size_t(HTML_COMMENT_NODE_INFO.tag_suffix_len)}; while (!view.empty()) { size_t pos = view.find('<'); if (pos == swoc::TextView::npos) return NO_MATCH; view.remove_prefix(pos); size_t abs_pos = data.size() - view.size(); if (view.starts_with(esi_prefix)) { /* COMPLETE_MATCH esi */ } if (view.size() > html_prefix.size() && view.starts_with(html_prefix)) { char ch = view[html_prefix.size()]; if (ch==' '||ch=='\t'||ch=='\r'||ch=='\n') { /* COMPLETE_MATCH html */ } } if (view.size() < esi_prefix.size() && esi_prefix.starts_with(view)) { /* PARTIAL */ } if (view.size() <= html_prefix.size() && html_prefix.starts_with(view)) { /* PARTIAL */ } view.remove_prefix(1); }There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was mostly unaware of what was offered in libswoc, so that's def why I didn't use it. I did some quick benchmarks with a synthetic comparison between this vs TextView and libswoc is as fast if not slightly faster sometimes. Its within margin of error, so Its more up to you guys if you want it re-written that way now, or something we should tackle later?