Welchen Filter ist effizienter?

Ash@wiki · 06.01.2022

Filter by bo:

typedef bool (*FNC_FOUNDMAIL)(char *);

bool IsAlowedMailChar(char chr) {
  /*
    According to RFC 2822, the local-part of the email may use any of these ASCII characters:
        Uppercase and lowercase letters
        The digits 0 through 9
        The characters "! # $ % & ' * + - / = ? ^ _ ` { | } ~"
        The character "." provided that it is not the first or last character in the local-part.
    */
  const char *SpecialChar = "*+-=^_`| ~.";
  char c[2] = {chr, 0};
  return ((chr >= 'a') && (chr <= 'z')) || ((chr >= 'A') && (chr <= 'Z')) || ((chr >= '0') && (chr <= '9')) || (strpbrk(c, SpecialChar)) && !(chr == ' ');
}

void SearchEMail(char *html_source, FNC_FOUNDMAIL found) {
  const char *EMAIL_AT[] = {"@", "[at]", "@@"};

  if (!html_source || !found)
    return;

  char *runptr;
  const unsigned int len = sizeof(EMAIL_AT) / sizeof(char *);

  // Plain Search
  for (unsigned int idx = 0; idx < len; ++idx) {
    runptr = strstr(html_source, EMAIL_AT[idx]);
    while (runptr) {
      char *begin = runptr - 1;
      char *end = runptr + strlen(EMAIL_AT[idx]);

      while ((begin > html_source) && IsAlowedMailChar(*begin))
        begin--;
      begin++;

      while ((*end) && IsAlowedMailChar(*end))
        end++;

      if (*end)
        runptr = strstr(runptr + 1, EMAIL_AT[idx]);
      else
        runptr = NULL;

      char Backup = *end;
      *end = 0;
      bool bContinue = found(begin);
      *end = Backup;
      if (!bContinue)
        return;
    }
  }

Filter by aw:

void SearchEMail(const char* cDomain, char nextEmail[MAX_PATH])
{
    const char symbolCha[3][5] = { "@","@@","[AT]" };
    const char* notCh = "!§$%&/()=?`´+#~*_<>|:;^°}{³² ";
    const char* buffCh = cDomain;
    char cpyConst[1024] = { 0 };
    unsigned int eCount = 0;

    for (int i = 0; i < sizeof(symbolCha); i++)
    {
    startTheScan:
        if (const char* chUpaDow = strstr(buffCh, symbolCha[i]))
        {
            while (chUpaDow)
            {
                if (strcspn(chUpaDow, notCh))
                {
                    chUpaDow--;
                }
                else
                {
                    chUpaDow++;


                    for (int i = 0; i < strlen(chUpaDow); i++)
                    {
                        cpyConst[i] = (int)chUpaDow[i];
                        unsigned int pos = strlen(cpyConst);
                        chUpaDow++;
                        for (int i = 0; i < strlen(notCh); i++)
                        {

                            if (cpyConst[pos] == notCh[i])
                            {
                                pos--;
                                nextEmail[eCount++] = cpyConst[pos];
                                buffCh = chUpaDow;
                                goto startTheScan;
                            }
                        }
                    }
                }
            }
        }
    }
}

german · 08.01.2022

Rein von drüberfliegen würde ich annehmen dass Code 1 eine bessere Performance zeigt. Code 2 ist zu tief verschachtelt und mehrfaches Ausführen von strlen in Schleifen kann dich eigentlich nur ausbremsen. Genauer weißt du das aber nur wenn du einen Codeprofiler drüber rennen lässt. Evtl. auch einfach mal bei godbolt reinwerfen und schauen wie der Assemblercode dazu aussieht.
Grundsätzlich würde ich dir auch raten mal in eine ASCII Tabelle zu schauen. Es gibt weitere zusammenhängende Zeichenbereiche und deine bestehenden kannst du da und dort auch noch erweitern. Alles was du auf diese Weise testen kannst, braucht anschließend nicht Zeichen für Zeichen in strpbrk oder ähnlichen Funktionen geprüft zu werden.

german · 08.01.2022

Hab mal kurz was zusammengeschrieben von dem ich denke, dass es eine einigermaßen gute Performance hat. Dazu gehört auch, einige Regeln zu ignorieren und anhand Länge von Zeichensequenzen und Auftretenswahrscheinlichkeit die Reihenfolge der Checks zu priorisieren. Findet sich entsprechend in den Codekommentaren ...

C:

// clang-format off

#include <stdbool.h>
#include <stddef.h>
#include <stdio.h>
#include <string.h>

// valid character / range           | priority
// ----------------------------------+----------
// !                                 | 5
// #$%&'                             | 4
// *+                                | 5
// -                                 | 4
// /0123456789                       | 3
// =                                 | 5
// ?                                 | 5
// ABCDEFGHIJKLMNOPQRSTUVWXYZ        | 2
// ^_`abcdefghijklmnopqrstuvwxyz{|}~ | 1
static bool CheckLocalChar(const char ch)
{
  return ((ch >= '^' && ch <= '~') ||
          (ch >= 'A' && ch <= 'Z') ||
          (ch >= '/' && ch <= '9') ||
          (ch >= '#' && ch <= '\'') ||
          ch == '-' || ch == '!' || ch == '*' || ch == '+' || ch == '=' || ch == '?');
}

// valid character / range           | priority
// ----------------------------------+----------
// -                                 | 4
// 0123456789                        | 3
// ABCDEFGHIJKLMNOPQRSTUVWXYZ        | 2
// abcdefghijklmnopqrstuvwxyz        | 1
static bool CheckDomainChar(const char ch)
{
  return ((ch >= 'a' && ch <= 'z') ||
          (ch >= 'A' && ch <= 'Z') ||
          (ch >= '0' && ch <= '9') ||
          ch == '-');
}

static bool CheckIteratively(const char *begin, const char *const end, bool (*const pCharCheckFunc)(const char))
{
  for (bool previousIsDot = false; begin < end; ++begin)
  {
    if (*begin != '.') // current character is no dot
    {
      if (!pCharCheckFunc(*begin)) // invalid characters
        return false;

      previousIsDot = false;
    }
    else // current character is dot
    {
      if (previousIsDot) // consecutive dots
        return false;

      previousIsDot = true;
    }
  }

  return true;
}

// @ is special (separates Local and Domain parts, must appear exactly once)
// . is special (neither allowed at the begin nor at the end of both Local and Domain, must not appear consecutively)
// - is special (neither allowed at the begin nor at the end of Domain)
// The Local part must not be longer than 64 characters
// NOTE: E-mail addresses that contain IP addresses or UTF-8 sequences or quoted phrases are not taken into account, but are treated as invalid.
static bool CheckAddress(const char *const address)
{
  if (!address || *address == '\0' || *address == '.') // NULL pointer passed, zero-length string passed, dot at the begin of the Local part
    return false;

  const char *const localEnd = strchr(address, '@'); // find @
  if (!localEnd || localEnd == address || localEnd[-1] == '.' || localEnd - address > 64) // no @, no Local part, Local part ends with dot, local Part too long
    return false;

  const char *const domain = localEnd + 1;
  if (*domain == '\0' || *domain == '.' || *domain == '-') // no Domain part, Domain part begins with dot, Domain part begins with hyphen
    return false;

  const char *const domainEnd = domain + strlen(domain);
  if (domainEnd[-1] == '.' || domainEnd[-1] == '-') // Domain part ends with dot, Domain part ends with hyphen
    return false;

  return (CheckIteratively(address, localEnd, CheckLocalChar) && CheckIteratively(domain, domainEnd, CheckDomainChar)); // valid characters?
}

int main(void)
{
  static const char *const addresses[] = {
    "Simple@Example.com",
    "very.common@example.com",
    "disposable.style.email.with+symbol@example.com",
    "mailhost!username@example.org",
    "user%example.com@example.org",
    "Abc.example.com", // no @
    "A@b@c@example.com", // more than one @
    "foo..bar@example.com", // consecutive dots
    "i_like_underscore@but_its_not_allowed_in_this_part.example.com", // underscores in Domain part
    "1234567890123456789012345678901234567890123456789012345678901234+x@example.com" // Local part too long
  };

  const size_t addressesLen = sizeof(addresses) / sizeof(*addresses);
  for (const char *const *pAddress = addresses, *const *const pEnd = pAddress + addressesLen; pAddress < pEnd; ++pAddress)
    printf("%s : %s\n", CheckAddress(*pAddress) ? "good" : "bad ", *pAddress);

  return 0;
}

// clang-format on

EDIT: Whoa! @JR Cologne Tut mir leid für das clang-format off. Am automatischen Formatter müssen wir glaube ich mal was machen. Funktionen komplett auf eine Zeile zu ziehen ist unschön

https://dev-community.de/threads/code-formatter-test.103/post-2541

JR Cologne · 09.01.2022

@german Jep, das geht so natürlich nicht. Sorry dafür.

Hier die zugehörige clang-Config: https://github.com/dev-community-de/code-formatter/blob/master/.clang-format

Wenn du einen konkreten Vorschlag hast, wie die Config angepasst werden müsste, wäre das super.
Dann könnte ich das einfach übernehmen und/oder du machst einen Pull Request auf.

Ansonsten würde ich mir das irgendwann in der fernen Zukunft mal in Ruhe selbst anschauen.
Habe ja sowieso auch noch den einen oder anderen Bug auf der ToDo-Liste für das Forum, der noch nicht angegangen wurde.

german · 09.01.2022

Die bei weitem einfachste Lösung ist ColumnLimit: 0 zu definieren. Damit werden auch Entscheidungen des Users respektiert, Zeilenumbrüche gezielt zu setzen (so wie ich es in den Vergleichen gemacht habe, wo die Zeichenbereiche in jeweils einer Zeile stehen und die Vergleiche für die Einzelzeichen in einer weiteren). @Lowl3v3l wird's nicht mögen ¯\_(ツ)_/¯

JR Cologne · 09.01.2022

Ok, danke für den Vorschlag.
Wäre aus meiner Sicht ok.
Wenn keine Einwände kommen, integriere ich das so.

Lowl3v3l · 10.01.2022

Ich möchte hinzufügen, dass C/C++-Compiler und Linker mittlerweile so gut im Optimieren sind, dass man im Allgemeinen, ohne zu Benchmarken und zu profilen, nur schwer sagen kann, welcher Code das bessere Ergebnis liefert, oder ob gar beide zum gleichen Ergebnis optimiert werden. Man sieht zum Beispiel oftmals "Bit-Hacks" oder so, die deshalb nichts bringen, weil der Compiler aus Multiplikationen und Divisionen eh das gleiche Ergebnis macht wie aus dem Bit-Hack. Darum kommt es auch nicht selten vor, dass Menschen im Versuch etwas zu optimieren zu schlechterem Code führen, weil sie es dem Compiler aktiv schwer machen.

Und oberstes Ziel von Code ist immer, dass er von Menschen gelesen werden kann

@german: hast recht, schön dass mir mein Ruf so voraus geht

MfG

german · 10.01.2022

Lowl3v3l schrieb:
@german: hast recht, schön dass mir mein Ruf so voraus geht

Alternativ AllowShortFunctionsOnASingleLine: None
Zieht dann immer noch alles was irgendwie in das ColumnLimit passt zusammen und gezielte Zeilenumbrüche werden aufgehoben. Ich persönlich finde das Bockmist, aber in dem Punkt kommen wir nicht zusammen

Von den Forennutzern zu erwarten, dass sie in dem Fall gezielt mit einem clang-format off arbeiten müssten, halte ich für illusorisch wenn man bedenkt für wen wir die automatische Formatierung im Forum überhaupt brauchen ...

Welchen Filter ist effizienter?

Ash@wiki

Neues Mitglied

german

Aktives Mitglied

german

Aktives Mitglied

JR Cologne

Administrator

german

Aktives Mitglied

JR Cologne

Administrator

Lowl3v3l

Aktives Mitglied

german

Aktives Mitglied