понедельник, 23 марта 2009 г.

C Sharp rewrite of Rob Pike's Beautiful Code is almost as fast as the code in the Brian Kernighan Essay

Created 2008-01-03 01:42

Hong Kong 1-3-2007 2:21 PM

Nonsense prevails, modesty fails
Grace and virtue turn into stupidity
While the calendar fades almost all barricades to a pale compromise
And our leaders have feasts on the backsides of beasts
They still think they're the gods of antiquity
If something you missed didn't even exist
It was just an ideal -- is it such a surprise?

What shall we do, what shall we do with all this useless beauty?
All this useless beauty

...a continuing discussion of Rob Pike's "regular expression" code in Beautiful Code, Ozam and Wilson (eds.), O'Reilly Media, Sebastopol CA 2007, p. 3.

I've implemented Rob Pike's algorithm in C Sharp, replacing its use of addresses with Strings to provide internationalization and above all safety as the code is changed. But I've preserved most of its features...including something I noted while doing the conversion: Pike's bad implementation of the compare for a character match (including the match of dot) in two places: in matchhere() and in matchstar(). This is something which should have been done with a function call or a macro in the text presented as any code, that is supposed to be Beautiful code.

I'm still doing validation of my comparision and plan to write a wrapper for both the C Sharp and C++/C executables to test them thoroughly for both comparative correctness and comparative performance. But preliminary results indicate that for the regular expression C.t, and a string consisting of 38 "d"s and Cat, both return the same results.

The Pike code takes on my contemporary machine about .6 seconds for 1000000 iterations. The C Sharp code takes about 2.7 seconds.

This is such a minimal gain in efficiency emerging after ONE MILLION iterations, that, in my view, Pike's code isn't Beautiful, and the use of C to point at "strings" is bad practice today as is the use of C, or C embedded within C++.

Brian, please don't praise Rob for taking only an hour. This culture of "doing it fast" cannot produce Beautiful or correct code, and it has resulted at many companies in the destruction of programmer morale, and their family life, as they are continually bullied to repeatedly do, and redo, bad code in inadequate languages, "fast" based on role models such as you and Pike are, deservedly enough.

In this culture, grace and virtue turn into stupidity most assuredly while the leaders have their feasts on the backside of beasts.

There are no bugs in Rob's code: bugs in the code presented here are my fault if they exist in fact.

The following consists of:

(1) My summary of the flaws in the Pike code in Beautiful Code (O'Reilly Media, 2007)

(2) The only file you need to create a C Sharp command line executable that uses strings to execute the C sharp version of the Pike code in Visual Studio C Sharp

(3) The only file you need to create a C++ command line executable that implements the Pike code, repeated from "elsethread"

FLAWS IN THE PIKE CODE

It doesn't implement a regular expression interpreter insofar as "regular expressions" have a formal, mathematical meaning: it makes no provision for a character which must occur at least once and it makes no provision for nesting as in (a+b)*, and, it returns the shortest conformant string. In a mathematical sense, the longest conformant string can be, possibly, transformed into the SET of conformant strings but the shortest conformant string is only one member. The code attains its micro-efficiency by not doing a thorough job.

It is idiomatic C, which uses a parameter passed by value as the changeable index into the string it points-at. This teaches bad practice unusable in other languages and unsafe in C from the standpoint of program maintenance.

While advertised as a string processor, it does not handle modern Unicode or double-byte strings containing international input.

Kernighan does express, in the text of the essay, the concern that the code, which does heavy recursion proportional to the string length, might run slowly or overrun stack limits.

But no consideration exists in the code or Brian's essay, that a Beautiful way exists to avoid most recursion and speed up the code. If the regular expression does not start with a string start carat or string end dollar sign, its first character that is not followed by an asterisk is a "handle" which which can be scanned-for in a nonrecursive loop or by a strcspn function.

Of course, starter code that searches the regular expression for a fixed handle (such as is c in cd*) might not be considered Beautiful in a minimalist sense.

I plan to implement this later in the C Sharp code.

The metacharacters carat, asterisk and dollar sign outside of expected positions are treated as ordinary data in an undocumented bug/feature, and no escape character is supported.

The matching of a specific character against a specific character in the regular expression is coded in two separate places in two separate procedures. This is very poor programming practice, and in itself makes this Ugly Code.

THE C SHARP CODE FILE
using System;
using System.Collections.Generic;
using System.Text;

namespace kernighanRegexCsharp
{
class Program
{

static void Main(string[] args)
{
const int REPCOUNT_BACKUP = 1000000;
const string REGEX_BACKUP = "C.t";
const string TESTSTRING_BACKUP = "ddddddddddddddddddddddddddddddddddddddCat";
System.Diagnostics.Stopwatch stopWatch = new System.Diagnostics.Stopwatch();
int repeats, reps;
int result = 0;
string regex;
string testString;
int satisfierIndex = 0;
int satisfierLength = 0;
int check = 0;
if (args.GetUpperBound(0) < 2)
{
System.Console.Write("My syntax is kernighanRegexCsharp [c]\nUse the c at the end of the command line to check results from repetition for identity to previous results\n");
System.Console.Write("I'll use my backup test values\n");
repeats = REPCOUNT_BACKUP;
regex = REGEX_BACKUP;
testString = TESTSTRING_BACKUP;
} else
{
repeats = System.Convert.ToInt32(args[0]);
regex = args[1];
testString = args[2];
if (args.GetUpperBound(0) >= 3)
check = (args[3].ToUpper() == "C") ? 1 : 0;
}
System.Console.Write("Applying the regular expression " +
"\"" + regex + "\" " +
"to the string " +
"\"" + testString + "\" " +
repeats.ToString() + " " +
"time(s)\n");
int result1 = -1;
int satisfierIndex1 = 0;
int satisfierLength1 = 0;
stopWatch.Start();
for (reps = 0; reps < repeats; reps++)
{
result = kernighanRegexCsharp.match(regex,
testString,
ref satisfierIndex,
ref satisfierLength);
if (check == 1 && repeats > 1)
if (result1 == -1)
{
assertx(result == 0 || result == 1);
result1 = result;
satisfierIndex1 = satisfierIndex;
satisfierLength1 = satisfierLength;
} else assertx(result == result1
&&
satisfierIndex == satisfierIndex1
&&
satisfierLength == satisfierLength1);
}
stopWatch.Stop();
double duration = (double)stopWatch.ElapsedMilliseconds / 1000;
if (reps > 0)
{
System.Console.WriteLine("Satisfier string was " +
(result == 0 ? "not " : "") +
"found");
if (result == 1)
{
System.Console.WriteLine("Zero-origin index of satisfier string: " +
satisfierIndex.ToString());
System.Console.WriteLine("Length of satisfier string: " +
satisfierLength.ToString());
}
System.Console.WriteLine("Time taken was about " +
duration.ToString() + " " +
"seconds");
if (check == 1 && reps > 1)
System.Console.WriteLine("\nThe first return value was checked to see if it was 1 or zero,\nand each set of results starting with result 2 was checked against\nthe first set of results to see if they matched: all of these tests succeeded\n");
}
return;
}
static void assertx(bool condition)
{
if (!condition)
{
throw new Exception("Assertion has failed\n");
}
}
}


public static class kernighanRegexCsharp
{
/* match: search for regexp anywhere in text */
public static int match(string regex, string text, ref int start, ref int length)
{
start = 0;
length = 0;
int textIndex = 0;
int regexIndex = 0;
if (regex[0] == '^')
{
regexIndex++;
if (matchhere(regex, ref regexIndex, text, ref textIndex) == 1)
{
start = 0;
length = textIndex;
return 1;
} else return 0;
}
int textIndex0;
do { /* must look even if string is empty */
textIndex0 = textIndex;
if (matchhere(regex, ref regexIndex, text, ref textIndex) == 1)
{
start = textIndex0;
length = textIndex - textIndex0;
return 1;
}
} while (textIndex++ < text.Length);
return 0;
}
/* matchhere: search for regexp at beginning of text */
private static int matchhere(string regex, ref int regexIndex, string text, ref int textIndex)
{
if (regexIndex >= regex.Length) return 1;
int index;
if ((index = regexIndex + 1) < regex.Length && regex[index] == '*')
{
regexIndex += 2;
return matchstar(regex[regexIndex], regex, ref regexIndex, text, ref textIndex);
}
if (regex[regexIndex] == '$' && index >= regex.Length)
return (textIndex >= text.Length) ? 1 : 0;
if (textIndex < text.Length && (regex[regexIndex] == '.' || regex[regexIndex] == text[textIndex]))
{
regexIndex += 1; textIndex += 1;
return matchhere(regex, ref regexIndex, text, ref textIndex);
}
return 0;
}

/* matchstar: search for c*regex at beginning of text */
static int matchstar(char c, string regex, ref int regexIndex, string text, ref int textIndex)
{
do { /* a * matches zero or more instance */
if (matchhere(regex, ref regexIndex, text, ref textIndex) == 1)
return 1;
} while (textIndex < text.Length && (text[textIndex++] == c || c == '.'));
return 0;
}
};



}

THE C++ CODE FILE
// kernighanRegexC.cpp : main project file.

#include "stdafx.h"
#include
#include
#include
#include
#include
#include
#include

using namespace std;


#define REPCOUNT_BACKUP ("1")
#define REGEX_BACKUP ("")
#define TESTSTRING_BACKUP ("dddCat")



using namespace System;

public value class kernighanRegexC
{
public:
/* match: search for regexp anywhere in text */
static int match(char *regexp, char *text, char **start, int *length)
{
*length = 0;
*start = text;
if (regexp[0] == '^')
{
return matchhere(regexp+1, &text)
?
(*length = text - *start, 1)
:
(*start = 0, 0);
}
do { /* must look even if string is empty */
*start = text;
if (matchhere(regexp, &text))
{
*length = text - *start;
return 1;
}
} while (*text++ != '\0');
*start = 0;
return 0;
}
private:
/* matchhere: search for regexp at beginning of text */
static int matchhere(char *regexp, char **textp)
{
if (regexp[0] == '\0') return 1;
if (regexp[1] == '*')
return matchstar(regexp[0], regexp+2, textp);
if (regexp[0] == '$' && regexp[1] == '\0')
return **textp == '\0';
if (**textp!='\0' && (regexp[0]=='.' || regexp[0]==**textp))
return matchhere(regexp+1, (++(*textp), textp));
return 0;
}

/* matchstar: search for c*regexp at beginning of text */
static int matchstar(int c, char *regexp, char **textp)
{
do { /* a * matches zero or more instance */
if (matchhere(regexp, textp))
return 1;
} while (**textp != '\0' && (*(*textp)++ == c || c == '.'));
return 0;
}
};

void assertx(int condition)
{
if (!condition)
{
printf("Assertion has failed\n");
abort();
}
}

int main(int argc, char *argv[] )
{
clock_t start;
int repeats, reps;
int result = 0;
char *regexAddr;
char *stringAddr;
char *satisfierAddr;
int satisfierLength;
int check = 0;
if (argc < 4)
{
printf("My syntax is kernighanRegexC [c]\nUse the c at the end of the command line to check results from repetition for identity to previous results\n");
printf("I'll use my backup test values\n");
repeats = atoi((REPCOUNT_BACKUP));
regexAddr = (REGEX_BACKUP);
stringAddr = (TESTSTRING_BACKUP);
} else
{
repeats = atoi(argv[1]);
regexAddr = (argv[2]);
stringAddr = (argv[3]);
if (argc > 4)
check = (argv[4][0] == 'c' && argv[4][1] == '\0');
}
printf( "Applying the regular expression '%s' to the string '%s' %i time(s)\n",
regexAddr, stringAddr, repeats);
start = clock();
int result1 = -1;
int satisfierIndex1;
int satisfierLength1;
for (reps = 0; reps < repeats; reps++)
{
result = kernighanRegexC::match(regexAddr,
stringAddr,
&satisfierAddr,
&satisfierLength);
if (check && repeats > 1)
if (result1 == -1)
{
assertx(result == 0 || result == 1);
result1 = result;
satisfierIndex1 = satisfierAddr - stringAddr;
satisfierLength1 = satisfierLength;
} else assertx(result == result1
&&
satisfierIndex1 == satisfierAddr - stringAddr
&&
satisfierLength == satisfierLength1);
}
double duration = (double)( clock() - start) / CLOCKS_PER_SEC;
if (reps > 0)
{
printf( "Satisfier string was %sfound\n", result==0 ? "not " : "");
if (result == 1)
{
printf("Zero-origin index of satisfier string: %i\n", satisfierAddr - stringAddr);
printf("Length of satisfier string: %i\n", satisfierLength );
printf("Starting address of test string: %x\n", stringAddr);
printf("Starting address of satisfier string: %x\n", satisfierAddr);
}
printf( "Time taken was about %2.8f seconds\n", duration );
if (check && reps > 1)
printf("\nThe first return value was checked to see if it was 1 or zero,\nand each set of results starting with result 2 was checked against\nthe first set of results to see if they matched: all of these tests succeeded\n");
}
return 0;
}

Комментариев нет:

Отправить комментарий