Re: C/C++ question about dynamic "static struct"

From:

ImpalerCore <jadill33@gmail.com>

Newsgroups:

comp.lang.c++,comp.lang.c

Date:

Sat, 27 Oct 2012 20:24:18 -0700 (PDT)

Message-ID:

<e97a379e-1ce0-43b3-b98b-db26c892471a@k20g2000vbj.googlegroups.com>

On Oct 27, 4:46 pm, Paavo Helde <myfirstn...@osa.pri.ee> wrote:

ImpalerCore <jadil...@gmail.com> wrote innews:fd1e7b35-3d5d-40b4-91b5-92e=

6b1b66fc2@d17g2000vbv.googlegroups.com:

On Oct 27, 12:13 pm, Paavo Helde <myfirstn...@osa.pri.ee> wrote:

ImpalerCore <jadil...@gmail.com> wrote
innews:906d2617-80f6-408e-bcf2-2ad

c990a0...@p22g2000vby.googlegroups.com:

Agreed, but I also think the "RAII for memory" is also encapsulated
in 'c_levenshtein', unless I misunderstand what you're saying by
"encapsulation". By that I mean that the c_levenshtein just takes
tw

strings;

That's true, but the usability of this encapsulation is on a
different level.

Okay, can you enumerate what "levels of encapsulation" you associate
std::string and c_levenshtein? Are you saying 'class' is a higher
level of encapsulation than 'function'?

No, I talked about the level of "usability". I just wanted to say that
the frequency of anyone using c_levenshtein is several magnitudes less
than anyone using std::string (in C++), just because the c_levenshtein is
a much more specialized thing. So, if I introduce some optimization like
small string optimization in std::string, it has several magnitudes more
impact than introducing it in c_levenshtein (and introducing it in all
functions similar to c_levenshtein is a lot of work which can be avoided
by introducing it in std::string instead).

Okay, I understand your point now. But it's a little difficult on the
comp.lang.c side of the cross post to get access to std::string :-)

, but I am making use of std::string every
day (as well as my home-grown variant class which is also using
small- string optimization).

Are you trying to point out that 'std::string (high) > char* (low)'?

Yes, std::string is inevitably using char* so it is higher level. But on
the other hand, in C++ std::string would be itself pretty much lowest
level and anything using it would be yet higher level. The fact that the
C version of c_levenshtein is containing char[] is dragging it to the
same lower level where std::string resides, where it does not actually
belong.

If I understand you correctly, in C++ land, you are claiming that
algorithms like the levenshtein should be parameterized to work on
std::string. I'd agree with that statement.

But on the comp.lang.c side of things, you have char*, or you have
some to build some kind of struct string with an API to manipulate it.

Encapsulation means I can build my own abstractions, and then build
other stuff on top of them, ad infinitum. Using abstractions means
the upper level code is not concerned with lower-level details. In
contrast, your c_levenshtein() function is containing lower-level
code (like setting up pm_workspace) which has absolutely nothing to
do with the actual purpose

this function. I understand this is kind of inevitable in C.

I'm a bit confused. Evaluating the levenshtein distance in the
classical method requires a matrix that you have to get memory for
from somewhere. Even if you have 'int c_levenshtein( const string&
s1, const string& s2 )' or some levenshtein member function, you still
have to provide memory for the matrix; it's not innate to 's1' and
's2'. I agree that 'pm_workspace' is a kind of small string
optimization for malloc that is not directly related to the algorithm,
but you still need to get the memory from somewhere. Can you
pseudocode your own version of levenshtein using the std::string
framework, so I can better understand where you're getting the memory
for the matrix from, and classify what parts of the function are
"high" and "low" level?

There are no two levels "high" and "low", there is a potentially open-
ended hiearchy of levels. I am very oriented to the actual working code,
so for me the notions "lower" and "higher" just mean the order of loading
dynamic-link-libraries into the process space, assuming that each feature
is implemented in its own dynamic-link library.

And I said the technique just reminded me the short-string optimization
of C++, not that it would be applicable for this very function. Anyway,
here is the translation of the levenshtein function into C++:

\code
int c_levenshtein( const std::string& s1, const std:string& s2 )
{
  // note: try-catch is probably unneeded here, it is just to replicate
  // the original function way to report any errors
  // by returning a non-informative -1 error code.
  try {
    /* If one of the strings is empty "", the edit distance is equal
       to the length of the non-empty string. */
    if (s1.empty() || s2.empty()) {
      return s1.length() + s2.length();
    }
    int m = s1.length()+1;
    int n = s2.length()+1;
    std::vector<int> proximity_matrix(n*m);
    gc_compute_levenshtein_matrix(
      s1.c_str(), s2.c_str(), &m, &n, &proximity_matrix[0] );
    return proximity_matrix[m*n-1];
  } catch(...) {
    return -1;
  }}

\endcode

The intermediate array is using std::vector here instead of std::string,
and I have not heard any implementation of std::vector that is using any
kind of "small-string" optimization. So this C++ version probably
involves a dynamic memory allocation even in case of short input strings
and so probably runs slower than the C version in case of short input
strings.

Getting it faster would involve more work, and I would not be
convinced this is needed unless the profiler told me that. On the other
hand, if more work were needed, I could encapsulate it in a class and
just replace the name std::vector with my class name.

Just from my limited measurements for levenshtein, replacing c_malloc
with c_region for string pairs whose matrix fits into the buffer
reduces the execution time by about 15% on my system. A little faster
but not really enough to jump up and down about, since levenshtein has
such a limited scope. I actually just use malloc in my levenshtein
function in my library, since for my application, it's fast enough.

If you're basically saying C++ > C for encapsulation, I agree with
you.

Yeah, I guess this is mostly what I wanted to say.

. However, one can still build a C interface to a resizing string
to have a kind of std::string equivalent in functionality, but the
code won't look as pretty, especially to someone accustomed to class
based design. But that doesn't mean that it can't be done, and that
it wouldn't be useful for someone using C.

Sure, C is Turing complete so one can do anything in it. It just takes
more care and discipline.

Thanks for taking the time to give your perspective.

Best regards,
John D.