Re: Searching for byte sequence

From:

"Kahlua" <kahlua@right.here>

Newsgroups:

microsoft.public.vc.mfc

Date:

Tue, 22 Apr 2008 21:19:46 GMT

Message-ID:

<SDsPj.7710$NK1.3904@trndny05>

If you want, you can see sample files I am trying to read data from I have 2
posted on my website.
www.kahlus.com/1.ppf and www.kahlus.com/2.ppf .
The first is a 6mb file and the second is a 1mb file.
Inside the file you can see the header info regarding the following binary
data in 4 sections.
I need to extract the date after each header.
ImageWidth X ImageHeight = Number of bytes to extract after ImageData
keyword.

"Joseph M. Newcomer" <newcomer@flounder.com> wrote in message
news:e3hs0454tcnsmk7t893l74qus8tfjod1dh@4ax.com...

How do you know the length of the text? Is it variable-length and has to
be deduced from
finding the NUL character? Or is there an explicit length? Is the header
one string or
several strings (lines don't count; a string is either NUL-terminated or
not) It's a
small detail, but would help finding things. Is it exactly 10,000 bytes
each time? Or
approximately?

Search the first string. Then skip to the second string and search it.
And so on. Find
will stop searching when it hits a NUL terminator.

Note that because of the huge ratio, I would be inclined to NOT read the
entire file; I
would read a header and search it, then if nothing is found, skip the 10K
bytes to the
next header, seek to that file position, read the next header, and so on.
The catch here
is knowing how long a header is; in a well-designed file format, a
variable-length field
will have an explicit length preceding it.

Also, because you suggested the files can be very long, the chances of
being able to read
the entire file in decreases with file length, whereas reading segments of
the file will
work independent of the file length. The trick of reading the entire file
is reliable up
to a few hundred megabytes at best, then it becomes problematic as to
whether it will work
or not.
joe

On Tue, 22 Apr 2008 17:19:57 GMT, "Kahlua" <kahlua@right.here> wrote:

Hmmm,
The actual format of the file is similar to as follows:
Several lines of "text" header info followed by 10,000bytes of binary data
(which might contain 0x00's) followed by several more lines of header info
and more binary data.
Each header contains txt keywords followed by txt values as to how many
binary bytes are to follow.
Thanks for the help,
Ed

"Joseph M. Newcomer" <newcomer@flounder.com> wrote in message
news:sg6s04dchm0d9k14edlii190j4lsd0gpee@4ax.com...

If the text data does not resemble the binary data (that is, binary data
that could look
like your text string) then the Find approach will work. Otherwise, you
need to parse the
structure of the file. Unfortunately, Find and its underlying CRT
support
do not have a
notion of "limiting to a range", that is, you can supply a start point,
but not a stop
point.

If there can be no ambiguity with text, I might consider something of
the
following:

int n = buffer.Find("text", offset);

then parse the file structure. If n gives me an offset in a string
(that
is, presumably
you know where the string data is as offset-and-length, so you parse
until
the offset
falls within an offset-and-length range...and if you find that the "next
string" you find
while parsing the file structure is beyond n, then you had one of the
ambiguities, so you
can then set the offset value in the call to be the offset of the string
you just had, and
issue the find again. This might be faster than alternatives if the
search string is
short and you have a lot of strings in the file. If the proportion of
string-to-binary is
low, that is, it is mostly binary, I'd probably be inclined to parse the
structure and
check just the strings. There's no one "easy" answer when you have a
mix
like this.
joe

On Tue, 22 Apr 2008 15:00:11 GMT, "Kahlua" <kahlua@right.here> wrote:

Thanks for all this usefull/informative information.
The files I am reading into buffer contain both text and binary data in
them.
I need to search for a text string and move the position ahead to the
binary
data that follows.
Then I need to extract a known number of bytes to another buffer for
processing.
After the binary data is text again which I need to search again for a
certain string.
Thanks for the help.
Ed

"Joseph M. Newcomer" <newcomer@flounder.com> wrote in message
news:a5tr04hnp8p1144cakj3araaqaf9k1bl1d@4ax.com...

See below...
On Tue, 22 Apr 2008 13:24:32 GMT, "Kahlua" <kahlua@right.here> wrote:

I have a ListBox with a list of files in the c:\data\ folder with the
extension .dat
Now that I have the file coppied into buffer how do I search through
the
buffer
for a specific sequence of bytes?

Thanks,
Ed

void CMyApp::OnSelchangeList()
{
CString mess;
CString JobFile;
char cSelect[50];

****
TCHAR cSelect[MAX_PATH];
at the VERY least. Better still,

CString cSelect;
****

int Length;
int nSelect;
CByteArray buffer;
CFile in;

nSelect=SendDlgItemMessage(IDC_LIST, LB_GETCURSEL, 0, 0L);

****
Why such a crude and antiquated mechanism? Create a control variable
for
your list and do
nSelect = c_List.GetCurSel();
note how much easier it is!
****

DlgDirSelect((LPSTR) cSelect, IDC_LIST);

****
Note that DlgDirSelect makes the GetCurSel superfluous, but actually
the
simplest thing to
do is to write
c_List.GetText(nSelect, cSelect);
which is a whole lot easier
****

Length=strlen(cSelect);
if (cSelect[Length-1]==0x2e)
cSelect[Length-1]=0;

****
I can't figure out what this is doing because I have no idea what the
purpose of it is.
For example, what in the world is 0x2e? Perhaps you meant to write
   if(cSelect[Length-1] == _T(','))
?

If you are testing for a character, it is generally considered good
programming practice
to use the character, and not its hex equivalent.

Also, using the obsolete 'char' data type is not good programming
practice; you should get
the length by writing
Length = _tcslen(cSelect);

But note that this is much more readily written if you have a CString:
    if(cSelect.Right(1) == _T("."))
        cSelect = cSelect.Left(cSelect.GetLength() - 1);
which is a lot easier to write and understand. Note that you don't
need
to get the length
as a separate variable.
****

JobFile = _T("c:\\data\\");

****
You are correctly using _T() here, but in a Unicode build the next
line
would fail
****

JobFile += cSelect;
JobFile += ".dat";

****
So why did you use _T() in one literal but not in another?
****

mess = "Would you like to load ";
mess += cSelect;
mess += " as top ?";

****
This would be a lot easier to write as
CString mess;
mess.Format(_T("Would you like to load \"%s\" as top ?"), cSelect);

Note that you do not need to declare the variable at the top; you do
not
need to declare
it until it is actually needed. Better still, put that string in the
STRINGTABLE and load
it, so you can localize
****

int a = MessageBox (mess, "Query", MB_ICONINFORMATION|MB_YESNO);

****
int a = AfxMessageBox(mess, MB_ICONQUESTION | MB_YESNO);

It is NOT an information prompt, it is a question prompt. Use
AfxMessageBox, which
follows recommended best practice for the caption (uses the program
name).
Use white
space around binary operators to make them legible
****

if (a==IDNO) return;

****
It would be safer to say
if(a != IDYES)
return;

This tests for the actual meaningful value; note the whitespace around
the
operator; note
that it uses two lines, which makes it easier to debug.
****

if(!in.Open(JobFile, CFile::modeRead)){
   DWORD err = ::GetLastError();
   CString msg;
   msg.Format(_T("Error opening file: %d"), err);
   AfxMessageBox(msg);
   return;
}
buffer.SetSize(in.GetLength());
if((INT_PTR)in.Read(buffer.GetData(), buffer.GetSize()) !=
buffer.GetSize()){
   DWORD err = ::GetLastError();
   CString msg;
   msg.Format(_T("Error reading file: %d"), err);
   AfxMessageBox(msg);
   return;
}

****
Are you searching for text or a binary pattern not expressible as
text?
If this is text,
and is known to be 8-bit characters, always, one solution is
    CStringA buffer;
    LPSTR p = buffer.GetBuffer(in.GetLength());
    if((INT_PTR)in.Read(p, in.GetLength()) != in.GetLength())
      ... as above

   buffer.ReleaseBuffer(in.GetLength());
   int n = buffer.Find("abc");
   if(n < 0)
     ...not found
   else
     ...found

If you need to find all instances of an 8-bit character string, you
would
have a loop, and
the second parameter of Find would give the starting offset for the
next
search.

However, if your file is in UTF-8 encoding, you would have to use the
UTF-8 representation
of the string (the most efficient means) or convert the file to a
Unicode
representation
(not efficient for large files, especially if the string ends up not
being
found). If
your file is potentially Unicode, life gets a good deal more complex,
but
I don't want to
get into that here right now.
****

in.Close();
}

Joseph M. Newcomer [MVP]
email: newcomer@flounder.com
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm