Re: get some chars from a .txt file

From:

"James Kanze" <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++.moderated

Date:

17 Jan 2007 11:40:50 -0500

Message-ID:

<1169027836.050953.162390@11g2000cwr.googlegroups.com>

G wrote:

James Kanze wrote:

G wrote:

One first question: I assume that it is a text file which
corresponds to the conventions for text files under the system
you are using. (This is far from obvious, and I regularly have
to deal with Windows text files under Unix and vice versa.)
Also, that there really are newlines where the line breaks occur
above. (Also far from obvious: if I were designing the file
format, I don't think I'd put a single line break in the gene
sequence.

Well,this text file must be the output the gene-analysis program.
If there are no line breaks, the sequence may be hard to read :-)

Even with line breaks, it looks hard to read to me:-). Do
biologists actually read it, or is it only exploited by other
programs?

Also, the length of one line in text file is limited.

Not on modern systems (unless you really want to read them with
an editor, maybe).

I need to get 7 charactors before the 28271st site and 8 after it such
as 28988 and 34586.

"Site" the gene site, or site the position in the file? (Not
that this changes anything really.)

site the gene site, of course :-)

Just for the record, I think if I were doing this, I'd use
getline to extract the first line, then istringstream to parse
it (or boost::regex to parse ti, and istringstream only for the
conversions). This leaves the file correctly positioned for the
gene sequence.

Sorry, but I haven't learn the STL and BOOST. And I will soon after I
learn the TC++PL. :-)
But it seems to need learn much. :-(

I think maybe you're going at it wrong. You should pick up a
few basics of the STL very quickly, even before learning all of
the subtilities of the language. Things like std::vector, and
basic use of iterators. And std::string, getline() and the
general IO subsystem. (Any useful program will have to do IO,
even if it doesn't need some of the more exotic operators or
name lookup rules.)

Boost is a bit trickier, because you have to install it
separately from your compiler, but it's probably worth it.
There again, you don't try to learn everything at once. Regular
expressions are a very powerful tool, and boost::regex is very
easy to use for the simple cases, which is all you need here.
(On the other hand, of course, you do need to learn regular
expressions. Coming from a Unix background, where just about
every tool you use supports them, I find it hard to imagine
doing anything without them, but I continually hear rumors that
people are actually able to work on Windows based machines
without installing a Unix toolkit:-).

At any rate, I would suggest that your code start with:

    std::ifstream source( filename ) ;
    if ( ! source ) {
        // Handle error...
    }
    std::string header ;
    if ( ! getline( source, header ) ) {
        // Handle error, file was empty...
    }
    // Parse header...

As I said, I rather like regular expressions for this sort of
parsing, and Boost has just the tool for it. If you're not
familiar with regular expressions, however, and don't want to
learn them just yet, then something like:

    std::istringstream headerStream( header ) ;

and read from it.

The advantage of this technique is that whatever you do in
parsing the header, source is now correctly positionned for
reading the rest of the data.

If you have some good advise , I wish you will send me a mail:
gpfei9@gmail.com.
Others who are willing to help me, I'm eager for your mails too.
I'm glad to communicate with you all ! :-D

But this method seems to be inefficient.

Is it causing you performance problems? If not, it's by far the
best solution.

Yes, but not on my PC. It looks inefficient on my classmates.

What does it mean: "looks inefficient"? You're dealing with
real data, on a real machine. You have real response times.
They are either acceptable, or not. If they're acceptable, then
the code is fine. If they're not, you have to do something
about it.

But I use DEV-C++ 4.9.9.5,and he use VC++6.0.
Maybe that's the problem ?

Or he's using a ten year old PC with a 100MHz clock, and you've
got a recent one with a 3 GHz clock. Or maybe he's feeding it a
much bigger data file.

As I said, my own solution would probably be to read the entire
sequence into an std::vector< char >, and then use that.
Something like:

    std::vector< char > data ;
    std::string line ;
    while ( std::getline( source, line ) ) {
        data.insert( data.end(), line.begin(), line.end() ) ;
    }
    if ( ! source.eof() ) {
        // Something went wrong while reading...
    }

If I could make any sort of reasonable guess as to the length
(or a reasonable maximum length), I would probably do
    data.reserve( maxLength ) ;
before the loop.

I rather suspect that this is also the fastest *portable*
solution. Memory mapping the file might be faster, but is not
portable, and it would be a lot more complex, because you'd have
to calculate the line breaks---one byte under Unix, two under
Windows---when mapping the sequence position to the file
position. (If you're doing a lot of random positionning, this
could end up costing more time than the time necessary for the
extra copying, above.)

It's also possible to save a copy by reading directly into the
std::vector. The code to do so, however, is considerably more
difficult to get right, and unless your sequences have lengths
measuring in the tens of millions of characters, or more, you
probably won't notice any difference in response time.

I'm not good at expression. If there are some words thar I used are not
correctly, just forgive me!

If English is not your native language, you're forgiven.
Otherwise... being able to express yourself precisely is an
absolute prerequisite to good programming.

--
James Kanze (GABI Software) email:james.kanze@gmail.com
Conseils en informatique orientie objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place Simard, 78210 St.-Cyr-l'Icole, France, +33 (0)1 30 23 00 34

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated. First time posters: Do this! ]