Re: File Processing
On Sep 30, 9:35 pm, Victor Bazarov <v.Abaza...@comAcast.net> wrote:
I want to read and process and rewrite a very large disk based file
(>3Gbytes) as quickly as possible.
The processing effectively involves finding certain strings and replaci=
them with other strings of equal length such that the file size is unal=
(the file is uncompressed btw). I wondered if anyone could advise me o=
best way to do this and also of things to avoid. More specifically I wa=
-Is it best to open a single file for read-write access and overwrite t=
changed bytes or would it be better to create a new file?
It is always a good idea to leave the old file intact, unless you
somehow can ensure that a single write operation will never fail and
that an incomplete set of find/replace operations is still OK. Ask in
any database development newsgroup.
This is generally true, but he said a "very large" file. I'd
have some hesitations about making a copy if the file size were,
say, 100 Gigabytes.
As always, you have to weigh the trade offs. Making a copy is
certainly a safer solution, if you can afford it.
-Is there any point in buffering bytes in rather than
reading one byte at a time or does this just defeat the
buffering that's done by the OS anyway?
You'd have to experiment. C++ language does not define any
buffering AFA OS is concerned.
C++ does define buffering in iostreams. But the fastest
solution will almost certainly involve platform specific
requests. I'd probably start by using mmap on a Unix system, or
CreateFileMapping/MapViewOfFile under Windows. If performance
is really an issue, he'll probably have to experiment with
different solutions, but I'd be surprised if anything was
significantly faster than using a memory mapped file, modified
But of course, as you pointed out above, this solution doesn't
provide transactional integrity. And it only works if the
process has enough available address space to map the file.
(Probably no problem on a 64 bit processor, but likely not the
case on 32 bit one.)
-Would this benefit from multi-threading - read, process, write?
Unlikely. Processing will take so little time compared to the
I/O, and I/O is going to be the bottleneck anyway, so...
If he uses memory mapping, the system will take care of all of
the IO behind his back anyway. Otherwise, some sort of
asynchronous I/O can sometimes improve performance.
James Kanze (GABI Software) email:email@example.com
Conseils en informatique orient=E9e objet/
Beratung in objektorientierter Datenverarbeitung
9 place S=E9mard, 78210 St.-Cyr-l'=C9cole, France, +33 (0)1 30 23 00 34