Re: Is a binary search (divide and conquer) efficient enough for searching through 4.2 million recs?
In article
<d251ee28-4ea3-4439-890b-8be133917e9f@b34g2000pra.googlegroups.com>,
postmaster@postales-apasionadas.com wrote:
I am hoping to get different opinions and advice on a problem that I
have.
I have a CSV file with approx. 4,200,000 lines which total over
135MB. The data is strictly used for looking up data, it is not going
to be modified.
The data will be sorted based on the first 2 columns. Here is a an
example of one 3 lines:
"3182420352","3182420479","22814"
"3182420480","3182420735","153358"
"3182420736","3182421503","48085"
Given that memory and speed are concerns, I noticed that if you are
keeping the data sorted, then the begin_range column is redundant, as
the range for any given row r is [(r-1).end_range + 1, r.end_range].
Therefore, you don't need to store begin_range at all.
I would keep the data sorted in a vector:
struct Row
{
unsigned long end_range;
unsigned long row_id;
friend bool operator<(Row const& lhs, Row const& rhs)
{ return lhs.end_range < rhs.end_range; }
};
std::vector<Row> data;
std::sort(data.begin(), data.end()); // if input not presorted
and use an algorithm like std::lower_bound to get the row_id.
You might still have to massage your data a little to match the range
concepts that the binary search algorithms are expecting, but it
shouldn't be much more work than this.
--
Nevin ":-)" Liber <mailto:nevin@eviloverlord.com> 773 961-1620
[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]