Re: Efficient Integer Comparison Algorithm
On 05.03.2008 16:49, JThangiah wrote:
Hi Robert,
Thank you so much for your detailed reply. I am so glad I found
some help.
I have the answers for your questions.
1. Nature of data:
I am doing this for linking two Bio-Informatics databases. One is my
local system which uses Oracle. The other database is external and I
only get the data in a file format.
If you want, you can look at the file here:
ftp://ftp.ncbi.nih.gov/gene/DATA/gene2refseq.gz
It is nearly 430MB and following are the fields of interest to me:
tax_id Integer
GeneID Integer
start_position_on_the_genomic_accession Integer
end_position_on_the_genomic_accession Integer
The corresponding data in my local database has the following fields:
Taxon Integer
EXTFEATUREID String
LEFTEND Integer
RIGHTEND Integer
If the leftend and rightend here is the same as the start_position and
end_position in the external database, then I know that both represent
the same Gene.
These are just positive numbers and only one pair of co-ordinates.
PreCondition:
For a particular genome(taxon), if I sort both set of records in
ascending order of start position, the indexes in both should be
identical.
I am not sure what you mean by "indexes". I certainly cannot see the
termin "index" mentioned above.
So, I cannot sort the entire DB that way. It should only be sorted for
one genome at a time.
I don't understand. You said, you wanted to match records based on
identical value pairs (start, end). You can (in a relational database
and also in Java code) easily sort by two criteria.
2. Database products:
Do you think there is a way to link my Oracle DB and the file input
directly to match the records?
Oracle can work with files, this is called "external tables". Please
have a look at the docs at http://tahiti.oracle.com e.g.
http://download.oracle.com/docs/cd/B19306_01/server.102/b14220/schema.htm#sthref777
That's probably what I'd do if the format of your data can be easily
made available via Oracle. Note though that the file needs to be on the
server IIRC.
Since I have a key and set of values for each entry in both DB's,
should I use a Map instead of a List? By using a TreeMap, I can keep
the data sorted on addition and so I'm just left with the mapping of
coordinates in the corresponding indexes.
Since you only want to compare according to equality of your two values
a HashMap is more efficient.
Will this be faster or is there a better way to do it like using a
SortedArrayList?
Don't even think about using lists for this. That's too slow. You
probably should get yourself a book about data structures and algorithms
to get a solid foundation.
By "pair type", do you mean an object with two ints and a String value
as members?
.... and with properly implemented equals() and hashCode() methods, exactly.
Is HashSet faster than any other datastructure for this purpose? Do
you think this will be the best one for this?
Yes.
3. Volume of data:
It is not too large as I exaggerated and is only around 3,48000
records in my local DB. The external DB has much more data than this
Did you mean 348,000 records? That's easily handled in memory.
which I do not need to deal with since I am only looking for a match
on all records in my local DB.
Then it's easy: create your class (see above), read in your local DB,
store it in a HashSet or HashMap, then query the DB and for each row
create a temporary instance of the class and look it up in the map / set.
Alternative: query both databases ordered by both coordinates and
compare records individually. You can then decide which ResultSet you
need to step forward based on the current comparison.
Sorry, I could not understand your alternate solution. Is this
comparing done by first sorting by start positions and then comparing
corresponding index values?
I don't understand why you bring up these indexes. What indexes?
No, select from the DB with an ORDER BY with two columns. Same for the
local store. Then iterate through the ResultSet and your local data and
compare. If the first ordered column in the local store data is less
than in the DB data, increase it until they are same or greater. If
same do the same for the second column until a match is found or the end
of one of the two sources is reached.
Yes, certainly. You could, for example, partition your data set into
smaller chunks, say x=0...999, 1000..1999, 2000..2999 etc. and have a
number of threads which each processes one chunk. You only need to make
sure that you only have one thread that will generate the output XML.
I am not used to Concurrent programming till now. Do you know of any
sample code that I can refer?
http://www.amazon.com/Concurrent-Programming-Java-Principles-Patterns/dp/0201695812
http://www.amazon.com/Java-Concurrency-Practice-Brian-Goetz/dp/0321349601
But it's probably not worthwhile in this case.
Cheers
robert