Converting EBCDIC to Unicode
Dear all
Hi
I wrote a program to convert a EBCDIC text file in OS/400 environment
to Unicode (UTF-16) in Windows XP.
Because, the text file contains information of Shareholders in Persian
(Farsi), I had to find
the mapping table of Persian characters. You may be know, Unlike
English,
in Persian some characters has one form, some of them two forms and
for some
characters, there are more than two forms. I mean there are Initial,
Medial and Final forms.
I found them using Character Map (One of System Programs in Windows
XP).
I really like to know your general and special opinion. If someone
already worked on the
subject even in other languages (like Arabic) h(is/er) advice may be
help so much.
1. Because the EBCDIC is 8-bits encoding and Unicode (UTF-16) is 16
bits
(or more precisely 21 bits) encoding, I use for input file an ifstream
object (character files) and for
output file wofstream object (Wide character file)
2. I use the int() function to know the ordinal number behind the
characters. I use the convention:
If the returned number is positive, it should be English letter or
numeric, in other words it isn't Persian
and If it is negative, it is Persian and I use my Mapping:
// mapping.h
struct Mapping {
std::map<int, int> Map;
Mapping();
void FillMap();
int operator[](const int k) { return Map[k]; }
};
// mapping.cpp
Mapping::Mapping()
{
FillMap();
}
void Mapping::FillMap()
{
// fill map
Map[-14] = 0xFEF4; // ARABIC LETTER YEH MEDIAL FORM
Map[-111] = 0xFE8B; // ARABIC LETTER YEH WITH HAMZA ABOVE INITIAL
FORM
Map[-122] = 0xFE81; // ARABIC LETTER ALEF WITH MADDA ABOVE
// other map entries
}
LineConvertor is a class that read one line and convert it to Unicode
standard:
//line_convertor.h
wstring LineConvertor::Replace(const string& s)
{
wstring ws;
for (string::size_type i = 0; i < s.size(); i++) {
wchar_t w = s[i];
if (int(s[i]) >= 0) ws.push_back(w);
else { // so it should be persian character in EBCEDIC character set
if (CP[int(s[i])] != 0) { // if the character is in lookup table
ws.push_back(wchar_t(CP[int(s[i])]));
}
else {
// there is no entry in Mapping data
structure.
// throw exception
}
}
}
return ws;
}
Is this a good way to find mapping for all Persian characters?
What is the reverse function of int()? I mean a function chr(int) that
returns the corresponding
character of an integer?
3. I trace my program using debugger, and I see my program works fine.
My main problem is: When I write the Persian character to wostream
file (output file)
The file is empty. There is nothing in output file:
In the following code, FileConvertor is a class with Convert member
function that
converts all the file. for each line the member LineConvertor,
converts a line.:
// file_convertor.h
class FileConvertor {
std::ifstream In; // original file
std::wofstream Out; // a file containing of converted records
(unicode)
LineConvertor LC;
// ...
public:
void Convert();
};
// file_convertor.cpp
void FileConvertor::Convert()
{
for (string s; getline(In, s); ++RecCount) {
try {
std::vector<std::wstring> V = LC.Convert();
for (std::vector<std::wstring>::size_type i = 0; i < V.size(); i+
+) {
Out << V[i] << L'\t'; // <-- no character is written to file
}
Out << L'\n';
}
}
4. I don't know. Do I should consider std::locale and std::facet in
programming
such applications (file conversion)? I want to extend my program to
convert Unicode to
EBCDIC, EBCDIC to XML, ... I mean Generic converter. How to apply
Policy class design?
5. How to write a general program with minimum effort to port it to
Linux environment?
I need to some general guidelines.
Please throw some light.
Regards,
-- Saeed Amrollahi