Re: Converting EBCDIC to Unicode

From:

James Kanze <james.kanze@gmail.com>

Newsgroups:

comp.lang.c++

Date:

Tue, 28 Sep 2010 09:33:04 -0700 (PDT)

Message-ID:

<73e1c1c2-b2ee-401a-b804-7fe5c6e6f974@p26g2000yqb.googlegroups.com>

On Sep 28, 8:27 am, Saeed Amrollahi <amrollahi.sa...@gmail.com> wrote:

I wrote a program to convert a EBCDIC text file in OS/400
environment to Unicode (UTF-16) in Windows XP. Because, the
text file contains information of Shareholders in Persian
(Farsi), I had to find the mapping table of Persian
characters. You may be know, Unlike English, in Persian some
characters has one form, some of them two forms and for some
characters, there are more than two forms. I mean there are
Initial, Medial and Final forms.

And isolated, no?

But that's usually a problem for the rendering machine, not for
your program.

I found them using Character Map (One of System Programs in
Windows XP). I really like to know your general and special
opinion. If someone already worked on the subject even in
other languages (like Arabic) h(is/er) advice may be help so
much.
1. Because the EBCDIC is 8-bits encoding and Unicode (UTF-16)
is 16 bits (or more precisely 21 bits) encoding, I use for
input file an ifstream object (character files) and for output
file wofstream object (Wide character file)

That's the way it was designed to work. (Actually, it was
designed so that you imbue a Persian EBCDIC local in a wifstream
when reading. If you can find such.)

2. I use the int() function to know the ordinal number behind the
characters.

In C++, all you have *is* the ordinal number. What you probably
do have to do is convert the input char to unsigned char.

I use the convention:
If the returned number is positive, it should be English
letter or numeric, in other words it isn't Persian and If it
is negative, it is Persian

You can't count on that. The type char may be signed or
unsigned. Convert to unsigned char, then compare to 128.

Except that that doesn't work at all for EBCDIC, where 'a' is
0x81, and the Persian characters are probably scattered about in
the unused spaces. Or it uses some sort of shift-in/shift-out
scheme with two different encodings. Or IBM has given up on
EBCDIC for non Latin scripts, and is using ISO 8859-6 or MS
Windows CP-1256 (although I'm not sure that either of these has
the extra characters needed for Persian).

and I use my Mapping:
// mapping.h
struct Mapping {
                std::map<int, int> Map;

                Mapping();
                void FillMap();
                int operator[](const int k) { return Map[k]; }
};

// mapping.cpp
Mapping::Mapping()
{
        FillMap();
}

void Mapping::FillMap()
{
        // fill map
        Map[-14] = 0xFEF4; // ARABIC LETTER YEH MEDIAL FORM
        Map[-111] = 0xFE8B; // ARABIC LETTER YEH WITH HAMZA ABOVE INITIAL
FORM
        Map[-122] = 0xFE81; // ARABIC LETTER ALEF WITH MADDA ABOVE
        // other map entries
}

Why do things the hard way?

I'd use something like:

    static wchar_t const map[] =
    {
        0x0000, 0x0001, 0x0002, 0x0003, // 0x00-0x03
        // ...
        0x0061, 0x0062, 0x0063, 0x0064, // 0x80-0x83
        // ...
    };

This should be indexed with the input char, converted to
unsigned char. (I'd also write some quicky program to generated
this table from some table you already have at hand.)

LineConvertor is a class that read one line and convert it to Unicode
standard:

//line_convertor.h
wstring LineConvertor::Replace(const string& s)
{
        wstring ws;
        for (string::size_type i = 0; i < s.size(); i++) {

                wchar_t w = s[i];
                if (int(s[i]) >= 0) ws.push_back(w);
                else { // so it should be persian character in EBCEDIC character set
                        if (CP[int(s[i])] != 0) { // if the character is in lookup table
                                ws.push_back(wchar_t(CP[int(s[i])]));

                        }
                        else {
                              // there is no entry in Mapping data
structure.
                              // throw exception
                        }
                }
        }
        return ws;
}

A better solution would be to create a codecvt facet, and use it
directly in the istream.

Is this a good way to find mapping for all Persian characters?

What is the reverse function of int()? I mean a function
chr(int) that returns the corresponding character of an
integer?

There is no "function" int(). Using int() this way is the same
as a static_cast<int>.

3. I trace my program using debugger, and I see my program
works fine. My main problem is: When I write the Persian
character to wostream file (output file) The file is empty.
There is nothing in output file:

That sounds like a completely different problem. Without
complete, compilable code, and information concerning the
system you've compiled and run on, it's impossible to say. One
possible explination, however, is that the locale imbued in the
output stream doesn't understand the Persian characters. The
first character which cannot be correctly transcoded will result
in an error (bad() returning true on the wostream).

Note that even a wofstream only writes bytes (char's). The
trick here is to imbue it with a locale which converts each
wchar_t into two bytes.

In the following code, FileConvertor is a class with Convert
member function that converts all the file. for each line the
member LineConvertor, converts a line.:
// file_convertor.h
class FileConvertor {
        std::ifstream In; // original file
        std::wofstream Out; // a file containing of converted records
(unicode)
        LineConvertor LC;
        // ...
public:
       void Convert();
};

// file_convertor.cpp
void FileConvertor::Convert()
{
        for (string s; getline(In, s); ++RecCount) {
                try {
                        std::vector<std::wstring> V = LC.Convert();
                        for (std::vector<std::wstring>::size_type i = 0; i < V.size(); i+
+) {
                                Out << V[i] << L'\t'; // <-- no character is written to file

                        }
                        Out << L'\n';
}

4. I don't know. Do I should consider std::locale and
std::facet in programming such applications (file conversion)?

You don't have a choice. If nothing else, you can use only
single byte streams, opened in binary mode, and imbued with the
"C" locale---these are transparent: the bytes you read are what
is on the disk, and the bytes you right are the bytes that end
up on the disk. In all other cases, the locale imbued in the
stream will get involved, or some other code translation will
take place in the stream.

I want to extend my program to convert Unicode to EBCDIC,
EBCDIC to XML, ... I mean Generic converter.

You mean iconv. It already exists.

How to apply Policy class design?

Generally, I've hear policy used to refer to some sort of
template metaprogramming technique. Perhaps you mean the
strategy pattern.

5. How to write a general program with minimum effort to port
it to Linux environment?

Well, if portability is a concern, avoid any locale but "C", and
avoid wchar_t.

--
James Kanze