Wednesday, November 29, 2006

Conversion from ANSI to OEM

Why do I need to address this issue? I will actually answer you with a question: do you never used SQL Server?

Yes, I mean Microsoft SQL Server: wonderful DBMS, isn't it? Well... joking apart (!), if you ever used SQL Server you noticed that while ISQL (the interactive SQL client) uses a Windows charset (i.e. ANSI), OSQL (the DOS counterpart) uses a DOS charset (i.e. OEM). This fact is not irrelevant in some circumstances.

Normally, developers test SQL statements interactively through ISQL, then, as statements are ok, they save their work in .sql files. As these SQL statements are released, it's common to execute them through a batch script, so using OSQL. The switch from ISQL to OSQL causes these statements to introduce in database values different than the expected ones.

I'm italian and in my language accents are important. So, obviously, "città" (i.e. town) is different form "citt..." (a nonsense in italian). Why am I saying this? Because if you saved in ISQL a script like the following: INSERT INTO TABLE_1 VALUES ("città") and then you executed it with OSQL, SQL Server will have recorded the value "citt..." instead of "città". ...Welcome to SQL Server's idiosyncrasies.

The fact is accents are coded differently between ANSI charset and OEM charset. Here comes the bad surprise.

Probably Microsoft didn't pay enough attention to this issue: after all we italians are a little minority in the computer realm and accents are not so common in SQL scripts. But for some of us italians (and for me in particular) this point is really important. Therefore, I took the decision to develop a little Java utility that converts text files from ANSI charset to OEM charset and vice versa.

Doing this task in Java is pretty easy. In fact, if you want to read a file using ANSI charset, you simply need to define an InputStreamReader object with code page 1252, while if you want to read a file with OEM charset the code page must be 850. Similarly, for writing a file using ANSI charset, an OutputStreamReader object with code page 1252 must be instantiated.

And so, after spending a pleasant evening with NetBeans, I came out with CharsetConverter! Now, let me introduce it.

CharsetConverter is an open-source command-line utility that accepts these parameters: <conversion type> <file>|<directory of files>

  • <conversion type> can be: ansi_to_oem or oem_to_ansi

  • <file> is the fully qualified pathname of a file to convert

  • <directory of files> can be either a directory name or a directory name followed by a file extension, thus meaning all the files with that extension inside the given directory.
For instance:
  • CharsetConverter ansi_to_oem c:\mssql\scripts\mydml.sql
    calls CharsetConverter for translating from ANSI to OEM the file called mydml.sql inside directory C:\mssql\scripts
  • CharsetConverter oem_to_ansi c:\mssql\scripts sql
    invokes CharsetConverter for translating from OEM to ANSI all files with extension sql inside directory C:\mssql\scripts.
Here you can download it.