World Wide Characters 
It is a big old world, full of many varied characters.

There are familiar friends like A, B and C. There are also Chinese characters and Cyrillic characters. We might even want to use Klingon characters!

There are over 65536 different characters that a computer might have to handle. Every possible character has been assigned its own number, called a ‘Unicode code point’. You can see them all here http://www.unicode.org/charts/charindex.html

A byte is a collection of 8 ones or zeros, the basic unit of computer memory. The largest number that can be stored in a byte is 256. In the old days when the world was smaller and simpler, about 20 or 30 years ago, this was sufficient to store A, B and C and their familiar friends. So each character was stored in one byte.

Nowadays we need to include the whole word and even the occasional Klingon. We need more space for our characters and their new friends.

Two bytes are called a word, and can be handled conveniently by most computers. The largest number that can be stored in a word is 65536. It is almost enough for every character. So, in the system called UTF-16 ( because a word contains 16 ones or zeroes ), most every character is stored in a word, with the overflow stored in two words.

The Microsoft Windows operating system uses UTF-16 to handle Unicode characters. Here is how the C programming language creates a UTF-16 encoded Unicode string


wchar_t * ws = L”Hello World”;


There is a snag. Although computers use UTF-16 internally, they cannot communicate easily with each other using UTF-16. This is because, although every computer agrees on the order in which the ones and zeros of a byte should be arranged, they do not all agree on which order the bytes in a word should be arranged. In a reference to Jonathan Swift’s novel ‘Gulliver’s Travels’ where characters fought over which end an egg should be opened, the two ways of arranging the bytes in a word are called ‘Big Endian’ and ‘Little Endian’. When communicating with each other computers use another standard called UTF-8 where each Unicode character is encoded by a series of bytes in a specified order which is the same whether the computer is ‘Big Endian’ or ‘Little Endian’

When a computer program needs to communicate with another computer, perhaps by reading or writing a web page, it must constantly convert back and forward between UTF-8 and UTF-16 encoded character strings. The C programming language provides routines for doing this: WideCharToMultiByte() and MultiByteToWideChar(). However, they are a pain to use. Each conversion requires two calls to the routines and you have to look after allocating/freeing memory and making sure the strings are correctly terminated. We need a wrapper!

Here is the interface to my wrapper:


/**

Conversion between UTF-8 and UTF-16 strings.

UTF-8 is used by web pages. It is a variable byte length encoding
of UNICODE characters which is independant of the byte order in a computer word.

UTF-16 is the native Windows UNICODE encoding.

The class stores two copies of the string, one in each encoding,
so should only exist briefly while conversion is done.

This is a wrapper for the WideCharToMultiByte and MultiByteToWideChar
*/
class cUTF
{
wchar_t * myString16; ///< string in UTF-16
char * myString8; ///< string in UTF-6
public:
/// Construct from UTF-16
cUTF( const wchar_t * ws );
/// Construct from UTF8
cUTF( const char * s );
/// get UTF16 version
const wchar_t * get16() { return myString16; }
/// get UTF8 version
const char * get8() { return myString8; }
/// free buffers
~cUTF() { free(myString8); free(myString16); }
};


Here is the code to implement this interface


/// Construct from UTF-16
cUTF::cUTF( const wchar_t * ws )
{
// store copy of UTF16
myString16 = (wchar_t * ) malloc( wcslen( ws ) * 2 + 2 );
wcscpy( myString16, ws );
// How long will the UTF-8 string be
int len = WideCharToMultiByte(CP_UTF8, 0,
ws, wcslen( ws ),
NULL, NULL, NULL, NULL );
// allocate a buffer
myString8 = (char * ) malloc( len + 1 );
// convert to UTF-8
WideCharToMultiByte(CP_UTF8, 0,
ws, wcslen( ws ),
myString8, len, NULL, NULL);
// null terminate
*(myString8+len) = '\0';
}
/// Construct from UTF8
cUTF::cUTF( const char * s )
{
myString8 = (char * ) malloc( strlen( s ) + 1 );
strcpy( myString8, s );
// How long will the UTF-16 string be
int len = MultiByteToWideChar(CP_UTF8, 0,
s, strlen( s ),
NULL, NULL );
// allocate a buffer
myString16 = (wchar_t * ) malloc( len * 2 + 2 );
// convert to UTF-16
MultiByteToWideChar(CP_UTF8, 0,
s, strlen( s ),
myString16, len);
// null terminate
*(myString16+len) = '\0';
}


And here is some code to test the wrapper


// create a native unicode string with some chinese characters
wchar_t * unicode_string = L"String with some chinese characters \x751f\x4ea7\x8bbe\x7f6e ";

// convert to UTF8
cUTF utf( unicode_string );

// create a web page
FILE * fp = fopen("test_unicode.html","w");

// let browser know we are using UTF-8
fprintf(fp,"<head><meta http-equiv=\"Content-Type\" content=\"text/html;charset=UTF-8\"></head>\n");

// output the converted string
fprintf(fp, "After conversion using cUTF8 - %s<p>\n", utf.get8() );

fclose(fp);



[ add comment ] ( 41 views )   |  permalink  |   ( 3 / 2115 )
Language Support 
Raven’s Point has clients all round the world, and sometimes my client’s have customers all round the world. Although my clients all communicate with me in English, my clients customers much prefer to use the applications I deliver in their own language. Some of my clients customers are in China, which presents a particular challenge.

It is important to be able to switch language the user sees quickly and easily. The requirement is that the user can select the preferred language while the program is running and instantly the entire user interface should change to the new language without changing or interrupting anything that is going on. It is not satisfactory to stop the program and restart it, or run another version, to change the language displayed.

This week I added support for the German language, in addition to English and Chinese. This went very smoothly, once I had obtained the German translation. So, here is my recipe for multi-language support in a C++ program built with Microsoft Visual Studio.



Create a table which has every character string displayed by the user interface assigned to a number. Each language has its own base number and the translations of each string are assigned a unique number which has the same offset from the language base. For a program that supports English and German, I might choose that the English base number is 40000 and German is 70000. So the English string “Run” might be given the number 40131 and the German string "Geführt" the number 70131.

The numbers are arbitrary, but there are a couple of things to watch out for. The numbers 1000 and upwards are used by Microsoft Visual Studio for all sorts of purposes, so it is best to stay way from this area – starting at 40000 works fine. The language base numbers must be far enough apart that there is no chance that you will run out of room between them – a separation of 10000 should be enough.

The table of numbered strings is saved in a text file which looks like this

STRINGTABLE
BEGIN
40131 “Run”
END


STRINGTABLE
BEGIN
70131 “Geführt”
END

The text file containing the numbered strings table is a resource which is compiled by the resource compiler and linked to the rest of the program. However, it is maintained and edited by using a text editor and must be protected from being changed by the Microsoft Visual resource editor. Do this by naming the file language.rc and storing it in the res subfolder of the project directory. The resource compiler reaches the file through an include to the file in <project folder>/res/<projectname>.rc2

The numbered string table is used by code like this

SetDlgItemText( IDC_RUN,
CString(MAKEINTRESOURCE( myLanguage + 131 ) ) );

The global variable myLanguage contains the base number of the currently selected language. This code must be called every time the GUI is redrawn and also each time the user changes the selected language.

It is convenient for the user if, when the program starts, it remembers the language that was selected last time it was run. When the user changes the language, call this line

AfxGetApp()->WriteProfileInt(L"startup", L"language", myLanguage );

And when the program starts

myLanguage = AfxGetApp()->GetProfileInt(L"startup", L"language", 40000 );

There is a temptation to use defines to replace the string number offsets ( e.g. 131 ) with symbolic constants ( STR_RUN ). I recommend against doing this. It is just another table which must be maintained and once there are more than a few dozen strings, maintenance becomes a pain. The numbered string table is self documenting and, if your are careful assigning the resource IDs ( IDC_RUN ) and use plenty of comments, the code will be self documenting, despite the sprinkling of mysterious numbers ( 131 ) through out.




Now, we come to the support of Chinese and other East Asian languages. Out of the box, Windows will not even display East Asian characters. Here is a link to advice from Robert Y Eng on switching on this support.

The next problem is how to represent the Chinese characters. There are several alternatives here and many technical details. It is easy to get lost for many days in researching and evaluating the alternatives ( I did! ). I am simply going to describe what I do.

The Chinese character strings are represented by 16 bit unicode numbers, using escaped hexadecimal. They look like this:

60131 L"\x8FD0\x884C"

This produces a couple of hieroglyphics which, I am assured, mean “Run” to anyone who can read them.

The advantage of this method is that you just have to add another language base number ( in my case 60000 ) for Chinese and immediately, magically the program displays Chinese characters in all the appropriate places on any computer with East Asian languages switched on. No new code is required.

The disadvantage of this method is that you probably will not receive the Chinese strings from the translator in this form. Since there are so many different ways to represent Chinese characters, this problem will probably arise no matter what scheme you choose. I have been doing this for less than a year, and already have received Chinese translations in several different formats which require some hacking about to decode. I cannot give details of all the different possibilities, but here is some general advice.

The first thing is to determine if the characters are being represented with fixed width 16 byte numbers. If they are, then you need to convert them into escaped hexadecimal ASCII character strings.

The other format that you will often see is variable width multibyte numbers, often called UTF-8. These need to be converted. Here is a straightforward manual procedure.
• Paste into notepad editor
• Clean up so that everything is as regular as possible
• Save as unicode big-endian
• Open in a hex editor
• Copy and paste the required code string into the string table file, escaping as you go.

Obviously, this procedure is only feasible for a small number of strings. If you need to automate this procedure, contact me.



[ add comment ] ( 77 views )   |  permalink  |   ( 3 / 2095 )
Bug Tracking 
Bug tracking is vital for software quality.

Complexity is not needed. The ability to assign a unique index number to every bug and record the status ( open, test, closed ) as it changes is all that is necessary. The index number can then be used to track the bug through other tools: message threads, source management, regression tests. So a spreadsheet is just fine.

The challenge is to share. The obvious answer is an online bug tracking tool. There are many such available, all of which I have found too complex, to the point of absurdity. They try to substitute for every other tool with a bug-centric model, but their substitutes are awkward and inferior. The real problem is that they present every user with yet another learning curve.

So I use the excellent, simple shared message threading feature of BaseCamp Project Management. Each bug gets is assigned to a thread along with its index number. The threads are assigned to a category ( Bugs – Open, Bugs – Test, Bugs – Closed ) as appropriate.

This works well. Clients use the message threads naturally. They read and post messages to threads without any learning curve. A little background management turns their activities into a bug tracking system.

There are a couple of problems managing this system. Finding the next unique index number for a new bug, and moving the bugs between the categories requires a bit more trouble than these routine tasks should need.

This week I decided to get serious and fix these problems. I wanted a command line prompt that would tell me what the next unique bug ID would be and that would let me specify an existing bug ID and a new category for it.

I achieved this in a few hours. cURL is used to query the BaseCamp API. TCL to decodes the XML returned from BaseCamp, runs the command line interface and generates the BaseCamp API calls.

This call lists the existing bugs, their status and the next ID to be used.

>tclsh bctid.tcl
1 open {TID1: test1}
Next TID 2

This call changes the status of Bug #1 to test

>tclsh bctid.tcl test 1
1 test {TID1: test1}
Next TID 2

Links

Bascamp

Basecamp API

cURL




[ add comment ] ( 73 views )   |  permalink  |   ( 3 / 1930 )

<Back | 1 | 2 |