Day-to-Day Software Engineering at Raven's Point

World Wide Characters

Sunday, May 30, 2010, 01:20 PM
Posted by Administrator

It is a big old world, full of many varied characters.

There are familiar friends like A, B and C. There are also Chinese characters and Cyrillic characters. We might even want to use Klingon characters!

There are over 65536 different characters that a computer might have to handle. Every possible character has been assigned its own number, called a ‘Unicode code point’. You can see them all here http://www.unicode.org/charts/charindex.html

A byte is a collection of 8 ones or zeros, the basic unit of computer memory. The largest number that can be stored in a byte is 256. In the old days when the world was smaller and simpler, about 20 or 30 years ago, this was sufficient to store A, B and C and their familiar friends. So each character was stored in one byte.

Nowadays we need to include the whole word and even the occasional Klingon. We need more space for our characters and their new friends.

Two bytes are called a word, and can be handled conveniently by most computers. The largest number that can be stored in a word is 65536. It is almost enough for every character. So, in the system called UTF-16 ( because a word contains 16 ones or zeroes ), most every character is stored in a word, with the overflow stored in two words.

The Microsoft Windows operating system uses UTF-16 to handle Unicode characters. Here is how the C programming language creates a UTF-16 encoded Unicode string


    wchar_t * ws = L”Hello World”;

There is a snag. Although computers use UTF-16 internally, they cannot communicate easily with each other using UTF-16. This is because, although every computer agrees on the order in which the ones and zeros of a byte should be arranged, they do not all agree on which order the bytes in a word should be arranged. In a reference to Jonathan Swift’s novel ‘Gulliver’s Travels’ where characters fought over which end an egg should be opened, the two ways of arranging the bytes in a word are called ‘Big Endian’ and ‘Little Endian’. When communicating with each other computers use another standard called UTF-8 where each Unicode character is encoded by a series of bytes in a specified order which is the same whether the computer is ‘Big Endian’ or ‘Little Endian’

When a computer program needs to communicate with another computer, perhaps by reading or writing a web page, it must constantly convert back and forward between UTF-8 and UTF-16 encoded character strings. The C programming language provides routines for doing this: WideCharToMultiByte() and MultiByteToWideChar(). However, they are a pain to use. Each conversion requires two calls to the routines and you have to look after allocating/freeing memory and making sure the strings are correctly terminated. We need a wrapper!

Here is the interface to my wrapper:


		/**

		Conversion between UTF-8 and UTF-16 strings.

		UTF-8 is used by web pages.  It is a variable byte length encoding 
		of UNICODE characters which is independant of the byte order in a computer word.

		UTF-16 is the native Windows UNICODE encoding.  

		The class stores two copies of the string, one in each encoding,
		so should only exist briefly while conversion is done.

		This is a wrapper for the WideCharToMultiByte and MultiByteToWideChar
		*/
		class cUTF
		{
			wchar_t * myString16;		///< string in UTF-16
			char * myString8;			///< string in UTF-6
		public:
			/// Construct from UTF-16
			cUTF( const wchar_t * ws ); 
			///  Construct from UTF8
			cUTF( const char * s );
			/// get UTF16 version
			const wchar_t * get16() { return myString16; }
			/// get UTF8 version
			const char * get8() { return myString8; }
			/// free buffers
			~cUTF() { free(myString8); free(myString16); }
		};

Here is the code to implement this interface


		/// Construct from UTF-16
		cUTF::cUTF( const wchar_t * ws ) 
		{
			// store copy of UTF16
			myString16 = (wchar_t * ) malloc( wcslen( ws ) * 2 + 2 );
			wcscpy( myString16, ws );
			// How long will the UTF-8 string be
			int len = WideCharToMultiByte(CP_UTF8, 0,
				ws, wcslen( ws ),
				NULL, NULL, NULL, NULL );
			// allocate a buffer
			myString8 = (char * ) malloc( len + 1 );
			// convert to UTF-8
			WideCharToMultiByte(CP_UTF8, 0,
				ws, wcslen( ws ),
				myString8, len, NULL, NULL); 
			// null terminate
			*(myString8+len) = '\0';
		}
		///  Construct from UTF8
		cUTF::cUTF( const char * s )
		{
			myString8 = (char * ) malloc( strlen( s ) + 1 );
			strcpy( myString8, s );
			// How long will the UTF-16 string be
			int len = MultiByteToWideChar(CP_UTF8, 0,
				s, strlen( s ),
				NULL, NULL );
			// allocate a buffer
			myString16 = (wchar_t * ) malloc( len * 2 + 2 );
			// convert to UTF-16
			MultiByteToWideChar(CP_UTF8, 0,
				s, strlen( s ),
				myString16, len); 
			// null terminate
			*(myString16+len) = '\0';
		}

And here is some code to test the wrapper


		// create a native unicode string with some chinese characters
		wchar_t * unicode_string = L"String with some chinese characters \x751f\x4ea7\x8bbe\x7f6e ";

		// convert to UTF8
		cUTF utf( unicode_string );

		// create a web page
		FILE * fp = fopen("test_unicode.html","w");

		// let browser know we are using UTF-8
		fprintf(fp,"<head><meta http-equiv=\"Content-Type\" content=\"text/html;charset=UTF-8\"></head>\n");
	
		// output the converted string
		fprintf(fp, "After conversion using cUTF8 - %s<p>\n", utf.get8() );

		fclose(fp);

[ add comment ] ( 41 views ) | permalink |

( 3 / 2200 )

<Back | 1 | 2 | Next> Last>>

Links

Home
Contact Me
Stats
Raven's Point Consulting

Calendar

Random Entry

Unit Testing
Bug Tracking
Language Support
Into the Cloud
Automatic Documentation