CityDesk 2.0-Documentation
Unicode and Character Set IssuesCityDesk can be used for editing text in any alphabet that your computer supports.
On Windows 98 and Windows Me:
- English language versions of Windows usually only support the Latin alphabet. "Latin" includes the Latin letters A-Z plus all the accented letters commonly used in popular Western European languages.
- National language versions of Windows support both Latin text and one other alphabet, for example, the Japanese version of Windows 98 supports English and Japanese.
On Windows NT, Windows 2000, and Windows XP:
- These versions of Windows support Unicode which allows you to edit text in almost any alphabet or writing system of the world, as long as that alphabet is supported by Unicode and you have an appropriate font on your system.
Text in CityDesk articles and variables are stored internally in Unicode format. When HTML files are written out, CityDesk converts them to UTF-8 format. UTF-8 is a way to encode Unicode which is understood by all modern web browsers, but you have to tell the web browser to expect UTF-8 so it knows how to decode it. This is done by putting the following tag in your HTML file immediately after the <head> tag:
<meta http-equiv="Content-Type"
content="text/html; charset=UTF-8">It's important that this be the first tag after the <head> tag to avoid confusing web browsers. For example:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html>
<head>
<meta http-equiv="Content-Type"
content="text/html; charset=UTF-8">
<title>Unicode and Character Set Issues</title>
</head>You will notice that CityDesk inserts this for you by default in new templates.
In rare circumstances you may need a different encoding than UTF-8. If you specify a different encoding in the meta tag, it will be honored by CityDesk when the file is written out. For example:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
<html>
<head>
<meta http-equiv="Content-Type"
content="text/html; charset=Windows-1252">
<title>Unicode and Character Set Issues</title>
</head>This causes the file to be written out using the "Western European" encoding optimized for Windows. Any characters which cannot be represented in that encoding, such as Japanese characters, will be written out as ?.
HTML Files and templates in CityDesk are treated a little bit differently: they are stored in CityDesk in UTF-8 format. When they are loaded into memory, for example, if you want to edit them using the CityDesk built-in editor, they will be converted to Unicode. Then they are converted back to UTF-8 when you save them. If you edit an HTML file or a template from CityDesk using an external editor, you will be working on the UTF-8 version of the file.
What is UTF-8, exactly? OK, you don't have to know this to get CityDesk to work, but you may be wondering, so we'll try to explain it here.
In the olden days computers used 8 bits to store a letter. There are 256 possible combinations of 8 bits. That's enough for many languages, but not enough for Asian languages like Chinese which have thousands of different "letters." There were many different incompatible encoding schemes for jamming different alphabets into the same 256 combinations. The most common format, ASCII, defined what would happen in the first 127 combinations, but it was only good enough for English.
To simplify the problem a consortium of computer makers came up with an international standard called Unicode. Under Unicode, you would use 16 bits to store a letter. That gives you room for 65,536 different letters, which is enough for just about every known alphabet all at once, making multilingual text on computers possible.
The trouble is that all the people who spoke English were distracted that they would have to "waste" an extra 8 bits on each letter even when they were just writing in English. And besides, there were already a lot of existing computer systems that assumed 8 bits = 1 letter. So the Unicode Consortium came up with a scheme called UTF-8. In this scheme, all English language letters (and indeed, all characters below 128 from the old fashioned ASCII character set) would be written out exactly the same way as before. Only non-English letters would be encoded using between 2 and 6 bytes. This scheme is the most popular method of encoding Unicode on the Internet.
The details, of course, are somewhat more complicated than this, and in fact, this is a rather grotesque oversimplification, but we've probably already bored you to tears so we'll move on now.
©Copyright 2001-2003 Fog Creek Software, Inc. All Rights Reserved.