FAQ: Unicode

We keep hearing the word 'Unicode', and people are saying that it is better if a font is 'Unicode'. What is this 'Unicode' thing, and why is it better?

What is Unicode?
Unicode is a system of codes for displaying characters, like "a" or "ཡ" or "दी"
But it isn't a code you have to learn, like you learn HTML or PHP.
The computer uses this code.
How to use Unicode?
- use a Unicode font,
- and tell the computer you are using it.
Here are the details for how to do those things.
Why is it good to use Unicode?
With Unicode, all languages will display for everybody everywhere.
More info is here.

If you are interested to learn the theory, go to the How it works section!

All you really need to know

use a Unicode font,
and tell the computer you are using it.

Like this:

Font: When you type anything (web page, css file, php program), use a Unicode font. how?

If you are using English, you can use any font. The Injis have it easy! This is because English comes first in the character set, and the letters always have the same code numbers.

If you are using Tibetan, Chinese, etc., make sure the font is a Unicode font.

These are all Unicode Tibetan fonts: Monlam 3 BodYig, TCRC YoutsoWeb, MS Himalayan, Tibetan Machine Uni

These are not Unicode fonts: Monlam 1 and 2, TCRC Bod-Yig, Sambhota

More info in links
Editor: Use the Unicode character set in your text editor how?

Look around in your editor's menus and preferences, for something like "character set" or "encoding", and choose "UTF-8" or "Unicode".

Examples:

BlueFish
Set for all files: Edit menu → Preferences → Files → Encoding → Default character set → UTF-8
Set for current file: Document menu → Character Encoding → UTF-8

NotePad++
Set for all files: Settings menu → Preferences → New Document → Encoding → UTF-8 without BOM
Set for current file: Format menu → Encode in UTF-8

vim
Set for all files – encoding may already be set to utf-8 by default. Type :set all to see if it is. If not, in your .vimrc file, add the line: set encoding=utf-8

More info in links
Web page When making web pages, always have the "meta charset" line. how?

Always have this line in the <head>your page, right above the <title> tag:

<meta http-equiv="content-type" content="text/html; charset=utf-8" />

("view source" of this page to see this being used.)

More info in links
Database: If you are using a database, make sure it is using Unicode for connection and for collation. how?

Example: PHPMyAdmin
Server: localhost → Create new database →
Collation menu → scroll to last item in menu: utf8_unicode_ci
MySQL connection collation → Collation menu → scroll to last item in menu: utf8_unicode_ci

More info in links
CMS: If you are using a CMS, such as WordPress or Joomla, make sure it is using Unicode for its database. how?

Example: WordPress
In the wp-config.php file:
define('DB_CHARSET', 'utf8');
(This is the WordPress default value.)
More info at WordPress.org

For other CMSs, check the documentation for that CMS.

More info in links

Why is Unicode important?

If you make your document in Unicode (that is, type it in some Unicode font), then even if another person doesn't have the font you used, as long as they have any Unicode font in the same language, they can still read and use your file.
What happens if you are not using a Unicode font:
- Another person may not have the font you used.
- The non-Unicode fonts may disappear or stop being maintained.
- Then – nobody can read your files, your data!
Using Unicode is future-proofing your data. If you use Unicode fonts for all your documents, you are encoding your data in a format where the user's program (word processor, browser, etc.) does not have to guess what language it may be in, and the data is not dependent on arbitrary mappings of codes in the range 1-127 to symbols.
An additional benefit: when the computer knows what language it is dealing with, it can automatically make other smart choices, such as which font to use, which spell-checker to use, what grammar-checker to use, whether to write from right-to-left (Hebrew) or from left-to-right (Tibetan), etc.

It can be interesting — but not necessary! — to

Understand how it works

Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language. ^

If we want to really understand Unicode, we need to have some idea of how a computer uses characters ("letters" or "fonts").

How does a computer recognize and display characters?

A computer does not know anything about "letters" or "symbols". All a computer understands is numbers. We come to believe that the computer knows about letters because it can write words on the screen, but the computer is really only translating codes (numbers) into drawings on the screen (the letters that you see).

For example, if the computer wants to show the word "cat" on the screen, it finds the codes for characters "C", "A", and "T" in its memory and replaces them on the screen with the drawings that look like "C", "A", and "T".

What are these codes? Originally it was decided to use the numbers from 1-127 (that's 7 bits) to represent a different symbol each. So for example, those letters "CAT" were given the codes 99, 97, and 116, respectively.

That worked well in the beginning when the only language that computer scientists cared about was English. They didn't need more than 127 codes to show all the characters in the English language: 26 for lower-case letters, 26 for upper-case letters, 10 for numbers, and plenty of extra for symbols and "control characters".

What happens when we go beyond English?

But — English isn't the only language in the world! As time went on, it became evident that people were wanting to use languages other than English on their computers. The range of codes was expanded to 1-255, and most of the european symbols were added. This range is what you will most often see referred to as ASCII code, today.

Now that worked fine for European languages. But there are many more languages besides just the European ones! With many many more characters. 255 codes is not enough to give each of the world's symbols (characters) a unique number.

This is where Unicode steps in ... It provides more than 4 BILLION (!) codes. So each one of more than 4 billion different symbols can each be given its own, unique number. (Technically speaking: Unicode expands the number of bits used by each symbol to 32.)

The Tibetan letters and symbols have been given the range 0xF00-0xFFF (hexadecimal representation) ... that is, there are 255 unique numbers for representing Tibetan letters, digits, and other symbols.

Why is it good that Tibetan (and other scripts) have their own numbers?

Remember that the computer stores the word "cat" as the numbers 99, 97, and 116. If you use a non-Unicode Tibetan font, the numbers 99, 97, and 116 would be drawn on the screen as some Tibetan characters and not as "C-A-T". OR, if I choose another language's non-Unicode font, the symbols drawn on the screen would be totally different again — these non-sense letters, are what you see in a browser or other program when it does not recognize the font.

This means that the user has to tell the computer what language the document is in, because the computer does not know that the document is english, tibetan, or hindi ... it just draws the symbols from the selected font for the given numbers.

But if we now use Unicode fonts, exclusively, we would find that the letters "C-A-T" would always be drawn on the screen because the computer uses other codes to represent Tibetan, English, and Hindi ... so the computer now knows that the document is English, because it is only using the codes for English.

Know more

Some other questions

What is `ASCII`?

It is the set of binary representations of the standard roman characters (A-Z, a-z, etc.) that most computers and peripherals use.
Stands for American Standard Code for Information Interchange.

What is `hexadecimal`?

It is a numeric notation that uses 16 possible values, 0-9 and A-F. That is, you start counting at 0, go up to 9, and then the next number is A, and the last number is F. It is the most common notation used in the computer world.

If you make web pages, you have seen hexadecimal notation for color codes: like #006633 for the green headers on this page.

What is UTF-8?

UTF-8 allows old, non-Unicode, english-language programs to work in a Unicode environment without modification.
UTF-8 is a way of storing numbers that makes the old ASCII codes (1-127) compatible with the new Unicode codes...
Whereas Unicode defines which symbol goes with which number in the range 1 - 4 billion, the computer is free to store that number in any way that it thinks is best.
You can Google for more technical detail,

Some useful links

General info
Using Unicode
Unicode fonts
Database and Unicode

Many thanks to all the below, who enabled me to come up with this, hopefully useful, summary.: The many students who have allowed me to stumble through explanations of Unicode — which made things clearer to me, if not to them!; Tashi Wangdue for his informative posts on the DITG mail list,; Jonas Bonn, who kindly allowed adaptation of his 21 May 2005 post to the DITG mail list
All errors and misteaks are mine only!

All content not copyright by anyone else is copyright © 2003–2009 James Walker.
License for use is the GNU Free Documentation License. Find it:
here in the License directory
or at the Free Software Foundation, www.fsf.org