FAQ: Unicode

We keep hearing the word 'Unicode', and people are saying that it is better if a font is 'Unicode'. What is this 'Unicode' thing, and why is it better?

  1. What is Unicode?
    Unicode is a system of codes for displaying characters, like "a" or "ཡ" or "दी"
    But it isn't a code you have to learn, like you learn HTML or PHP.
    The computer uses this code.
  2. How to use Unicode? Here are the details for how to do those things.
  3. Why is it good to use Unicode?
    With Unicode, all languages will display for everybody everywhere.
    More info is here.

If you are interested to learn the theory, go to the How it works section!

All you really need to know

  1. use a Unicode font,
  2. and tell the computer you are using it.

Like this:

Why is Unicode important?

It can be interesting — but not necessary! — to

Understand how it works

Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language. ^

If we want to really understand Unicode, we need to have some idea of how a computer uses characters ("letters" or "fonts").

How does a computer recognize and display characters?

A computer does not know anything about "letters" or "symbols". All a computer understands is numbers. We come to believe that the computer knows about letters because it can write words on the screen, but the computer is really only translating codes (numbers) into drawings on the screen (the letters that you see).

For example, if the computer wants to show the word "cat" on the screen, it finds the codes for characters "C", "A", and "T" in its memory and replaces them on the screen with the drawings that look like "C", "A", and "T".

What are these codes? Originally it was decided to use the numbers from 1-127 (that's 7 bits) to represent a different symbol each. So for example, those letters "CAT" were given the codes 99, 97, and 116, respectively.

That worked well in the beginning when the only language that computer scientists cared about was English. They didn't need more than 127 codes to show all the characters in the English language: 26 for lower-case letters, 26 for upper-case letters, 10 for numbers, and plenty of extra for symbols and "control characters".

What happens when we go beyond English?

But — English isn't the only language in the world! As time went on, it became evident that people were wanting to use languages other than English on their computers. The range of codes was expanded to 1-255, and most of the european symbols were added. This range is what you will most often see referred to as ASCII code, today.

Now that worked fine for European languages. But there are many more languages besides just the European ones! With many many more characters. 255 codes is not enough to give each of the world's symbols (characters) a unique number.

This is where Unicode steps in ... It provides more than 4 BILLION (!) codes. So each one of more than 4 billion different symbols can each be given its own, unique number. (Technically speaking: Unicode expands the number of bits used by each symbol to 32.)

The Tibetan letters and symbols have been given the range 0xF00-0xFFF (hexadecimal representation) ... that is, there are 255 unique numbers for representing Tibetan letters, digits, and other symbols.

Why is it good that Tibetan (and other scripts) have their own numbers?

Remember that the computer stores the word "cat" as the numbers 99, 97, and 116. If you use a non-Unicode Tibetan font, the numbers 99, 97, and 116 would be drawn on the screen as some Tibetan characters and not as "C-A-T". OR, if I choose another language's non-Unicode font, the symbols drawn on the screen would be totally different again — these non-sense letters, are what you see in a browser or other program when it does not recognize the font.

This means that the user has to tell the computer what language the document is in, because the computer does not know that the document is english, tibetan, or hindi ... it just draws the symbols from the selected font for the given numbers.

But if we now use Unicode fonts, exclusively, we would find that the letters "C-A-T" would always be drawn on the screen because the computer uses other codes to represent Tibetan, English, and Hindi ... so the computer now knows that the document is English, because it is only using the codes for English.

Know more

Some other questions

What is ASCII?

It is the set of binary representations of the standard roman characters (A-Z, a-z, etc.) that most computers and peripherals use.
Stands for American Standard Code for Information Interchange.

What is hexadecimal?

It is a numeric notation that uses 16 possible values, 0-9 and A-F. That is, you start counting at 0, go up to 9, and then the next number is A, and the last number is F. It is the most common notation used in the computer world.

If you make web pages, you have seen hexadecimal notation for color codes: like #006633 for the green headers on this page.

What is UTF-8?

UTF-8 allows old, non-Unicode, english-language programs to work in a Unicode environment without modification.
UTF-8 is a way of storing numbers that makes the old ASCII codes (1-127) compatible with the new Unicode codes...
Whereas Unicode defines which symbol goes with which number in the range 1 - 4 billion, the computer is free to store that number in any way that it thinks is best.
You can Google for more technical detail,

Some useful links

General info
Using Unicode
Unicode fonts
Database and Unicode
Many thanks to all the below, who enabled me to come up with this, hopefully useful, summary.
The many students who have allowed me to stumble through explanations of Unicode — which made things clearer to me, if not to them!
Tashi Wangdue for his informative posts on the DITG mail list,
Jonas Bonn, who kindly allowed adaptation of his 21 May 2005 post to the DITG mail list
All errors and misteaks are mine only!