On this page:
All you really need to know
How to use Unicode
Why is Unicode important?
How it works:
characters (fonts)
beyond English
Know more
Some questions
Links to more info
We keep hearing the word 'Unicode', and people are saying that it is better if a font is 'Unicode'. What is this 'Unicode' thing, and why is it better?
If you are interested to learn the theory, go to the How it works section!
Like this:
future-proofingyour data. If you use Unicode fonts for all your documents, you are encoding your data in a format where the user's program (word processor, browser, etc.) does not have to guess what language it may be in, and the data is not dependent on arbitrary mappings of codes in the range 1-127 to symbols.
It can be interesting — but not necessary! — to
Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language.
^
If we want to really understand Unicode, we need to have some idea of how a computer uses characters ("letters" or "fonts").
A computer does not know anything about "letters" or "symbols". All a computer understands is numbers. We come to believe that the computer knows about letters because it can write words on the screen, but the computer is really only translating codes (numbers) into drawings on the screen (the letters that you see).
For example, if the computer wants to show the word "cat" on the screen, it finds the codes for characters "C", "A", and "T" in its memory and replaces them on the screen with the drawings that look like "C", "A", and "T".
What are these codes? Originally it was decided to use the numbers from 1-127 (that's 7 bits) to represent a different symbol each. So for example, those letters "CAT" were given the codes 99, 97, and 116, respectively.
That worked well in the beginning when the only language that computer scientists cared about was English. They didn't need more than 127 codes to show all the characters in the English language: 26 for lower-case letters, 26 for upper-case letters, 10 for numbers, and plenty of extra for symbols and "control characters".
But — English isn't the only language in the world! As time went on, it became evident that people were wanting to use languages other than English on their computers. The range of codes was expanded to 1-255, and most of the european symbols were added. This range is what you will most often see referred to as ASCII code, today.
Now that worked fine for European languages. But there are many more languages besides just the European ones! With many many more characters. 255 codes is not enough to give each of the world's symbols (characters) a unique number.
This is where Unicode steps in ... It provides more than 4 BILLION (!) codes. So each one of more than 4 billion different symbols can each be given its own, unique number. (Technically speaking: Unicode expands the number of bits used by each symbol to 32.)
The Tibetan letters and symbols have been given the range
0xF00-0xFFF
(hexadecimal representation) ...
that is, there are 255 unique numbers
for representing Tibetan letters, digits, and other symbols.
Remember that the computer stores the word "cat" as the numbers 99, 97, and 116. If you use a non-Unicode Tibetan font, the numbers 99, 97, and 116 would be drawn on the screen as some Tibetan characters and not as "C-A-T". OR, if I choose another language's non-Unicode font, the symbols drawn on the screen would be totally different again — these non-sense letters, are what you see in a browser or other program when it does not recognize the font.
This means that the user has to tell the computer what language the document is in, because the computer does not know that the document is english, tibetan, or hindi ... it just draws the symbols from the selected font for the given numbers.
But if we now use Unicode fonts, exclusively, we would find that the letters "C-A-T" would always be drawn on the screen because the computer uses other codes to represent Tibetan, English, and Hindi ... so the computer now knows that the document is English, because it is only using the codes for English.
ASCII
?
It is the set of binary representations of the standard roman characters
(A-Z, a-z, etc.) that most computers and peripherals use.
Stands for American Standard Code for Information Interchange.
hexadecimal
?
It is a numeric notation that uses 16 possible values, 0-9 and A-F. That is, you start counting at 0, go up to 9, and then the next number is A, and the last number is F. It is the most common notation used in the computer world.
If you make web pages, you have seen hexadecimal notation for color codes: like #006633 for the green headers on this page.
UTF-8 allows old, non-Unicode, english-language programs to work in a Unicode
environment without modification.
UTF-8 is a way of storing numbers that makes the old ASCII codes
(1-127) compatible with the new Unicode codes...
Whereas Unicode defines which symbol goes with which number
in the range 1 - 4 billion, the computer
is free to store that number in any way that it thinks is best.
You can Google for more technical detail,
All content not copyright by anyone else is
copyright © 2003–2009 James Walker.
License for use is the GNU Free Documentation License.
Find it:
here in the
License directory
or
at the Free Software Foundation,
www.fsf.org