www.howardism.org
Babblings of an aging geek in love with the Absurd, his family, and his own hubris.... oh, and Lisp.

Unicode Characters

The Story of Chess

Once Upon a Time

Once upon a time, when Sessa invented chess, he showed this to the king.

unicode-chess1.jpg

Name Your Price

The king was impressed and offered to pay Sessa.

He said, 1 grain of wheat on the first square, 2 on the next…

unicode-chess2.jpg

How Much?

The king’s treasurer told the king that would be too much.

18,446,744,073,709,551,615 grains

18 thousand trillion

Once Upon a Time

My first computer could only display 128 characters.

  • A byte is 8 bits
  • 128 uses 7 bits
  • Extra bit was a parity check

Parity Check?

7 bits of data Count 8 bits with parity
0000000 0 0000000 0
1010001 3 1010001 1
1101001 4 1101001 0
1111111 7 1111111 1

It got better…

  • Parity not too effective.
  • Switched to 8 bits
  • Now we get 256 characters.
  • 128 MORE characters…

What should we display?

  • European characters
    • ¿Cómo golpear la piñata?
    • Góðan daginn⁈
    • Or Greek? καλημέρα
  • Or Graphic symbols and line drawings:
    ┏━━━┱───┐
    ┃ ☻ ┃ ☺ │
    ┣━━━╉───┤
    ┃ ⚑ ┃ ⚐ │
    ┗━━━┹───┘
    

What about:

  • Chinese: 早安
  • Japanese: おはよう
  • Thai: อรุณสวัสดิ์
  • Korean: 안녕하세요
  • Bengali: সুপ্রভাত
  • Nordic Runes: ᚠᚢᚦᚨᚱᚲ
  • Math symbols: ∞ ÷ √

Gets Worse

Some languages write from right-to-left.

  • Hebrew: בֹּקֶר טוֹב
  • Arabic: صباح الخير

But…

  • 8 bits was a byte … a “character”
  • Everything would have to change:
    • Computer displays
    • Programming languages
    • All programs
    • All files and databases
  • English email would suddenly be larger!

Could we do both?

A book report is a just a series of ones and zeros, so how should we interpret this?

unicode.png

How Big is Big?

  • Use two bytes: 65,536 characters
  • Almost would work
    • Japanese: 3,000+
    • Traditional Chinese: 10,000+
    • Really needs more

But what about Ancient Egyptian?

Need a bit more… 4 bytes would give us 4,294,967,296

We Solved It…Kinda

Unicode-8 (called UTF-8) is pretty good.

  • Character from 1 to 4 bytes
  • 70,000 characters for Chinese
  • First 7 bits are same for old computers