Evan Martin (evan) wrote in lj_clients,
Evan Martin
evan
lj_clients

heads-up

LiveJournal will soon be using UTF-8 for journal data. This may affect you.

In addition to this, we're planning to add a version key to all protocol requests (to indicate which version of the protocol the client understands), and after everything has worked for a few months, disable access to clients that don't support the new version (which will begin at 1, I assume). This will affect you, though I don't forsee it happening for a few months.

Update: Corrections at the end.


New protocol version
So what does the new protocol version change?
Let's see:
- UTF-8 encoding of all strings.
And... yeah, that's it. Clients that send invalid UTF-8 data will have their requests refused.

What is UTF-8?
If you don't know what UTF-8 is, now's a good time to learn. To summarize, it's a way to support any of the world's languages in one character stream.
There are introductions to its concepts everywhere.
The official one:
http://www.unicode.org/unicode/standard/principles.html
I also liked this one:
http://gnosis.cx/publish/programming/unicode_primer.txt
and the most commonly-linked to FAQ is here:
http://www.cl.cam.ac.uk/~mgk25/unicode.html

Why we're using UTF-8
Right now, people post to their journals in a variety of encodings. For example, look at avva.
If you switch your browser to reading Cyrillic (Russian), which I did by picking
View->Encoding->Cyrillic->Cyrillic (Windows-1251)
you'll see his journal as it is intended to be read.
But what if he's on my friends list? My journal isn't in Cyrillic, and neither is my (hypothetical) Japanese friend.
If everyone posted in UTF-8, this wouldn't be a problem.
(And it's pretty cool to see Japanese kana with Cyrillic on the same page-- it works on our test servers!)

In summary, what you need to know:
  • This change should not affect most English-based journals, as UTF-8 is designed to be an easy transition: hence the name, Unicode Transition Format.
  • However, posting any non-standard characters (which includes anything with accents like á, annoying symbols like ©, and generally anything wacky, like ß) will no longer work. Instead, detect this and tell your user to user HTML entities, which they should be using anyway, or somehow handle the appropriate conversions (? -- I don't know how this would work for a language like French...).
  • Because UTF-8 strings have the potential to be more bytes than a similarly-sized (by character count) unencoded string, we may be resizing some of the fields-- for example, the possible length of a subject on a post might be increased. But this won't affect you, because you're not hardcoding any array sizes into your code, right?
    Right? :)
  • Most reasonable programming languages support UTF-8 already.
    • I know Perl, Python and Java do (though it appears Java only supports a subset?)
    • I wouldn't be surprised if Objective-C (used on Mac OS X) does, because OS X supports UTF-8.
    • But, please note that this list does not include C or C++! You'll have to find a library that supports UTF-8 for those languages:
      • I believe the Microsoft Windows APIs (such as MFC) have wrapped Unicode pretty well.
      • On the Unix side, I know the to-be-released GTK+ 2.0 supports it, and I'm pretty sure KDE already supports it.


Finally, if there are any people watching this community who use LiveJournal with a non-ISO-8859-1 charset, please speak up! We'll all need help testing this sort of functionality.

Updates:
(20:40:58) brad: we won't ban old clients
(20:41:02) brad: we'll ban old clients posting 8 bit
(20:41:07) brad: so mart's old client will still work fine
(20:41:11) brad: if he posts ASCII
Subscribe
  • Post a new comment

    Error

    Comments allowed for members only

    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

  • 40 comments