?

Log in

No account? Create an account
heads-up - LiveJournal Client Discussions [entries|archive|friends|userinfo]
LiveJournal Client Discussions

[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

heads-up [Jan. 28th, 2002|07:52 pm]
LiveJournal Client Discussions
lj_clients
[evan]
LiveJournal will soon be using UTF-8 for journal data. This may affect you.

In addition to this, we're planning to add a version key to all protocol requests (to indicate which version of the protocol the client understands), and after everything has worked for a few months, disable access to clients that don't support the new version (which will begin at 1, I assume). This will affect you, though I don't forsee it happening for a few months.

Update: Corrections at the end.


New protocol version
So what does the new protocol version change?
Let's see:
- UTF-8 encoding of all strings.
And... yeah, that's it. Clients that send invalid UTF-8 data will have their requests refused.

What is UTF-8?
If you don't know what UTF-8 is, now's a good time to learn. To summarize, it's a way to support any of the world's languages in one character stream.
There are introductions to its concepts everywhere.
The official one:
http://www.unicode.org/unicode/standard/principles.html
I also liked this one:
http://gnosis.cx/publish/programming/unicode_primer.txt
and the most commonly-linked to FAQ is here:
http://www.cl.cam.ac.uk/~mgk25/unicode.html

Why we're using UTF-8
Right now, people post to their journals in a variety of encodings. For example, look at avva.
If you switch your browser to reading Cyrillic (Russian), which I did by picking
View->Encoding->Cyrillic->Cyrillic (Windows-1251)
you'll see his journal as it is intended to be read.
But what if he's on my friends list? My journal isn't in Cyrillic, and neither is my (hypothetical) Japanese friend.
If everyone posted in UTF-8, this wouldn't be a problem.
(And it's pretty cool to see Japanese kana with Cyrillic on the same page-- it works on our test servers!)

In summary, what you need to know:
  • This change should not affect most English-based journals, as UTF-8 is designed to be an easy transition: hence the name, Unicode Transition Format.
  • However, posting any non-standard characters (which includes anything with accents like á, annoying symbols like ©, and generally anything wacky, like ß) will no longer work. Instead, detect this and tell your user to user HTML entities, which they should be using anyway, or somehow handle the appropriate conversions (? -- I don't know how this would work for a language like French...).
  • Because UTF-8 strings have the potential to be more bytes than a similarly-sized (by character count) unencoded string, we may be resizing some of the fields-- for example, the possible length of a subject on a post might be increased. But this won't affect you, because you're not hardcoding any array sizes into your code, right?
    Right? :)
  • Most reasonable programming languages support UTF-8 already.
    • I know Perl, Python and Java do (though it appears Java only supports a subset?)
    • I wouldn't be surprised if Objective-C (used on Mac OS X) does, because OS X supports UTF-8.
    • But, please note that this list does not include C or C++! You'll have to find a library that supports UTF-8 for those languages:
      • I believe the Microsoft Windows APIs (such as MFC) have wrapped Unicode pretty well.
      • On the Unix side, I know the to-be-released GTK+ 2.0 supports it, and I'm pretty sure KDE already supports it.


Finally, if there are any people watching this community who use LiveJournal with a non-ISO-8859-1 charset, please speak up! We'll all need help testing this sort of functionality.

Updates:
(20:40:58) brad: we won't ban old clients
(20:41:02) brad: we'll ban old clients posting 8 bit
(20:41:07) brad: so mart's old client will still work fine
(20:41:11) brad: if he posts ASCII
linkReply

Comments:
From: compwiz
2002-01-28 04:59 pm (UTC)
Is there a difference between Unicode & UTF-8? Or is this just a matter of a copyrighted term, like Firewire and IEEE1394?
(Reply) (Thread)
From: evan
2002-01-28 05:05 pm (UTC)
Unicode is the character standard-- for example, Unicode character 0x30A1 is the small Katakana A.

UTF-8 is a method for encoding Unicode that's backwards compatibile with ASCII. It uses a varying number of bytes per letter and does sorta-weird hacks with the high bits (ASCII > 127) to encode multibyte characters.

There are other Unicode encodings. For example, you could encode every character as a flat four bytes. But your filesizes would be huge. :P
(Reply) (Parent) (Thread)
(no subject) - (Anonymous) Expand
[User Picture]From: cryo
2002-01-28 05:11 pm (UTC)
If you're finally going to roll a rev on the protocol, then there should be 'other things' fixed, too. We all had a difficult time with the existing protocol and I'm sure there are numerous tiny things that can be done to make clients easier and more compact.

As for UTF-8 on the OSX client, the conversion should be minimal impact. I'll look into adding it this weekend.
(Reply) (Thread)
From: evan
2002-01-28 05:22 pm (UTC)
Propose away!
What needs changing?
(Reply) (Parent) (Thread) (Expand)
[User Picture]From: mart
2002-01-28 05:26 pm (UTC)

If it's impossible to support UTF-8 in a given implementation, would it be acceptable just to force users to only send 7-bit ASCII characters and specify that a given client is "English-only", and still comply with the new protocol version? I'm pretty sure this is technically true, but if the protocol says no then it's false anyway.

(Reply) (Thread)
From: evan
2002-01-28 05:46 pm (UTC)
Yeah.
Updated the post after talking to Brad.
(Reply) (Parent) (Thread)
[User Picture]From: avva
2002-01-28 06:00 pm (UTC)
Err.. the thing is, you ought to use Cyrillic (Windows), not Cyrillic (KOI8-R) to view my journal ;)

Also, the suggestion to use HTML entities is not so hot IMHO. One of the benefits of using UTF-8 is that we don't *need* HTML entities for things like nonstandard characters anymore. They should just be specified by their UTF-8 codes.

Using HTML entities might be a not-so-wonderful cop-out strategy for a client where the author is sure that users will write mostly in English, with an occassional accented characters now and then, and the author doesn't want to code UTF-8 support and up the protocol version, so he stays at version 0 and ensures 7bit by encoding the occassional accented charactes as HTML entities.

But I would really hate it if people used HTML entities as a general-purpose solution to avoid coding UTF-8 support, and international users would use those clients to post bloated entries in ugly HTML-entitese.
(Reply) (Thread)
From: evan
2002-01-28 06:03 pm (UTC)
Ack... fixed the encoding thing.



What about entities like ©?
(Reply) (Parent) (Thread) (Expand)
From: ex_snej373
2002-01-28 06:55 pm (UTC)
"posting any non-standard characters [...] will no longer work. Instead, detect this and tell your user to user HTML entities, which they should be using anyway, or somehow handle the appropriate conversions"

Uh, I don't understand. Why would non-ascii characters not work? They work fine in regular web pages. And what's the point of declaring that the protocol strings are encoded in UTF-8 if you can't send non-ascii characters in posts? Am I misreading something?

I would hate to use a client that made me type things like "é" instead of being able to type a real e-acute. I would hate this a dozen times more if I were posting in a language that actually used such characters. What about Russian or Japanese? Now I'm totally confused.

As for language support: Objective-C's somewhat-built-in NSString class is based on Unicode with full support for all popular encodings including UTF-8. Mac OS X clients using other languages or frameworks can use the CFString API, which is a procedural near-equivalent. Java has full Unicode and UTF-8 support on all platforms.
(Reply) (Thread)
From: evan
2002-01-28 07:11 pm (UTC)
Er, right-- é (the character, not the HTML entity I'm using here to produce the character) is a valid UTF-8 character, so I think you'd use that. For a language like French or Russian you'll have a keymap already working (to type them in other programs) and Japanese is so complicated you don't even want to begin talking about it (good/expensive Japanese software actually recognizes the words around what you're typing and modifies the output accordingly-- it actually has to understand grammar!).

The only place I'm not certian about the use of entites is in cases like ©...
though it appears 0x00A9 is a copyright sign...

(See above for avva's comments.)
(Reply) (Parent) (Thread)
From: ex_snej373
2002-01-28 06:56 pm (UTC)

Encodings of existing posts?

Oh also: when the new protocol is used to retrieve old posts, what encoding will LJ assume those posts to have? I would guess WinLatin1 (CP-1252) since that's what seems to work best today (it's what I assume in my OS X client.) Right?
(Reply) (Thread)
From: evan
2002-01-28 07:08 pm (UTC)

Re: Encodings of existing posts?

I'm not sure. The details of it escape me, to be honest.

brad and avva are cooking up some strange heuristics and I think some of it will be user-specified...
(Reply) (Parent) (Thread)
[User Picture]From: thelovebug
2002-01-28 09:14 pm (UTC)
<whispers>PHP supports UTF-8 too! :-)

What about things like < and >? They would still have to be entered as &lt; and &gt;, yeah?

Oh heck, and what about & as &amp;? LOL
(Reply) (Thread)
[User Picture]From: bradfitz
2002-01-28 10:27 pm (UTC)
HTML escaping has nothing to do with UTF-8.

Glad your PHP supports UTF-8.
(Reply) (Parent) (Thread)
[User Picture]From: sapphirecat
2002-01-29 02:32 am (UTC)

What about comments and old entries on talkread.bml?

How will old Latin1 comments work on talkread.bml after an old post is edited and stored in UTF-8? Could you have some sort of daemon convert comments to UTF-8 when an old post is edited and resubmitted? Otherwise, things like � could get really ugly...
(Reply) (Thread)
[User Picture]From: avva
2002-01-29 05:15 am (UTC)

Re: What about comments and old entries on talkread.bml?

Every comment that is not UTF-8 and not pure ASCII has been so marked individually; when it is displayed it's converted to UTF-8 on-the-fly based on the user's default encoding field, or, if there's none, the 8bit characters will be replaced by with question marks. There's no connection between a post and a comment in that respect; one of them can be translated to UTF-8 while the other remains 8bit non-ASCII.
(Reply) (Parent) (Thread)
[User Picture]From: visions
2002-01-29 07:20 am (UTC)
(Reply) (Thread)
From: billemon
2002-01-29 11:50 am (UTC)
UTF-8 shouldn't be a problem for C/C++ because it's an encoding that's designed to be safe in an environment that uses null-terminated strings (less of an issue in C++). So most code should continue to work as before (as long as you don't start using any non-ASCII characters in bits of text that form parts of the LJ client protocol itself).
(Reply) (Thread)