Log in

No account? Create an account
How to Download Entries - LiveJournal Client Discussions [entries|archive|friends|userinfo]
LiveJournal Client Discussions

[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

How to Download Entries [Mar. 8th, 2004|10:13 am]
LiveJournal Client Discussions


Okay. I've seen a bunch of posts in the past where people are talking about downloading entries, and they're totally going about it the wrong way. So, I'm going to hopefully set the record pretty straight and outline various ways of getting entries from the servers and the relative merits of each, and the one I recommend for various reasons.

First off, support Unicode. If you write a client and release it at all, it will be used by people who need Unicode support. LiveJournal has a huge community of users that don't necessarily keep their journal in English. The Russian community is huge, for example, and their journals require Unicode to post/view the entries. Java supports Unicode fairly natively, Delphi users will want to use the WideString class instead of plain old String, and other languages will need you to figure out how best to handle Unicode in your language of choice.

In general, there are four methods of downloading entries with the getevents protocol mode: lastn, syncitems, one, and day. These four methods are specified in the selecttype variable of the getevents call. I will discuss each of these and when to use them.

This is most effectively used when you are providing the user a snapshot of their recent entries, or when you simply want to get their most recently posted entry to verify that the entry you just posted was posted, or you want to allow the user to edit their most recent entry.

You should not use this mode to download an entire journal. I don't believe you can specify a huge number that would give you their entire journal (unless their journal was a few dozen entries only).

This is useful for people who are writing calendars and want to get entries on a day that the user has clicked on. This should be used in conjunction with the getdaycounts protocol mode to figure out when the user has posted and then to get entries on that particular date.

This mode should never be used for enumerating someone's journal and downloading their entries. There is one quirk to this mode that causes me to say that: if, for some reason (non-Unicode client, for example), the server is unable to send you a particular entry, it will instead send you text indicating that the entry's subject and body are "(cannot be shown)". It doesn't TELL you it's done this, so you end up thinking that's the user's real entry and blow away whatever they had.

When you want to download a handful of entries scattered about, you can use this mode to get them. It's usually fairly safe to download an entry with this mode and then to resubmit it to the server. Example: you use getdaycounts to show a calendar, then you use the day mode to show entries for that day, then you use this mode to get the real entry for editing.

If you are trying to download someone's entire journal, this is the mode to use. This mode is the only way you can account for edits that the user has made to their entries without using your client. This is also the most efficient way of downloading entries, because the server will send you a whole bunch at a time (100 last I checked). This mode is used in conjunction with the appropriately titled syncitems client protocol mode.

NOW! It is time for an example of how to use this mode properly to download someone's entire journal. Alright, let's talk some pseudocode:

send client request "syncitems" with the "lastsync" variable not specified
get list of items back from request, save items into list for processing later
while size_of_list < sync_total {
	find most recent time in list
	call "syncitems" again, but set "lastsync" to most recent time
	push result items onto lost
iterate through list and remove items that don't start with "L-" (L means 'log' which is a journal entry)
create hash of journal itemids with data { downloaded => 0, time => whatever sync_X_time was }
while (any item in hash has downloaded == 0) {
	find the oldest "time" in this hash for items that have downloaded == 0
	decrement this time by one second :P
	mark THIS item as downloaded (so we don't use the same time twice and loop forever)
	send client request "getevents" with selecttype set to syncitems, lastsync set to oldest time minus 1 second
	mark each item you get back as downloaded in your hash
	put the entries you got into storage somewhere

That's it. You will have to call syncitems and getevents a bunch of times each to get the data you need, but this isn't a problem if you do it smartly. Also note that the server keeps track of the times you use when you call getevents, and if you start specifying the same time repeatedly (infinite loop or something) then your client will be given an error message "Perhaps the client is broken?" or something like that.

I make no warranty as to my pseudocode. I based it off of the Perl code I wrote that downloads entire journals and uploads them to other services, and I haven't had any problems with it. This is also the algorithm that is used in LochJournal's history code that I'm working on. And remember, set ver to 1 or you will have no end of trouble!

[User Picture]From: hythloday
2004-03-08 11:09 am (UTC)
Thanks for taking the time to write this.

As an addendum, python can support unicode natively with the right compile-time options - as far as I know only Gentoo and Redhat package python with these options.
(Reply) (Thread)
From: evan
2004-03-08 11:22 am (UTC)
Here's a mostly-working implementation in C:


(Note that if you may only use this code in GPLed code; if you're not willing to put your code under that license, don't look at it.)
(Reply) (Thread)
[User Picture]From: benzado
2004-03-08 08:39 pm (UTC)
You should add that to the protocol documentation. This is useful but no one will see this entry a month from now.

(And getchallenge ought to be formally documented, too.)
(Reply) (Thread)
[User Picture]From: benzado
2004-03-08 09:54 pm (UTC)
By "oldest" time, you mean "earliest", right?
(Reply) (Thread)
[User Picture]From: marksmith
2004-03-09 09:25 am (UTC)
That's correct. Old and early are the same in this case.
(Reply) (Parent) (Thread)
From: perplexes
2004-03-09 09:44 am (UTC)
Lastn can be specified with up to 50 entries, however, the livejournal servers will only give you journal batches in increments of 50 entries. So if your journal mod 50 != 0, you can't use it that way.

I found that if you just want the default 'lastn', which is 20, that it will return packs of 20, and a pack of less than 20 at the end. This is the mode I used.

However, I will agree that syncitems may be the method to use. The last time I tried it, it returned to me 100 entries that were in random order, and I wasn't cool with that. I'll try again.
(Reply) (Thread)
[User Picture]From: marksmith
2004-03-09 09:51 am (UTC)
They're not random. It returns them in order of modification. So, if you go back and edit an entry from 1999, it will show up when you do a sync and specify a lastsync of 2003. This is the only way to account for edits that the user makes on the web site or with another client.

syncitems - returns list of events modified/created/deleted after lastsync time
getevents - selecttype=syncitems, returns the actual events
(Reply) (Parent) (Thread)
From: camdez
2004-03-23 10:29 pm (UTC)
Thanks for posting this. I think I follow for the most part, but what is the value of sync_total?
(Reply) (Thread)