?

Log in

No account? Create an account
lj to email? screenscraping? - LiveJournal Client Discussions [entries|archive|friends|userinfo]
LiveJournal Client Discussions

[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

lj to email? screenscraping? [Nov. 13th, 2002|06:08 am]
LiveJournal Client Discussions

lj_clients

[gravitrue]
I want to participate in lj, but I can't deal with reading journals
on webpages; the load time, the top-down posting order,
and not being able to tell what I've looked at already all make it feel
like I'm trying to swim in ooblick. The web as a messaging
platform is just not as user-friendly, fast, or featureful as email
and usenet. As far as I've been able to discern, the lj client protocol is fine for posting entries but not designed for reading them.

What I really want is to have my friends posts emailed to me,
and followup comments mailed under the same subject heading;
as far as I can tell, one would need to write a screenscraper to download
and chunk up the html in order to do this... am I right?

Is there anyone who's written code to do any of this?
My ideal platform would be perl on linux.
I've come up with some psuedocode for what would need to happen...

as best I can tell, the view?type=month pages would be the easiest format to scrape; does the format change often?
Is it affected by user journal style?

I do realize the underlying pages linked to off of a month view would vary by user journal style; is there any way to get around having to parse each different journal style one wishes to read?

Should I be asking on lj dev? Is there any chance a mail engine could
get built into the server anytime soon (say, next six-eight months)
or would the whole idea be incorrect to the server dev folks on some
religous basis?
linkReply

Comments:
[User Picture]From: benzado
2002-11-13 06:53 am (UTC)
If you want to scrape, the best bet is to make a custom style for your friends page that is easy to parse.

Although it won't solve all of your problems, you may want to look into RSS and a desktop aggregator program. Your LJ has an RSS feed at http://www.livejournal.com/users/gravitrue/rss .

An NNTP to LiveJournal gateway would be very interesting, though...
(Reply) (Thread)
[User Picture]From: dottey
2002-11-13 07:10 am (UTC)
I don't think the LJ admins really like any form of programized screenscraping... But I'm not an admin, so I could be wrong.

They would probably much rather someone write a protocol-based backend to be placed on the servers which would spit out only the necessary information.

The HTML output of the normal views is sorta bulky (and wasted) in screenscraping. Though I guess it is true you could use the simplified RSS feed.
(Reply) (Thread)
[User Picture]From: ayoub
2002-11-13 07:21 am (UTC)
I've been meaning to try get the rss feed parsed into an exchange folder, but I haven't got round to it yet...
(Reply) (Parent) (Thread)
[User Picture]From: gravitrue
2002-11-13 07:32 am (UTC)

rss update on comments?

the rss feed doesn't indicate number of comments,
all it seems to give is a list of most recent posts.

This would let me check for a new post but not for comments on a post, whereas scraping the month view would give me both...
how is rss then an advantage, given that
I'd have to scrape something else for comments anyway?

am I missing something?
(Reply) (Parent) (Thread)
[User Picture]From: herbie
2002-11-13 07:46 am (UTC)

Re: rss update on comments?

Yeah.... RSS doesn't hit the server like mad. You have to realize that LJ has a history of being hammered and having the servers slowed down. A major part (most?) of that process time is accessing the database, cooking up HTML based on styles, and serving it for your viewing pleasure. If you're going to throw all that out, why bother? Especially if you're going to hit it periodically with a bot. Even custom styles are a chore, I'm sure the RSS-ifying proces is faster than an RSS style (I could be wrong - it could simply be an RSS style)... So, yeah - no advantage to you directyl, except that if every body scraped LJ, you would be lucky to get on yours.
(Reply) (Parent) (Thread)
[User Picture]From: gravitrue
2002-11-13 08:00 am (UTC)

Re: rss update on comments?

I do understand that rss is intended to decrease server load,
and yes, I like servers that aren't overloaded,
but if rss doesn't get me the data (data here being the number of
comments on a post as well as the existence of the post itself),
then I have to hit the server for the html anyway,
so what has the rss bought?

It's more load to ask for both html and rss
than for just html, and asking for just rss isn't an option
because it does not divulge number of comments on a post.

(Reply) (Parent) (Thread)
[User Picture]From: dottey
2002-11-13 08:23 am (UTC)

Re: rss update on comments?

No, you're right then. But I would still suggest you don't do scraping. I think there is protocol-mode stuff available to get the number of comments on a particular entry, as well as a listing of new entries on a friends page. I could be wrong about this.

If there isn't, then someone should make the protocol-mode backend.

Scraping is just a lousy workaround, and it wastes bandwidth for data/HTML that isn't even seen by the end-user.
(Reply) (Parent) (Thread)
[User Picture]From: gravitrue
2002-11-13 09:09 am (UTC)

Re: rss update on comments?

I agree that scraping is a hideous ugly kludge to be avoided
if at all possible. But... can it be avoided?

As far as I can tell from reading a few entries in this journal,
this one, in particular the protocol mode can't retrieve posts from
other folks' journals, only from a journal I have write-access to.
It looks to me like scraping is the only way to get the
darned posts out... I would not at all mind being wrong on
this if folks have other suggestions...

having this functionality in the lj backend itself would be much nicer,
I just don't know if it is at all likely... I'm hoping one of the official-type
lj folks speaks up here; if not, I'll poke on lj_dev or
write customer service or something...
(Reply) (Parent) (Thread)
[User Picture]From: juliekate
2002-11-14 12:48 pm (UTC)

The detect music quest

Hey guys, please point me in the right direction if I'm in the wrong spot but I am currently running a project at http://www.sourceforge.net/projects/detmusic to find a way to take vision's Windows LJ code (I think it's in C++) and get it translated into another language that is friendly to PHP.

I'm begging...is there anyone who can give me some tips? Feel free to use the forum on sourceforge. Is there a place I can ask developers to work on it? I asked visions but did not get a response.

Thank you so much!
(Reply) (Thread)