Log in

No account? Create an account
I've written a program (a hack, really), that will take a particular… - LiveJournal Client Discussions — LiveJournal [entries|archive|friends|userinfo]
LiveJournal Client Discussions

[ userinfo | livejournal userinfo ]
[ archive | journal archive ]

[Jan. 18th, 2004|11:08 am]
LiveJournal Client Discussions


[mood |creativecreative]

I've written a program (a hack, really), that will take a particular LJ username and return a list of that username's friend's interests, sorted by the most common interests. For example, 'publius_ovidius' returns something like:

        dancing 21
         movies 19
        reading 18
            sex 17
       portland 16
          music 14
         travel 14

However, there are a couple of problems with the program. First, it's highly dependant on the structure of the LJ HTML, thus making it fragile. Are there RSS feeds or something similar of the full user info page?

There are a few other "issues" with this script, but I don't want to keep working on it until I know how to play nicely with LJ.

use warnings;
use strict;
use WWW::Mechanize;
use URI::Escape;
use HTML::Entities 'decode_entities';
use HTML::TokeParser::Simple;
use Getopt::Long;

$|++; # so we can see the results as they are printed

    'help|?'      => \&usage,
    'user=s'      => \my $user,
    'verbose=i'   => \my $VERBOSE,
    'syndicated'  => \my $INCLUDE_SYNDICATED,
    'communities' => \my $INCLUDE_COMMUNITIES,

sub usage
    print <<"    END_USAGE";
    $0 will calculate the number of interests that the friends for
    a given user has.
    $0 --user [options]

    --help        Display this information and exit
    --?           Same as '--help'
    --user        Mandatory.  This is the user we will fetch friends for.
    --syndicated  Include syndicated "friends" (e.g., doonesbury)
    --communities Include communities
    --verbose     Takes an integer (0, 1, or 2).  If zero, will print nothing
                  but the interest list (this is the default).  1 and 2 will
                  print more and more information.  These are useful to let you
                  know the program has not "hung" if you're working with a
                  large list or over a slow connection.


    $0 --user publius_ovidius --verbose 2 --communities

    That will calculate the common interests for friends of publius_ovidius,
    displaying verbose information and includes community interests (but does
    not include syndicated feed interests).

    Note that arguments may be abbreviated to the first letter.  The above
    command may be written as:

    $0 -u publius_ovidius -v 2 -c

# bad regexes.  Need to improve them!
use constant FRIENDS   => {
    regex    => qr{href='http://www\.livejournal\.com/users/[^/]+/friends'},
    label    => 'friends',
    instance => qr{^/userinfo.bml\?user=(.*)$},
use constant INTERESTS => {
    regex    => qr{href='/interests\.bml'},
    label    => 'interests',
    instance => qr{^/interests.bml\?int=(.*)$},

my $MECH = WWW::Mechanize->new;

$user ||= die "You must supply an LJ username";
print "Fetching user info for ($user) ...\n" if $VERBOSE;
my $html = get_user_info($user);

print "Fetching friends list for ($user)...\n" if $VERBOSE;
my $users = get_list($html,FRIENDS);

my $current = 1;
my $count   = @$users;

my %sections;
foreach my $user (@$users) {
    print "Fetching $user:  $current out of $count\n" if $VERBOSE;
    sleep 1; # be nice to their server
    print "Fetching user info for ($user) ...\n" if $VERBOSE > 1;
    my $html = get_user_info($user);
    next unless $html;
    print "Fetching interests for ($user) ...\n" if $VERBOSE > 1;
    my $interests = get_list($html, INTERESTS);
    foreach my $interest (@$interests) {

my @results = 
    sort { $b->[1] <=> $a->[1] }
    map  { [$_, $sections{$_}] }
        keys %sections;

foreach my $interest (@results) {
    printf "%30s %d\n", @$interest;

sub get_list
    my ($html,$section) = @_;
    my $parser = HTML::TokeParser::Simple->new(\$html);
    while (my $token = $parser->get_token) {
        next unless $token->as_is =~ /$section->{regex}/;
    $parser->get_tag('td'); # advance to first td tag
    my @sections;
    while (my $token = $parser->get_token) {
        last if $token->is_end_tag('td'); # we're at the end of the member table data element
        next unless $token->is_start_tag('a');
        if ($token->return_attr->{href} =~ $section->{instance}) {
            push @sections => decode_entities($1);
    printf("\t%d %s found\n", scalar @sections, $section->{label})
        if $VERBOSE > 1;
    return \@sections;

sub get_user_info {
    my $user = shift;
    my $info = sprintf "http://www.livejournal.com/userinfo.bml?user=%s&mode=full" 
        => uri_escape($user);
    my $page = $MECH->get($info);
    my $html = $MECH->content;

    if ('Error' eq $MECH->title && $html =~ /Unknown user/) {
        # this isn't perfect, but it's reasonable since LJ does
        # not return error codes
        warn "User ($user) not found";
    if ($MECH->title =~ /Syndicated Account/ && ! $INCLUDE_SYNDICATED) {
        print "\tSkipping syndicated account ($user)\n" if $VERBOSE;
    if ($MECH->title =~ /Community Info/ && ! $INCLUDE_COMMUNITIES) {
        print "\tSkipping community ($user)\n" if $VERBOSE;
    return $html;

[User Picture]From: rjray
2004-01-18 04:12 pm (UTC)


(If you get an e-mail with my previous comment, ignore it. I was suggesting this community, because I thought this was an individual post, not a community one. That was kinda dense on my part.)

Have a look at:


In particular, section II. It isn't clear if there is a call for getting the userinfo data for an arbitrary account, though.

(Reply) (Thread)
[User Picture]From: publius_ovidius
2004-01-18 07:52 pm (UTC)


I was looking through there and couldn't see anything related to "interests". That reduces me to parsing the HTML for the relevant data. Bummer. Still, I should play with this anyway and see if there's extra information being provided that I'm not aware of.
(Reply) (Parent) (Thread)
[User Picture]From: xb95
2004-01-22 10:29 am (UTC)


There is no way to get userinfo or interests yet. That's been tossed around and someone is working on it, I believe, but nothing final yet. :)
(Reply) (Parent) (Thread)
[User Picture]From: da_lj
2005-04-16 06:31 am (UTC)

bug fix?

I couldn't get the code to work as-is; I assume the user page HTML format changed. Thankfully, the fix is a simple one. change the line:

last if $token->is_end_tag('td'); # we're at the end of the member table data element

into these four lines:

last if ($token->is_start_tag('td')
&& $token->return_attr->{colspan}
&& $token->return_attr->{colspan} == 2);
# we're at the end of the member table data element

Hopefully this fix works for you?

Next up, subverting this hack to my own nefarious ends. :)
(Reply) (Thread)
[User Picture]From: publius_ovidius
2005-04-16 02:42 pm (UTC)

Re: bug fix?

I had noticed that it was broken a while back. I just hadn't had the tuits to fix it. Thanks!
(Reply) (Parent) (Thread)
[User Picture]From: publius_ovidius
2005-04-16 02:47 pm (UTC)

Re: bug fix?

Oh, and I forgot to mention, if you're using a relativlely recent version of HTML::TokeParser::Simple, you can simplify that code to:

    no warnings 'uninitialized';
    last if $token->is_start_tag('td') && 2 == $token->get_attr('colspan');
(Reply) (Parent) (Thread)
[User Picture]From: da_lj
2005-04-16 05:29 pm (UTC)

Re: bug fix?


Looking into the "right" way to do this: the FOAF data looked promising, but LJ doesn't include rss feeds as friends; so if you want to do something with all your friendslist including RSS feeds, you're stuck screen-scraping. So far as I can see.

(Reply) (Parent) (Thread)