www2007 initial summary
posted by shuri on 2007-05-13 17:02:38
The www2007 conference is over. It was fun. There were some real good presentations. I summarized and linked to some below. I am sure there are many more good papers that may have been missed during the conference due to the presentations.
The yahoo party was fun and I won a squeezebox music player :). The banquet was fun and I thought the food was good.
Banff is amazing, a small town of 6700 people according to wikipedia
which is largely only there for tourists. The main street is a long street with almost nothing but restaurants and gift shops. Lake Louise
is really close by and everything is beautiful. Wild life, snow, mountains, forests, all very beautiful.
One of the interesting things about the www conference is that it is so diverse, people from the academia and industry all come here to look for good ideas. Furthermore, the Internet touches almost every field these days and
the conference is just huge. A production of this scale is really difficult and all in all I think it was a great success.
So, I hope to be back in Beijing, China in 2008.
Why We Search: Visualizing and Predicting User Behavior
posted by shuri on 2007-05-12 13:18:15
by Eitan Adar et al. is an interesting paper that tries to find correlations between topic event streams generated from blogs and news sites, and try to use one stream to predict the others shape. They use dynamic time warping to map individual segments of the curves such as peak, rise, fall and run.
They explore various ways of visualizing the topic behavior through time.
Learning to Detect Phishing Emails
posted by shuri on 2007-05-12 10:21:33
Very nice work
by Ian Fette et al. The first thing they do is define "identifying phishing spam emails" as a different problem than just regular spam. Then they use a decision tree based classifier and a set of smart features to identify phising attacks.
The features include: when the domains in the links were registered, ip number links and comparison of the domains of the links in the email to the domain of the "click here" type of links.
Predicting Clicks: Estimating the Click-Through Rate for New Ads
posted by shuri on 2007-05-11 13:20:03
How to determine ad ordering if you do not have extensive click-through-rate probabilities? That is what this
paper does. They use machine learning, logistic regression, to predict the click-through-rate (CTR).
The basic model builds on previous work. The first thing they try to add a notion of ad quality, the landing page quality, and relevance. They further tried to improve the results by adding features, which key terms appear in the title and the tex, and using machine learning to learn quality.
Page-level Template Detection via Isotonic Smoothing
posted by shuri on 2007-05-10 10:00:26
about template detection, short summary follows.
Previous work, site based, two phase. The limitations of this technique, pages may not be processed in site order, new sites may be a problem and processing may be inefficient.
- obtain training data site specific
- learn site specific templates
- try to learn a global detector for templateness.
Features they use include: placement on the screen,back ground color, identify series of links that are likely to be part of the template, average sentence size. Then they use a classifier to differentiate between the template parts of a page and the content.
In the results they show that shingling after template detection works better than shingling without template detection.
A Banff View
posted by shuri on 2007-05-09 10:30:17
posted by shuri on 2007-05-09 10:29:36
Efficient Search Engine Measurements
posted by shuri on 2007-05-09 10:08:51
If you happen to miss the www2007 talk "Efficient Search Engine Measurements" by Ziv Bar-Yossef and Maxim Gurevich you should go and read the paper
The paper describes an efficient and accurate method of estimating various properties of the search engine such as the size of the document collection. It does so through the standard query interface. I will not do it justice if I try to describe the details so go and read
Navigation-Aided Retrieval by Pandit and Olston
posted by shuri on 2007-05-09 09:49:02
The basic idea of this
work is to assume the user of the search engine is willing to do some navigation to find what he is looking for.
The question then becomes not what is the most relevant document but where should we "drop off" the user, for him to be most likely to find what he is looking for. Cool.
Further, they highlight the paths that could lead the user to interesting pages.
For those not in WWW2007
posted by shuri on 2007-05-08 11:00:14
If you are not in www2007, and you still want to see a cool lecture, go here
and look for PRABHAKAR RAGHAVAN. This excellent lecture covers both Yahoo answers and advertisement auctions. The implication of any optimizations to advertisement auctions means big money and that is why you should care.
WWW2007 worth a read
posted by shuri on 2007-05-08 10:54:40
When attending the Query Log Analysis
of the WWW2007 conference, this work
seems good. The presentation talks about a better model of search engine users and the way they click. For example, the user model takes into account if the user considered a result and its attractiveness.
I am in WWW2007
posted by shuri on 2007-05-07 14:19:11
That is it. I am here in Banff Canda, in the WWW 2007 conference. I will be presenting my paper Do Not Crawl in the DUST
about identifying different URLs with similar text. I am excited to see my Israeli colleagues, Maxim Gurevich and Ziv Bar-Yossef who are also presenting a paper
about efficient search engine measurement.
I will try and update the web site with anything I find interesting.