Date: Wednesday, 11 April 2007, 6:30 PM
Location: SAP LABS, Building D, 3410 Hillview Avenue,
Palo Alto, CA (Google
Maps | Yahoo!
Maps | Mapquest)
Cost: Free and open to all who wish to attend, but membership is
only $10/year.
Topic
The World Wide Web provides a wealth of data that can be harnessed to
help improve information retrieval and increase understanding of the relationships
between different entities. In many cases, we are often interested in determining
how similar two entities may be to each other, where the entities may be
pieces of text or descriptions of some object. In this work, we examine
multiple instances of this problem, and show how they can be addressed
by harnessing data mining techniques applied to large web-based data sets.
Specifically, we examine the problems of determining the similarity of
short texts (even those that may not share any terms in common) and also
of learning similarity functions for semi-structured data to address tasks
such as record linkage between objects. While we present rather different
techniques for each problem, we show how measuring similarity between entities
in these domains has a direct application to the overarching goal of improving
information access for users of web-based systems.
Paper references:
- Bilenko, M., Basu, S., and Sahami, M. 2005. Adaptive product normalization: Using online learning for record linkage in comparison shopping. In Proc. of the 5th IEEE Int'l Conference on Data Mining (ICDM-05), p. 58-65.
- Spertus, E., Sahami, M., and Buyukkokten, O. 2005. Evaluating similarity measures: a large-scale study in the Orkut social network. In Proc. of the 11th ACM SIGKDD Int'l Conference on Knowledge Discovery and Data Mining (KDD-07), p. 678-684.
- Sahami, M., and Heilman, T. 2006. A web-based kernel function for measuring the similarity of short text snippets. In Proc. of the 15th Int'l World Wide Web Conference (WWW-06).
About the Speaker
Mehran Sahami is a Senior Research Scientist at Google. His research
interests include machine learning, data mining, and information retrieval
on the Web. Mehran was also previously a Lecturer in the Computer Science
Department at Stanford University (where he received his PhD), and prior
to Google, involved in a number of commercial and research machine learning
projects at Epiphany, Xerox PARC and Microsoft Research. He has published
dozens of refereed technical papers, served on numerous conference program/organizing
committees and has several patents pending. This year he is serving at
Track Chair for the Industrial Practice and Experience track at WWW-07
and is Co-Chair of the Student Abstract and Poster program at AAAI-07.