Technical area exam: Bradley Rhodes

Question 1

Question: One model of personal information retrieval is three dimensional: described by filtering of content, enrichment of context, and self-expression. Examples from media lab include FishWrap, PLUM, and SilverStringers. Discuss technologies such as Remembrance Agents in light of this model. Discuss specific examples from existing systems and extrapolate in order to describe what personal information retrieval might look like in 10 years, when the proliferation of these technologies is more wide-spread.

Remembrance Agents currently breakdown into the three dimensions of content filtering, context enrichment, and enabling self-expression as follows.

Content filtering

The metaphor of content filtering is based on the idea that there is a large amount of information already out there in the real world, and that a filter picks out the good parts. Often the metaphor is extended to talk about filtering information one would otherwise have to read themselves, like news or mailing list traffic. In this light, remembrance agents are the dual of information filtering: they give you more information than you would receive without the RA. On the other hand, some people have started using the RA on mailing lists they would like to read, but have never gotten around to. From their perspective, the RA performs filtering on those mailing lists, based on their current context.

I tend to break the information retrieval part of RAs into suggestion selection and suggestion filtering. Suggestion selection is finding the best suggestions given the user's current context. Suggestion filtering refers only to the decision of whether to show the suggestion(s) chosen at all. A different breakdown could consider both steps to be "content filtering," but from a practical standpoint it's easier to perform filtering on a single suggestion a time without regard to any other alternative suggestions. Once you need to trade off showing one or another, ranking algorithms are more appropriate.

In my current implementations filtering is done by setting a threshold on the same relevance score used to determine what suggestion to show. This can be a problem, since I'm trying to perform both filtering and find best matches with the same algorithm, which might be tuned better for one or the other but is hard to get right for both. Currently the system seems to work well enough for content selection, but tends to bias towards short queries for filtering.

Context enrichment

RAs are probably best described as context enrichment systems, although they are not as specific in this purpose as is PLUM. PLUM is very specific in relating a disaster article to a more personal context, namely the reader's own home-town population and geography. RAs also relate a current document or environment to a different context, but depending on the database that context can be fairly far ranging. For example, some parts of this exam answer are being related to the context of papers I've written about the RA in the past, some parts are being related to calls for papers from related conferences, and some are being related to the syllabus for a course Phil Agre recently taught on "contextual design."

As a side note, I'm not splitting hairs here as to what qualifies as a "remembrance agent." In actuality, I think PLUM qualifies as a type of RA, though one designed for a very specific purpose.

Personal Expression

Of the three dimensions, current RAs are weakest in this area, though looking to the future this can change. Currently the only outlet for personal expression is in giving someone else your own database of notes, email, or data and allowing them to use it for their own remembrance agent. In a way, everyone who sends me email is expressing themselves when the RA brings up their mail as a suggestion. However, they don't have any say as to where or when those suggestions come up or how they are presented. They don't even necessarily know their text is being used by an RA, in the same way people who put up web-pages aren't necessarily thinking about how search engines will present summaries of their page.

In the future we should see more blurring of these three dimensions as systems have more elements of all of them. Another trend that is already happening and will continue is the further use of networks and of using community knowledge in the place of AI techniques. My hope is that the predictors of "middle-ware" will finally be proven right, and many systems will coexist and interoperate, such that different back ends might combine to work with a common (possibly personalized) interface, but these sorts of things move slower than the technology itself. I'm reminded how zephyr started as a simple communications system but was quickly coopted as a front-end for all sorts of agent-like applications from notification of email to food being in the third-floor kitchen, all because it was "good enough for the job" and was easily modified.

Here's one possible scenario for an integrated system, and the current systems that express the features I'd like to see. I'll start with proactive information systems like the RA, then move to more general personal information retrieval.

First, RAs should become more like communications and personal annotations of information, extending into the dimension of aiding self-expression. Ever since Xanadu and before people have been talking about being able to provide meta-information about other people's text. For example, several years ago Terry Winograd at Stanford worked on systems where people in a work group can annotate sections of webpages. Annotations were marked with the author's own icon (usually a tiny gif of their picture). This worked well in a work group-sized environment, but could never work on the web with hundreds of thousands of annotations on popular sites. In theses situations, I envision a remembrance-agent like system where the source of data is the set of annotations published by other users about a particular section of a webpage. In this scenario the RA would no longer do text retrieval to determine whether a suggestion was relevant, it would assume relevance. Instead, the RA would act as a filter to decide which annotations are most important to show a given user at a given time. This filtering could be based on the reputation of the author, as Giorgos Zacharia is trying to do for e-commerce and Adriana Vivacqua is researching for finding reputable experts. Author reputation could also be based on whether the author is in the same electronic or real-world community as the reader, whether the author is a friend of a friend, or whether the author is from an automatically generated community such as those created by Yenta.

The filtering could also be based on reputation of the annotation itself. This is especially important if we want the possibility of anonymous annotation, since author reputation would therefore not be available. Annotation reputation could be calculated using Automatic Collaborative Filtering, as is being done with the GroupLense system. The drawback of ACF is that there needs to be a way to bootstrap new annotations so they get seen enough to be rated, so ACF could only be used to filter out already well-rated annotations, with a different method used for new annotations. Annotations can also be filtered through a combination of user profiling and communityware techniques. For example, a personal profile can be used in conjunction with traffic history. This is similar to the WebWatcher system where users specify their goal (e.g. "I'm looking for software agents papers") and links are then rated based on whether other people with similar goals followed that link.

Filtering could also be performed based on the path of links to get where you were going, ala Footprints. For example, if you go from a page on engine technology to a car specification page, you might get annotations about that car's technical details. If you go to the same page from a consumer reports guide, you might find annotations about reliability and price/performance ratios. These annotations would be based on how close the path you took to the current page is to the path the author of the annotation took.

When working in environments where public annotation is impossible (such as suggesting information related to works in progress on a word-processor, real-time conversation, or pulling suggestions from more personalized data) there are still many extensions that can and should be applied to RAs. One is the integration of personal profiles into the algorithm for deciding relevant suggestions.

Personal profiles can really be divided into ranges of time. Most profiles are descriptions of a user that remains stable over a long period of time. For example, the interests used by FishWrap change only occasionally, and only when manually changed. At the extreme, an RA using only such a profile would become a personalized newspaper like FishWrap or the various news clipping services that email news based on user interest keywords.

In the middle range, user profiles can describe fleeting interests that change dynamically over the course of a few hours or days. For example, the Letizia system automatically creates and updates a user profile based on the text contained in webpages visited. Over the course of a few browsing sessions a user's profile gets updated with new interests.

The current RA's are at the other end of the extreme, looking only at the current paragraph of text or section of webpage currently being viewed. They have no memory of what has happened before. In future systems, all three time ranges of profiles should be used. The trick is getting the mix right. Too much long-term profile and the suggestions never change regardless of your context. Too much near-term and there's no personalization except for what comes out of your current context and the personalization of your source data.

Finally, profiling should be used not only to pick suggestions but also to determine whether and how to show that data. For example, Nitin Sawheny's Nomadic Radio system is working towards using a combination of a user's own learned tolerance to interruption (a long-term profile) with a detector to determine if they are in a conversation (a short-term profile) to decide whether to play a full message, a short description, just a simple alert, or no message at all when voicemail or audio news arrives. PLUM is another interesting mix of short and long-term profiles, as it uses the two profiles for different reasons. The short-term profile (the news story being read at the time) is used as to pick the content of the suggestion, while the long-term profile (the user's home town) is used to pick the expression of that content. One can also imagine an RA that determines whether a user is driving or in a conversation, and automatically picks its output modality accordingly.

So far this discussion has been all about proactive presentation of information, but in future systems proactive and interactive systems will naturally merge. For example, the entire idea of a file system should be (and is slowly being) replaced with a database where multiple views and queries are possible. Lifestreams is one cut at that problem, breaking views up by time. Altavista also sells a search engine for your own files that will parse and index most popular windows file formats. These sorts of search engines and viewing aides should be integrated with all kinds of other personal information systems, so it becomes trivial to switch between proactive remembrance agent hit to an interactive search for other files relating to that hit. It should also be easy to specify links between pieces of data, like systems such as The Brain provides. Other systems will automatically log and collect personal information on the fly, automatically remembering where we go, what files we read, who we see, etc. This information can be used both for proactive information systems like the Wearable Remembrance Agent and for query-based systems like Forget-Me-Not.