Tetsushi Morita
NTT Corporation@@@NTT Cyber Solutions Laboratories
3-9-11 Midori-Cho Musashino-shi Tokyo, 180-8585
Japan
81-422-59-4840
morita.t@
lab.ntt.co.jp
Tetsuo Hidaka
NTT Corporation@@@NTT Cyber Solutions Laboratories
3-9-11 Midori-Cho Musashino-shi Tokyo, 180-8585
Japan
81-422-59-7150
hidaka.tetsuo@
lab.ntt.co.jp
Akimichi Tanaka
NTT Corporation@@@NTT Cyber Solutions Laboratories
3-9-11 Midori-Cho Musashino-shi Tokyo,
180-8585 Japan
81-422-59-4483
tanaka.akimichi@
lab.ntt.co.jp
Yasuhisa Kato
NTT Corporation@@@NTT Cyber Solutions Laboratories
3-9-11 Midori-Cho
Musashino-shi Tokyo, 180-8585 Japan
81-422-59-4420
kato.yasuhisa@
lab.ntt.co.jp
ABSTRACT
We propose
a system
for reminding
a user of information obtained through a web browsing experience.
The system
extracts
keywords
from the content
of the web page currently
being
viewed and retrieves the context
of past web browsing related
to the keywords.
We define the context
as a sequence of web browsing when
many web pages related
to the keyword
were viewed
intensively
because
we assume that a lot of information
connected
to the current
content
was obtained
in the sequence. The information
is not only what
pages
you viewed but also
how you found those pages and what
knowledge
you acquired
from them. Specifically,
when you browse web pages, this
system
automatically displays a list
of the contexts
judged
to be important
in relation
to the current
web page. If you select
the context,
details
of the context
are shown
graphically
with marks
indicating characteristic activities.
Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]:
Information Search and Retrieval – retrieval model,
Search
process
General Terms: Algorithms, Management, Design.
Keywords: Context, Information
Retrieval,
Userfs Behavior, History.
1.
INTRODUCTION
|
Copyright is held by the author/owner(s).
WWW 2007, May 8–12, 2007, Banff,
Alberta,
Canada.
ACM 978-1-59593-654-7/07/0005.
|
Have you ever
been frustrated at failing to rediscover useful web pages that you viewed
in the past? We forget and waste various
information
that we obtain through our own web browsing. Most of us have
retrieved
the same web page
more than
once. One report says that
the retrieval
rate of previously
seen web pages among all pages that people
view is 81% [3]. The information
that we obtain through an experience,
such as web browsing, is not limited
to the content
of web pages.
We seem to recognize
which web pages we viewed
in a session,
how the pages
were found,
and what knowledge
we acquired
from them.
We define
the information
as gobtained
informationh.
In this paper, we describe
a system
that aims
to remind
a user of previously
obtained
information
efficiently.
The bookmark function
of a web browser
and a desktop
search
system
are popular
to help us to retrieve previously
seen web pages by keyword
matching
[2]. Several
semantic desktop
search
systems
have been
proposed.
One of them helps a user
to retrieve
web pages
and e-mails according
to intimate
information
by adding metadata such as the
URL of a web page that
is visited subsequently
and the destination address
of an e-mail [1]. A previous
version
of our method
helps
users to retrieve web pages viewed
in the past by calculating
their
personal
importance by using log data
from personal
computers
[4]. These
desktop
search
systems
help a user
to find a web page
viewed in the past
efficiently.
However,
they do not seem
to be so good at reminding us of the various
kinds of information
that we acquired
simultaneously in the past web browsing
experience because we need to choose and visit many retrieved independent
web pages.
2.
PROPOSED SYSTEM
We focus on the context
in the past. The context
is a sequence of web browsing when
many web pages related
to the content of the web page currently
being
viewed were viewed
intensively
and a lot of actions
were performed.
We call the time
of this sequence
an gintensive
periodh.
We assume
that a lot of obtained information
is also contained
in the context. For example,
if a user is researching a product,
he first
finds the context related
to current web pages and then he acquires
a lot of obtained
information
such as the URLs of multiple
web pages
that were
visited at that
time. By chronologically
tracing
his activities,
such as which
web pages
he viewed in the context, he learns how
to find the pages and what knowledge he got from them.
2.1
Collecting action
logs
It is difficult
to force a user
to perform
the actions
required
to create
history
data such
as recording when and how he or she viewed a web page.
A logging
module collects the information about a computerfs
mouse, keyboard,
copying,
and printing
events
and window
conditions, the URLs of
visited pages,
source
files, thumbnails,
http headers,
text selected
by the user, and so on. It has an encryption
function
to protect
the userfs privacy.
2.2
Extracting keywords
of current
web page
Our system analyzes
the content
of the current
web page to extract
keywords
that represent
the web page. Its technique is very
simple.
An analyzer
of a browser
component
obtains
the current
content
C and characterizes
it by extracting
the most frequent terms.
A score
Si is then determined
for each term
ti¸C, where Si = (1 +ƒ¿0+cƒ¿n)R(ti) and ƒ¿i is a weighting
coefficient
that varies
heuristically with
the locality
of ti. That is, extracted
terms instanced
in anchor
text are assigned a higher weight than
those
in <body> text, but a lower one than those
in <title> or <h1>. This R(ti) is a variation of the TF-IDF algorithm, where the term frequency of ti is multiplied by the inverse
document frequency
of ti to approximate each
termfs importance. Next,
the top N ranked characterization
terms with
weights
where
stemming
and other
adjustments are applied are regarded as a set of representative
keywords
of the current
web page. We call
this set of keywords
the gcurrent
keywordsh.
2.3
Extracting past
contexts
We extract an intensive
period through
the following
steps. First, the degree of importance I of a random
time t to current keywords k is calculated
by eq. (1). We focus on an active period ap when
a web page is actively
shown
in a window. Ei is the weighting
factor
of action
category i such as the amount of active time, copying,
printing,
mouse
clicking,
keyboard
input,
and text selection. Fri is the number
of occurrences of i in ap. R is a relevance
ratio of a web page
in ap to k, which is given
the value of TF-IDF based on the set of all web pages logged by the logging
module.
Then, the average degree of importance AI of t to k is determined
from eq. (2). Here,
is a parameter
for averaging. If the average degree is not more
than parameter
, the operation related
to the current keywords is regarded as being discontinuous
at time t (Fig. 1). The values of
and
are decided
by heuristics.
In this way, the method
can regard
an intensive period ip as a past