You are how you e-mail: A new technique can tell people apart using only the timestamps in their Sent folders.
In the interactive, real-time world of Twitter, blogs and World of Warcraft,
timing is one of the most salient aspects of social behavior. Now,
researchers at Northwestern University and Yahoo Research in New York
show that they can distinguish and categorize people based solely on
the timestamps of their e-mails, paving the way for smarter
advertisements, spam filters and social networking sites.
“You can’t track everything an individual is doing at every hour of
the day,” said Dean Malmgren of Northwestern University, lead author of
the study posted May 11 on the pre-publication physics repository,
arXiv. “But this shows that with just a snapshot of what they’re doing
— knowing what time they send their e-mails — you can actually get
meaningful information.”
Of particular interest to Yahoo is a more effective way to catch
spammers. Between 80 and 90 percent of all e-mail in the world is spam.
Spam isn’t just obnoxious, it also uses up bandwidth, storage space and
time. In 2009, spam may cost $42 billion in the United States and $130 billion worldwide — and that doesn’t include the money scammed from gullible internet users like Citigroup.
Spam filters and spammers are engaged in a perpetual arms race, with
spammers constantly changing their domains and IP addresses and
disguising dirty words. But spammers have a major limitation: In order
to send their millions of e-mails, they need bots. If a temporal model
of e-mail behavior can distinguish between different people, it can
also distinguish people from nonpeople.
“Any novel way to identify spammers makes a huge contribution,” said
Jake Hofman of Yahoo Research. “Even if you just reduce it by a small
percent, that’s a big win.”
Malmgren and Hofman tested their model using data from two groups of
college students: European students from a few years ago, when home
internet access was rare, and American students when home internet
access was much more common. They focused on how frequently the
students were sending e-mails and when the e-mail sessions begun and
ended.
Despite the dramatic chronological differences between these
students — at least in the e-mail world — Malmgren found they fell into
one of two categories: “day laborers,” who sent the bulk of their
e-mails during the working day, or “e-mailaholics,” who sent e-mails
from morning deep into the night.
“It was pretty amazing,” said Malmgren. “It didn’t have to be two categories. There could have been a continuum.”
The researchers also found that e-mail behavior was stable within
individuals, with fewer than 20 percent of American students deviating
from their e-mailer categories over two years. This stability could
allow an e-mail service to recognize when an account is being
commandeered by a spambot, at which point it can alert the user or
freeze the account.
Hofman imagines numerous applications for analyzing time-related
aspects of internet usage, beyond e-mail, and says this ability to
robustly categorize people shows how powerful their model can be.
“This is just our toy demonstration,” he said. “There’s a lot of
temporal data from e-mails and website visits out there, but they
haven’t been leveraged for any meaningful analysis. The argument we’re
making here is that these data can be a surprisingly useful source of
information about individuals.”
Hofman says the technique could also allow websites to tailor their
services to individuals, as the activity pattern of websites visits may
be indicative of a user’s taste.
“It might turn out that I should market Blackberries and iPhones to
users who visit sites more frequently, scattered throughout the day,
like you and me” he said, “while I should market books and newspapers
to users with lighter usage patterns, like my dad. This could influence
what display or text ads I show these users when they’re on my site.”
A detailed description of activity patterns could also be useful for
heavily trafficked sites, like Twitter, which could optimize how their
servers allocates resources, and internet services that depend on
real-time interactions, like Aardvark.
Source: Wired