Eight years of email stats, pass 1

What’s the reality behind the ’email overload’ talk? Let’s look at some numbers… personal numbers.

To kick things off, I’ve got a huge email archive. I started emailing in the early ArpaNet days, around 1972, and haven’t stopped since. My archive has been extremely thorough for at least the past 12 years (and, in case you think I’m nuts for keeping all of these, my actual regret from a scientific/archive perspective is that I don’t have the earlier ones too!). Why? Let’s just say that one day I planned to do an analysis of it all… types of mails, social networks, the whole works. But things got a little out of hand…. (anyone lookin’ for some data, give me a shout… but first read on)…

Most of this ‘storage mania’ was triggered by a casual comment in around 1992 or 1993 by Ron Baecker, of the University of Toronto, a longtime research colleague and acquaintance and someone whose work I have long admired and respected. Ron asked me, “given ultra-cheap storage and ultra-fast search, both clearly on their way, why would you ever need either to delete or indeed to accurately file/categorize your emails?”

OK, so as a little personal experiment, I decided to keep ’em, and to see what happened. The quick story is that migrating across machines, operating systems, and preferred email clients, plus being a bit cavalier about the whole thing, has meant that although all the emails are ‘there’ in various archive files, it takes a little work to get ’em all back in a harmonious form, that is with all headers intact and no duplicates (the main formats are Vax mails, Unix mails, Mac Eudora, PC Eudora, Outlook Express, and Outlook).

The longer story, with some data and preliminary analysis, begins like this:

Even though I haven’t had the time or motivation thus far to put in the harmonization work required to get all the data in one format and with duplicates eliminated, I nevertheless thought that a little ‘first pass’ set of totals (with my estimate of their accuracy) would be interesting, and maybe even provide a little coarse empirical support for Stowe’s “Just Say No To Email” campaign.

So I quickly eyeballed-and-tallied the most coherent of the archives, spanning eight years of emails, from January 1st 1997 to December 31st 2004. The totals are real enough, but the ‘eyeballing’ was needed to assess the approximate propotion of spam and duplication involved in the emails. A more detailed analysis later will enable me to do these more accurately. I’ve indicated my estimate of the margin for error in the third column, and my estimate for the percentage of spam received (and I mean real spam: i.e. either ‘greedily-lookin-for-suckers’ or ‘low-down-mean-and-nasty spam’, not conference announcements – you know what I’m talkin’ about). For 2003, this number is precise, because I filtered off such spam using SpamAssassin, and counted them! 2004 spam numbers are an extrapolation, but the totals are accurate, as explained below. Here goes:

TABLE 1: Eisenstadt’s 1997-2004 email totals

Year	Emails received	Est. Error	Est. Spam
1997	4320	20%	2%
1998	3996	20%	3%
1999	6821	10%	5%
2000	7580	5%	6%
2001	6125	5%	7%
2002	6497	5%	10%
2003	13092	1%	37.6%
2004	13889	1%	40%

2003 is the most accurate, because (unlike earlier years when I was changing clients and machines) I have all emails in one clean format and all spam preserved, auto-filtered by SpamAssassin into a folder that I look at only a few times a year, scanning rapidly for false rejections. Incidentally, that falsely rejected email rate appears to be roughly 1 in 5000: good enough for me! By 2004, although I kept all emails, I got fed up keeping the spam even for analysis purposes, and can’t even be bothered to scan it, so stuff auto-filtered by SpamAssassin is now deleted without my looking at it – so the column 4 ‘40% spam’ in the lower right hand corner is a well-educated approximation based on my observation of the ebb and flow of the size of my ‘deleted’ folder.

It’s interesting that before 2003, I found that I didn’t really need SpamAssassin – the number were annoying, but manageable, as the fourth column estimates show. As we go back in time, I have less patience with the process of harmonizing the data, as I mentioned above, hence the ‘20% error’ estimate… in other words I believe, subjectively of course, that the totals for 1997 and 1998 could be off by roughly 20% either way. That’s the price I pay for doing a quick-and-dirty analsysis right now. On the other hand, even with such an analysis, I find the totals illuminating.

What does it all mean?

The totals in Table 1 tell me that the subjective ‘quantum leap in spam’ in 2002/3 that led me to install SpamAssassin as a full-time companion is certainly corroborated by the numbers. There’s simply no other way to cope with the large volume of junk. But now (auto)strip away that nasty spam, and we’re still looking at some scary numbers. Let’s call the emails that are left over, after stipping away the nasty spam, “OK emails” (let’s face it, they are never going to be “GOOD emails”, right?). What we see then is an increase from 5-6K annual “OK emails” in the late nineties (15-ish daily) to 8-9K annual “OK emails” today (25-ish daily). A bright note in all this is that the numbers for 2004 are surprisingly steady compared with 2003, i.e. there’s no exponential growth, even though things are clearly getting ‘intense’.

25 emails daily (and thereare many I know who have WAY more than this) is a lot to deal with, especially since the emails don’t cluster evenly throughout the week. To get to a 25-per-day average, you’re looking at more like 30-40 per working weekday, if you’re the kind of person who switches off at the weekend (ha!). If each email requires 3 minutes of thinking/response time (you’re lucky if you can average that), then you’ve got a guaranteed two hours straight down the tubes every day.

But wait a minute, “down the tubes” is incorrect: surely your emails involve key interactions, networking, brainstorming, appropriate drudgery and admin, in short what you get paid to do, right? Well, that’s not clear… and requires drilling down a bit deeper into the data.

Digging deeper: a work-week in depth

Table 2 shows a coarse categorization of all 286 emails I received during a Monday-Friday working week in January 2005 (10th-14th to be precise). I break them down into four groups, labelled simply A, B, C, D in the left hand column for ease of reference, along with the specific category label in the middle, and the total number of emails in each category shown in the third column. I also checked every email to see whether it involved some mundane scheduling/timetabling query/response (e.g. “Can you meet with Jones on 13th Feb at 10AM?”), on the hunch that such emails arrived a little too often for my liking. The fourth column shows the number of the emails in column 3 that involved such scheduling interactions (e.g. for row D, KMi Management, of the 68 emails received in that category, fully 32 of them were scheduling-related).

TABLE 2: Main categories for 5 workdays of email in January 2005

Group	Summary	Number	Num of those involving ‘scheduling’?
A	Projects, papers, info requests	71	6
B	Blog and site comments and maintenance	73	3
C	Announcements, news, social, family	74	2
D	KMi Managements, Gigs, Visitors, Invitations	68	32
TOTAL		286	43

The four main categories A, B, C, D of Table 2 are further subdivided in Table 3, this time preserving the A, B, C, D labels in the left-hand column for cross-referencing with Table 2, but breaking them down into finer categories as shown in the second column (in reality I did this breakdown first, and only later chunked them together to create Table 2, but thought it was easier to present this way).

TABLE 3 Further subdivisions of Table 2

Group	Category	Number
A	Funding bids, new project work requests	17
A	Alerts, requests, lab messages	21
A	Main project work, paper writing	33
B	Blog commentary and queries	40
B	Issues related to ‘popular KMi tools’ (BuddySpace, HitMaps etc)	33
C	Conference and seminar announcments	16
C	Semi-junk, news, domain renewals, etc	14
C	Family / social	30
C	From self and meta (system email bounces etc)	14
D	KMi Management	44
D	Visitors, gig arrangements, etc	24
TOTALS		286

Now what?

So there you have my finer-grained interactions ‘laid bare’. Allowing ZERO minutes of response time for some finer-grained categories (e.g. semi-junk, self/meta, which don’t require reading at all) and ONE-THREE minutes of response times for most categories, plus, say, TEN minutes of response time for an important research category such as ‘main project work, paper writing’, it is trivially easy to get to 2.5 hours per workday assuming a fairly ruthless, ‘one-touch’, knee-jerk email interaction regime. And worse if you deviate from the regime.

Then there are other sources of workflow: blogs, aggregator summaries, phone calls (rare, but I still allow one or two), cell-phone, text message, instant messaging (my buddy list is very large, and most of them are work-related).

All of this paints a very very bad picture. Sure, if you’re “in the business” like we are, then that’s the price you pay. But the pace is quickening (I’ve just tallied what we already knew intuitively), and I have little faith or trust right now in intelligent agents being able to solve my overload problems. Just consider the proportion of emails listed above that are scheduling-related! 43 out of 286, that’s 15%! We already have a tool, Meetomatic, that would handle at least half of those, but of course not everyone uses it. And the other half of that subset tend to require awkward interactions and judgement calls that no delegated agent, human or artificial, can actually cope with.

We’re entering an era in which something that Stowe has often written about is going to become an essential skill: “continuous partial attention.” I thought I was pretty good at it, but I am slowly-but-surely observing everyone around me slipping into a kind of cognitive quicksand, getting increasingly grumpy and stressed out, and I don’t like it.

As I was putting together this entry, I noticed that The New York Times has an article this week on email-overload and related attentional problems (free subscription required). The research described there is interesting, but falls into the trap I refer to above, of requiring agents that I personally would not trust to handle my attentional needs. Stanford University’s Donald Knuth opted out of email many years ago – what a visionary!

Eight years of email stats, pass 1

Leave a Reply Cancel reply

Open Tags: Made For A Distributed World

BitTorrent, eXeem, Meta-Torrent, Podcasting: “What? So What?”

What’s Wrong With Bloggercon

RSS Readering: Why RSS Readers Are No Good For Me (And You, Too, I Bet)

Skype: be afraid

Traitors in our Midst: Web 2.0 Antihype

Eight years of email stats, pass 1

Leave a Reply Cancel reply

Liked this post?

You May Also Like