It Takes A village To Stop Spam

By: Seairth Jacobs
Status: Draft, Revision 2
Revision Date: 02/05/2004
Original Date: 01/30/2004

NOTE:If you have any comments, feel free to post them at http://www.seairth.com/blog/industry/80 or e-mail me.

Social Networking

Social Networking, among other things, is a game of trust. Generally, within your social network, you put more trust in those that are near you than those that are not. For instance, you are more likely to believe what a close friend says than you would of a person that your friend knows. However, who (and how much) you believe is not as simple as that. For instance, you may be more likely to believe a close friend of your close friend than you would of a close friend of an acquaintance. I know all of this is obvious. But what may be less obvious is that we can use this approach to effectively stop spam.

Applied

The approach is simple...

Suppose you receive an e-mail (though this approach is not limited to e-mail) from "seairth@seairth.com". The e-mail falls into one of two categories: (1) it's someone you personally know or (2) it's someone you don't know. In the first case, you accept the e-mail and there's nothing more to do. However, in the second case, you have a problem. How do you know whether to trust this e-mail?

Suppose that you were to ask each of the people you know "do you know/trust this address?". If one of those people answered "yes", then you could be fairly assured that the address is safe. On the other hand, suppose all those people answer "no". Further, they say "why don't you ask the people I know." In turn, you ask those people, following the same formula as you did when asking the people you knew directly. The only difference is that a "yes" answer would possibly hold less weight. As can be seen, this approach could be repeated until everyone was asked the question. If absolutely no one recognized the address, then you would treat the message as "not trusted".

Heuristics

Of course, it's not realistic to continue asking everyone looking for a match. And, in fact, you wouldn't want to. The reality is that you wouldn't want to ask any more than a fixed number of levels deep, the exact number depending on your particular preferences. For instance, suppose you put a threshold of five levels, meaning that if no one within five hops recognizes the address, the message can be considered "not trusted". Others may limit themselves to only 3 or 4 hops, or may be more trusting and allow up to 6 or 7 hops.

Note, however, that the increased number hops dramatically decreases performance and scalability. For instance, suppose you limited the search to a maximum of a 4-hop depth. Now also suppose that the average person listed 10 other people. What this means is that you would have to visit approximately 10^3 (or 1,000) other people's lists to discover if a person is known or not. In this scenario, for each additional hop, you add a multiple of ten, so a 6-hop depth would mean up to 100,000 visits to discover that you do not know someone. If everyone were running on fast computers and networks, this may not be much of an issue. But current circumstances indicate that this "raw" search approach is still not acceptable.

To improve search times, at least two additional tactics can be employed. The first is simply a maximum limit to the number of nodes that are searched by the client. For instance, a limit of 1000 could be set, meaning that you are asking no more than 1000 people. The second tactic is to start asking people based on how many they claim to know, always asking the highest-ratio people first. In other words, you are more likely to find the person you are looking for if you start by asking those who know the most people themselves. The rules would work as follows:

  1. Sort your list of known people based on the number of people they claim to know, the highest number first.
  2. Ask the first person in your list if he/she knows the person in question. If the answer is "no", get back that person's list of know people, sorted in the same order as your own.
  3. Merge the two lists such that people who know the same number of other people are sorted based on their distance (hop count) from you. If two or more people have the same count and are the same distance, then they can be sorted in any way desired, though the most likely approach will be a traditional queue.
  4. Repeat step 2 until either the person in question is found or other appropriate limits are hit (such as the 1000 max nodes mentioned above).

The number of hops between you and a person who recognizes the address is not the only way of determining the trustworthiness of the message. For instance, suppose you had multiple thresholds that worked as follows: for each threshold, the number of people who recognize the address must increase by some number. An example might be:

This way, a person can make up for a weak connection by being know by many people with stronger connections. A slightly more advanced approach would be to maintain these thresholds for each person you know directly. In other words, you may have a more strict threshold set for your co-worker, while having a less strict threshold for your spouse.

A Self-Correcting System

You may be saying to yourself "yeah, but if a spammer can get just one person to vouch for him, he's in and we all get spammed." That's true. But the nice thing about this approach is that the system is potentially self-correcting. If the person vouching for the spammer doesn't stop vouching, then the people who are vouching for the person will likely stop vouching him instead. This may happen at many different levels, since each person is free to decide the level to which they want to adjust their limits (including the possibility of removing someone they vouched for). In the end, the person at fault will be effectively treated as a spammer. Now, if this were an honest mistake, then it's likely that a the person will be able to make amends with those who vouched for him and they will start vouching for him again. For this reason, it is also unlikely that someone will willingly vouch for a spammer since they know what the consequence will be.

Likewise, the spammer still cannot get around an individual person's preferences. Even if a spammer does manage to get "inside", if they are too many hops away (and/or not vouched for by enough people, which would be likely), the message would still be flagged as "not trusted". If the spammer did fall within a threshold (and you therefore got spammed), you would also have a chain of people that you had to follow to determine the spammer's trustworthiness.

Implementation

So, how do you implement this whole thing? Well, there are two parts to this. First, the user would maintain a whitelist. This list would contain the addresses you trust directly, and the networked files of others that you trust to query. Second, the user would maintain a networked file that others could use when they trust to query you. A popular format of this sort would be FOAF.

In the FOAF file, you list the people you know in one of two ways:

The whitelist would always include those entries listed in the FOAF file, but may include others not listed, and also may not include the seeAlso link for some entries. To be clear, the FOAF list is only for others to access and use. The FOAF files are critical for this approach to work and should be provided by everyone.

To facilitate one of the heuristics mentioned above, a new FOAF element would be added to the files for each person known. This element would be called "knowsHint". It would contain the approximate number of people that a given person claims to know in their own FOAF file. This is an approximate number because it is only as current as the last time you retrieve the FOAF file (and make the count). However, this count can be quite valuable when trying to figure out which person's FOAF file to process first.

Each person places their FOAF file on the Web, preferably by an automated process that each person uses to maintain the file. When the filter/process needs to validate an address, it should first check its local cache of addresses to see if it came across that address in a prior search. If not found, then it would start traversing the FOAF files (within the limitations of the thresholds), searching for the address. As it did so, it would make sense to update the local cache to potentially avoid future searches. The HTTP caching mechanism could be used to control the freshness of the local cache (just as a web browser would do for browsed pages).

Ideally, the client portion of this application would be built into the mail client directly, allowing direct use of the native address book and providing a seamless environment. However, it could also be implemented as a "proxy". One advantage to the proxy approach is that one could either run a local copy or use a centralized copy over the Web. The mail client would connect to the proxy, which would in turn connect to the mail server. As the proxy retrieved messages to pass on to the client, it would examine the message and modify it with a flag the mail client can filter on.

Through a web interface, the user would be able to maintain their FOAF file. To facilitate this, the proxy could track the URLs of FOAF files in incoming messages. For instance, suppose a new header were defined called X-FOAF-URL, which pointed to an FOAF URL associated to the sender of the message. If the proxy saw that header, it would track it for later referencing. When the user decides to add someone to their FOAF file, they would have that list to make it easier to set up the "seeAlso" value (if the user chooses to provide that).

Odds and Ends

The following are a series of bits (in no particular order) that I felt should be mentioned, but didn't specifically fit in anywhere above:

  1. Current spam approaches can still be used in conjunction with this technique. For instance, even if a spammer managed to get several messages out, it could still be flagged by your mail client's spam controls. Of course, the belief is there there would be few spams to block in this manner, but no system is perfect.
  2. In FOAF, some people give the SHA1 of an address instead of the address itself. The reasoning is that they don't want to get spammed. This makes perfect sense. You can test if an address you have matches the person, but you cannot get the address directly. However, it may be (I really don't know on this one) that once this approach took off and was being widely used, it would no longer be necessary to obscure the addresses. Yes, spammers could harvest the e-mails, but they wouldn't be able to do much with them if they couldn't get by the defenses to start with.
  3. This approach does not address the most obvious weakness: impersonating someone else. In other words, if a spammer sent me an e-mail claiming to be someone I know (or is within my threshold, at the least), then nothing would stop them. However, now they are impersonating someone, which most governments frown upon in a big way. Further, this too could be stopped if people were to start using PGP/GPG/etc. signatures. Again, this approach does not address the issue of impersonation, but is also not meant to.
  4. As noted at the very beginning, this approach could be used in a variety of ways, not just in e-mail. The same approach might be usable in instant messaging, blog comments, newsgroups, etc. Very little would change for each use except possibly what is considered the "address". Using the proxy approach mentioned above, it would be possible for a single proxy to handle multiple messaging protocols, and therefore share a single whitelist.