One truism in forensics is that every contact leaves evidence. That's true at a crime screen in the real world and also true on the Internet. You might think you can get away with posting an anonymous message somewhere, or even sending an anonymous e-mail via a bogus Web mail account, or perhaps writing anonymous malicious code. You can't. Not entirely. At this year's Black Hat Briefings in Las Vegas, Dr. Neal Krawetz, of Hacker Factor Solutions, demonstrated how he and others have started to use nonclassical digital forensics techniques. By analyzing the words used or the keyboard characters typed, he can tell a lot about these supposedly anonymous online authors.
Surf anonymously? Think again
Recently I reviewed Torpark, an Internet browser designed to disguise your originating IP address. Whenever you connect to the Internet, your ISP assigns you an IP address. This IP address tells sites such as Google what country you're in. This is important if you live in a country where free speech is tightly controlled. Torpark, which is based on Mozilla's Firefox 1.5, uses a worldwide network of encrypted routers to randomly choose a different IP address for you. So, when you launch Torpark, you're likely to see the Firefox-like browser default to Google Denmark instead of the Google U.S. screen.
As hard as it is to change or alter the whorls of your physical fingerprints, it is just as hard to alter the way we think about and use language or the way our fingers naturally hit the keyboard.
Torpark is designed primarily to keep your online search requests from being censored or subpoenaed in the future by some court. But even if you were to use Torpark to disguise your IP address from law enforcement for malicious intent, the e-mails you send, the posts you make, even the code you upload still says a lot about you--probably more than you intended. As hard as it is to change or alter the whorls of your physical fingerprints, it is just as hard to alter the way we think about and use language or the way our fingers naturally hit the keyboard. Dr. Krawetz showed how information about gender, country of origin, handedness, and even whether or not you play the guitar can all be determined from sample text.
Vocabulary and gender
In his talk, Kazwetz mentioned several studies on gender use of keywords which, when weighted--with specific numerical values for male and different numerical values for female--can determine the gender of the author. Sounds too simple to be true, but research (including Gender, Genre, and Writing Style in Formal Written Texts by Shlomo Argamon, et al., and Sexed Texts by Charles McGrath) has shown that some words are more likely to be written by one gender or the other. In informal writing, men are more likely to write "some," "this", and "as" while women are more likely to write "actually," "everything," and "because." In formal writing, men write "around," "more," and "what" while women write "if," "with," and "where." By determining the point totals in a given document, Dr. Krawetz can predict the gender of the author.
Dr. Krawetz admits upfront that this method is only 60 to 70 percent accurate, but it is far better than guessing, which is only 50 percent accurate. He further cautions that text including citations from poetry, quotes from others, and even the influence of copy editors on the original can all skew the results. It is best to collect a large number of examples, then average the point totals.
In the end, Dr. Krawetz narrowed the possible candidates for TheUntouchable down from a list that included everyone on the planet to one: a right-handed male, possibly a musician, whose profile matches only 3.5 percent of the population.
Who are you--really?
The New Yorker cartoon states: On the Internet, nobody knows you're a dog. Well, Dr. Krawetz has applied his tests to several Web site examples. The most interesting case study involved soft-porn sites where he determined that a few of the "women" who write fantasies for men tested out to be men. So much for believing all those sultry blogs. Want to see for yourself? Dr. Krawetz's Gender Guesser is available online from Hacker Factor Solutions.
By analyzing text, Dr. Krawetz can also learn something of a person's nationality. For example, when analyzing multiple documents, patterns emerge in word choice, punctuation, and sentence length. Americans choose words from a small core vocabulary, while Europeans draw from a much larger vocabulary and use alternative spelling choices to some equivalent American English words. Australians are a hybrid, using a smaller core vocabulary but choosing European spellings. Shorter sentences with simple punctuation are more likely to be American, and longer sentences with complex punctuation are more likely to be European.
What randomness can teach us
But most of us don't publish in large online publications, so blogs and even online chats can also reveal a lot of who we are online. Out of frustration, have you ever typed and posted some gibberish in an online post or chat? Like this: asdfasjfdj. Your dominant hand is likely to hit more adjacent keys faster than your other hand. In this example, assuming the author used a standard QWERTY keyboard, you might determine that the author--myself, in this case--is left-handed, which would be true. Research shows that the ratio of right-handed individuals to left-handed is roughly a 70/30 split worldwide.
Again some disclaimers: This method is also only 70 percent accurate and doesn't include such factors as whether the person is an online game player, has carpal tunnel syndrome, or is employed as a typist (or at a job where typing is required).
Drumming for fun
But here's where it really gets fun. Try drumming your fingers on the keyboard. Does the pattern radiate out from the center of the keyboard or in from the outside? Research shows that the Out to In ratio is also a 70/30 split. And if one hand is mostly In and the other is mostly Out, the result is asymentrical drumming, which might mean the author is a musician (usually a piano or a stringed instrument player); when no clear pattern emerges from this drumming exercise, that can be seen as a possible sign of Attention Deficit Disorder. Also if the letters are adjacent to each other, such as in my above example, we can infer that I'm using a QWERTY-based keyboard, and not a Dvorak keyboard. Dr. Krawetz can further tell whether a person is using an ergonomic setup or not by which rows of keys are used more.
All this might sound crazy, but check out Dr. Krawetz's slides from his Black Hat talk where he uses the above techniques in a few real-world hacking examples. With one case study, he attempts to identify "TheUntouchable," a member of DutchMafia, a phishing group that has since disbanded. In the end, Dr. Krawetz narrowed down the possible candidates for TheUntouchable from a list that included everyone on the planet to just one: a right-handed male, possibly a musician, whose profile matched only 3.5 percent of the population. It wasn't a direct match but it's better than nothing. Dr. Krawetz's Gender Guesser is available online from Hacker Factor Solutions, or try the Gender Genie by BookBlog.
You can run but... Have you ever posted anything anonymously online? Comments on a column you've read, for example. Talk back to me.
Research such as this should help investigators zero in on online criminals. Virus writers have in the past enjoyed some anonymity--until they bragged about their exploits in online chats. Now investigators have a means of fingerprinting text and code.