Archive for Projects

Semantic Similarity in Social Media

Winner of the SemEval 2015 shared task for detection of semantic similarity in Twitter. The goal was to build a system capable of judging if two texts express the same or very similar meaning. This task is complicated by the informal vocabulary and syntax employed online. Our novel approach succeeded in generalizing well to new topics perhaps because our recurrent neural network is equipped with both string matching metrics and pre-trained models of word meaning learned from large collections of online text.

We describe our approach in this paper:

Discriminating Non-Native English with 350 Words

We were named co-winner of the Native Language Identification shared task at NAACL’s 2013 BEA-8 workshop! The task was to identify an author’s native language based on a short English essay. Our system was 83% accurate when reading, on average, 348 words of English and selecting a native language from the set of Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, and Turkish. We spent 3 weeks developing our submission, and our result was statistically tied for first place among the 66 submissions from 29 teams.

More info at:, paper to be published in June 2013.

PHAT: Monitoring Obesity Trends with Social Data

Obesity accounts for 21% of US healthcare costs, even as billions of dollars are spent on prevention. We equip policy makers with tools to evaluate these investments, supplementing traditional surveillance by making real-time measurements of social media to infer behavior among Americans. We combine our predictions of gender, age, and location of social media users with surveys from the CDC, and identify social media predictors of real-world behavioral trends.

Pinocchio: Efficient Search for Information Campaigns

Social media isn’t “one man, one vote.” Deception is commonplace, meaning analyses of social data are often built on biased evidence. Unfortunately the sheer volume of messaging on Twitter & Facebook allows adversaries to hide in plain sight. We eschew standard classification techniques and use probabilistic algorithms that uncover hidden groups of accounts engaged in misuse of the network. This novel approach has been used to discover spammers and participants in information campaigns who, in some cases, have been highly active on Twitter for several years.

Our Pinocchio software analyzes millions of authors to automate detection of deceptive identities. On Twitter, accounts we flag are 2x more likely to eventually be suspended for violating the Terms of Service. However Twitter’s Trust & Safety team detects only 1/20th of our flagged accounts, which can make up >15% of traffic on a given topic.

More information here or at Get in touch if you’d like to hear more.


Predicting Demographics on Twitter

“When you tweet — even if you tweet under a pseudonym — how much do you reveal about yourself? More than you realize, argues a new paper from researchers at the MITRE Corporation. The paper, “Discriminating Gender on Twitter,” which was presented at the Conference on Empirical Methods in Natural Language Processing, demonstrates that machines can often figure out a person’s gender on Twitter just by reading their tweets.”

More at:

Paper here: