counter Blog para proyectantes de Dani Gayo - Les Liaisons dangereuses
HOT!, Artículos
Saturday, January 15, 2011, 01:06 AM
Update (January 18, 2011): A few bloggers published some posts on the topic. José López Ponce (in Spanish), Dorjival Silva (in Portuguese), Aymeric Pontier (in French), Dominique Desaunay (in French), and iriospark (in Italian).

Update (January 17, 2011): It also appeared in the print version of Le Monde (PDF).

Update (January 15, 2011): Le Monde (French newspaper) has featured this study.

More than a year ago I wrote in critical terms about MIT's Gaydar project, the aim of which was determining if a Facebook user is or not gay from his/her contacts. A paper reporting the results was later published and I cannot help to recommend it.

Needless to say, I found the study disturbing but, even more disturbingly, in short time I was up to my neck in a very similar work. My original purposes were two-fold: to find close communities in Twitter, and to find interesting users to recommend. Preliminary work was done and, at that point, I found that detailed demographic information about users could be very valuable (to test the goodness of the detected communities and to improve the recommendations). Facebook already knows that but Twitter profiles are not that detailed.

Thus, I started working on inferring user's attributes from known attributes from their neighbors. After all, I already knew the sex, age, and location for a number of users (see the Appendix of this paper). Could it be possible to determined those values for unlabeled users? Reluctantly, I also turned to other attributes which are generally considered sensitive: political and religious beliefs, sexual orientation and race/ethnicity.

As you probably know, Facebook asks for all of them (except for race/ethnicity) and many people provide them. I'm not much of a Facebook user so I don't care a lot about information in my friends profiles but, because of this study, I checked several Facebook profiles: a number of them include their political and religious beliefs and, a few, what they are "interested in" (i.e. the Facebook's euphemism for sexual orientation).

I was not really shocked nor worried about this but then, I wonder, what if I had something to conceal --from my friends, my family, my employer, or gasp! my government). Guilt-by-association algorithms are becoming quite popular, and the so-called "War on Terror" is driving western democracies to take previously unseen measures towards their own citizens' privacy and liberties. As someone with near 100% chance of being subjected to "random" security checks in airports (still don't know why) I could not help to be a little touched with the matter.

Thus, I prepared a labeled Twitter dataset applying pattern-matching to user bios for all of the aforementioned attributes. Yes, even, the sensitive ones, and yes I felt a bit awkward.

Then, I re-adapted the work I was doing in community detection/user recommendation to perform user profiling and applied the algorithm to my Twitter user graph (I mentioned it in the past: 1.8M English speaking users) using 80% of the labeled users. Then, label assignments were checked against the remaining 20% and against users appearing in the WeFollow user directory.

The results? Rather shocking.

Detection of religious and political beliefs, sexual orientation, and race/ethnicity achieved above 95% precision. Sex and age achieved poorer results but, still, they were much more precise than a random classifier.

The implications are IMHO important (and a little scary), it means that simple algorithms can be used to label people, the most promising assignments can be manually checked and, hence, used to bootstrap the next iteration. Besides, we the users are doing most of the work by telling about ourselves; we are providing the labels, happily, for free, for anyone.

Maybe you are aware of this and don't care discussing your beliefs and personal choices, fine. But by doing that, those of your friends and acquaintances who conceal such an information are at risk.

Thus, after visiting the dark-side, I turned to study active measures users can adopt to avoid privacy risks due to data mining. At this moment I only have got an outline but, hopefully, it can developed into a full-fledged prototype along 2011.

What's the morale of this? The old saying "You are known by the company you keep" is absolutely true, so don't tell anybody who your friends are.

By the way, should you be interested in the full details of the study you can check this preprint:

"All liaisons are dangerous when all your friends are known to us"
Online Social Networks (OSNs) are used by millions of users worldwide. Academically speaking, there is little doubt about the usefulness of demographic studies conducted on OSNs and, hence, methods to label unknown users from small labeled samples are very useful. However, from the general public point of view, this can be a serious privacy concern. Thus, both topics are tackled in this paper: First, a new algorithm to perform user profiling in social networks is described, and its performance is reported and discussed. Secondly, the experiments –conducted on information usually considered sensitive– reveal that by just publicizing one’s contacts privacy is at risk and, thus, measures to minimize privacy leaks due to social graph data mining are outlined.


As always I'd be happy to hear your comments. Tweet me at @pfcdgayo.

| enlace relacionado | ( 3 / 263 ) | Top



Tuesday, January 11, 2011, 07:51 PM
This morning I've discussed with a couple of friendsters (i.e. Twitter friends) about different methods to determine the language (natural language, that is) in which a tweet is written. A pretty obvious thing to do is to rely on a dictionary, the largest the dictionary the better.

However, which words should you use in case you want a really short dictionary? Those most likely to appear in any conceivable text. Words such as "the", "of", "with" in English, or "el", "un", "de" in Spanish would be a good choice. This kind of words are called stop words and there are several available lists for different languages (e.g English, French, Spanish, etc.)

Nevertheless, this dictionary-based method is not error free. For instance, "de" is not only a stop word for Spanish but also for Portuguese and Dutch.

Moreover, I argued that the method is specially problematic when trying to identify very short texts --specially if they are written using "imaginative" grammar and spelling. For instance, this real tweet is English but, except for "amazing", none of the words would appear in a (sensible) dictionary:


@justinbieber omg Justin bieber ur amazing lol : )



Therefore, I argued that using n-grams could be a more robust approach and, besides, the model could be trained on different data once we are sure the tweets are actually written in a given language.

So, first of all, what's an n-gram? A subsequence of n successive characters extracted from a given text string. For example, in the previous tweet we'd find the following 3-grams:
@ju
jus
ust
sti

...

lol
ol
l :
 : )

N-grams can be obtained for texts of any length and, thus, the underlying idea is to collect a list of n-grams (ideally with their relative frequency or, even better, their use probability) from a collection of documents.

Ideally, the collection should be similar to the documents you are to identify; that is, if you are going to classify tweets you shouldn't train on the Shakespeare's works. However, you are probably using any documents you find (for this post I've used the text of "The Universal Declaration of Human Rights").

Then, for any document you want to classify you just need to obtain a similar n-gram vector and compute their similarity (e.g. cosine, Jaccard, Dice, etc.)

Needless to say, when the document to classify is very short (such as tweets) most of the n-grams appearing within the document are going to be unique and, thus, awkward results can be obtained. If you are performing language identification in such short texts it's much better to just count the number of n-grams from the short text which appear for each language model and choose that language with a larger coverage.

For instance, let's take the following short texts:

(German, 48 4-grams) Als er erwachte, war der Dinosaurier immer noch da.
(Galician, 44 4-grams) Cando espertou, o dinosauro aínda estaba alí.
(Spanish, 51 4-grams) Cuando despertó, el dinosaurio todavía estaba allí.
(Basque, 43 4-grams) Esnatu zenean, dinosauroa han zegoen oraindik.
(Catalan, 46 4-grams) Quan va despertar, el dinosaure encara era allà.
(English, 43 4-grams) When [s]he awoke, the dinosaur was still there.

Using the model I've built, each of the texts has the following significant intersections:

Als er erwachte, war der Dinosaurier immer noch da. => German, 27 common 4-grams
Cando espertou, o dinosauro aínda estaba alí. => Portuguese, 18 common 4-grams
Cando espertou, o dinosauro aínda estaba alí. => Galician, 17 common 4-grams
Cuando despertó, el dinosaurio todavía estaba allí. => Spanish, 21 common 4-grams
Cuando despertó, el dinosaurio todavía estaba allí. => Asturian, 20 common 4-grams
Esnatu zenean, dinosauroa han zegoen oraindik. => Basque, 17 common 4-grams
Quan va despertar, el dinosaure encara era allà. => Catalan, 21 common 4-grams
Quan va despertar, el dinosaure encara era allà. => Spanish, 20 common 4-grams
Quan va despertar, el dinosaure encara era allà. => Asturian, 20 common 4-grams
When [s]he awoke, the dinosaur was still there. => English, 15 common 4-grams

If we choose the language with the largest intersection then we have that each text is classified as:
Als er erwachte, war der Dinosaurier immer noch da. => German, Correct!
Cando espertou, o dinosauro aínda estaba alí. => Portuguese, Incorrect, but a near miss
Cuando despertó, el dinosaurio todavía estaba allí. => Spanish, Correct!
Esnatu zenean, dinosauroa han zegoen oraindik. => Basque, Correct!
Quan va despertar, el dinosaure encara era allà. => Catalan, Correct!
When [s]he awoke, the dinosaur was still there. => English, Correct!

Another advantage of using n-gram models is that they decay gracefuly.

For instance, classifying a short text written in Galician as Portuguese is rather acceptable.

Or let's take this text:

"Hrvatski jezik skupni je naziv za standardni jezik Hrvata, i za skup narjecja i govora kojima govore ili su nekada govorili Hrvati."


It's actually Croatian, but since I did not train my system on Croatian samples it's classified as Serbian which, again, is reasonable.

In addition to this (hopefully) explanatory post, I've developed a bit of source code. You can try the demo and download the source code and data files (it's PHP, so proceed at your discretion).

As usual, if you want to discuss something on this post, just tweet me at @pfcdgayo

| enlace relacionado | ( 3 / 221 ) | Top


HOT!
Friday, December 10, 2010, 11:15 AM
It has been a long time since my last entry in this blog: sorry, I've been pretty busy. In fact, part of that time was devoted to a particular interesting idea which was co-developed with David Brenes, Diego Fernández, María Fernández, and Rodrigo García.

To put in short a long story, the idea was to develop a physical metaphor for influence in Twitter and check for its goodness. Sure, we are aware that influence is a rather elusive concept and that there already exist a number of choices to compute authority/centrality/clout/etc in Twitter; nevertheless, there are some juicy novelties in our approach.

First of all, we have completely disregarded the user graph; i.e., all of the computation is performed by just using the tweets and, hence, it is amenable for real-time application. In that sense, our approach is quite different from PageRank or TunkRank.

Second, because this new method can be applied in (almost) real time, user's scores are truly dynamic and not just periodically recomputed.

The implications for this are clear: a picture of the evolution of a user's influence is much more informative that just a position within a ranking. Think for instance of the possibilities for brands, marketing campaigns, and public relationships.

Regarding the implementation details; if you are thinking of citations you are getting warmer. We use Twitter mentions but in a rather novel way: we treat them as a "force" able to "move" users, while the number of followers for a user is his "mass".

That way, and taking into account the unavoidable "friction", we can compute both "acceleration" and "velocity" for every user in the Twitter stream.

Therefore, in such a context, "velocity" is a proxy for user's influence while "acceleration" provides a way to detect "trending users", i.e. those users which are gaining influence so fast that they must be involved into "something".

If you are a connoisseur of Twitter you are probably aware of the so-called "Velocity and Acceleration" model developed by Jason Harper at Organic.

No details about the model have been disclosed and, so, little can be said about its relation to our method. Nevertheless, given that --unlike our approach-- Organic's model is not focused in user influence it may very well differ in many details from our model.

For the full description of the method you can read the paper "Retibus socialibus et legibus momenti -- On social networks and the laws of influence".

If you want a palatable (but thorough) introduction please check these slides.



You'll see it is a simple and elegant solution, and you are welcome to comment anything you find interesting/intriguing about the model and, of course, we encourage you to implement it.

Oh, and don't forget to tweet about this if you like it! Besides, you can reach us at @pfcdgayo (that's me), @brenes, @littlemove (Diego), @mimiru (María), and @rochgs (Rodrigo).

| enlace relacionado | ( 3 / 261 ) | Top



Friday, June 18, 2010, 10:15 AM
Predicting elections from Twitter data is becoming a red hot topic of research.

In a previous post I've discussed some recent (and serious) studies on the field. More recently, there have been some press coverage like this article by New Scientist: "Blogs and tweets could predict the future". In addition to that, some reports claim to have (rather accurately) predicted elections in United Kingdom and Belgium by merely counting the number of mention each candidate had received in Twitter (1, 2, 3).

In that previous post I tried to argue why it's difficult. In this one I'll just try to provide a headline to bring some balance to the force: "Why you cannot (always and consistently) predict elections from Twitter".Hence, by just analyzing the public Twitter stream with regards to one particular elections:

(1) You have missed:(2) You have taken into account:And, finally,

(3) You have inferred votes using noisy and not-that-accurate methods.

So, depending on who's using and not using Twitter you can be close or very far away to predict a given elections' outcome.

What's the problem with this?

We're seeing lots of buzz about positive results but very little discussion on negative results and why they occur and how they could be avoided.

In short, take it easy guys... It's a really awesome field of research but it simply cannot be that easy.

| enlace relacionado | ( 3 / 36 ) | Top


HOT!, Investigación
Friday, April 23, 2010, 08:46 AM
That's the title of one paper I've recently submitted to a journal. I didn't intend to talk about it because I cannot publish it as a preprint; however, some recent independent events have touched me a nerve and I'd like to "put my two cents" on the topic.

The first "event": a tweet by @zephoria linking a post about Big Data and Social sciences. She hits the nail in the head with that post, I specially liked this part (mainly because of my paper):

[...] Big Data presents new opportunities for understanding social practice. Of course the next statement must begin with a “but.” And that “but” is simple: Just because you see traces of data doesn’t mean you always know the intention or cultural logic behind them. And just because you have a big N doesn’t mean that it’s representative or generalizable.


Amen to that!

Second "event": a tweet by @munmun10 about the bias towards publishing positive results. She links an interesting article by Ars Technica which describes a study on the infamous "file-drawer effect". Such a fancy name refers to researchers tendency to just report possitive results while not discussing negative results --which, of course, can be equally important.

OK, enough, I'll talk, I'll talk.

Why do these two unrelated tweets push me to urgently describe my own paper? Mainly because it deals with negative results, lessons learned from strong assumptions about exploiting Big Data, and it gives some warnings about different pitfalls one can find when doing Social Media research.

First of all, the abstract:

A warning against converting Twitter into the next Literary Digest. Daniel Gayo-Avello (2010). User generated content has experimented a vertiginous growth both in the diversity of applications and the volume of topics covered by the users. Content published in micro-blogging systems such as Twitter is thought to be feasibly data-mine in order to "take the pulse" to society. At this moment, plentiful of positive experiences have been published, praising the goodness of relatively simple approaches to sampling, opinion mining, and sentiment analysis. In this paper I'd like to play devil's advocate by describing a careful study in which such simple approaches largely overestimate Obama's victory in U.S. 2008 Presidential Elections. A thorough post-mortem of that study is conducted and several important lessons are extracted.


The study described in the paper had been in my drawer since mid-2009 because I thought it was unpublishable because of the outcome of the research: my data predicted a Obama victory (good), but the margin was too big. And when I mean too big I mean that Obama won Texas according to Twitter data (bad).

All of this remind me of the (infamous) Literary Digest poll that was a total failure predicting the outcome of U.S. 1936 Presidential Elections. Thus, I simply assumed (in 2009) that using Twitter to predict elections in 2008 was like polling owners of cars in 1936 to predict who would be the next POTUS. Without further ado I simply moved on.

Then, this year three different papers appeared in a short time span:
All three papers worth a careful reading:
So, here I was, I had a complete report on how to predict a landslide victory that never happened that (1) was consistent with an independent study (the one by O'Connor et al.), and (2) seemed to reach the opposite conclusion of a third study (the one by Tumasjan et al.)

Of course, the problem here is overgeneralization; I mean, my study do not prove that elections cannot be predicted by mining Twitter, it proves, however, that I wasn't able to predict U.S. 2008 Elections with my data and my sentiment analysis methods (by the way, I tested four different ones). Neither the study by Tumasjan et al. proves that elections can be predicted; the study demonstrates that it was possible to predict one particular election in one particular country.

Hence, I decided to write a paper dealing with (1) the need to publish negative results, (2) a post-mortem in a failed Social Media study analyzing the sources of bias and ways to correct that, and (3) providing some lessons and caveats for future research on the field.

So, I'll put here the lessons I extracted for this; I hope you find them useful or, at least, you can give me some feedback on them:
  1. The Big Data fallacy. Social Media are extremely appealing because researchers can easily obtain large data collections to be mined. However, just being large does not make such collections statistically representative of the global population.
  2. Beware of naïve sentiment analysis. It is certainly possible that some applications can achieve reasonable results by merely accounting topic frequency or using simple approaches to sentiment detection. However, noisy instruments should be avoided and one should carefully check whether s/he is using --maybe unknowingly-- a random classifier.
  3. Be careful with demographical bias. Social Media users tend to be relatively young and, depending on the population of interest, this can introduce an important bias. To improve results it is imperative to know users' ages and try to correct the age bias in the data.
  4. What is essential is invisible to the eye. Non-responses can play a role even more important than the collected data. If the lack of information mostly affect just one group the results can greatly depart from reality. Nonetheless to say, estimating degree of non-response and its nature is extremely difficult --if not impossible at all. Thus, we must be very conscious of this issue.
  5. (A few) Past positive results do not guarantee generalization. As researchers we must be aware of the file drawer effect and, hence, we should carefully evaluate positive reports before assuming the reported methods can be straightforwardly applied to any similar scenario with identical (positive) results.
Of course, if you are interested in the paper just e-mail me (dani AT uniovi DOT es) and I'll send you a copy.

| enlace relacionado | ( 2.9 / 34 ) | Top



Anterior Siguiente