counter Blog para proyectantes de Dani Gayo - Affective lexicons in Spanish

Thursday, January 17, 2013, 06:31 PM
Warning: quick and dirty post.

Lexicon matching is by no means the most accurate way of performing sentiment analysis; however, it is one of the easiest ways of implementing a quick prototype.

Needless to say, we need a lexicon to do that and they tend to be scarce, mostly limited to the English language, and small.

I've just heard of the corpus by Warriner et al. with almost 14,000 English words and I've tried to prepare a quick "translation" into Spanish.

The method I've applied is extremely crude:
  1. Translate the list of words from English into Spanish.
  2. Translate (again) the list of Spanish words into English.
  3. Check the original English word and the English-to-Spanish-to-English word are the same.
In addition to that there has been some manual checking but, as I say, it is pretty crude and is provided as is.

So, here you are, the Warriner et al. corpus machine-translated into Spanish with a little more than 9,000 words. Enjoy!

As usual you can find me at PFCdgayo for any comment regarding this post.

| enlace relacionado | ( 2.9 / 337 ) | Top



Tuesday, November 27, 2012, 09:03 PM
A couple of weeks ago I had to deliver a lecture by video conference. I know there are plenty of services to do that but since both parts were short of time I wanted something easy and that I had previously used. After some trial and error I discharged most of the available free services purported to do that and I chose Skype (yes, Skype). A problem I had was that I wanted my webcam to appear next to the slides and that requires Skype premium.

To solve that, a student (David Moreno) tried to convince me to use Google+ Hangouts and I went through this thorough explanation on how to use Hangouts to stream a keynote.

Nevertheless, I found rather uncomfortable to need two computers in order to share both the slides and the webcam and, besides, the quality of the image was IMHO better with Skype than with the Hangout.

On the other hand, using a Hangout On Air I would be able of streaming the conference through YouTube and anyone interested in it could attend the lecture without joining the Hangout. Moreover, the video would be immediately available after the broadcasting.

Hence, I prepared my list of "requirements":The solution was to prepare a webpage showing the "slides" (actually images produced from PowerPoint) and the webcam. To that end, the jQuery webcam plugin was invaluable, as it was the help of David.

You can check the result in this video (in Spanish) and you can find the current version of Screen Me! (HTML+CSS+JS) prepared by David in github.


| enlace relacionado | ( 3.1 / 242 ) | Top



Wednesday, November 7, 2012, 09:22 AM
Now that polls have closed we are able to compute the MAE (Mean Absolute Error) of a sensible (albeit naïve) baseline for predicting these elections: that past results just occur again.

In this table you can find:
  1. Number of votes for Obama and McCain in 2008.
  2. Number of votes for Obama and Romney in 2012.
  3. The error made by predicting that Obama would obtain exactly the same percentage in 2012 that in 2008 in each state.
  4. The electoral votes obtained in 2008 and in 2012.
  5. The % of popular and electoral vote.
  6. The MAE (Mean Absolute Error)
Data for 2008 Elections was obtained from Wikipedia. Data for 2012 Elections was obtained from politico.com.

Please note that I do not fully understand most of the subtleties of the electoral college so the number of electoral votes may be not accurate.

Nevertheless, this "groundhog-day" baseline is really good (for the US Presidential Elections): it only missed 2 out of 51 states with an impressive MAE of 2.75%.

So, in short, what would be a reasonable MAE for an algorithm to be credible (to me)? A 5% improvement over the baseline, i.e. MAE = 2.61%.

Of course, if your algorithm is able to be below a MAE of 2.48% (a 10% improvement) I would be impressed.

If your MAE is greater than 2.61%, I'm really sorry but your algorithm is useless :(

However, I think that any sensible prediction should take into account data from past elections and, therefore, it would be really difficult to tell the difference between just using the baseline and "icing" that historical data with some extra information from social media.

As usual, contact me on Twitter if you please: @PFCdgayo

| enlace relacionado | ( 3 / 289 ) | Top



Monday, November 5, 2012, 10:41 AM
A fellow researcher has just sent me his prediction for tomorrow's elections in the U.S. based on Twitter data.

According to his model Obama will get 50.71% of the popular vote but 67.35% of the electoral votes.

A number of states are key in this election and, according to the model of this researcher, Obama will win by a tight margin in all of them:
Personal comment: I personally believe that results for Obama are slightly overestimated by this model and those tight victories could very well be tight loses...

We'll see tomorrow anyway.

If any of you want to contact the original author please send me an e-mail or contact me at Twitter: @PFCdgayo.

Update (Nov. 6th): A brief description of the way in which such results were obtained can be found at arXiv.

| enlace relacionado | ( 3.1 / 223 ) | Top


Artículos
Wednesday, June 27, 2012, 10:18 AM
In my previous post I was deliberately caustic on the matter of electoral prediction from Twitter data.

After writing that I decided to be a bit more constructive, try to see the glass half full instead of half empty, and improve the advice I had provided there (especially regarding baselines).

Hence, I've prepared a new paper where I conduct a meta-analysis on Twitter electoral predictions (note: only those made in scholar papers) to reach some conclusions:In addition to that I suggest a <b>conceptual scheme to characterize any Twitter-based prediction method</b>. Such a scheme comprises the characteristics and sub-characteristics defining any predictive method using Twitter data:Hence, what I claim is that (1) current research does not warrant purity of the dataset nor tries to debias or denoise it; and (2) performance is estimated using inappropriate baselines.

Finally, what would be an appropriate way to evaluate performance?

Certainly MAE (Mean Absolute Error) is commonly applied but this measure changes from election to election so a baseline must be used with each election so MAE of the system is compared against that of the baseline.

What would be that baseline?

I propose using the results of the immediately prior election as a prediction. That is, assuming the same results are to be obtained.

Certainly, this has got issues: e.g., new parties running for election or coalitions created or dismantled between elections. Still, it is simple and can provide an intuitive hint about how "hard" or "easy" to predict an election can be.

Such a baseline was used to determine the performance of each prediction made to date.

And that's all! You can find the paper in arXiv and you can send me your comments on Twitter (@PFCdgayo).

| enlace relacionado | ( 3 / 284 ) | Top



Anterior Siguiente