Thursday, January 17, 2013, 06:31 PM
Lexicon matching is by no means the most accurate way of performing sentiment analysis; however, it is one of the easiest ways of implementing a quick prototype.
Needless to say, we need a lexicon to do that and they tend to be scarce, mostly limited to the English language, and small.
I've just heard of the corpus by Warriner et al. with almost 14,000 English words and I've tried to prepare a quick "translation" into Spanish.
The method I've applied is extremely crude:
- Translate the list of words from English into Spanish.
- Translate (again) the list of Spanish words into English.
- Check the original English word and the English-to-Spanish-to-English word are the same.
So, here you are, the Warriner et al. corpus machine-translated into Spanish with a little more than 9,000 words. Enjoy!
As usual you can find me at PFCdgayo for any comment regarding this post.
| enlace relacionado |




( 3.2 / 49 ) | TopTuesday, November 27, 2012, 09:03 PM
To solve that, a student (David Moreno) tried to convince me to use Google+ Hangouts and I went through this thorough explanation on how to use Hangouts to stream a keynote.
Nevertheless, I found rather uncomfortable to need two computers in order to share both the slides and the webcam and, besides, the quality of the image was IMHO better with Skype than with the Hangout.
On the other hand, using a Hangout On Air I would be able of streaming the conference through YouTube and anyone interested in it could attend the lecture without joining the Hangout. Moreover, the video would be immediately available after the broadcasting.
Hence, I prepared my list of "requirements":
- I wanted to share my screen using Skype and Google+ Hangouts.
- The screen should show both my slides and my webcam.
- I didn't want to pay for Skype premium.
You can check the result in this video (in Spanish) and you can find the current version of Screen Me! (HTML+CSS+JS) prepared by David in github.
| enlace relacionado |




( 3.2 / 56 ) | TopWednesday, November 7, 2012, 09:22 AM
In this table you can find:
- Number of votes for Obama and McCain in 2008.
- Number of votes for Obama and Romney in 2012.
- The error made by predicting that Obama would obtain exactly the same percentage in 2012 that in 2008 in each state.
- The electoral votes obtained in 2008 and in 2012.
- The % of popular and electoral vote.
- The MAE (Mean Absolute Error)
Please note that I do not fully understand most of the subtleties of the electoral college so the number of electoral votes may be not accurate.
Nevertheless, this "groundhog-day" baseline is really good (for the US Presidential Elections): it only missed 2 out of 51 states with an impressive MAE of 2.75%.
So, in short, what would be a reasonable MAE for an algorithm to be credible (to me)? A 5% improvement over the baseline, i.e. MAE = 2.61%.
Of course, if your algorithm is able to be below a MAE of 2.48% (a 10% improvement) I would be impressed.
If your MAE is greater than 2.61%, I'm really sorry but your algorithm is useless
However, I think that any sensible prediction should take into account data from past elections and, therefore, it would be really difficult to tell the difference between just using the baseline and "icing" that historical data with some extra information from social media.
As usual, contact me on Twitter if you please: @PFCdgayo
| enlace relacionado |




( 3 / 67 ) | TopMonday, November 5, 2012, 10:41 AM
According to his model Obama will get 50.71% of the popular vote but 67.35% of the electoral votes.
A number of states are key in this election and, according to the model of this researcher, Obama will win by a tight margin in all of them:
- Colorado: Obama (53.33%)
- Florida: Obama (50.97%)
- New Hampshire: Obama (53.75%)
- Ohio: Obama (51.39%)
- Pennsylvania: Obama (54.08%)
- Virginia: Obama (52.41%)
Personal comment: I personally believe that results for Obama are slightly overestimated by this model and those tight victories could very well be tight loses...
We'll see tomorrow anyway.
If any of you want to contact the original author please send me an e-mail or contact me at Twitter: @PFCdgayo.
Update (Nov. 6th): A brief description of the way in which such results were obtained can be found at arXiv.
| enlace relacionado |




( 3 / 63 ) | TopArtículos
Wednesday, June 27, 2012, 10:18 AM
In my previous post I was deliberately caustic on the matter of electoral prediction from Twitter data.Wednesday, June 27, 2012, 10:18 AM
After writing that I decided to be a bit more constructive, try to see the glass half full instead of half empty, and improve the advice I had provided there (especially regarding baselines).
Hence, I've prepared a new paper where I conduct a meta-analysis on Twitter electoral predictions (note: only those made in scholar papers) to reach some conclusions:
- With regards to predictions based on raw counts:
- It is too dependent on arbitrary decisions such as the parties or candidates to be considered, or the selection of a period for collecting the data.
- Its performance is too unstable and strongly dependent on such parameterizations, and
- Considering the reported results as a whole it seems plausible that positive results could have been due to chance or, even, to unintentional data dredging due to post hoc analysis.
- With regards to predictions based on sentiment analysis:
- It is unclear the impact that sentiment analysis has in Twitter-based predictions. The studies applying this technique are fewer than those counting tweets and the picture they convey is confusing to say the least.
- However, taking into consideration that even naïve sentiment analysis seems to outperform a reasonable baseline it is clear that further research is needed in that line
- Both approaches share a number of weaknesses:
- All of them are post hoc analysis.
- Proposed baselines are too simplistic.
- Sentiment analysis is applied with naïveté since commonly used methods are slightly better than random classifiers and fail to catch the subtleties of political discourse.
- All of the tweets are assumed to be trustworthy when it is not the case.
- Demographics bias is neglected even when it is well known that social media is not a random sample of the population.
- Self-selection bias is also ignored although it is well known that supporters are much more vocal and responsible of most of the content.
- Period and method of collection: i.e., the dates when tweets were collected, and the parameterization used to collect them.
- Data cleansing measures:
- Purity: i.e., to guarantee that only tweets from prospective voters are used to make the prediction.
- Debiasing: i.e., to guarantee that any demographic bias in the Twitter user base is removed.
- Denoising: i.e., to remove tweets not dealing with voter opinions (e.g. spam or disinformation) or even users not corresponding to actual prospective voters (e.g. spammers, robots, or propagandists).
- Prediction method and its nature:
- The method to infer voting intentions from tweets.
- The nature of the inference: i.e., whether the method predicts individual votes or aggregated vote rates.
- The nature of the prediction: i.e., whether the method predicts just a winner or vote rates for each candidate.
- Granularity: i.e., the level at which the prediction is made (e.g. district, state, or national).
- Performance evaluation: i.e., the way in which the prediction is compared with the actual outcome of the election.
Finally, what would be an appropriate way to evaluate performance?
Certainly MAE (Mean Absolute Error) is commonly applied but this measure changes from election to election so a baseline must be used with each election so MAE of the system is compared against that of the baseline.
What would be that baseline?
I propose using the results of the immediately prior election as a prediction. That is, assuming the same results are to be obtained.
Certainly, this has got issues: e.g., new parties running for election or coalitions created or dismantled between elections. Still, it is simple and can provide an intuitive hint about how "hard" or "easy" to predict an election can be.
Such a baseline was used to determine the performance of each prediction made to date.
And that's all! You can find the paper in arXiv and you can send me your comments on Twitter (@PFCdgayo).
| enlace relacionado |




( 3 / 144 ) | TopSiguiente




