An Empirical Study of Skip-Gram Features and Regularization for Learning on Sentiment Analysis

Cheng Li, Bingyu Wang, Virgil Pavlu, Javed A. Aslam.

Proceedings of the 38th European Conference on Information Retrieval (ECIR), Padova, Italy, 2016.

Downloads

[Paper] [Slides] [BibTex]

Keywords

Sentiment analysis;    Skip-grams;    Feature selection;    Regularization

Abstract

The problem of deciding the overall sentiment of a user review is usually treated as a text classification problem. The simplest machine learning setup for text classification uses a unigram bag-of-words feature representation of documents, and this has been shown to work well for a number of tasks such as spam detection and topic classification. However, the problem of sentiment analysis is more complex and not as easily captured with unigram (single-word) features. Bigram and trigram features capture certain local context and short distance negations—thus outperforming unigram bag-of-words features for sentiment analysis. But higher order n-gram features are often overly specific and sparse, so they increase model complexity and do not generalize well.

In this paper, we perform an empirical study of skip-gram features for large scale sentiment analysis. We demonstrate that skip-grams can be used to improve sentiment analysis performance in a model-efficient and scalable manner via regularized logistic regression. The feature sparsity problem associated with higher order n-grams can be alleviated by grouping similar n-grams into a single skip-gram: For example, “waste time” could match the n-gram variants “waste of time”, “waste my time”, “waste more time”, “waste too much time”, “waste a lot of time”, and so on. To promote model-efficiency and prevent overfitting, we demonstrate the utility of logistic regression incorporating both L1 regularization (for feature selection) and L2 regularization (for weight distribution).