Text Classification with Naive Bayes

What is text classification?

Text classification is the task of assigning a label or category to an entire text or document.

What is sentiment analysis?

Sentiment analysis is the extraction of sentiment, the positive or negation orientation that a writer expresses toward some object.

What is spam detection?

Spam detection is the binary classification task of categorizing an email as either spam or not-spam.

What is language id?

Language id is the task of identifying the language in which the text is written.

What is authorship attribution?

Authorship attribution is the task of determining a text’s author.

What is supervised machine learning?

The task of supervised machine learning is to take an input x and a fixed ser of output classes Y = y1, y2, … , yM and return a predicted class y from Y. This is done by using supervision signal from a data set of input observations and their corresponding ground truth.

What is a probabilistic classifier?

A probability classifier is a supervised machine learning classifier that additional provides the probability of the observation being in a class.

What are generative classifiers?

Generative classifiers are based on modeling how likely a class can generate a given observation. Naive bayes is an example of generative classifiers.

What are discriminative classifiers?

Discriminative classifiers are those that learn what features from the input are most useful to discriminate between the different possible classes. Logistic regression is an example.

What is bag of words representation?

Bag of words representation is to represent a text document as an unordered set of words along with their frequency. The position information is not retained.

What is the formula and interpretation for naive bayes after applying bayes theorem but before the naive bayes assumptions?

Formula: c_hat = argmax(P(c|d), c in C) = argmax(P(d|c)*P(c), c in C), where c_hat is the predicted class, C is all possible classes, d is the observed input. Interpretation: First a class c is sampled from P(c), then the words are generated by sampling from P(d|c). We will pick the c that produces the input with the highest likelihood.

What are the prior probability and likelihood in `c_hat = argmax(P(d|c)*P(c), c in C)`

The prior probability is P(c). The likelihood is P(d|c).

What is Maximum Likelihood Estimation?

Maximum likelihood estimation (MLE) is the method of maximizing a likelihood function and choosing the point in parameter space that maximizes this likelihood as the maximum likelihood estimate.

What is the formula for naive Bayes with the naive Bayes assumption?

c_NB = log[argmax(P(c)*pi(P(w_i|c), i in word positions), c in C)] = argmax(logP(c) + sum(logP(w_i|c), i in word positions), c in C)

What are linear classifiers?

Linear classifiers, like naive Bayes and logistic regression, are classifiers that use a linear combination of inputs to make a classification decision.

What is P_hat(w_i|c) with laplacian smoothing?

(count(w_i, c) + 1)/(sum(count(w, c), w in V) + |V|)

How does naive bayes handles unknown words and stop words?

Ignores them

What is binary NB?

Binary NB or binary multinomial naive Bayes is a variant of naive Bayes that the clips the word counts in each document at 1.

How does binary NB deal with negation in sentiment analysis?

By prepending a prefix NOT_ to the words following each negation until the next punctuation mark.

How do we deal with insufficient labeled training data in NB-based sentiment analysis?

By using sentiment lexicon such as General Inquirer, LIWC, opinion lexicon, and MPQA Subjectivity lexicon and maintaining two features that record whenever a word from the positive or negative lexicon is encountered.

List some features used for NB-based spam detection?

mentions phrases like “one hundred percent guaranteed”, “urgent reply”, “online pharmaceutical”
mentions large sums of money
HTML has a low ratio of text to image area
Email subject line is all capital letters
HTML has unbalanced “head” tags
Claims you can be removed from the list

What are some tricks for language id?

Character n-grams
Byte n-grams
Feature selection of n-grams

What are gold labels?

Gold labels are the labels annotated by humans.

What is a confusion matrix?

Confusion matrix is a table that uses the system outputs along the rows and the gold labels along the columns.

What is precision?

Precision is the percentage of system positives that are in fact positive.

What is recall?

Recall is the percentage of actual positives that the system labels as positive.

What is the formula for F-measure?

F_beta = (beta^2 + 1)*P*R/(beta^2*P + R). beta < 1 favors precision and beta > 1 favors recall.

What is macroaveraging?

Computing the performance for each class and averaging over each class.

What is microaveraging?

Adding the individual confusion matrices to create a pooled confusion matrix and computing precision and recall on that.

What is k-fold cross validation?

Partitioning the data in to k disjoint sets called folds, choose each of the set as test set with the others as training set, and averaging the accuracies from the k runs.

What is effect size of A over B on a test set x?

The effect size is the performance difference between A on x and B on x.

What is null hypothesis?

Null hypothesis is the hypothesis that the actual effect size is zero or less than zero.

What is p-value?

P-value is the probability of seeing the observed effect size we saw or greater if the null hypothesis were true.

When do we say a result is statistically significant?

When the p-value is below a threshold and thus the null hypothesis can be rejected.

What is the paired bootstrap test?

We repeatedly draw large numbers of smaller samples with replacement (bootstrap samples) from an original collection, test each sample on both the intervention and comparision and measure the statistical significance of intervention being better across the samples.

What are representational harms?

Representational harms are harms caused by a system that demeans a social group by assigning unfavorable scores to input that contains words associated to a social group (e.g., African American names were found to cause lower sentiment and more negative emotion).

Give an example of how an attempt to prevent societal harm can actually cause representational harm and censorship?

Toxicity detection could lead to censorship because mere mention of minority identities, blind people or gay people or simply using African American English would result in false positives.

What is a model card?

A model card is the description of the model that makes it clear whether biases were studied. For example a model card can include:

training algorithm and parameters
training and evaluation data
preprocessing applied
intended use and users
model performance across demographics, groups and environments