asper brothers team
2 Nov 2021 13 min to read

Sentiment Analysis in Python – Example with Code based on Hotel Review Dataset

Sentiment analysis is the way of identifying a sentiment of a text. In this case, sentiment is understood very broadly. It could be as simple as whether a text is positive or not, but it could also mean more nuanced emotions or attitudes of the author like anger, anxiety, or excitement. It’s even possible to train your computer to detect sarcasm.

Today, most sentiment analysis systems use NLP and machine/deep learning (with computational linguistics and text mining being used in the past), which allows for relatively straightforward implementation of such a system using only existing (and labeled) data as input without any input linguistic knowledge.

Applications of sentiment analysis

From a business point of view, sentiment analysis presents a way of automatically monitoring “what people are saying” without ever having actually read anything. For example, it can be used for:

  • product review monitoring – monitoring which of your products receive a higher rate of positive comments
  • market research – discovering attitudes of internet users towards the research target
  • search engines/recommender systems – enhancing performance by better understanding what users meant by the content of a query

Data gathering

One of the most obvious places to look for data for your sentiment analysis system is social media platforms such as Facebook, Twitter, or Instagram. But these sites, like countless others, have always been trying to protect their users’ privacy and experience by preventing anyone from running any bot activity (i.e., Captcha). Fortunately, nowadays, they also provide official APIs for accessing some of their content.

Implementation

As an example of sentiment analysis, we’ll be analyzing a TripAdvisor hotel review dataset using a simple instance of LSTM neural network.

Setup

For our project, we’ll be needing the following packages installed:

  • pytorch with torchtext
  • sklearn
  • pandas
  • matplotlib

As for data, we need to download and unpack this dataset

When we have everything in place, we can start by importing everything we will need during implementation.

import torch
from torch import nn
from torch import optim
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.utils.data import DataLoader, Dataset

Additionally, we have to specify what device we will be training an LSTM on:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

I strongly recommend using cuda. Otherwise, training will be very time-consuming. If you don’t have access to a GPU machine, consider using a free one on collab or kaggle.

And as the last setup step, we’ll specify a global random seed for reproducibility:

np.random.seed(42)

Finally, we will define a path to our dataset:

DATA_PATH = "./trip-advisor-hotel-reviews/tripadvisor_hotel_reviews.csv"

Loading and Inspecting Data

First things first, loading data to panda’s DataFrame:

df = pd.read_csv(DATA_PATH)
df.head()

Table 1

For each Review in our dataset, we are provided with a rating in the form of a number from 1-5 (1-vary bad experience 5-very good experience).

 

ratings histogram

 

In a real-world situation, this `Rating` column would be more than enough for us to train the review scoring regression model. But instead of that, we’ll demonstrate how to work with less informative data. We’ll categorize reviews into three categories:

  • negative: ratings below 4
  • neutral: ratings equal to 4
  • positive: ratings equal to 5
neutral_range = {"low": 4, "high": 5}
df["Sentiment"] = "neutral"
df["Sentiment"].loc[df["Rating"] < neutral_range["low"]] = "negative"
df["Sentiment"].loc[df["Rating"] >= neutral_range["high"]] = "positive"
df.head()

table 2

 

sentiment histogram

 

Data split and Baseline

As usual, we’ll be splitting our data into train and validation subsets while ensuring that the resulting split is stratified.

X_train, X_validation, y_train, y_validation = train_test_split(df["Review"], df["Sentiment"], test_size=0.2,
stratify=df["Sentiment"])

To correctly evaluate created model, we have to establish some initial baseline first. In our case (sentiment classification), we can build a simple model that picks sentiment at random.

class RandomBaseline:

def __init__(self):
self.categories = {}

def fit(self, data, target_col):
cat_names = data[target_col].unique()
agg = data.groupby(target_col).count()
for n in cat_names:
self.categories[n] = agg.loc[n][0] / len(data)

def predict(self, data):
return np.random.choice(list(self.categories.keys()), len(data), list(self.categories.values()))

Now we can “train” our random model.

rb = RandomBaseline()
rb.fit(df.iloc[X_train.index], "Sentiment")
pred = rb.predict(X_test)
accuracy_score(y_test, pred)
# 0.3273969260795316

As expected, accuracy for randomly picking 1 from 3 categories is close to 33%

Data preparation

Before we start modeling, we have to transform reviews to form “understandable” for the neural network. We’ll do it by:
1. tokenizing each review – converting it to a list of words (tokens)
2. building a vocabulary – mapping each token to index

tokenization:

tokenizer = get_tokenizer('basic_english')
tokenizer("the place was nice")
# ['the', 'place', 'was', 'nice']

building a vocabulary:

def tokenized_review_iterator(reviews):
for r in reviews:
yield tokenizer(r)

vocab = build_vocab_from_iterator(tokenized_review_iterator(X_train), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])
vocab(['the', 'place', 'was', 'nice'])
# [33, 31, 3826, 15]

Now we need to implement a custom PyTorch Dataset that will handle preparing and serving data during training and evaluation.

target_map = {
"positive": 0,
"neutral": 1,
"negative": 2
}

text_pipeline = lambda x: vocab(tokenizer(x))
label_pipeline = lambda x: target_map[x]

class ReviewDataset(Dataset):

def __init__(self, X, y, text_pipeline, label_pipeline):
self.X = X
self.y = y
self.text_pipeline = text_pipeline
self.label_pipeline = label_pipeline

def __len__(self):
return len(self.X)

def __getitem__(self, idx):
text = torch.tensor(self.text_pipeline(self.X.iloc[idx]))
length = torch.tensor(len(text))
label = torch.tensor(self.label_pipeline(self.y.iloc[idx]))
return {"text": text, "length": length, "labels": label}

train_dataset = ReviewDataset(X_train, y_train, text_pipeline, label_pipeline)
test_dataset = ReviewDataset(X_test, y_test, text_pipeline, label_pipeline)

Additionally, as we work with reviews that don’t have equal lengths, we’ll have to provide a function that will pad shorter sequences in batch with blank tokens.

def collate(batch):
batch.sort(key=lambda x: x["length"], reverse=True)
text, lengths, labels = zip(*[d.values() for d in batch])
text = torch.nn.utils.rnn.pad_sequence(text, batch_first=True)
lengths = torch.stack(lengths)
labels = torch.stack(labels)
return text, lengths, labels

Model implementation

As for our model, we’ll implement simple LSTM with embeddings. It’ll consist of:
– embedding layer – for training vector representations of words in our vocabulary
– lstm layers – as our core feature extractor
– dropout and batch normalization layers – for regularisation
– dense layer – for mapping extracted features to predictions


class SentimentLSTM(nn.Module):

def __init__(self, vocab_size, embed_dim, hidden_size, n_layers, num_class):
super(SentimentLSTM, self).__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, sparse=True)
self.lstm = nn.LSTM(embed_dim, hidden_size, num_layers=n_layers, batch_first=True)
self.drop = nn.Dropout(0.5)
self.batch_norm = nn.BatchNorm1d(n_layers * hidden_size)
self.dense = nn.Linear(n_layers * hidden_size, num_class)

def dense_parameters(self):
return list(self.lstm.parameters()) + list(self.dense.parameters())

def forward(self, encoded_text, lengths):
batch_size = lengths.shape[0]
embedded = self.embedding(encoded_text)
packed_embeded = nn.utils.rnn.pack_padded_sequence(embedded, lengths.cpu(), batch_first=True)
_, (hidden, cell) = self.lstm(packed_embeded)
hidden = hidden.permute([1, 0, 2]).contiguous().view(batch_size, -1)
hidden = self.drop(hidden)
hidden = self.batch_norm(hidden)
hidden = self.dense(hidden)
return hidden

Training

It’s finally time to begin our training loop. Below you can see some example training parameters (feel free to fine-tune them), and below that, you can see our training loop.

# training parameters
n_epoch = 20
lr = 1e-4
batch_size = 512

# model parameters
embedding_dim = 256
hidden_size = 128
n_layers = 3

model = SentimentLSTM(len(vocab), embedding_dim, hidden_size, n_layers, 3)

losses = {"train": [], "validation": []}
accuracies = {"train": [], "validation": []}

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate)
validation_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate)

criterion = nn.CrossEntropyLoss()
optimizer_sparse = optim.SparseAdam(model.embedding.parameters(), lr=lr)
optimizer_dense = optim.Adam(model.dense_parameters(), lr=lr)

model = model.to(device)

for n in range(n_epoch):
epoch_loss = []
epoch_acc = []
for encoded_text, lengths, labels in train_loader:
model = model.train()
optimizer_dense.zero_grad()
optimizer_sparse.zero_grad()

encoded_text, lengths, labels = encoded_text.to(device), lengths.to(device), labels.to(device)
y_pred = model(encoded_text, lengths)
loss = criterion(y_pred, labels)

loss.backward()
optimizer_sparse.step()
optimizer_dense.step()

epoch_loss.append(loss.item())
acc = accuracy_score(labels.detach().cpu(), y_pred.argmax(1).detach().cpu())
epoch_acc.append(acc)

avg_loss = (sum(epoch_loss) / len(epoch_loss))
avg_acc = (sum(epoch_acc) / len(epoch_acc))
print(f"epoch:{n} train_loss: {avg_loss:.4f}; train_acc: {avg_acc:.4f}")
losses["train"].append(avg_loss)
accuracies["train"].append(avg_acc)

epoch_loss = []
epoch_acc = []
with torch.no_grad():
for encoded_text, lengths, labels in validation_loader:
model = model.eval()

encoded_text, lengths, labels = encoded_text.to(device), lengths.to(device), labels.to(device)
y_pred = model(encoded_text, lengths)
loss = criterion(y_pred, labels)

epoch_loss.append(loss.item())
acc = accuracy_score(labels.detach().cpu(), y_pred.argmax(1).detach().cpu())
epoch_acc.append(acc)

avg_loss = (sum(epoch_loss) / len(epoch_loss))
avg_acc = (sum(epoch_acc) / len(epoch_acc))
print(f"epoch:{n} validation_loss: {avg_loss:.4f}; validation_acc: {avg_acc:.4f}")
losses["validation"].append(avg_loss)
accuracies["validation"].append(avg_acc)

Results

 

loss

 

accuracy

 

As we can see on the pictures above, we can pretty quickly train our LSTM to accuracy ~65%, which could still be easily improved with some hyperparameter tuning. With those results, we can easily baseline result of 33%

table

Conclusion

Sentiment analysis can be a very useful tool for user response monitoring. Its most significant advantage is the introduction of the possibility to use direct user feedback with minimal human supervision while still being able to scale up easily.

 

Call to action
Do you need to enhance your application with solutions based on NLP and Machine Learning? Are you looking for ways to analyze large data sets or want to add new AI functionality for your users? Contact us for a free consultation and we’ll introduce you optimal solutions.

Share

SUBSCRIBE our NEWSLETTER

Are you interested in news from the world of software development? Subscribe to our newsletter and receive a list of the most interesting information.

    ADD COMMENT

    RECOMMENDED posts