We create custom solutions for founders in different sectors. Count on us to use the best tech, ensuring your products are rock-solid and ready to scale.
Sentiment analysis is the way of identifying a sentiment of a text. In this case, sentiment is understood very broadly. It could be as simple as whether a text is positive or not, but it could also mean more nuanced emotions or attitudes of the author like anger, anxiety, or excitement. It’s even possible to train your computer to detect sarcasm.
Today, most sentiment analysis systems use NLP and machine/deep learning (with computational linguistics and text mining being used in the past), which allows for relatively straightforward implementation of such a system using only existing (and labeled) data as input without any input linguistic knowledge.
Applications of Sentiment Analysis
From a business point of view, sentiment analysis presents a way of automatically monitoring “what people are saying” without ever having actually read anything. For example, it can be used for:
product review monitoring – monitoring which of your products receive a higher rate of positive comments
market research – discovering attitudes of internet users toward the research target
search engines/recommender systems – enhancing performance by better understanding what users meant by the content of a query
Most tools use Python-based applications and libraries for this, but there are also other solutions.
In implementing Sentiment Analysis in Python for startups, we closely collaborate with clients to align the tool with their specific needs. Utilizing Python’s libraries, we tailor the sentiment analysis to their industry’s requirements. Our approach involves transparent communication and iterative feedback, ensuring the solution effectively supports the client’s objectives and enhances their decision-making processes Mike JackowskiCOO, ASPER BROTHERSContact Me
Data gathering
One of the most obvious places to look for data for your sentiment analysis system is social media platforms such as Facebook, Twitter, or Instagram. But these sites, like countless others, have always been trying to protect their users’ privacy and experience by preventing anyone from running any bot activity (i.e., Captcha). Fortunately, nowadays, they also provide official APIs for accessing some of their content.
For our project, we’ll need the following packages installed:
pytorch with torchtext
sklearn
pandas
matplotlib
As for data, we need to download and unpack this dataset
When we have everything in place, we can start by importing everything we will need during implementation.
import torch
from torch import nn
from torch import optim
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.utils.data import DataLoader, Dataset
Additionally, we have to specify what device we will be training an LSTM on:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
I strongly recommend using cuda. Otherwise, training will be very time-consuming. If you don’t have access to a GPU machine, consider using a free one on collab or kaggle.
And as the last setup step, we’ll specify a global random seed for reproducibility:
First things first, loading data to panda’s DataFrame:
df = pd.read_csv(DATA_PATH)
df.head()
For each Review in our dataset, we are provided with a rating in the form of a number from 1-5 (1-vary bad experience 5-very good experience).
In a real-world situation, this `Rating` column would be more than enough for us to train the review scoring regression model. But instead of that, we’ll demonstrate how to work with less informative data. We’ll categorize reviews into three categories:
To correctly evaluate created model, we have to establish some initial baseline first. In our case (sentiment classification), we can build a simple model that picks sentiment at random.
class RandomBaseline:
def __init__(self):
self.categories = {}
def fit(self, data, target_col):
cat_names = data[target_col].unique()
agg = data.groupby(target_col).count()
for n in cat_names:
self.categories[n] = agg.loc[n][0] / len(data)
def predict(self, data):
return np.random.choice(list(self.categories.keys()), len(data), list(self.categories.values()))
pred = rb.predict(X_test)
accuracy_score(y_test, pred)
# 0.3273969260795316
As expected, accuracy for randomly picking 1 from 3 categories is close to 33%
Data preparation
Before we start modeling, we have to transform reviews to form “understandable” for the neural network. We’ll do it by:
1. tokenizing each review – converting it to a list of words (tokens)
2. building a vocabulary – mapping each token to index
tokenization:
tokenizer = get_tokenizer('basic_english')
tokenizer("the place was nice")
# ['the', 'place', 'was', 'nice']
building a vocabulary:
def tokenized_review_iterator(reviews):
for r in reviews:
yield tokenizer(r)
vocab = build_vocab_from_iterator(tokenized_review_iterator(X_train), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])
vocab(['the', 'place', 'was', 'nice'])
# [33, 31, 3826, 15]
Now we need to implement a custom PyTorch Dataset that will handle preparing and serving data during training and evaluation.
Additionally, as we work with reviews that don’t have equal lengths, we’ll have to provide a function that will pad shorter sequences in batch with blank tokens.
def collate(batch):
batch.sort(key=lambda x: x["length"], reverse=True)
text, lengths, labels = zip(*[d.values() for d in batch])
text = torch.nn.utils.rnn.pad_sequence(text, batch_first=True)
lengths = torch.stack(lengths)
labels = torch.stack(labels)
return text, lengths, labels
Model implementation
As for our model, we’ll implement simple LSTM with embeddings. It’ll consist of:
– embedding layer – for training vector representations of words in our vocabulary
– lstm layers – as our core feature extractor
– dropout and batch normalization layers – for regularisation
– dense layer – for mapping extracted features to predictions
It’s finally time to begin our training loop. Below you can see some example training parameters (feel free to fine-tune them), and below that, you can see our training loop.
for n in range(n_epoch):
epoch_loss = []
epoch_acc = []
for encoded_text, lengths, labels in train_loader:
model = model.train()
optimizer_dense.zero_grad()
optimizer_sparse.zero_grad()
encoded_text, lengths, labels = encoded_text.to(device), lengths.to(device), labels.to(device)
y_pred = model(encoded_text, lengths)
loss = criterion(y_pred, labels)
loss.backward()
optimizer_sparse.step()
optimizer_dense.step()
epoch_loss.append(loss.item())
acc = accuracy_score(labels.detach().cpu(), y_pred.argmax(1).detach().cpu())
epoch_acc.append(acc)
avg_loss = (sum(epoch_loss) / len(epoch_loss))
avg_acc = (sum(epoch_acc) / len(epoch_acc))
print(f"epoch:{n} train_loss: {avg_loss:.4f}; train_acc: {avg_acc:.4f}")
losses["train"].append(avg_loss)
accuracies["train"].append(avg_acc)
epoch_loss = []
epoch_acc = []
with torch.no_grad():
for encoded_text, lengths, labels in validation_loader:
model = model.eval()
encoded_text, lengths, labels = encoded_text.to(device), lengths.to(device), labels.to(device)
y_pred = model(encoded_text, lengths)
loss = criterion(y_pred, labels)
epoch_loss.append(loss.item())
acc = accuracy_score(labels.detach().cpu(), y_pred.argmax(1).detach().cpu())
epoch_acc.append(acc)
avg_loss = (sum(epoch_loss) / len(epoch_loss))
avg_acc = (sum(epoch_acc) / len(epoch_acc))
print(f"epoch:{n} validation_loss: {avg_loss:.4f}; validation_acc: {avg_acc:.4f}")
losses["validation"].append(avg_loss)
accuracies["validation"].append(avg_acc)
Results
As we can see on the pictures above, we can pretty quickly train our LSTM to accuracy ~65%, which could still be easily improved with some hyperparameter tuning. With those results, we can easily baseline result of 33%
Conclusion
Sentiment analysis can be a very useful tool for user response monitoring. Its most significant advantage is the introduction of the possibility to use direct user feedback with minimal human supervision while still being able to scale up easily.
Εxcellent aгticle! We will be linking to this great articⅼe on our site.
Kеep up the great writing.