Bar Reviews

Bar Reviews#

\(^{1}\)Image credit: https://www.zumarraga.eus/es/

Feb, 2023

Web Scraping Sentiment Analysis

Background#

I wanted to do sentiment analysis with reviews written in Spanish. But I couldn’t use TextBlob library, because this out-of-the-box package is trained in English. Therefore, I had to build my own machine learning model to have a positive-or-negative-opinion classifier in Spanish.

Also, for the model to be trained, I needed enough reviews both good and bad. A special place came to mind, a very nice bar-restaurant in my town, that I personally appreciate very much but receives polarised reviews from customers.

Web scraping the data#

I had to scrape the reviews from this site in Google Maps. The webpage is dynamic, meaning I had to write some lines of code to sequentially operate on it, scrolling down pages and clicking on buttons before fetching the information.

Show code cell source Hide code cell source

# Start the session
service = Service(executable_path="/driver") # chromedriver.exe location
driver = webdriver.Chrome(service=service)

# Navigate to the webpage
url = "https://www.google.com/maps/place/Bar-Restaurante+Bidebide+-+Zumarraga/@43.0916521,-2.3038457,15z/data=!4m7!3m6!1s0x0:0xbbdffd835d70d9e4!8m2!3d43.0916521!4d-2.3038457!9m1!1b1"
driver.get(url)

# Agree with cookies in Google consent page
driver.find_element(by=By.XPATH,
                    value='//*[@id="yDmH0d"]/c-wiz/div/div/div/div[2]/div[1]/div[3]/div[1]/div[1]/form[2]/div/div/button/span')\
                    .click()

# Request page title
title = driver.title
print(f"Page title -> {title}")

# Find the total number of reviews
number_reviews = driver.find_element(by=By.XPATH,
                                     value='//*[@id="QA0Szd"]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]/div[1]/div/div[2]/div[2]')

n_reviews = int(number_reviews.text.split(" ")[0])

# Find scrollable element
scrollable_div = driver.find_element(by=By.XPATH,
                                     value='//*[@id="QA0Szd"]/div/div/div[1]/div[2]/div/div[1]/div/div/div[2]')

# Scroll down as many times as necessary to load all reviews
for i in range(0,(round(n_reviews/10 - 1))):
    driver.execute_script('arguments[0].scrollTop = arguments[0].scrollHeight', scrollable_div)
    time.sleep(1)

# Click 'Más' buttons to reveal complete text of reviews
buttons = driver.find_elements(by=By.TAG_NAME,
                               value="button")
for button in buttons:
    if button.text == "Más":
        button.click()
        
# Wait
time.sleep(5)


# Parse html page
soup = BeautifulSoup(driver.page_source)

# Init
names = []
dates = []
stars = []
texts = []

# Search for the data
for review in soup.find_all('div', attrs={'jstcache':'393'}):
    
    # Extract stars
    star = review.find('span', attrs={'class':'kvMYJc'})
    if star is None:
        stars.append('')
    else:
        star = star["aria-label"]
        stars.append(star)
    
    # Extract name
    name = review.find('span', attrs={'jstcache':'358'})
    if name is None:
        names.append('')
    else:
        name = name.get_text(strip=True)
        names.append(name)
    
    # Extract date
    date = review.find('span', attrs={'jstcache':'139'})
    if date is None:
        dates.append('')
    else:
        date = date.get_text(strip=True)
        dates.append(date)
    
    # Extract text
    text = review.find('span', attrs={'jstcache':'292'})
    if text is None:
        dates.append('')
    else:
        text = text.get_text(strip=True)
        texts.append(text)

    

# Store listed reviews in a pandas dataframe
antio = pd.DataFrame({"name": names, "date": dates, "stars": stars, "text": texts})

# End browser the session
driver.quit()

# Display dataframe
antio

Page title -> Bar-Restaurante Bidebide - Zumarraga - Google Maps

	name	date	stars	text
0	Alberto Urtaza	Hace un año	4 estrellas	Excelente trato y buena relación precio-calida...
1	Mikel Landa	Hace 5 meses	5 estrellas	Muy buen menú con mucha variedad y buen produc...
2	Francisco Javier Martin Mateo	Hace 4 meses	4 estrellas	Comida buenísima y buena atención, vistas mara...
3	VICTOR T O	Hace 7 meses	5 estrellas	He comido el menú del día por 13 euros, y tien...
4	Mari Carmen Merino	Hace 3 meses	4 estrellas	Restaurante grande con unos menús muy ricos y ...
...	...	...	...	...
454	Aitor R	Hace 7 meses	4 estrellas
455	sheila perez	Hace 8 meses	5 estrellas
456	Alejandro Alban	Hace un año	3 estrellas
457	Cristhian Frutos	Hace 2 años	3 estrellas
458	JM I	Hace 2 años	3 estrellas

459 rows × 4 columns

Data processing#

The acquired data needed some processing before proceeding with the analysis:

Extract stars as an integer from the string type stars column.
Convert date to the year in which the review was written.
Rearrange the pandas dataframe: desired columns, names, order, index.
Finally, fill in missing (NaN) values in text with empty strings.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 459 entries, 2022 to 2021
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    459 non-null    object
 1   stars   459 non-null    int32 
 2   text    459 non-null    object
dtypes: int32(1), object(2)
memory usage: 12.6+ KB
None

	name	stars	text
date
2022	Alberto Urtaza	4	Excelente trato y buena relación precio-calida...
2022	Mikel Landa	5	Muy buen menú con mucha variedad y buen produc...
2022	Francisco Javier Martin Mateo	4	Comida buenísima y buena atención, vistas mara...
2022	VICTOR T O	5	He comido el menú del día por 13 euros, y tien...
2022	Mari Carmen Merino	4	Restaurante grande con unos menús muy ricos y ...

Data analysis#

Number of reviews per year#

../_images/52215f46368b58b204f7a29f3d50ee1604af0fc8cf65df5236f515e2dede9f66.png

Distribution over the years of the number of reviews on Google Maps.

Stars’ distribution#

../_images/ad9a70ab8eadf9e44711209ecdf86321078410ca5829bcd6a8dd54453aba1c78.png

Let’s calculate the total score.

General rating –> 3.9

Which has to be the same rating score that is calculated and shown on Google.

Sentiment analysis#

Data labeling#

To build the model, we first we have to label the reviews as:

0 : if the review is negative.
1 : if the review is positive.

Depending on the quantity, reading each text and assigning it the corresponding label can become tedious. Anyway, at this stage this task cannot be automated and needs to be done by a human intelligence (a simple job, also known as Mechanical Turk).

For my convenience, though, I will trust customers and I am going to automate it, considering as positive the reviews with 4 or 5 stars —so reviews with 1, 2 or 3 stars will be considered negative.

193 'texted' reviews out of 459 were selected.

	label	review
0	1	Excelente trato y buena relación precio-calida...
1	1	Muy buen menú con mucha variedad y buen produc...
2	1	Comida buenísima y buena atención, vistas mara...
3	1	He comido el menú del día por 13 euros, y tien...
4	1	Restaurante grande con unos menús muy ricos y ...
...	...	...
188	1	(Traducido por Google) Gran lugar\n\n(Original...
189	1	(Traducido por Google) Increíble vista\n\n(Ori...
190	1	(Traducido por Google) Hermoso menú y temperat...
191	1	(Traducido por Google) Gran lugar\n\n(Original...
192	0	(Traducido por Google) El restaurante\n\n(Orig...

193 rows × 2 columns

../_images/904e1bccddaa6cdbfe77f4888f2c732fe825560170f7b56e1aa4a50274eda531.png

We can see that there is an imbalance in the binary target variable (the label) we would like to predict with the model. The proportion is around 1:2, and even if the imbalance is not extreme, it is going to affect the predicted results, because the class of interest is the one less represented (it is more interesting to automatically detect negative reviews, i.e. “0”s, to be notified and deal with them) and therefore there is less information available to it. I will address this issue while modelling.

Bag-of-words#

Next, we have to transform our text data to numeric form. A machine learning model cannot work with the text data directly, but rather with numeric features created from the data. A basic method is the bag-of-words (BOW): a bag-of-words approach describes the frequency of words within a document. It basically builds a vocabulary of the words in the document, keeping track of their frequencies (losing word order and grammar rules: it is like throwing all words in a bag!).

	abrir	abrir original	abrir ánimo	abrumado	abrumado lástima	abundante	abundante personal	acabar	acabar frerir	acabar gente	...	zerbitzatzeko ondorioz	zona	zona acerco	zumaia	zumaia arantzazu	ánimo	ánimo seguiremos	ún	ún camarero	único
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	1	1	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
188	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
189	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
190	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
191	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
192	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

193 rows × 2320 columns

New features#

Having extra features besides the BOW usually results in a better model. We will add the number of individual words in each review (a measure of the length of each review).

Model fitting#

Predicting whether a review was good or bad is a binary classification task in machine learning. When splitting the data in train and test sets, as there is an imbalance in the target classes, it is best practice to use stratified sampling to ensure the split reflects the original proportion of labels.

I will choose a Logistic Regression model because it is a standard approach in binary classifiers. To address the imbalance issue, I will instantiate the model with parameter class_weight assigned to “balanced”, so that the model automatically adjusts weights inversely proportional to class frequencies.

Model evaluation#

If for example we want to automatically detect bad reviews (to raise an alert and do something about them), it is important to catch as many of them as possible. In that case, I would choose “recall” to evaluate the model: the higher the score of this metric, the higher the number of reviews that will correctly be interpreted as bad reviews.

That will come at the expense of “precision” though, as a higher sensitivity will mean more false positives (i.e. good reviews that will mistakenly be considered bad), but this is not a problem in this case (just a false alarm!).

Confusion matrix: 

 [[14  6]
 [ 9 29]]

Report: 
                  precision    recall  f1-score   support

 bad reviews ->       0.61      0.70      0.65        20
good reviews ->       0.83      0.76      0.79        38

       accuracy                           0.74        58
      macro avg       0.72      0.73      0.72        58
   weighted avg       0.75      0.74      0.75        58

Recall (sensitivity) for bad reviews -> 0.7

Conclusions#

In this project, reviews from a bar-restaurant were web-scraped from Google Maps, and a sentiment analysis model for texts was built.

However, a machine learning model development project like this one needs iterative testing with different algorithms and hyperparameter tuning until the desired performance is achieved. There are lots of them in the language processing area. This was not though the aim of this little project, which was just to give an initial approach on the subject.

Also —and most importantly—, a next step would involve experimenting with the business problem to decide whether the deployment into production of this machine learning tool pays off.

	abrir	abrir original	abrir ánimo	abrumado	abrumado lástima	abundante	abundante personal	acabar	acabar frerir	acabar gente	...	zerbitzatzeko ondorioz	zona	zona acerco	zumaia	zumaia arantzazu	ánimo	ánimo seguiremos	ún	ún camarero	único
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	1	1	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
188	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
189	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
190	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
191	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
192	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

	abrir	abrir original	abrir ánimo	abrumado	abrumado lástima	abundante	abundante personal	acabar	acabar frerir	acabar gente	...	zerbitzatzeko ondorioz	zona	zona acerco	zumaia	zumaia arantzazu	ánimo	ánimo seguiremos	ún	ún camarero	único
0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	1	1	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
188	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
189	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
190	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
191	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
192	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0