Overview

Sentiment analysis is a tool for assessing the mood of a piece of text. For example, we can use sentiment analysis to understand public perceptions of topics in environmental policy like energy, climate, and conservation. In this post we will access the Nexis Uni Database through the UCSB library https://www.library.ucsb.edu/research/db/211. Then we will choose a key search term or terms to define a set of articles.

In this lab, I will be exploring the words ‘blue cabron’ and ‘mangrove’. I am particularly interested in investigating the relationship between positivism and negativism as they relate to political and government initiatives in the environment.

Here is a link that takes you to the related articles from the University of California data base.

Use your search term along with appropriate filters to obtain and download a batch of at least 100 full text search results (.docx)

Part 1: Load the data

Read your Nexis article document into RStudio.

Now we will use the full text of the articles for the analysis. First clean any artifacts of the data collection process (hint: this type of thing should be removed: “Apr 04, 2022 (Biofuels Digest: http://www.biofuelsdigest.com/ Delivered by Newstex”))

It appears that the composition of the text is not being impacted by any artifacts in this particular section, so im going to leave the tibbles as they come.

Part 2: Explore and Clean

Lets explore your data a bit and replicate the analyses above presented in class (Lab2)

Code

# The @ is for indexing within tibbles
meta_df <- text_dat@meta
articles_df <- text_dat@articles
paragraphs_df <- text_dat@paragraphs

dat2 <- tibble(Date = meta_df$Date, 
              Headline = meta_df$Headline, 
              id = text_dat@articles$ID, 
              text = text_dat@articles$Article)

Joining with `by = join_by(word)`
Joining with `by = join_by(word)`

Part 3: Score

Next, we will calculate the average sentiment score of words in the filtered articles using the stop_words and bing_sentiment datasets.

Code

mean(sentiment_words$sent_num)

[1] 0.422933

Part 4: Statistic Analysis

Now lets calculate mean sentiment (by word polarity) across articles

Code

sentiment_article <- sentiment_words |> 
  group_by(Headline) |> 
  count(id, sentiment) |> 
  pivot_wider(names_from = sentiment, 
              values_from = n,
              values_fill = 0) |> 
  mutate(polarity = positive - negative)

mean(sentiment_article$polarity)

[1] 9.635659

Upon this analysis, it is evident that the average sentiment score per article (9.63) is considerably higher than that of words (0.43).

Part 5: Examination

Now, let’s examine the sentiment distribution of articles by plotting a graph, in order to visualize the proportion of positive and negative articles.

Code

ggplot(sentiment_article, aes(x = id)) +
  geom_col(aes(y = positive, fill = "Positive"), alpha = 0.8) +
  geom_col(aes(y = negative, fill = "Negative"), alpha = 0.8) +
  labs(title = "Sentiment Analysis",
       subtitle = "Positive vs. negative articles",
       caption = "Articles related words: Blue Carbon/ Mangrove", 
       y = "Sentiment Score",
       x = "ID of the Article",
       fill = "Sentiment") +
  scale_fill_manual(values = c("darkorange", "darkblue"), 
                    labels = c("Negative", "Positive"), 
                    name = "Sentiment")

Part 6: Plots on emotions

Now, lets look at the nrc contribution to emotion by word plots. Identify and reclassify or remove at least one term that gives misleading results in your context.

Code

# Create a new data frame with the sentiments per word.
word_sentiment <- get_sentiments("nrc") |> 
  filter(word != "blue") #Remove the word blue because is misleading

Joining with `by = join_by(word)`

Warning in inner_join(group_by(anti_join(text_words, stop_words, by = "word"), : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 8 of `x` matches multiple rows in `y`.
ℹ Row 9845 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

Part 7: Plots over time

Plot the amount of nrc emotion words as a percentage of all the emotion words used each day (aggregate text from articles published on the same day).

How does the distribution of emotion words change over time? Can you think of any reason this would be the case? Plot all the emotion word from the article. What percentage is for the emotion and how it changes over time.

Fun Fact: The NRC lexicon was created by selecting 10,000 words from a thesaurus and asking a set of five questions about each word to reveal associated emotions and polarity. Over 50,000 questions were asked to over 2,000 people on Amazon’s Mechanical Turk website, resulting in a comprehensive word-emotion lexicon for over 10,000 words. The Turkers were paid 4 cents for each set of properly answered questions.

Code

words_time <- text_words |>
  select(id , date = Date, word) |> 
  mutate(date = as.Date(date, format = "%B %d, %Y")) |> 
  anti_join(stop_words, by= "word") |> 
  inner_join(bing_sent) |> 
  rename(bing = sentiment) |> 
  mutate(bing_num = case_when(bing == "negative" ~ -1,
                              bing == "positive" ~ 1)) |> 
  inner_join(word_sentiment) |> 
  rename(nrc = sentiment)

Joining with `by = join_by(word)`
Joining with `by = join_by(word)`

Warning in inner_join(mutate(rename(inner_join(anti_join(mutate(select(text_words, : Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 7 of `x` matches multiple rows in `y`.
ℹ Row 9845 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

Code

colors <- c("positive" = "green", 
            "anticipation" = "cyan2", 
            "joy" = "pink", 
            "trust" = "coral",
            "negative" = "red", 
            "sadness" = "gray30", 
            "anger" = "darkorchid3", 
            "disgust" = "black",
            "fear" = "darkred", 
            "surprise" = "gold")

Code

words_time %>%
  filter(date >= as.Date("2021-1-1")) |> 
  group_by(date) %>%
  summarise(bing_num = sum(bing_num)) %>%
  ggplot(aes(x = date, y = bing_num)) +
  geom_line(color = "#6C8EBF", size = 1.2) +
  geom_hline(yintercept = 0, color = "gray10", size = 0.5, alpha = 0.6 ) +
  geom_smooth(method = lm, 
              se = FALSE, 
              color = "#E47E72") +
  labs(x = "Date", y = "Sentiment score", 
       title = "Sentiment Analysis over Time", 
       subtitle = "Bing lexicon classification", 
       caption = "Source: library.ucsb ") +
  theme_classic() +
  theme(plot.title = element_text(face = "bold", size = 18),  
        axis.text = element_text(size = 12), 
        axis.title = element_text(size = 14),
        axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1)) +
  scale_x_date(date_labels = "%b %Y", 
               breaks = seq(as.Date("2021-01-01"), 
                            as.Date("2023-04-16"), by = "1 month"))

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

`geom_smooth()` using formula = 'y ~ x'

This plot above shows the overall sentiment per day, based on sentiment analysis using the Bing lexicon classification method. The sentiment score is aggregated for each day by summing the scores of individual words analyzed that day, and the resulting plot shows the trend in sentiment over time. The x-axis shows the date range from January 1st, 2021 to April 16th, 2023, and the y-axis shows the sentiment score. The plot suggests that there have been fluctuations in sentiment over time, with both positive and negative trends observed during the period analyzed.

Code

words_time |> 
  filter(nrc %in% c("anticipation", "joy", "trust", "sadness", "anger", "disgust" ,"fear", "surprise")) |> 
  group_by(date, nrc) |> 
  summarise(word_appearance = n()) |> 
  ungroup() |> 
  group_by(date) |> 
  mutate(tot_appearance = sum(word_appearance)) |> 
  mutate(proportion = word_appearance/tot_appearance) |> 
  ggplot(aes(x=date, y= proportion, color = nrc)) +
  geom_line(alpha = 0.2) +
  geom_smooth(method = "loess",
              se = F,
              span = 0.5,
              alpha = 0.4) +
  labs(x = "Date", 
       y = "Sentiment Proportion", 
       color = "NRC Sentiment", 
       title = "Sentiment Proportion over Time", 
       subtitle = "NRC word classification", 
       caption = "Source: library.ucsb ")

`summarise()` has grouped output by 'date'. You can override using the
`.groups` argument.
`geom_smooth()` using formula = 'y ~ x'

The plot above displays the distribution of NRC sentiments used over time, enabling an examination of any potential correlations. To enhance clarity, I have removed the postive, and negative sentiments, and used the geom_smooth() function to create trend lines that are easier to interpret. This is particularly useful due to the recent high influx of articles related to blue carbon, which can make the line graphs difficult to decipher. Nevertheless, I have made these lines transparent (alpha = 0.2) in the background to ensure that the trends remain visible. Upon analyzing the plot, two significant trends become apparent: surprise is consistently low, while trust and sadness are consistently high.

Below, you can see the total distribution of sentiments across our database.

Code

# Compute the proportion of each category
nrc_proportions <- words_time %>%
  count(nrc) %>%
  mutate(proportion = n / sum(n))

nrc_proportions %>%
  mutate(proportion = sprintf("%.2f%%", proportion * 100)) %>% # Multiply by 100 and format as percentage
  kable(format = "html", align = "c") %>%
  kable_styling(full_width = F, 
                bootstrap_options = "striped", 
                position = "center", 
                font_size = 16)

nrc	n	proportion
anger	153	5.28%
anticipation	235	8.11%
disgust	104	3.59%
fear	188	6.49%
joy	264	9.11%
negative	514	17.74%
positive	780	26.92%
sadness	241	8.32%
surprise	69	2.38%
trust	349	12.05%

Code

words_time %>%
  group_by(date, nrc) %>%
  summarise(appearance = n()) %>%
  group_by(nrc) %>%
  mutate(cum_appearance = cumsum(appearance)) %>%
  ungroup() %>%
  arrange(desc(cum_appearance)) |> 
  ggplot(aes(x = date, y = cum_appearance, color = nrc)) +
  geom_line(alpha = 0.8) +
   scale_color_manual(values = colors) +
  labs(x = "Date", 
       y = "Cumulative Appearances", 
       color = "NRC Sentiment", 
       title = "Sentiment Cumulative Analysis over Time", 
       subtitle = "NRC classification", 
       caption = "Source: library.ucsb ")

`summarise()` has grouped output by 'date'. You can override using the
`.groups` argument.

In the above graph, I have calculated a cumulative score for each sentiment to gain insight into the overall trends of words and vibes used in the articles. This plot also highlights the high influx of articles in recent years. As anticipated, the positive and negative sentiments have the highest scores, but the positive line surpasses the negative line by a significant margin. Additionally, there are noteworthy correlations between certain words and the sentiments of trust, joy, and sadness, which have seen increased usage in recent times. Conversely, surprise and disgust are used less frequently, representing the less common sentiments.

Citation

BibTeX citation:

@online{patrón2023,
  author = {Patrón, Javier},
  title = {Sentiment {Analysis} of {Blue} {Carbon} {Discourse}},
  date = {2023-04-18},
  url = {https://github.com/javipatron},
  langid = {en}
}

For attribution, please cite this work as:

Patrón, Javier. 2023. “Sentiment Analysis of Blue Carbon Discourse.” April 18, 2023. https://github.com/javipatron.