I recently asked a question about entries that are omitted after a sentiment analysis. The tweets that I analyse don't always contain words that are in the lexicon. I would like to know which ones can't be translated. So I would like to keep these even if zero words were scored. In my previous question, the drop parameter was given as a solution. However, I think I might be doing it wrong or missing something. This is my first time working with these techniques.
The following function takes a data frame and gives a new one in return, containing the amount of positive and negative words along with the sentiment.
The input (with one text in Dutch on purpose so it can't be scored)
id <- c(1, 2, 3)
date <- c("12-05-2021", "12-06-2021", "12-07-2021")
text <- c("Dit is tekst in het Nederlands", "I,m so happy that websites like this exsist", "This icecream tastes terrible. It made me upset")
df <- data.frame(id, date, text)
What i want as output is:
sentiment positive negative
0 0 0
2 2 0
-2 0 2
But my function gives me something else:
sentimentAnalysis <- function(tweetData){
sentimentDataframe <- data.frame()
for(row in 1:nrow(tweetData)){
tekst <- as.character(tweetData[row, "text"])
positive <- 0
negative <- 0
tokens <- tibble(text = tekst) %>% unnest_tokens(word, text, drop = FALSE)
sentiment <- tokens %>%
inner_join(get_sentiments("bing")) %>%
count(sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
sentimentDataframe <- bind_rows(sentimentDataframe, sentiment)
}
sentimentDataframe[is.na(sentimentDataframe)] <- 0
return(sentimentDataframe)
}
This still returns a data frame with the unscored texts missing. As you can see, the first text is omitted:
sentiment positive negative
2 2 0
-2 0 2
If there are no rows returned after the join you can return a tibble with all 0 values. We can use an if condition to check this.
In cases when there is only positive or negative sentiment in a sentence, complete would create another row with opposite sentiment and assign it the value 0. Also replaced spread with pivot_wider since spread is now superseded.
library(tidyverse)
library(tidytext)
map_df(df$text, ~{
tibble(text = .x) %>%
unnest_tokens(word, text, drop = FALSE) %>%
inner_join(get_sentiments("bing")) -> tmp
if(nrow(tmp) == 0) tibble(sentiment = 0, positive = 0, negative = 0)
else {
tmp %>%
count(sentiment) %>%
complete(sentiment = c('positive', 'negative'), fill = list(n = 0)) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative)
}
}) -> res
res
# sentiment positive negative
# <dbl> <dbl> <dbl>
#1 0 0 0
#2 2 2 0
#3 -2 0 2
Related
I have a fairly alrge dataset and I am running a for loop to remove one line per transect and calculate the frequency of the category. I am now trying to make it so that instead of one line per transect it removes a whole transect every iteration. Is it possible to do this?
Here is a sample dataset with the same columns I have
Transect<- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)
Category<- c("S","S","S","C","T","S","SP","T", "C", "T","S","SP","T","S","C")
dat<- data.frame(Transect,Category)
So the current code below removes one line per transect. How could I do it so that it randomly deletes a whole transect category (i.e. in the first iteration all of transect 3 is removed and on the second all of 1 is removed)
for (q in 1:2) {
for ( i in 0:5){
#if (i>0)
df<- dat2 %>%
group_by(Transect) %>%
sample_n(n() - i, replace = TRUE) %>%
ungroup()
c<-df %>%
group_by(Category) %>%
summarise(n = n(), replace=TRUE) %>%
mutate(freq = n / sum(n),
total=55-i)
if (i==0){
tot_1=c
} else {
tot_1=bind_rows(tot_1,c)
}
}
tot_1$rep = q
if (q==1){
dftot = tot_1
} else {
dftot=bind_rows(dftot, tot_1)
}
}
It seems your goals is to iteratively assess increasingly small subsamples of your data to assess loss of representation of the whole. This code will try dropping a random 1 then 2 then 3... and report the distribution of categories. The last few lines normalize count to fraction of total for easy comparison between iterations.
Note I used set.seed() because it will return a different result each time due to random sampling.
To break down this answer a bit:
It's important that Category is a factor so that table() won't drop any Category values that have no count in a particular iteration. It would run to a point but then the rowbinding operation within map_dfr() would fail.
First I just enumerate the numbers of Transect to leave out (should be 0:4 in this example) using 0:length(unique(d$Transect)). I included 0 so that we can see what it looks like with the full dataset.
I used set_names() so that it becomes a named vector. This allows us to use .id inside map_dfr() so that we get an extra column which stores the value of the leaveout.
purrr::map_dfr() will iteratively apply a function over some list. In this case I piped in the list of leaveout values (which we just named) and the function we apply is given as an rlang-style lambda function which begins with ~ and operates on the argument .x.
Working from the inside of the filter operation, this function first randomly samples a number of values of Transect to exclude given by .x and then removes data with said value of Transect. Here we use %in% and negate the whole result with ! at the beginning.
Then we just use dplyr::pull() to take the Category column as a vector and run table() on it to tabulate the occurrence of each value.
The rest just calculates the total count for each iteration and then divides the values by that to get a fraction.
library(tidyverse)
d <- tibble(
Transect = as.character(c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3)),
Category = factor(c("S","S","S","C","T","S","SP","T", "C", "T","S","SP","T","S","C"))
)
set.seed(1)
0:length(unique(d$Transect)) %>% set_names() %>%
map_dfr( ~ d %>%
filter(!Transect %in% sample(unique(d$Transect), size = .x)) %>%
pull(Category) %>%
table(),
.id = "leaveout_transects") %>%
rowwise() %>%
mutate(total_count = sum(c_across(-1)), .after = 1) %>%
mutate(across(-c(1:2), ~.x/total_count))
#> # A tibble: 4 × 6
#> # Rowwise:
#> leaveout_transects total_count C S SP T
#> <chr> <int> <table> <table> <table> <table>
#> 1 0 15 0.2 0.4 0.1333333 0.2666667
#> 2 1 10 0.2 0.3 0.2000000 0.3000000
#> 3 2 5 0.2 0.2 0.2000000 0.4000000
#> 4 3 0 NaN NaN NaN NaN
It would probably be more rigorous to simulate each leaveout condition multiple times and look at the distribution of performance you get at each value to assess what's likely to happen in the future with a given subsample.
Base r has the built in function replicate which is great for this purpose. Here I'm just using the code above with replicate and then reformatting the data a bit to graph it.
# use replicate to make many simulations
n_reps <- 20
replicate(
n_reps,
0:length(unique(d$Transect)) %>% set_names() %>%
map_dfr(
~ d %>%
filter(!Transect %in% sample(unique(d$Transect), size = .x)) %>%
pull(Category) %>%
table(),
.id = "leaveout_transects"
) %>%
rowwise() %>%
mutate(total_count = sum(c_across(-1)), .after = 1) %>%
mutate(across(-c(1:2), ~ .x / total_count)) %>%
select(3:6) %>%
t() %>%
cor() %>%
.[, 1]) %>%
as_tibble(.name_repair = "unique") %>%
mutate("leavout_transects" = factor(0:length(unique(d$Transect)))) %>%
pivot_longer(-leavout_transects, values_to = "correlation") %>%
select(-name) %>%
ggplot(aes(leavout_transects, correlation)) +
geom_boxplot()
Created on 2022-09-22 by the reprex package (v2.0.1)
I'm very novice so, apologies if my question is trivial.
I am trying to do sentiment analysis on some twitter data I downloaded but am having trouble. I am trying to follow this example:
which creates a bar plot that shows positive/negative sentiment counts. The code for the example is here**
original_books %>%
unnest_tokens(output = word,input = text) %>%
inner_join(get_sentiments("bing")) %>%
count(book, index, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n) %>%
mutate(sent_score = positive - negative) %>%
ggplot() +
geom_col(aes(x = index, y = sent_score,
fill = book),
show.legend = F) +
facet_wrap(~book,scales = "free_x")
Here is the code I have so far for my own analysis:
#twitter scraping
ref <- search_tweets(
"#refugee", n = 18000, include_rts = FALSE,lang = "en"
)
data(stop_words)
new_stops <- tibble(word = c("https", "t.co", "1", "refugee", "#refugee", "amp", "refugees",
"day", "2022", "dont", "0", "2", "#refugees", "4", "2021") ,lexicon = "sabs")
full_stop <- stop_words %>%
bind_rows(new_stops) #bind_rows adds more rows (way to merge data)
Now I want to make a bar graph similar to the one above but I get an error because I don't have a column called "index." I tried making one but it didn't work. Here is the code I am trying to use:
ref %>%
unnest_tokens(word,text,token = "tweets") %>%
anti_join(full_stop) %>%
inner_join(get_sentiments("bing")) %>%
count(word, index, sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n) %>%
mutate(sent_score = positive - negative) %>%
ggplot() + #plot the overall sentiment (pos - neg) versus index,
geom_col(aes(x = index, y = sent_score), show.legend = F)
Here is an image of the error
Any suggestions are really appreciated! Thank you
Contents of ref
enter image description here
enter image description here
In the example, index just refers to a group of lines from the book, in order (i.e., 1, 2, 3...) . It's a way to group the text -- you could think of it like a page, which would also be in numerical order. The text just be split up into some kind of groups in order to compute the sentiment within each group. Tweets are natural groups of words, and you want to compute the sentiment within a single tweet -- you don't need to split it up farther. In the example, the figure has a bar for each "page" of the book. You'll have a bar for each tweet. You need to assign the tweets consecutive numbers because they don't have a natural order. I did that below using rowid_to_column(), and I named the new column "tweet". It just contains the row numbers of the tweets, so once the ref dataframe is split up by word, each word is still tied back to the original tweet it belonged to by that number.
Note that many tweets don't have enough words with an associated sentiment to even calculate their sentiment score, so I then re-assigned a consecutive number to those that did -- this one is called "index".
I also added the argument values_fill = 0 to the pivot_wider() line because tweets with only positive (or negative) sentiment were not getting included, because the other value was NA instead of 0.
Along the way there are a couple of places where I just stop and look at the data -- this is really helpful in understanding errors.
library(tidyverse)
library(rtweet)
library(tidytext)
#twitter scraping
ref <- search_tweets(
"#refugee", n = 18000, include_rts = FALSE,lang = "en"
)
data(stop_words)
new_stops <- tibble(word = c("https", "t.co", "1", "refugee", "#refugee", "amp", "refugees",
"day", "2022", "dont", "0", "2", "#refugees", "4", "2021") ,lexicon = "sabs")
full_stop <- stop_words %>%
bind_rows(new_stops) #bind_rows adds more rows (way to merge data)
ref_w_sentiments <- ref %>%
rowid_to_column("tweet") %>%
unnest_tokens(word, text, token = "tweets") %>%
anti_join(full_stop) %>%
inner_join(get_sentiments("bing"))
# look at what the data looks like
select(ref_w_sentiments, tweet, word, sentiment)
#> # A tibble: 811 × 3
#> tweet word sentiment
#> <int> <chr> <chr>
#> 1 2 helping positive
#> 2 3 inspiring positive
#> 3 4 support positive
ref_w_scores <- ref_w_sentiments %>%
group_by(tweet) %>%
count(sentiment) %>%
pivot_wider(names_from = sentiment,
values_from = n, values_fill = 0) %>%
mutate(sent_score = positive - negative) %>%
# not all tweets were scored, so create a new index
rowid_to_column("index")
# look at the data again
ref_w_scores
#> # A tibble: 418 × 5
#> # Groups: tweet [418]
#> index tweet positive negative sent_score
#> <int> <int> <int> <int> <int>
#> 1 1 2 1 0 1
#> 2 2 3 1 0 1
#> 3 3 4 1 0 1
ggplot(ref_w_scores) + #plot the overall sentiment (pos - neg) versus index,
geom_col(aes(x = index, y = sent_score), show.legend = F)
I have the following problem: I have a dataframe with 3 columns, line number, date, and a single word. I am trying to perform text analysis on GitHub commit comments using the https://www.tidytextmining.com/ method. I would like to have my aggregate sentiment score on a quarterly basis rather than by the number of comments which i did by count(index = line %/% 10, sentiment) %>%. Is there an easy way to count all my "sentiment scores" by quarter?
Many thanks for any suggestions.
single_word_with_date$date <- substr(single_word_with_date$date,1,nchar(single_word_with_date$date)-10)
single_word_with_date$date <- as.Date(single_word_with_date$date , format = "%Y-%m-%d")
comment_sentiments_with_date <- single_word_with_date %>%
inner_join(get_sentiments("bing")) %>%
count(index = date %/% month(date) , sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
This is the dataframe line is the comment number (e.g. line 2 comment had several words in it), date is datetime, word is a string.
> head(single_word_with_date)
line date word
1 1 2011-11-16 love
2 2 2012-04-13 random
2.1 2 2012-04-13 question
2.8 2 2012-04-13 answered
2.14 2 2012-04-13 darwin
2.19 2 2012-04-13 purpose
Try this :
library(tidytext)
library(dplyr)
library(tidyr)
single_word_with_date %>%
inner_join(get_sentiments("bing")) %>%
group_by(quarter = paste(format(date, '%Y'), quarters(date), sep = '-')) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(sentiment = positive - negative) -> result
result
I'm trying to count results from one dataset that I imported into R and display those counts into a separate dataset that gets created within R for each unique Player.
Here is what a simplified version of the dataset looks like with only the relevant columns:
Label <- c("Raul", "Raul", "Raul", "Eric", "Eric", "Eric", "Aaron", "Aaron", "Aaron")
Result <- c("s", "b", "fo", "s", "f", "b", "ss", "go", "s")
df2 <- data.frame(Label, Result)
My data was compiled in Excel and exported as a CSV with about 4000 more rows of similar results and about 45 unique "Labels", but this smaller example shows you what the df looks like. Here is an example of what I want to end up with (line breaks to keep the rows separate):
Raul, count(s), count(b), count(fo), etc
Eric, count(s), count(b), count(fo), etc
Aaron, count(s), count(b), count(fo), etc
So that each unique "Label" for the players is on the row and the columns are the count of each type of Result. It should give me 45 rows, one for each of the unique players in my dataset.
I've been able to get the unique Player Labels just fine by running this:
dfstat <- data.frame(unique(df2$Label)
The problem comes when I try to get the counts for each type of result. I've tried a variety of things, like:
dfstat <- dfstat %>%
mutate(Strikes = count(subset(df2, Label = unique.df2.label & Result == "s")))
But I get this error code: Error: Column ``Strikes`` is of unsupported class data.frame
And
df34$Strikes <- count(subset(df2, Label = unique.df2.label & Result == "s"))
Gives me this error code: Error in ``$<-.data.frame``(``*tmp*``, Strikes, value = list(n = 9L)) : replacement has 1 row, data has 3
I'm doing something similar to be a part of a Shiny App and got that to work no problem, but that's because I was able to subset for my input value of a single Player. But I'm having trouble with getting this count data for ALL the unique players in my dataset into another dataset within R.
I appreciate any help with this issue because I'd really rather not manually type in all my different count formulas for every unique player. Thank you!
You can use table to count the frequencies for each Player.
table(df2)
# Result
#Label b f fo go s ss
# Aaron 0 0 0 1 1 1
# Eric 1 1 0 0 1 0
# Raul 1 0 1 0 1 0
If there are other columns in the data you can specify the columns whose frequency you want to count.
table(df2$Label, df2$Result)
A tidyverse approach would be :
library(dplyr)
library(tidyr)
df2 %>%
count(Label, Result) %>%
pivot_wider(names_from = Result, values_from = n, values_fill = 0)
We could group by 'Label' and get the number of 's' elements by taking the sum of logical expression
library(dplyr)
df2 %>%
group_by(Label) %>%
summarise(n = sum(Result == 's'))
Or to get the frequency of both column elements
count(df2, Label, Result)
If we need all the combinations, then do a complete before getting the count
library(tidyr)
df2 %>%
mutate(n = 1) %>%
complete(Label, Result, fill = list(n = 0)) %>%
group_by(Label, Result) %>%
summarise(n = sum(n))
NOTE: count expects a data.frame/tibble as input, so it won't work within mutate where it receives a vector as input
You could do a tapply followed by an rbind making sure that stats that are missing are given a count of 0.
res <- tapply(df2$Result, df2$Label, function(x) {
x <- table(x)
x[setdiff(unique(df2$Result), names(x))] <- 0
return(x[order(names(x))])
})
Then we can take this list of counts and rbind it
res <- do.call(rbind, res)
Your players will now be rownames
dfstat <- data.frame(label = row.names(res), res)
I was wondering if there was a way to create multiple columns from a list in R using the mutate() function within a for loop.
Here is an example of what I mean:
The Problem:
I have a data frame df that has 2 columns: category and rating. I want to add a column for every element of df$category and in that column, I want a 1 if the category column matches the iterator.
library(dplyr)
df <- tibble(
category = c("Art","Technology","Finance"),
rating = c(100,95,50)
)
Doing it manually, I could do:
df <-
df %>%
mutate(art = ifelse(category == "Art", 1,0))
However, what happens when I have 50 categories? (Which is close to what I have in my original problem. That would take a lot of time!)
What I tried:
category_names <- df$category
for(name in category_names){
df <-
df %>%
mutate(name = ifelse(category == name, 1,0))
}
Unfortunately, It doesn't seem to work.
I'd appreciate any light on the subject!
Full Code:
library(dplyr)
#Creates tibble
df <- tibble(
category = c("Art","Technology","Finance"),
rating = c(100,95,50)
)
#Showcases the operation I would like to loop over df
df <-
df %>%
mutate(art = ifelse(category == "Art", 1,0))
#Creates a variable for clarity
category_names <- df$category
#For loop I tried
for(name in category_names){
df <-
df %>%
mutate(name = ifelse(category == name, 1,0))
}
I am aware that what I am essentially doing is a form of model.matrix(); however, before I found out about that function I was still perplexed why what I was doing before wasn't working.
We can use pivot_wider after creating a sequence column
library(dplyr)
library(tidyr)
df %>%
mutate(rn = row_number(), n = 1) %>%
pivot_wider(names_from = category, values_from = n,
values_fill = list(n = 0)) %>%
select(-rn)
# A tibble: 3 x 4
# rating Art Technology Finance
# <dbl> <dbl> <dbl> <dbl>
#1 100 1 0 0
#2 95 0 1 0
#3 50 0 0 1
Or another option is map
library(purrr)
map_dfc(unique(df$category), ~ df %>%
transmute(!! .x := +(category == .x))) %>%
bind_cols(df, .)
# A tibble: 3 x 5
# category rating Art Technology Finance
#* <chr> <dbl> <int> <int> <int>
#1 Art 100 1 0 0
#2 Technology 95 0 1 0
#3 Finance 50 0 0 1
If we need a for loop
for(name in category_names) df <- df %>% mutate(!! name := +(category == name))
Or in base R with table
cbind(df, as.data.frame.matrix(table(seq_len(nrow(df)), df$category)))
# category rating Art Finance Technology
#1 Art 100 1 0 0
#2 Technology 95 0 0 1
#3 Finance 50 0 1 0
Wanted to throw something in for anyone who stumbles across this question. The problem in the OP is that the "name" column name gets re-used during each iteration of the loop: you end up with only one new column, when you really wanted three (or 50). I consistently find myself wanting to create multiple new columns within loops, and I recently found out that mutate can now take "glue"-like inputs to do this. The following code now also solves the original question:
for(name in category_names){
df <-
df %>%
mutate("{name}" := ifelse(category == name, 1, 0))
}
This is equivalent to akrun's answer using a for loop, but it doesn't involve the !! operator. Note that you still need the "walrus" := operator, and that the column name needs to be a string (I think since it's using "glue" in the background). I'm thinking some people might find this format easier to understand.
Reference: https://www.tidyverse.org/blog/2020/02/glue-strings-and-tidy-eval/