Wrong values in a column of dataframe with R - r

I have this problem since three days now and I hope so much to find someone who can help me to find a solution :
To do a sentiment analysis of a text, I store i a dataframe a list of words and their positive and negative polarities:
word positive.polarity negative.polarity
1 interesting 1 0
2 boring 0 1
then, for each word of those words in the dataframe, I would like to know if in their context ( context is a set of 3 words preceding the word) there is a booster word or a negation word :
-booster_words <- c("more","enough", "a lot", "as", "so")
-negative_words <- c("not", "rien", "ni", "aucun", "nul", "jamais", "pas", "non plus", "sans")
I would like to create a new column positive.ponderate.polarity which contains positive polarity value + 4 if there in a booster and negative word in the context, andpositive polarity value + 9 if there is only booster word in the context (there is no negative word in context).
Here is the code :
calcPolarity <- function(sentiment_DF,sentences){
booster_words <- c("more","enough", "a lot", "as", "so")
negative_words <- c("not", "rien", "ni", "aucun", "nul", "jamais", "pas", "non plus", "sans")
reduce_words <- c("peu", "presque", "moins", "seulement")
# pre-allocate the polarity result vector with size = number of sentences
polarity <- rep.int(0,length(sentences))
# loop per sentence
for(i in 1:length(polarity)){
sentence <- sentences[i]
# separate each sentence in words using regular expression
wordsOfASentence <- unlist(regmatches(sentence,gregexpr("[[:word:]]+",sentence,perl=TRUE)))
# get the rows of sentiment_DF corresponding to the words in the sentence using match
# N.B. if a word occurs twice, there will be two equal rows
# (but I think it's correct since in this way you count its polarity twice)
subDF <- sentiment_DF[match(wordsOfASentence,sentiment_DF$word,nomatch = 0),]
# Find (number) of matching word.
wordOfInterest <- wordsOfASentence[which(wordsOfASentence %in% levels(sentiment_DF$word))] # No multigrepl, so working with duplicates instead. eg interesting
regexOfInterest <- paste0("([^\\s]+\\s){0,3}", wordOfInterest, "(\\s[^\\s]+){0,3}")
# extract a context of 3 words before the word in the dataframe
context <- stringr::str_extract(sentence, regexOfInterest)
names(context) <- wordOfInterest # Helps in forloop
print(context)
for(i in 1:length(context)){
if(any(unlist(strsplit(context[i], " ")) %in% booster_words))
{
print(booster_words)
if(any(unlist(strsplit(context[i], " ")) %in% negative_words))
{
subDF$positive.ponderate.polarity <- subDF$positive.polarity + 4
}
else
{
subDF$positive.ponderate.polarity <- subDF$positive.polarity + 9
}
}
}
# Debug option
print(subDF)
# calculate the total polarity of the sentence and store in the vector
polarity[i] <- sum(subDF$positive.ponderate.polarity) - sum(subDF$negative.ponderate.polarity)
}
return(polarity)
}
sentiment_DF <- data.frame(word=c('interesting','boring','pretty'),
positive.polarity=c(1,0,1),
negative.polarity=c(0,1,0))
sentences <- c("The course was interesting, but the professor was not so boring")
result <- calcPolarity(sentiment_DF,sentences)
When I run it with this sentence :
"The course was interesting, but the professor was not so boring"
I get this result :
word positive.polarity negative.polarity positive.ponderate.polarity
1 interesting 1 0 5
2 boring 0 1 4
but this is not correst,
The correct result is :
word positive.polarity negative.polarity positive.ponderate.polarity
1 interesting 1 0 1
2 boring 0 1 4
I do'nt know whey i get incorrect value..
Any idea please to help me?
Thank you
EDIT:
For example , If i have this dataframe :
word positive.polarity negative.polarity positive.ponderate.polarity negative.ponderate.polarity
1 interesting 1 0 1 1
2 boring 0 1 4 2
The result should be : (1+4) -(1+2)

I have caught the error. In cases like this it is recommended to debug line by line, and to print the initial variable, the result of each if statement or an indicator if the if else statement was processed.
Here your initial subDF$positive.polarity is a vector c(1,0) of length 2, which is the number of words in sentiment_DF c("interesting, "boring").
when i=1, context="The course was interesting", there is no booster and no negative words -- subDF$positive.polarity is c(1,0) and subDF$positive.ponderate.polarity is NULL
when i=2, context="was not so boring", there is a booster and a negative word -- subDF$positive.polarity is c(1,0) and you are adding 4 to both elements when you want to add 4 to only the second element corresponding to "boring", because of this subDF$positive.ponderate.polarity is c(5,4) which is what is returned.
The trick here is that length of subDF$positive.polarity and subDF$positive.ponderate.polarity depends on the number of sentiment_DF words in the sentence. The corrected code and the debugging are below. Here are the fixes:
A. Initialize so as lengths are equal
subDF$positive.ponderate.polarity <- subDF$positive.polarity
B. Use i to index so you are adding value only to the element corresponding to the current context element, and not all elements
subDF$positive.ponderate.polarity[i] <- subDF$positive.polarity[i] + 4
subDF$positive.ponderate.polarity[i] <- subDF$positive.polarity[i] + 9
C. There is one thing that I did not fix as I'm not sure how to treat it... what if the context is: "course was so boring"? There is a booster, and no negative words so it passes to else statement and 9 is added. Is this positive.ponderate.polarity? Wouldn't it be a negative.ponderate.polarity?
calcPolarity(sentiment_DF, "The course was so boring")
word positive.polarity negative.polarity positive.ponderate.polarity
2 boring 0 1 9
D. Other cases check out:
calcPolarity(sentiment_DF, "The course was interesting, but the professor was not so boring")
word positive.polarity negative.polarity positive.ponderate.polarity
1 interesting 1 0 1
2 boring 0 1 4
calcPolarity(sentiment_DF, "The course was so interesting")
word positive.polarity negative.polarity positive.ponderate.polarity
1 interesting 1 0 10
Edited to correct result of polarity as in the comment:
The output of polarity is c(0,5) as the orig code is: polarity[i] <- sum(subDF$positive.ponderate.polarity) - sum(subDF$negative.ponderate.polarity). Since you have 2 context phrases, your i at the end is 2, then polarity[1] is your initial value 0, and the result of your sum is assigned to polarity[2] which is 5, leaving you with c(0,5). Instead remove the [i], should be just polarity <- sum(subDF$positive.ponderate.polarity) -sum(subDF$negative.ponderate.polarity)
Here is the corrected code:
calcPolarity <- function(sentiment_DF,sentences){
booster_words <- c("more","enough", "a lot", "as", "so")
negative_words <- c("not", "rien", "ni", "aucun", "nul", "jamais", "pas", "non plus", "sans")
reduce_words <- c("peu", "presque", "moins", "seulement")
# pre-allocate the polarity result vector with size = number of sentences
polarity <- rep.int(0,length(sentences))
# loop per sentence
for(i in 1:length(polarity)){
sentence <- sentences[i]
# separate each sentence in words using regular expression
wordsOfASentence <- unlist(regmatches(sentence,gregexpr("[[:word:]]+",sentence,perl=TRUE)))
# get the rows of sentiment_DF corresponding to the words in the sentence using match
# N.B. if a word occurs twice, there will be two equal rows
# (but I think it's correct since in this way you count its polarity twice)
subDF <- sentiment_DF[match(wordsOfASentence,sentiment_DF$word,nomatch = 0),]
print(subDF)
# Find (number) of matching word.
wordOfInterest <- wordsOfASentence[which(wordsOfASentence %in% levels(sentiment_DF$word))] # No multigrepl, so working with duplicates instead. eg interesting
regexOfInterest <- paste0("([^\\s]+\\s){0,3}", wordOfInterest, "(\\s[^\\s]+){0,3}")
# extract a context of 3 words before the word in the dataframe
context <- stringr::str_extract(sentence, regexOfInterest)
names(context) <- wordOfInterest # Helps in forloop
for(i in 1:length(context)){
print(paste("i:", i))
print(context)
print("initial")
print(subDF$positive.polarity)
subDF$positive.ponderate.polarity <- subDF$positive.polarity
print(subDF$positive.ponderate.polarity)
if (any(unlist(strsplit(context[i], " ")) %in% booster_words)) {
print(booster_words)
length(booster_words)
print("if level 1")
print(subDF$positive.polarity)
if (any(unlist(strsplit(context[i], " ")) %in% negative_words)) {
subDF$positive.ponderate.polarity[i] <- subDF$positive.polarity[i] + 4
print("if level 2A")
print(subDF$positive.ponderate.polarity)
} else {
print("if level 2B")
subDF$positive.ponderate.polarity[i] <- subDF$positive.polarity[i] + 9
print(subDF$positive.ponderate.polarity)
}
print("level 2 result")
print(subDF$positive.ponderate.polarity)
}
print("level 1 result")
print(subDF$positive.ponderate.polarity)
}
}
# Debug option
print(subDF)
# calculate the total polarity of the sentence and store in the vector
polarity <- sum(subDF$positive.ponderate.polarity) - sum(subDF$negative.ponderate.polarity)
return(polarity)
}
sentiment_DF <- data.frame(word=c('interesting','boring','pretty'),
positive.polarity=c(1,0,1),
negative.polarity=c(0,1,0))
calcPolarity(sentiment_DF, "The course was interesting, but the professor was not so boring")
calcPolarity(sentiment_DF, "The course was so interesting")
calcPolarity(sentiment_DF, "The course was so boring")

Related

Regex - filter with (1) hyphen or (2) end of sentence

I need support with RegEx filtering!
I have a list of keywords and many rows that should be checked.
In this example, the keyword "-book-" can be (1) in the middle of the sentence or (2) at the end, which would mean that the last hyphen is not present.
I need a RegEx expression, which identifies "-book-" and "-book".
I don't want similar keywords like "-booking-" etc to be identified.
library(dplyr)
keywords = c( "-album-", "-book-", "-castle-")
search_terms = paste(keywords, collapse ="|")
number = c(1:5)
sentences = c("the-best-album-in-shop", "this-book-is-fantastic", "that-is-the-best-book", "spacespacespace", "unwanted-sentence-with-booking")
data = data.frame(number, sentences)
output = data %>% filter(., grepl( search_terms, sentences) )
# Current output:
number sentences
1 1 the-best-album-in-shop
2 2 this-book-is-fantastic
# DESIRED output:
number sentences
1 1 the-best-album-in-shop
2 2 this-book-is-fantastic
3 3 that-is-the-best-book
You could also do:
subset(data, grepl(paste0(sprintf("%s?\\b",keywords),collapse = "|"), sentences))
number sentences
1 1 the-best-album-in-shop
2 2 this-book-is-fantastic
3 3 that-is-the-best-book
Note that this will only check for the -book- at the (1) in the middle of the sentence or (2) at the end Not at the beginning
The -book- pattern will match a whole word book with hyphen on the left and right.
To match a whole word book with a hyphen on the left or right, you need an alternation \bbook-|-book\b.
Thus, you can use
keywords = c( "-album-", "\\bbook-", "-book\\b", "-castle-" )
Another solution you can take it into account
library(stringr)
data %>%
filter(str_detect(sentences, regex("-castle-|-album-|-book$|-book-\\w{1,}")))
# number sentences
# 1 1 the-best-album-in-shop
# 2 2 this-book-is-fantastic
# 3 3 that-is-the-best-book

R: Count the frequency of every unique character in a column

I have a data frame df which contains a column named strings. The values in this column are some sentences.
For example:
id strings
1 "I want to go to school, how about you?"
2 "I like you."
3 "I like you so much"
4 "I like you very much"
5 "I don't like you"
Now, I have a list of stop word,
["I", "don't" "you"]
How can I make another data frame which stores the total number of occurrence of each unique word (except stop word)in the column of previous data frame.
keyword frequency
want 1
to 2
go 1
school 1
how 1
about 1
like 4
so 1
very 1
much 2
My idea is that:
combine the strings in the column to a big string.
Make a list storing the unique character in the big string.
Make the df whose one column is the unique words.
Compute the frequency.
But this seems really inefficient and I don't know how to really code this.
At first, you can create a vector of all words through str_split and then create a frequency table of the words.
library(stringr)
stop_words <- c("I", "don't", "you")
# create a vector of all words in your df
all_words <- unlist(str_split(df$strings, pattern = " "))
# create a frequency table
word_list <- as.data.frame(table(all_words))
# omit all stop words from the frequency table
word_list[!word_list$all_words %in% stop_words, ]
One way is using tidytext. Here a book and the code
library("tidytext")
library("tidyverse")
#> df <- data.frame( id = 1:6, strings = c("I want to go to school", "how about you?",
#> "I like you.", "I like you so much", "I like you very much", "I don't like you"))
df %>%
mutate(strings = as.character(strings)) %>%
unnest_tokens(word, string) %>% #this tokenize the strings and extract the words
filter(!word %in% c("I", "i", "don't", "you")) %>%
count(word)
#> # A tibble: 11 x 2
#> word n
#> <chr> <int>
#> 1 about 1
#> 2 go 1
#> 3 how 1
#> 4 like 4
#> 5 much 2
EDIT
All the tokens are transformed to lower case, so you either include i in the stop_words or add the argument lower_case = FALSE to unnest_tokens
Assuming you have a mystring object and a vector of stopWords, you can do it like this:
# split text into words vector
wordvector = strsplit(mystring, " ")[[1]]
# remove stopwords from the vector
vector = vector[!vector %in% stopWords]
At this point you can turn a frequency table() into a dataframe object:
frequency_df = data.frame(table(words))
Let me know if this can help you.

Inserting random letters at random locations within a string

I am trying to make a little script to demonstrate how DNA sequences can evolve using a sentence as an example. I would like to repeatedly replace or insert letters or words into a string in R. I would like this to happen repeatedly so one can watch the string change over time. Finally I would like there to be a greater probability of letters changing than words changing.
So far I have defined a string and created lists of both letters and words and sample randomly from both these lists.
However I do not know how to then modify the text with a set probability. For example how do I make it so there is a 50% chance of a letter in the text being replaced with a letter from my letter list and if this happens, it should occur at a random location in the text?
I also want this process to occur X times so I can show the text changing over time. Any help or suggestions are greatly appreciated. My current incomplete code is below
#First I define the string
text <- c("This sentence is changing")
#Then make a vector of words from the string
word_list <- strsplit(text, " ")
word_list <- unlist(word_list)
#Also make a vector of letters from the string
letters_and_gaps <- substring(text, seq(1, nchar(text), 1), seq(1, nchar(text), 1))
letters_and_gaps <- unlist(letters_and_gaps)
#Now for probability 1 in 2 or it occuring, select a random character from letters_and_gaps:
sample(letters_and_gaps, 1)
#Then choose a random character in text and replace it with this randomly sampled character:
#Now with probability 1 in 10 or it occuring, select a random word from word_list
sample(letters_and_gaps, 1)
#Then choose a random word in text and replace it with this randomly sampled word:
#Then print the updated text:
text
#Iteratively repeat this process X times
My goal is to ultimately put this in a Shiny app where one can select the probability of different events occuring (letter vs word replacement) and then watch how this influence how the text evolves.
Here is the beginning of an implementation. We just wrap your logic up in a function and use a for loop to apply it again and again. Here I put the output in a table and then display only unique rows (possibly excluding times where it mutated back to the same string as a previous iteration but probably not significant) so you can see that changes happening. Note that because we are sampling from the words and characters of the previous sentence, and we are including spaces, new words can form when spaces are inserted and the distribution will tend to become more uniform (if a character is common it will tend to be substituted more often)
library(tidyverse)
evolve_sentence <- function(sentence, arg2) {
chars <- str_split(sentence, "") %>% pluck(1)
if (runif(1) > 0.5) {
chars[sample(1:length(chars), 1)] <- sample(chars, 1)
}
sentence <- str_c(chars, collapse = "")
words <- str_split(sentence, " ") %>% pluck(1)
if (runif(1) > 0.9) {
words[sample(1:length(words), 1)] <- sample(words, 1)
}
sentence <- str_c(words, collapse = " ")
sentence
}
tbl_evolve <- tibble(iteration = 1:500, text = "This sentence is changing")
for (i in 2:500) {
tbl_evolve$text[i] <- evolve_sentence(tbl_evolve$text[i - 1])
}
tbl_evolve %>%
distinct(text, .keep_all = TRUE)
#> # A tibble: 204 x 2
#> iteration text
#> <int> <chr>
#> 1 1 This sentence is changing
#> 2 3 hhis sentence is changing
#> 3 4 hhis sentence is chasging
#> 4 6 hhis sestence is chasging
#> 5 10 hhi sestence is chasging
#> 6 12 hhi sesnence is chasging
#> 7 14 hhi sesnesce is chasging
#> 8 15 hhi se nesce is chasging
#> 9 18 hhi se nesceiis chasging
#> 10 20 hhi se nesceiis chasgihg
#> # … with 194 more rows
Created on 2019-04-17 by the reprex package (v0.2.1)

R - looking up strings and exclude based on other string

I could not find the answer how to count words in data frame and exclude if other word is found.
I have got below df:
words <- c("INSTANCE find", "LA LA LA", "instance during",
"instance", "instance", "instance", "find instance")
df <- data.frame(words)
df$words_count <- grepl("instance", df$words, ignore.case = T)
It counts all instances of "instance" I have been trying to exclude any row when word find is present as well.
I can add another grepl to look up for "find" and based on that exclude but I try to limit number of lines of my code.
I'm sure there's a solution using a single regular expression, but you could do
df$words_count <- Reduce(`-`, lapply(c('instance', 'find'), grepl, df$words)) > 0
or
df$words_count <- Reduce(`&`, lapply(c('instance', '^((?!find).)*$'), grepl, df$words, perl = T, ignore.case = T))
This might be easier to read
library(tidyverse)
df$words_count <- c('instance', '^((?!find).)*$') %>%
lapply(grepl, df$words, perl = T, ignore.case = T) %>%
reduce(`&`)
If all you need is the number of times "instance" appears in a string, negating all in that string if "find" is found anywhere:
df$counts <- sapply(gregexpr("\\binstance\\b", words, ignore.case=TRUE), function(a) length(a[a>0])) *
!grepl("\\bfind\\b", words, ignore.case=TRUE)
df
# words counts
# 1 INSTANCE find 0
# 2 LA LA LA 0
# 3 instance during 1
# 4 instance 1
# 5 instance 1
# 6 instance 1
# 7 find instance 0

Count 1st instance of keyword in list with no duplicate counts in R

I have a list of keywords:
library(stringr)
words <- as.character(c("decomposed", "no diagnosis","decomposition","autolysed","maggots", "poor body", "poor","not suitable", "not possible"))
I want to match these keywords to text in a data frame column (df$text) and count the number of times a keyword occurs in a different data.frame (matchdf):
matchdf<- data.frame(Keywords=words)
m_match<-sapply(1:length(words), function(x) sum(str_count(tolower(df$text),words[[x]])))
matchdf$matchs<-m_match
However, I've noticed that this method counts EACH occurrence of a keyword within a column. eg)
"The sample was too decomposed to perform an analysis. The decomposed sample indicated that this animal was dead for a long time"
Would then return a count of 2. However, I only want to count the first instance of "decomposed" within a field.
I thought there would be a way to only count the first instance using str_count but there doesn't seem to be one.
The stringr isn't strictly necessary in this example, grepl from base R will suffice. That said, use str_detect instead of grepl, if you prefer the package function (as pointed out by #Chi-Pak in comment)
library(stringr)
words <- c("decomposed", "no diagnosis","decomposition","autolysed","maggots",
"poor body", "poor","not suitable", "not possible")
df <- data.frame( text = "The sample was too decomposed to perform an analysis. The decomposed sample indicated that this animal was dead for a long time")
matchdf <- data.frame(Keywords = words, stringsAsFactors = FALSE)
# Base R grepl
matchdf$matches1 <- sapply(1:length(words), function(x) as.numeric(grepl(words[x], tolower(df$text))))
# Stringr function
matchdf$matches2 <- sapply(1:length(words), function(x) as.numeric(str_detect(tolower(df$text),words[[x]])))
matchdf
Result
Keywords matches1 matches2
1 decomposed 1 1
2 no diagnosis 0 0
3 decomposition 0 0
4 autolysed 0 0
5 maggots 0 0
6 poor body 0 0
7 poor 0 0
8 not suitable 0 0
9 not possible 0 0

Resources