I'm trying to count the number of times that some pre-specified words appear in a character column, Post.
This is what my dataset looks like:
Now, I want to count all green/sustainable words in each of the posts and add this number as an extra column.
I have manually created a lexicon where all green words have Polarity == 1 and non-green words have Polarity == 0.
How can I do this?
str_count() from stringr can help with this (and with a lot more string-based tasks, see this R4DS chapter).
library(string)
# Create a reproducible example
dat <- data.frame(Post = c(
"This is a sample post without any target words",
"Whilst this is green!",
"And this is eco-friendly",
"This is green AND eco-friendly!"))
lexicon <- data.frame(Word = c("green", "eco-friendly", "neutral"),
Polarity = c(1, 1, 0))
# Extract relevant words from lexicon
green_words <- lexicon$Word[lexicon$Polarity == 1]
# Create new variable
dat$n_green_words <- str_count(dat$Post, paste(green_words, collapse = "|"))
dat
Output:
#> Post n_green_words
#> 1 This is a sample post without any target words 0
#> 2 Whilst this is green! 1
#> 3 And this is eco-friendly 1
#> 4 This is green AND eco-friendly! 2
Created on 2022-07-15 by the reprex package (v2.0.1)
example <- data.frame(
file_name = c("some_file_name_first_2020.csv",
"some_file_name_second_and_third_2020.csv",
"some_file_name_4_2020_update.csv"),
a = 1:3
)
example
#> file_name a
#> 1 some_file_name_first_2020.csv 1
#> 2 some_file_name_second_and_third_2020.csv 2
#> 3 some_file_name_4_2020_update.csv 3
I have a dataframe that looks something like this example. The "some_file_name" part changes often and the unique identifier is usually in the middle and there can be suffixed information (sometimes) that is important to retain.
I would like to end up with the dataframe below. The approach I can think of is finding all common string "components" and removing them from each row.
desired
#> file_name a
#> 1 first 1
#> 2 second_and_third 2
#> 3 4_update 3
This works for the example shared, perhaps you can use this to make a more general solution :
#split the data on "_" or "."
list_data <- strsplit(example$file_name, '_|\\.')
#Get the words that occur only once
unique_words <- names(Filter(function(x) x==1, table(unlist(list_data))))
#Keep only unique_words and paste the string back.
sapply(list_data, function(x) paste(x[x %in% unique_words], collapse = "_"))
#[1] "first" "second_and_third" "4_update"
However, this answer relies on the fact that you would have separators like "_" in the filenames to detect each "component".
I have a data frame df which contains a column named strings. The values in this column are some sentences.
For example:
id strings
1 "I want to go to school, how about you?"
2 "I like you."
3 "I like you so much"
4 "I like you very much"
5 "I don't like you"
Now, I have a list of stop word,
["I", "don't" "you"]
How can I make another data frame which stores the total number of occurrence of each unique word (except stop word)in the column of previous data frame.
keyword frequency
want 1
to 2
go 1
school 1
how 1
about 1
like 4
so 1
very 1
much 2
My idea is that:
combine the strings in the column to a big string.
Make a list storing the unique character in the big string.
Make the df whose one column is the unique words.
Compute the frequency.
But this seems really inefficient and I don't know how to really code this.
At first, you can create a vector of all words through str_split and then create a frequency table of the words.
library(stringr)
stop_words <- c("I", "don't", "you")
# create a vector of all words in your df
all_words <- unlist(str_split(df$strings, pattern = " "))
# create a frequency table
word_list <- as.data.frame(table(all_words))
# omit all stop words from the frequency table
word_list[!word_list$all_words %in% stop_words, ]
One way is using tidytext. Here a book and the code
library("tidytext")
library("tidyverse")
#> df <- data.frame( id = 1:6, strings = c("I want to go to school", "how about you?",
#> "I like you.", "I like you so much", "I like you very much", "I don't like you"))
df %>%
mutate(strings = as.character(strings)) %>%
unnest_tokens(word, string) %>% #this tokenize the strings and extract the words
filter(!word %in% c("I", "i", "don't", "you")) %>%
count(word)
#> # A tibble: 11 x 2
#> word n
#> <chr> <int>
#> 1 about 1
#> 2 go 1
#> 3 how 1
#> 4 like 4
#> 5 much 2
EDIT
All the tokens are transformed to lower case, so you either include i in the stop_words or add the argument lower_case = FALSE to unnest_tokens
Assuming you have a mystring object and a vector of stopWords, you can do it like this:
# split text into words vector
wordvector = strsplit(mystring, " ")[[1]]
# remove stopwords from the vector
vector = vector[!vector %in% stopWords]
At this point you can turn a frequency table() into a dataframe object:
frequency_df = data.frame(table(words))
Let me know if this can help you.
After trying for an embarrassingly long time and extensive searches online, I come to you with a problem.
I am looking for a method to (non-randomly) shuffle a string to get a string which has the maximal ‘distance’ from the original one, while still containing the same set of characters.
My particular case is for short nucleotide sequences (4-8 nt long), as represented by these example sequences:
seq_1<-"ACTG"
seq_2<-"ATGTT"
seq_3<-"ACGTGCT"
For each sequence, I would like to get a scramble sequence which contains the same nucleobase count, but in a different order.
A favourable scramble sequence for seq_3 could be something like;
seq_3.scramble<-"CATGTGC"
,where none of the sequence positions 1-7 has the same nucleobase, but the overall nucleobase count is the same (A =1, C = 2, G= 2, T=2). Naturally it would not always be possible to get a completely different string, but these I would just flag in the output.
I am not particularly interested in randomising the sequence and would prefer a method which makes these scramble sequences in a consistent manner.
Do you have any ideas?
python, since I don't know r, but the basic solution is as follows
def calcDistance(originalString,newString):
d = 0
i=0
while i < len(originalString):
if originalString[i] != newString[i]: d=d+1
i=i+1
s = "ACTG"
d_max = 0
s_final = ""
for combo in itertools.permutations(s):
if calcDistance(s,combo) > d_max:
d_max = calcDistance(s,combo)
s_final = combo
Give this a try. Rather than return a single string that fits your criteria, I return a data frame of all strings sorted by their string-distance score. String-distance score is calculated using stringdist(..., ..., method=hamming), which determines number of substitutions required to convert string A to B.
seq_3<-"ACGTGCT"
myfun <- function(S) {
require(combinat)
require(dplyr)
require(stringdist)
vec <- unlist(strsplit(S, ""))
P <- sapply(permn(vec), function(i) paste(i, collapse=""))
Dist <- c(stringdist(S, P, method="hamming"))
df <- data.frame(seq = P, HD = Dist, fixed=TRUE) %>%
distinct(seq, HD) %>%
arrange(desc(HD))
return(df)
}
library(combinat)
library(dplyr)
library(stringdist)
head(myfun(seq_3), 10)
# seq HD
# 1 TACGTGC 7
# 2 TACGCTG 7
# 3 CACGTTG 7
# 4 GACGTTC 7
# 5 CGACTTG 7
# 6 CGTACTG 7
# 7 TGCACTG 7
# 8 GTCACTG 7
# 9 GACCTTG 7
# 10 GATCCTG 7
I have this problem since three days now and I hope so much to find someone who can help me to find a solution :
To do a sentiment analysis of a text, I store i a dataframe a list of words and their positive and negative polarities:
word positive.polarity negative.polarity
1 interesting 1 0
2 boring 0 1
then, for each word of those words in the dataframe, I would like to know if in their context ( context is a set of 3 words preceding the word) there is a booster word or a negation word :
-booster_words <- c("more","enough", "a lot", "as", "so")
-negative_words <- c("not", "rien", "ni", "aucun", "nul", "jamais", "pas", "non plus", "sans")
I would like to create a new column positive.ponderate.polarity which contains positive polarity value + 4 if there in a booster and negative word in the context, andpositive polarity value + 9 if there is only booster word in the context (there is no negative word in context).
Here is the code :
calcPolarity <- function(sentiment_DF,sentences){
booster_words <- c("more","enough", "a lot", "as", "so")
negative_words <- c("not", "rien", "ni", "aucun", "nul", "jamais", "pas", "non plus", "sans")
reduce_words <- c("peu", "presque", "moins", "seulement")
# pre-allocate the polarity result vector with size = number of sentences
polarity <- rep.int(0,length(sentences))
# loop per sentence
for(i in 1:length(polarity)){
sentence <- sentences[i]
# separate each sentence in words using regular expression
wordsOfASentence <- unlist(regmatches(sentence,gregexpr("[[:word:]]+",sentence,perl=TRUE)))
# get the rows of sentiment_DF corresponding to the words in the sentence using match
# N.B. if a word occurs twice, there will be two equal rows
# (but I think it's correct since in this way you count its polarity twice)
subDF <- sentiment_DF[match(wordsOfASentence,sentiment_DF$word,nomatch = 0),]
# Find (number) of matching word.
wordOfInterest <- wordsOfASentence[which(wordsOfASentence %in% levels(sentiment_DF$word))] # No multigrepl, so working with duplicates instead. eg interesting
regexOfInterest <- paste0("([^\\s]+\\s){0,3}", wordOfInterest, "(\\s[^\\s]+){0,3}")
# extract a context of 3 words before the word in the dataframe
context <- stringr::str_extract(sentence, regexOfInterest)
names(context) <- wordOfInterest # Helps in forloop
print(context)
for(i in 1:length(context)){
if(any(unlist(strsplit(context[i], " ")) %in% booster_words))
{
print(booster_words)
if(any(unlist(strsplit(context[i], " ")) %in% negative_words))
{
subDF$positive.ponderate.polarity <- subDF$positive.polarity + 4
}
else
{
subDF$positive.ponderate.polarity <- subDF$positive.polarity + 9
}
}
}
# Debug option
print(subDF)
# calculate the total polarity of the sentence and store in the vector
polarity[i] <- sum(subDF$positive.ponderate.polarity) - sum(subDF$negative.ponderate.polarity)
}
return(polarity)
}
sentiment_DF <- data.frame(word=c('interesting','boring','pretty'),
positive.polarity=c(1,0,1),
negative.polarity=c(0,1,0))
sentences <- c("The course was interesting, but the professor was not so boring")
result <- calcPolarity(sentiment_DF,sentences)
When I run it with this sentence :
"The course was interesting, but the professor was not so boring"
I get this result :
word positive.polarity negative.polarity positive.ponderate.polarity
1 interesting 1 0 5
2 boring 0 1 4
but this is not correst,
The correct result is :
word positive.polarity negative.polarity positive.ponderate.polarity
1 interesting 1 0 1
2 boring 0 1 4
I do'nt know whey i get incorrect value..
Any idea please to help me?
Thank you
EDIT:
For example , If i have this dataframe :
word positive.polarity negative.polarity positive.ponderate.polarity negative.ponderate.polarity
1 interesting 1 0 1 1
2 boring 0 1 4 2
The result should be : (1+4) -(1+2)
I have caught the error. In cases like this it is recommended to debug line by line, and to print the initial variable, the result of each if statement or an indicator if the if else statement was processed.
Here your initial subDF$positive.polarity is a vector c(1,0) of length 2, which is the number of words in sentiment_DF c("interesting, "boring").
when i=1, context="The course was interesting", there is no booster and no negative words -- subDF$positive.polarity is c(1,0) and subDF$positive.ponderate.polarity is NULL
when i=2, context="was not so boring", there is a booster and a negative word -- subDF$positive.polarity is c(1,0) and you are adding 4 to both elements when you want to add 4 to only the second element corresponding to "boring", because of this subDF$positive.ponderate.polarity is c(5,4) which is what is returned.
The trick here is that length of subDF$positive.polarity and subDF$positive.ponderate.polarity depends on the number of sentiment_DF words in the sentence. The corrected code and the debugging are below. Here are the fixes:
A. Initialize so as lengths are equal
subDF$positive.ponderate.polarity <- subDF$positive.polarity
B. Use i to index so you are adding value only to the element corresponding to the current context element, and not all elements
subDF$positive.ponderate.polarity[i] <- subDF$positive.polarity[i] + 4
subDF$positive.ponderate.polarity[i] <- subDF$positive.polarity[i] + 9
C. There is one thing that I did not fix as I'm not sure how to treat it... what if the context is: "course was so boring"? There is a booster, and no negative words so it passes to else statement and 9 is added. Is this positive.ponderate.polarity? Wouldn't it be a negative.ponderate.polarity?
calcPolarity(sentiment_DF, "The course was so boring")
word positive.polarity negative.polarity positive.ponderate.polarity
2 boring 0 1 9
D. Other cases check out:
calcPolarity(sentiment_DF, "The course was interesting, but the professor was not so boring")
word positive.polarity negative.polarity positive.ponderate.polarity
1 interesting 1 0 1
2 boring 0 1 4
calcPolarity(sentiment_DF, "The course was so interesting")
word positive.polarity negative.polarity positive.ponderate.polarity
1 interesting 1 0 10
Edited to correct result of polarity as in the comment:
The output of polarity is c(0,5) as the orig code is: polarity[i] <- sum(subDF$positive.ponderate.polarity) - sum(subDF$negative.ponderate.polarity). Since you have 2 context phrases, your i at the end is 2, then polarity[1] is your initial value 0, and the result of your sum is assigned to polarity[2] which is 5, leaving you with c(0,5). Instead remove the [i], should be just polarity <- sum(subDF$positive.ponderate.polarity) -sum(subDF$negative.ponderate.polarity)
Here is the corrected code:
calcPolarity <- function(sentiment_DF,sentences){
booster_words <- c("more","enough", "a lot", "as", "so")
negative_words <- c("not", "rien", "ni", "aucun", "nul", "jamais", "pas", "non plus", "sans")
reduce_words <- c("peu", "presque", "moins", "seulement")
# pre-allocate the polarity result vector with size = number of sentences
polarity <- rep.int(0,length(sentences))
# loop per sentence
for(i in 1:length(polarity)){
sentence <- sentences[i]
# separate each sentence in words using regular expression
wordsOfASentence <- unlist(regmatches(sentence,gregexpr("[[:word:]]+",sentence,perl=TRUE)))
# get the rows of sentiment_DF corresponding to the words in the sentence using match
# N.B. if a word occurs twice, there will be two equal rows
# (but I think it's correct since in this way you count its polarity twice)
subDF <- sentiment_DF[match(wordsOfASentence,sentiment_DF$word,nomatch = 0),]
print(subDF)
# Find (number) of matching word.
wordOfInterest <- wordsOfASentence[which(wordsOfASentence %in% levels(sentiment_DF$word))] # No multigrepl, so working with duplicates instead. eg interesting
regexOfInterest <- paste0("([^\\s]+\\s){0,3}", wordOfInterest, "(\\s[^\\s]+){0,3}")
# extract a context of 3 words before the word in the dataframe
context <- stringr::str_extract(sentence, regexOfInterest)
names(context) <- wordOfInterest # Helps in forloop
for(i in 1:length(context)){
print(paste("i:", i))
print(context)
print("initial")
print(subDF$positive.polarity)
subDF$positive.ponderate.polarity <- subDF$positive.polarity
print(subDF$positive.ponderate.polarity)
if (any(unlist(strsplit(context[i], " ")) %in% booster_words)) {
print(booster_words)
length(booster_words)
print("if level 1")
print(subDF$positive.polarity)
if (any(unlist(strsplit(context[i], " ")) %in% negative_words)) {
subDF$positive.ponderate.polarity[i] <- subDF$positive.polarity[i] + 4
print("if level 2A")
print(subDF$positive.ponderate.polarity)
} else {
print("if level 2B")
subDF$positive.ponderate.polarity[i] <- subDF$positive.polarity[i] + 9
print(subDF$positive.ponderate.polarity)
}
print("level 2 result")
print(subDF$positive.ponderate.polarity)
}
print("level 1 result")
print(subDF$positive.ponderate.polarity)
}
}
# Debug option
print(subDF)
# calculate the total polarity of the sentence and store in the vector
polarity <- sum(subDF$positive.ponderate.polarity) - sum(subDF$negative.ponderate.polarity)
return(polarity)
}
sentiment_DF <- data.frame(word=c('interesting','boring','pretty'),
positive.polarity=c(1,0,1),
negative.polarity=c(0,1,0))
calcPolarity(sentiment_DF, "The course was interesting, but the professor was not so boring")
calcPolarity(sentiment_DF, "The course was so interesting")
calcPolarity(sentiment_DF, "The course was so boring")