how to strcount each element dataframe in r - r

I have a little problem to print and count one by one array/dataframe string appearance
I have a dataframe called Pos_1, it contains string like these :
Pos_1 = (morning bliss great happy)
and the other dataframe called Pos_2, it contains string like these :
Pos_2 = (morning great)
What I want to do is, count the string that appear from Pos_1 dataframe in Pos_2
I'm using the str_count to count each string that appear
for(h in 1:5)
Score=sum(str_count(Pos_2, Pos_1[h]))[1:length(Pos_1)]
from the code above it only return the total of all string element from Pos_1
Text Score
morning 0
bliss 0
great 0
happy 0
expected result from count the element that match from dataframe Pos_1 and dataframe Pos_2 with strcount (see below),
i need to produce Only the Score Row below
Text Score
morning 1
bliss 0
great 1
happy 0
is there any solution ?

I think this does what you want:
library(stringr)
Score <- sapply(seq_along(unlist(Pos_1)), function(i)
sum(str_count(unlist(Pos_2), unlist(Pos_1)[i])))
You use unlist to convert your data frames of strings into vectors. Then you use sapply to iterate str_count over the elements of the unlisted Pos_1, getting a vector in return.
If each element of Pos_1 will appear no more than once in Pos_2, you don't need str_count and could just use:
Score <- +(unlist(Pos_1) %in% unlist(Pos_2))

try this
library(stringr)
Pos_1 <- c("morning", "bliss", "great","happy")
Pos_2 <- c("morning", "great")
df<-data.frame(Text=Pos_1,Score=unlist(lapply(Pos_1,function(x) sum(str_count(x,Pos_2)))))
df
output
Text Score
1 morning 1
2 bliss 0
3 great 1
4 happy 0

Related

How to search for words with asterisks and wildcards (e.g., exampl*) in R (word appearance in a data frame)

I wrote a code to count the appearance of words in a data frame:
Items <- c('decid*','head', 'heads')
df1<-data.frame(Items)
words<- c('head', 'heads', 'decided', 'decides', 'top', 'undecided')
df_main<-data.frame(words)
item <- vector()
count <- vector()
for (i in 1:length(unique(Items))){
item[i] <- Items[i]
count[i]<- sum(df_main$words == item[i])}
word_freq <- data.frame(cbind(item, count))
word_freq
However, the results are like this:
item
count
1
decid*
0
2
head
1
3
heads
1
As you see, it does not correctly count for "decid*". The actual results I expect should be like this:
item
count
1
decid*
2
2
head
1
3
heads
1
I think I need to change the item word (decid*) format, however, I could not figure it out. Any help is much appreciated!
I think you want to use decid* as regex pattern. == looks for an exact match, you may use grepl to look for a particular pattern.
I have used sapply as an alternative to for loop.
result <- stack(sapply(unique(df1$Items), function(x) {
if(grepl('*', x, fixed = TRUE)) sum(grepl(x, df_main$word))
else sum(x == df_main$words)
}))
result
# values ind
#1 2 decid*
#2 1 head
#3 1 heads
Using tidyverse
library(dplyr)
library(stringr)
df1 %>%
rowwise %>%
mutate(count =sum(str_detect(df_main$words,
str_c("\\b", str_replace(Items, fixed("*"), ".*" ), "\\b")))) %>%
ungroup
-output
# A tibble: 3 × 2
Items count
<chr> <int>
1 decid* 2
2 head 1
3 heads 1
Perhaps as an alternative approach altogether: instead of creating a new dataframe word_freq, why not create a new column in df_main(if that's your "main" dataframe) which indicates the number of matches of your (apparently key)Items. Also, that column will not actually contain counts because the input column words only contains a single word each. So the question is not how many matches are there for each row but whether there is a match in the first place. That can be indicated by greplin base Ror str_detectin stringr
EDIT:
Given the newly posted input data
Items <- c('decid*','head', 'heads')
df1<-data.frame(Items)
words<- c('head', 'heads', 'decided', 'decides', 'top', 'undecided')
df_main<-data.frame(words)
and the OP's wish to have the matches in df_main, the solution might be this:
library(stringr)
df_main$Items_match <- +str_detect(df_main$words, str_c(Items, collapse = "|"))
Result:
df_main
words Items_match
1 head 1
2 heads 1
3 decided 1
4 decides 1
5 top 0
6 undecided 1

Search a word in a sentence and represent it as new feature

I am trying to identify the sentence having a particular word(eg. high) from a list of sentences in a dataframe in R, and if that word is present in the sentence of a dataframe, i want to add another column in that dataframe representing 1 for present and 0 for not present.
Reviews: contains_awesome
Today is an awesome day. 1
The book is good. 0
Awesome weather 1
I tried for a particular review as:
grep("awesome", tolower(df$Reviews[1])) # returned output as 1
I want to apply this on each sentence in my dataframe to have corresponding value of 0 and 1 in "contains_awesome" column. Please guide, if i should run a for-loop here, but that might be expensive with huge dataset, how should i go for it? I am not very used to R syntax.
grep is vectorized, so it can be applied directly on the whole column
df$contains_awesome <- as.integer(grepl("awesome", df$Reviews, ignore.case = TRUE))
df$contains_awesome
#[1] 1 0 1
data
df <- structure(list(Reviews = c("Today is an awesome day.", "The book is good.",
"Awesome weather")), class = "data.frame", row.names = c(NA,
-3L))
grep returns index of matches
grep('awesome', df$Reviews, ignore.case = TRUE)
#[1] 1 3
Using grepl here is straight-forward since it returns output of length same as the input so that it is easy to add as a new column. But if you want to use grep here are couple of approaches.
df$contains_awesome <- +(with(df, seq_along(Reviews) %in%
grep('awesome', Reviews, ignore.case = TRUE)))
df
# Reviews contains_awesome
#1 Today is an awesome day. 1
#2 The book is good. 0
#3 Awesome weather 1
Or with match
df$contains_awesome <- +(!is.na(match(1:nrow(df),
grep('awesome', df$Reviews, ignore.case = TRUE))))
The + in the beginning converts logical values TRUE/FALSE to 1/0 respectively.

R: Update Column Based on Text Condition from Another Column

I would like to make a new column in my data frame by using a conditional statement that would say "If Column_y contains Column_x then 1 else 0"
For example:
Event Name Winner Loser New Column
1 James James,Bob John,Steve 1
1 Bob James,Bob John,Steve 1
1 John James,Bob John,Steve 0
1 Steve James,Bob John,Steve 0
I want to have New Column<- "If Winner contains Name then 1 else 0"
Keep in mind this is for 100,000 rows and probably 700 unique names. When I try things like
df$NewColumn<-ifelse(grepl(df$Name,df$Winner)==TRUE,1,0)
or variations I get the "pattern has a length > 1" error.
I think you just want to compare the Name column against the Winner column:
df$NewColumn <- ifelse(df$Name == df$Winner, 1, 0)
Note that because df$Name == df$Winner is actually a boolean expression, you might also be able to simplify to:
df$NewColumn <- df$Name == df$Winner
In your example, exact string matching works. But I am assuming it does not hold true for your entire data.
Implementing the contains condition would be something like this:
library(dplyr)
library(purrr)
df = df %>%
dplyr::mutate(NewColumn = purrr::map2_dbl(.x=Winner,.y=Name,~ifelse(grepl(.y,.x),1,0)))
Adding an alternate solution with stringr:
df = df %>%
dplyr::mutate(NewColumn=ifelse(str_detect(Winner,Name),1,0))
Let me know if this works.
P.S.: str_detect is faster.

Count 1st instance of keyword in list with no duplicate counts in R

I have a list of keywords:
library(stringr)
words <- as.character(c("decomposed", "no diagnosis","decomposition","autolysed","maggots", "poor body", "poor","not suitable", "not possible"))
I want to match these keywords to text in a data frame column (df$text) and count the number of times a keyword occurs in a different data.frame (matchdf):
matchdf<- data.frame(Keywords=words)
m_match<-sapply(1:length(words), function(x) sum(str_count(tolower(df$text),words[[x]])))
matchdf$matchs<-m_match
However, I've noticed that this method counts EACH occurrence of a keyword within a column. eg)
"The sample was too decomposed to perform an analysis. The decomposed sample indicated that this animal was dead for a long time"
Would then return a count of 2. However, I only want to count the first instance of "decomposed" within a field.
I thought there would be a way to only count the first instance using str_count but there doesn't seem to be one.
The stringr isn't strictly necessary in this example, grepl from base R will suffice. That said, use str_detect instead of grepl, if you prefer the package function (as pointed out by #Chi-Pak in comment)
library(stringr)
words <- c("decomposed", "no diagnosis","decomposition","autolysed","maggots",
"poor body", "poor","not suitable", "not possible")
df <- data.frame( text = "The sample was too decomposed to perform an analysis. The decomposed sample indicated that this animal was dead for a long time")
matchdf <- data.frame(Keywords = words, stringsAsFactors = FALSE)
# Base R grepl
matchdf$matches1 <- sapply(1:length(words), function(x) as.numeric(grepl(words[x], tolower(df$text))))
# Stringr function
matchdf$matches2 <- sapply(1:length(words), function(x) as.numeric(str_detect(tolower(df$text),words[[x]])))
matchdf
Result
Keywords matches1 matches2
1 decomposed 1 1
2 no diagnosis 0 0
3 decomposition 0 0
4 autolysed 0 0
5 maggots 0 0
6 poor body 0 0
7 poor 0 0
8 not suitable 0 0
9 not possible 0 0

R - How to build a list a character pattern in data.frame column

In R, I have the following column in my data frame...
bodyweight
65lbs
72kgs
20kgs
30lbs
.
.
.
I want to convert it into a column with weight in numeric common unit (kgs).
Have managed to extract the numeric values from the column by using grep() to remove all non-numeric characters. However to convert lbs values in to kgs I need to have another column showing all cells where lbs is present. So the output would be something as follows...
bodyweight_lbs
1
0
0
1
...
How do I get this output?
When I use grep('lbs',data$bodyweight) it returns count all lbs in the entire column.
You could do like this,
> df <- data.frame(bodyweight=c("65lbs", "72kgs", "20kgs", "30lbs"))
> df$weight_lbs <- ifelse(grepl("lbs$", as.character(df$bodyweight)), 1, 0)
> df
bodyweight weight_lbs
1 65lbs 1
2 72kgs 0
3 20kgs 0
4 30lbs 1
grepl("lbs$", as.character(df$bodyweight)) will return TRUE only if the vector element contains the substring lbs at the last, otherwise it would return FALSE. By passing this inside a ifelse function, the above code will create a new column called weight_lbs and it would add the value 1 if the corresponding string in the bodyweight column ends with lbs else it would add 0.
OR
df <- data.frame(bodyweight=c("65lbs", "72kgs", "20kgs", "30lbs"))
df$weight_lbs <- as.numeric(grepl("lbs$", as.character(df$bodyweight)))

Resources