Remove first n words and take count - r

I have a dataframe with text column, I need to ignore or eliminate first 2 words and take count of string in that column.
b <- data.frame(text = c("hello sunitha what can I do for you?",
"hi john what can I do for you?")
Expected output in dataframe 'b': how can we remove first 2 words, so that count of 'what can I do for you? = 2

You can use gsub to remove the first two words and then tapply and count, i.e.
i1 <- gsub("^\\w*\\s*\\w*\\s*", "", b$text)
tapply(i1, i1, length)
#what can I do for you?
# 2
If you need to remove any range of words, we can amend i1 as follows,
i1 <- sapply(strsplit(as.character(b$text), ' '), function(i)paste(i[-c(2:4)], collapse = ' '))
tapply(i1, i1, length)
#hello I do for you? hi I do for you?
# 1 1

b=data.frame(text=c("hello sunitha what can I do for you?","hi john what can I do for you?"),stringsAsFactors = FALSE)
b$processed = sapply(b$text, function(x) (strsplit(x," ")[[1]]%>%.[-c(1:2)])%>%paste0(.,collapse=" "))
b$count = sapply(b$processed, function(x) length(strsplit(x," ")[[1]]))
> b
text processed count
1 hello sunitha what can I do for you? what can I do for you? 6
2 hi john what can I do for you? what can I do for you? 6
Are you looking for something like this? watch out for stringsAsFactors = FALSE else your texts will be factor type and harder to work on.

Related

Exact Matching text with dataframe column in r

I have a vector of words in R:
words = c("Awesome","Loss","Good","Bad")
And I have the following dataframe in R:
df <- data.frame(ID = c(1,2,3),
Response = c("Today is an awesome day",
"Yesterday was a bad day,but today it is good",
"I have losses today"))
What I want to do is words that are exact matching in Response column should be extracted and inserted into new column in dataframe. Final output should look like this
ID Response Match
1 Today is an awesome day Awesome
2 Yesterday was a bad day Bad,Good
,but today it is good
3 I have losses today NA
I used the following code:
extract the list of matching words
x <- sapply(words, function(x) grepl(tolower(x), tolower(df$Response)))
paste the matching words together
df$Words <- apply(x, 1, function(i) paste0(names(i)[i], collapse = ","))
But it is providing the match, but not the exact. Please help.
If you use anchors in your words vector, you will ensure exact matches: ^ asserts that you're at the start, $ that you're at the end of a word. So:
words = c("Awesome","^Loss$","Good","Bad")
Then use your code:
x <- sapply(words, function(x) grepl(tolower(x), tolower(df$Response)))
df$Words <- apply(x, 1, function(i) paste0(names(i)[i], collapse = ","))
which gives:
> df
ID Response Words
1 1 Today is an awesome day Awesome
2 2 Yesterday was a bad day,but today it is good Good,Bad
3 3 I have losses today
To turn blanks to NA:
df$Words[df$Words == ""] <- NA
We can use str_extract_all
library(stringr)
library(dplyr)
library(purrr)
df %>%
mutate(Words = map_chr(str_extract_all(Response, str_c("
(?i)\\b(", str_c(words, collapse="|"), ")\\b")), toString))
# ID Response Words
#1 1 Today is an awesome day awesome
#2 2 Yesterday was a bad day,but today it is good bad, good
#3 3 I have losses today
data
words <- c("Awesome","Loss","Good","Bad")
Change the first *apply function to a two lines function. If the regex becomes "\\bword\\b" then it captures the word surrounded by boundaries.
x <- sapply(words, function(x) {
y <- paste0("\\b", x, "\\b")
grepl(tolower(y), tolower(df$Response))
})
Now run the second apply as posted in the question.
df$Words <- apply(x, 1, function(i) paste0(names(i)[i], collapse = ","))
df
# ID Response Words
#1 1 Today is an awesome day Awesome
#2 2 Yesterday was a bad day,but today it is good Good,Bad
#3 3 I have losses today
As for the NA's, I will use function is.na<-.
is.na(df$Words) <- df$Words == ""
Data.
df <- read.table(text = "
ID Response
1 'Today is an awesome day'
2 'Yesterday was a bad day,but today it is good'
3 'I have losses today'
", header = TRUE)
words <- c("Awesome","Loss","Good","Bad")

Counting number of words between a predefined delimiter

What's the best way to count number of words between a predefined delimiter (in my case '/')?
Dataset:
df <- data.frame(v1 = c('A DOG//1//',
'CAT/WHITE///',
'A HORSE/BROWN & BLACK/2//',
'DOG////'))
Expected results are the following numbers..
2 (which are A DOG and 1)
2 (which are CAT and WHITE)
3 (A HORSE, BROWN & BLACK, 2)
1 (DOG)
Thank you!
strsplit at one or more slash ("/+") and count strings
lengths(strsplit(as.character(df$v1), "/+"))
#[1] 2 2 3 1
Assuming your data doesn't have cases where a string (a) begins with "/" or (b) doesn't end with "/," then you can just count the number of times there's a chunk of slashes in order to get the number of chunks between slashes. So the following works for the data you've provided.
stringr::str_count(df$v1, "/+")
Using stringr::str_split() and counting the number of nonblank strings...
df <- data.frame(v1 = c('A DOG//1//',
'CAT/WHITE///',
'A HORSE/BROWN & BLACK/2//',
'DOG////'))
sapply(stringr::str_split(df$v1, '/'), function(x) sum(x != ''))
[1] 2 2 3 1

R Exact match strings in two columns

I have a data frame of the following form:
Column1 = c('Elephant,Starship Enterprise,Cat','Random word','Word','Some more words, Even more words')
Column2=c('Rat,Starship Enterprise,Elephant','Ocean','No','more')
d1 = data.frame(Column1,Column2)
What I want to do is to look for and count the exact match of words in column 1 and column 2. Each column can have multiple words separated by a comma.
For example in row 1, we see there are two common words a) Starship Enterprise and b) Elephant. However, in row 4, even though the word "more" appears in both columns, the exact string (Some more words and Even more words) do not appear. The expected output would be something like this.
Any help will be appreciated.
Split columns on comma and count the intersection of words
mapply(function(x, y) length(intersect(x, y)),
strsplit(d1$Column1, ","), strsplit(d1$Column2, ","))
#[1] 2 0 0 0
Or a tidyverse way
library(tidyverse)
d1 %>%
mutate(Common = map2_dbl(Column1, Column2, ~
length(intersect(str_split(.x, ",")[[1]], str_split(.y, ",")[[1]]))))
# Column1 Column2 Common
#1 Elephant,Starship Enterprise,Cat Rat,Starship Enterprise,Elephant 2
#2 Random word Ocean 0
#3 Word No 0
#4 Some more words, Even more words more 0
We can do this with cSplit
library(splitstackshape)
library(data.table)
v1 <- cSplit(setDT(d1, keep.rownames = TRUE), 2:3, ",", "long")[,
length(intersect(na.omit(Column1), na.omit(Column2))), rn]$V1
d1[, Common := v1][, rn := NULL][]
# Column1 Column2 Common
#1: Elephant,Starship Enterprise,Cat Rat,Starship Enterprise,Elephant 2
#2: Random word Ocean 0
#3: Word No 0
#4: Some more words, Even more words more 0

How to split a sentence in two halves in R

I have a vector of string, and I want each string to be cut roughly in half, at the nearest space.
For exemple, with the following data :
test <- data.frame(init = c("qsdf mqsldkfop mqsdfmlk lksdfp pqpdfm mqsdfmj mlk",
"qsdf",
"mp mlksdfm mkmlklkjjjjjjjjjjjjjjjjjjjjjjklmmjlkjll",
"qsddddddddddddddddddddddddddddddd",
"qsdfmlk mlk mkljlmkjlmkjml lmj mjjmjmjm lkj"), stringsAsFactors = FALSE)
I want to get something like this :
first sec
1 qsdf mqsldkfop mqsdfmlk lksdfp pqpdfm mqsdfmj mlk
2 qsdf
3 mp mlksdfm mkmlklkjjjjjjjjjjjjjjjjjjjjjjklmmjlkjll
4 qsddddddddddddddddddddddddddddddd
5 lmj mjjmjmjm lkj lmj mjjmjmjm lkj
Any solution that does not cut in halves but "so that the first part isn't longer than X character" would be also great.
First, we split the strings by spaces.
a <- strsplit(test$init, " ")
Then we find the last element of each vector for which the cumulative sum of characters is lower than half the sum of all characters in the vector:
b <- lapply(a, function(x) which.max(cumsum(cumsum(nchar(x)) <= sum(nchar(x))/2)))
Afterwards we combine the two halfs, substituting NA if the vector was of length 1 (only one word).
combined <- Map(function(x, y){
if(y == 1){
return(c(x, NA))
}else{
return(c(paste(x[1:y], collapse = " "), paste(x[(y+1):length(x)], collapse = " ")))
}
}, a, b)
Finally, we rbind the combined strings and change the column names.
newdf <- do.call(rbind.data.frame, combined)
names(newdf) <- c("first", "second")
Result:
> newdf
first second
1 qsdf mqsldkfop mqsdfmlk lksdfp pqpdfm mqsdfmj mlk
2 qsdf <NA>
3 mp mlksdfm mkmlklkjjjjjjjjjjjjjjjjjjjjjjklmmjlkjll
4 qsddddddddddddddddddddddddddddddd <NA>
5 qsdfmlk mlk mkljlmkjlmkjml lmj mjjmjmjm lkj
You can use the function nbreak from the package that I wrote:
devtools::install_github("igorkf/breaker")
library(tidyverse)
test <- data.frame(init = c("Phrase with four words", "That phrase has five words"), stringsAsFactors = F)
#This counts the numbers of words of each row:
nwords = str_count(test$init, " ") + 1
#This is the position where break the line for each row:
break_here = ifelse(nwords %% 2 == 0, nwords/2, round(nwords/2) + 1)
test
# init
# 1 Phrase with four words
# 2 That phrase has five words
#the map2_chr is applying a function with two arguments,
#the string is "init" and the n is "break_here":
test %>%
mutate(init = map2_chr(init, break_here, ~breaker::nbreak(string = .x, n = .y, loop = F))) %>%
separate(init, c("first", "second"), sep = "\n")
# first second
# 1 Phrase with four words
# 2 That phrase has five words

R remove multiple text strings in data frame

New to R. I am looking to remove certain words from a data frame. Since there are multiple words, I would like to define this list of words as a string, and use gsub to remove. Then convert back to a dataframe and maintain same structure.
wordstoremove <- c("ai", "computing", "ulitzer", "ibm", "privacy", "cognitive")
a
id text time username
1 "ai and x" 10 "me"
2 "and computing" 5 "you"
3 "nothing" 15 "everyone"
4 "ibm privacy" 0 "know"
I was thinking something like:
a2 <- apply(a, 1, gsub(wordstoremove, "", a)
but clearly this doesnt work, before converting back to a data frame.
wordstoremove <- c("ai", "computing", "ulitzer", "ibm", "privacy", "cognitive")
(dat <- read.table(header = TRUE, text = 'id text time username
1 "ai and x" 10 "me"
2 "and computing" 5 "you"
3 "nothing" 15 "everyone"
4 "ibm privacy" 0 "know"'))
# id text time username
# 1 1 ai and x 10 me
# 2 2 and computing 5 you
# 3 3 nothing 15 everyone
# 4 4 ibm privacy 0 know
(dat1 <- as.data.frame(sapply(dat, function(x)
gsub(paste(wordstoremove, collapse = '|'), '', x))))
# id text time username
# 1 1 and x 10 me
# 2 2 and 5 you
# 3 3 nothing 15 everyone
# 4 4 0 know
Another option using dplyr::mutate() and stringr::str_remove_all():
library(dplyr)
library(stringr)
dat <- dat %>%
mutate(text = str_remove_all(text, regex(str_c("\\b",wordstoremove, "\\b", collapse = '|'), ignore_case = T)))
Because lowercase 'ai' could easily be a part of a longer word, the words to remove are bound with \\b so that they are not removed from the beginning, middle, or end or other words.
The search pattern is also wrapped with regex(pattern, ignore_case = T) in case some words are capitalized in the text string.
str_replace_all() could be used if you wanted to replace the words with something other than just removing them. str_remove_all() is just an alias for str_replace_all(string, pattern, '').
rawr's anwswer could be updated to:
dat1 <- as.data.frame(sapply(dat, function(x)
gsub(paste0('\\b', wordstoremove, '\\b', collapse = '|'), '', x, ignore.case = T)))

Resources