Exact Matching text with dataframe column in r

Exact Matching text with dataframe column in r - r

I have a vector of words in R:
words = c("Awesome","Loss","Good","Bad")
And I have the following dataframe in R:
df <- data.frame(ID = c(1,2,3),
Response = c("Today is an awesome day",
"Yesterday was a bad day,but today it is good",
"I have losses today"))
What I want to do is words that are exact matching in Response column should be extracted and inserted into new column in dataframe. Final output should look like this
ID Response Match
1 Today is an awesome day Awesome
2 Yesterday was a bad day Bad,Good
,but today it is good
3 I have losses today NA
I used the following code:
extract the list of matching words
x <- sapply(words, function(x) grepl(tolower(x), tolower(df$Response)))
paste the matching words together
df$Words <- apply(x, 1, function(i) paste0(names(i)[i], collapse = ","))
But it is providing the match, but not the exact. Please help.

If you use anchors in your words vector, you will ensure exact matches: ^ asserts that you're at the start, $ that you're at the end of a word. So:
words = c("Awesome","^Loss$","Good","Bad")
Then use your code:
x <- sapply(words, function(x) grepl(tolower(x), tolower(df$Response)))
df$Words <- apply(x, 1, function(i) paste0(names(i)[i], collapse = ","))
which gives:
> df
ID Response Words
1 1 Today is an awesome day Awesome
2 2 Yesterday was a bad day,but today it is good Good,Bad
3 3 I have losses today
To turn blanks to NA:
df$Words[df$Words == ""] <- NA

We can use str_extract_all
library(stringr)
library(dplyr)
library(purrr)
df %>%
mutate(Words = map_chr(str_extract_all(Response, str_c("
(?i)\\b(", str_c(words, collapse="|"), ")\\b")), toString))
# ID Response Words
#1 1 Today is an awesome day awesome
#2 2 Yesterday was a bad day,but today it is good bad, good
#3 3 I have losses today
data
words <- c("Awesome","Loss","Good","Bad")

Change the first *apply function to a two lines function. If the regex becomes "\\bword\\b" then it captures the word surrounded by boundaries.
x <- sapply(words, function(x) {
y <- paste0("\\b", x, "\\b")
grepl(tolower(y), tolower(df$Response))
})
Now run the second apply as posted in the question.
df$Words <- apply(x, 1, function(i) paste0(names(i)[i], collapse = ","))
df
# ID Response Words
#1 1 Today is an awesome day Awesome
#2 2 Yesterday was a bad day,but today it is good Good,Bad
#3 3 I have losses today
As for the NA's, I will use function is.na<-.
is.na(df$Words) <- df$Words == ""
Data.
df <- read.table(text = "
ID Response
1 'Today is an awesome day'
2 'Yesterday was a bad day,but today it is good'
3 'I have losses today'
", header = TRUE)
words <- c("Awesome","Loss","Good","Bad")

Related

order a list of stings in r

The data I have include two variables: id and income (a list of characters)
id <- seq(1,6)
income <- c("2322;5125",
"0110;2012",
"2212;0912",
"1012;0145",
"1545;1102",
"1010;2028")
df <- data.frame(id, income)
df$income <- as.character(df$income)
I need to add a third column income_order which includes the ordered values of column income. The final output would look like
NOTE: I would still need to keep the leading zeros

We could split the string on ";", sort and paste the string back.
df$income_order <- sapply(strsplit(df$income, ";"), function(x)
paste(sort(x), collapse = ";"))
df
# id income income_order
#1 1 2322;5125 2322;5125
#2 2 0110;2012 0110;2012
#3 3 2212;0912 0912;2212
#4 4 1012;0145 0145;1012
#5 5 1545;1102 1102;1545
#6 6 1010;2028 1010;2028

We can use gsubfn
library(gsubfn)
df$income_order <- gsubfn("(\\d+);(\\d+)", ~ paste(sort(c(x, y)), collapse=";"), df$income)
df$income_order
#[1] "2322;5125" "0110;2012" "0912;2212" "0145;1012" "1102;1545" "1010;2028"

Remove first n words and take count

I have a dataframe with text column, I need to ignore or eliminate first 2 words and take count of string in that column.
b <- data.frame(text = c("hello sunitha what can I do for you?",
"hi john what can I do for you?")
Expected output in dataframe 'b': how can we remove first 2 words, so that count of 'what can I do for you? = 2

You can use gsub to remove the first two words and then tapply and count, i.e.
i1 <- gsub("^\\w*\\s*\\w*\\s*", "", b$text)
tapply(i1, i1, length)
#what can I do for you?
# 2
If you need to remove any range of words, we can amend i1 as follows,
i1 <- sapply(strsplit(as.character(b$text), ' '), function(i)paste(i[-c(2:4)], collapse = ' '))
tapply(i1, i1, length)
#hello I do for you? hi I do for you?
# 1 1

b=data.frame(text=c("hello sunitha what can I do for you?","hi john what can I do for you?"),stringsAsFactors = FALSE)
b$processed = sapply(b$text, function(x) (strsplit(x," ")[[1]]%>%.[-c(1:2)])%>%paste0(.,collapse=" "))
b$count = sapply(b$processed, function(x) length(strsplit(x," ")[[1]]))
> b
text processed count
1 hello sunitha what can I do for you? what can I do for you? 6
2 hi john what can I do for you? what can I do for you? 6
Are you looking for something like this? watch out for stringsAsFactors = FALSE else your texts will be factor type and harder to work on.

R - looking up strings and exclude based on other string

I could not find the answer how to count words in data frame and exclude if other word is found.
I have got below df:
words <- c("INSTANCE find", "LA LA LA", "instance during",
"instance", "instance", "instance", "find instance")
df <- data.frame(words)
df$words_count <- grepl("instance", df$words, ignore.case = T)
It counts all instances of "instance" I have been trying to exclude any row when word find is present as well.
I can add another grepl to look up for "find" and based on that exclude but I try to limit number of lines of my code.

I'm sure there's a solution using a single regular expression, but you could do
df$words_count <- Reduce(`-`, lapply(c('instance', 'find'), grepl, df$words)) > 0
or
df$words_count <- Reduce(`&`, lapply(c('instance', '^((?!find).)*$'), grepl, df$words, perl = T, ignore.case = T))
This might be easier to read
library(tidyverse)
df$words_count <- c('instance', '^((?!find).)*$') %>%
lapply(grepl, df$words, perl = T, ignore.case = T) %>%
reduce(`&`)

If all you need is the number of times "instance" appears in a string, negating all in that string if "find" is found anywhere:
df$counts <- sapply(gregexpr("\\binstance\\b", words, ignore.case=TRUE), function(a) length(a[a>0])) *
!grepl("\\bfind\\b", words, ignore.case=TRUE)
df
# words counts
# 1 INSTANCE find 0
# 2 LA LA LA 0
# 3 instance during 1
# 4 instance 1
# 5 instance 1
# 6 instance 1
# 7 find instance 0

R: How to seperate values and subtract from next row in a table?

Right now, I have a table as so:
Time Jack Kate
1 105~100 88~99
2 100~107 90~91
3 101~99 98~91
(etc)
I want to make it so that the "~" gets separated and I can get the first values of Jack and Kate in the Current row, and subtract it from the difference of the second value of Jack and Kate. So it will first be (105-88)-(107-91), and then (100-90)-(99-91), etc.
I have:
splt <- strsplit(x, slit="~', fixed=TRUE)
I tried using tapply, but I don't know how to refer to each row as the function progresses. Apologize for my lack of knowledge, but I'm not sure how to go about this or if tapply is the right function here.
Cheers

You had a good start with strsplit. After that, you can use the fact that the indexing operator "[" is a function that you can apply to the lists to get all of the individual numbers from your original strings.
## Your sample data
df = read.table(text="Time Jack Kate
1 105~100 88~99
2 100~107 90~91
3 101~99 98~91",
header=TRUE,
stringsAsFactors=FALSE)
Jack1 = as.numeric(sapply(strsplit(df$Jack, "~"), "[", 1))
Jack2 = as.numeric(sapply(strsplit(df$Jack, "~"), "[", 2))
Kate1 = as.numeric(sapply(strsplit(df$Kate, "~"), "[", 1))
Kate2 = as.numeric(sapply(strsplit(df$Kate, "~"), "[", 2))
Jack1
[1] 105 100 101
Now you can just compute the differences that you wanted.
(Jack1 - Kate1)[-length(Jack1)] - (Jack2 - Kate2)[-1]
[1] 1 2

R remove multiple text strings in data frame

New to R. I am looking to remove certain words from a data frame. Since there are multiple words, I would like to define this list of words as a string, and use gsub to remove. Then convert back to a dataframe and maintain same structure.
wordstoremove <- c("ai", "computing", "ulitzer", "ibm", "privacy", "cognitive")
a
id text time username
1 "ai and x" 10 "me"
2 "and computing" 5 "you"
3 "nothing" 15 "everyone"
4 "ibm privacy" 0 "know"
I was thinking something like:
a2 <- apply(a, 1, gsub(wordstoremove, "", a)
but clearly this doesnt work, before converting back to a data frame.

wordstoremove <- c("ai", "computing", "ulitzer", "ibm", "privacy", "cognitive")
(dat <- read.table(header = TRUE, text = 'id text time username
1 "ai and x" 10 "me"
2 "and computing" 5 "you"
3 "nothing" 15 "everyone"
4 "ibm privacy" 0 "know"'))
# id text time username
# 1 1 ai and x 10 me
# 2 2 and computing 5 you
# 3 3 nothing 15 everyone
# 4 4 ibm privacy 0 know
(dat1 <- as.data.frame(sapply(dat, function(x)
gsub(paste(wordstoremove, collapse = '|'), '', x))))
# id text time username
# 1 1 and x 10 me
# 2 2 and 5 you
# 3 3 nothing 15 everyone
# 4 4 0 know

Another option using dplyr::mutate() and stringr::str_remove_all():
library(dplyr)
library(stringr)
dat <- dat %>%
mutate(text = str_remove_all(text, regex(str_c("\\b",wordstoremove, "\\b", collapse = '|'), ignore_case = T)))
Because lowercase 'ai' could easily be a part of a longer word, the words to remove are bound with \\b so that they are not removed from the beginning, middle, or end or other words.
The search pattern is also wrapped with regex(pattern, ignore_case = T) in case some words are capitalized in the text string.
str_replace_all() could be used if you wanted to replace the words with something other than just removing them. str_remove_all() is just an alias for str_replace_all(string, pattern, '').
rawr's anwswer could be updated to:
dat1 <- as.data.frame(sapply(dat, function(x)
gsub(paste0('\\b', wordstoremove, '\\b', collapse = '|'), '', x, ignore.case = T)))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Exact Matching text with dataframe column in r - r

Related

order a list of stings in r

Remove first n words and take count

R - looking up strings and exclude based on other string

R: How to seperate values and subtract from next row in a table?

R remove multiple text strings in data frame

Categories

Resources