R remove multiple text strings in data frame - r

New to R. I am looking to remove certain words from a data frame. Since there are multiple words, I would like to define this list of words as a string, and use gsub to remove. Then convert back to a dataframe and maintain same structure.
wordstoremove <- c("ai", "computing", "ulitzer", "ibm", "privacy", "cognitive")
a
id text time username
1 "ai and x" 10 "me"
2 "and computing" 5 "you"
3 "nothing" 15 "everyone"
4 "ibm privacy" 0 "know"
I was thinking something like:
a2 <- apply(a, 1, gsub(wordstoremove, "", a)
but clearly this doesnt work, before converting back to a data frame.

wordstoremove <- c("ai", "computing", "ulitzer", "ibm", "privacy", "cognitive")
(dat <- read.table(header = TRUE, text = 'id text time username
1 "ai and x" 10 "me"
2 "and computing" 5 "you"
3 "nothing" 15 "everyone"
4 "ibm privacy" 0 "know"'))
# id text time username
# 1 1 ai and x 10 me
# 2 2 and computing 5 you
# 3 3 nothing 15 everyone
# 4 4 ibm privacy 0 know
(dat1 <- as.data.frame(sapply(dat, function(x)
gsub(paste(wordstoremove, collapse = '|'), '', x))))
# id text time username
# 1 1 and x 10 me
# 2 2 and 5 you
# 3 3 nothing 15 everyone
# 4 4 0 know

Another option using dplyr::mutate() and stringr::str_remove_all():
library(dplyr)
library(stringr)
dat <- dat %>%
mutate(text = str_remove_all(text, regex(str_c("\\b",wordstoremove, "\\b", collapse = '|'), ignore_case = T)))
Because lowercase 'ai' could easily be a part of a longer word, the words to remove are bound with \\b so that they are not removed from the beginning, middle, or end or other words.
The search pattern is also wrapped with regex(pattern, ignore_case = T) in case some words are capitalized in the text string.
str_replace_all() could be used if you wanted to replace the words with something other than just removing them. str_remove_all() is just an alias for str_replace_all(string, pattern, '').
rawr's anwswer could be updated to:
dat1 <- as.data.frame(sapply(dat, function(x)
gsub(paste0('\\b', wordstoremove, '\\b', collapse = '|'), '', x, ignore.case = T)))

Related

Match strings between two data frame and add missing strings to strings which are not complete in R

I have two data frames, with 4 different columns in different orders. One data frame 1 has an ID column with strings having the complete and correct name. Then I have a data frame 2 with a ID column which is not complete (ID_not_complete). Data frame 1 has more rows and contains 100% all strings from data frame 2. I would like to add the missing strings to the ID_not_complete column in data frame 2. As it is in my example, a string in data frame 2 can have multiple matches in data frame 1, but only one matches the exact length of the string:
rs1725 --> AX-42144793569__rs1725
rs1725 --> AX-42179369__rs1725074
The first option should be the correct one.
Data frame 1
ID<-c("AX-35388475__rs16896864","AX-11425569__rs289621","AX-11102771__rs10261724","AX-42179369__rs1725074","AX-42144793569__rs1725","AX-42749369__rs264930","AX-32893019__rs6114382")
ID<-as.data.frame(ID)
Data frame 2
ID_not_complete<-c("rs16896864","rs289621","rs10261724","rs1725074","rs1725")
ID_not_complete <-as.data.frame(ID_not_complete)
The output data frame 2 should look like:
ID_complete<-c("AX-35388475__rs16896864","AX-11425569__rs289621","AX-11102771__rs10261724","AX-42179369__rs1725074","AX-42144793569__rs1725")
ID_complete <-as.data.frame(ID_complete)
I think I need to use grep. But I really don't know how to to it for each value in a column.
The fuzzyjoin package allows for joining on regular expressions (patterns).
A first (flawed) approach is:
fuzzyjoin::regex_inner_join(ID, ID_not_complete, by = c(ID="ID_not_complete"))
# ID ID_not_complete
# 1 AX-35388475__rs16896864 rs16896864
# 2 AX-11425569__rs289621 rs289621
# 3 AX-11102771__rs10261724 rs10261724
# 4 AX-42179369__rs1725074 rs1725074
# 5 AX-42179369__rs1725074 rs1725
# 6 AX-42144793569__rs1725 rs1725
where rs1725 matches both rs1725 and rs1725074 (matching leading characters). I'll infer that you don't mean for this to happen, so a quick fix using some additional boundary-like patterns (also correcting for your data having spaces):
ID_not_complete$ptn <- paste0("(^|[\\s_])", ID_not_complete$ID_not_complete, "([\\s_]|$)")
fuzzyjoin::regex_inner_join(ID, ID_not_complete, by = c(ID="ptn"))
# ID ID_not_complete ptn
# 1 AX-35388475__rs16896864 rs16896864 (^|[\\s_])rs16896864([\\s_]|$)
# 2 AX-11425569__rs289621 rs289621 (^|[\\s_])rs289621([\\s_]|$)
# 3 AX-11102771__rs10261724 rs10261724 (^|[\\s_])rs10261724([\\s_]|$)
# 4 AX-42179369__rs1725074 rs1725074 (^|[\\s_])rs1725074([\\s_]|$)
# 5 AX-42144793569__rs1725 rs1725 (^|[\\s_])rs1725([\\s_]|$)
(Side note: I originally wanted and intended to use the regex word-boundary \b in the pattern, but according to https://www.regular-expressions.info/shorthand.html, it considered [A-Za-z0-9_] to be "word characters", so the _ preceding the real IDs was not a boundary. So if other users have a similar problem that does not use underscores, then the (^|[\\s_]) can be replaced completely with \\b, similarly for the end-pattern.)
Edit
If all you need is to filter out from ID those that are not found otherwise, then perhaps just
paste0("_(", paste(ID_not_complete$ID_not_complete, collapse = "|"), ") *$")
# [1] "_(rs16896864|rs289621|rs10261724|rs1725074|rs1725) *$"
grepl(paste0("_(", paste(ID_not_complete$ID_not_complete, collapse = "|"), ") *$"), ID$ID)
# [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE
ID[grepl(paste0("_(", paste(ID_not_complete$ID_not_complete, collapse = "|"), ") *$"), ID$ID),,drop=FALSE]
# ID
# 1 AX-35388475__rs16896864
# 2 AX-11425569__rs289621
# 3 AX-11102771__rs10261724
# 4 AX-42179369__rs1725074
# 5 AX-42144793569__rs1725
or
gsub("\\s", "", gsub(".*_", "", ID$ID))
# [1] "rs16896864" "rs289621" "rs10261724" "rs1725074" "rs1725" "rs264930" "rs6114382"
ID[ gsub("\\s", "", gsub(".*_", "", ID$ID)) %in% ID_not_complete$ID_not_complete,,drop=FALSE]
# ID
# 1 AX-35388475__rs16896864
# 2 AX-11425569__rs289621
# 3 AX-11102771__rs10261724
# 4 AX-42179369__rs1725074
# 5 AX-42144793569__rs1725
EDIT:
Data:
ID <- c("AX-35388475__rs16896864","AX-11425569__rs289621","AX-11102771__rs10261724","AX-42179369__rs1725074","AX-42144793569__rs1725","AX-42749369__rs264930","AX-32893019__rs6114382")
df1 <- data.frame(
ID = ID
)
ID_not_complete<-c("rs16896864","rs289621","rs10261724","rs1725074","rs1725")
df2 <- data.frame(
ID_not_complete = ID_not_complete
)
Solution:
First define patterns:
patt0 <- paste0(sub("([^_]+)__(.*)", "\\2", df1$ID), collapse = "|")
patt1 <- paste0(df2$ID_not_complete, collapse = "|")
Now transform df2$ID_not_complete:
df2$ID_not_complete <- paste0(sub("([^_]+)__(.*)", "\\1__", df1$ID[grepl(patt1, df1$ID)]),
grep(patt0, df2$ID_not_complete, value = T))
Result:
df2
ID_not_complete
1 AX-35388475__rs16896864
2 AX-11425569__rs289621
3 AX-11102771__rs10261724
4 AX-42179369__rs1725074
5 AX-42144793569__rs1725

Remove first n words and take count

I have a dataframe with text column, I need to ignore or eliminate first 2 words and take count of string in that column.
b <- data.frame(text = c("hello sunitha what can I do for you?",
"hi john what can I do for you?")
Expected output in dataframe 'b': how can we remove first 2 words, so that count of 'what can I do for you? = 2
You can use gsub to remove the first two words and then tapply and count, i.e.
i1 <- gsub("^\\w*\\s*\\w*\\s*", "", b$text)
tapply(i1, i1, length)
#what can I do for you?
# 2
If you need to remove any range of words, we can amend i1 as follows,
i1 <- sapply(strsplit(as.character(b$text), ' '), function(i)paste(i[-c(2:4)], collapse = ' '))
tapply(i1, i1, length)
#hello I do for you? hi I do for you?
# 1 1
b=data.frame(text=c("hello sunitha what can I do for you?","hi john what can I do for you?"),stringsAsFactors = FALSE)
b$processed = sapply(b$text, function(x) (strsplit(x," ")[[1]]%>%.[-c(1:2)])%>%paste0(.,collapse=" "))
b$count = sapply(b$processed, function(x) length(strsplit(x," ")[[1]]))
> b
text processed count
1 hello sunitha what can I do for you? what can I do for you? 6
2 hi john what can I do for you? what can I do for you? 6
Are you looking for something like this? watch out for stringsAsFactors = FALSE else your texts will be factor type and harder to work on.

R - looking up strings and exclude based on other string

I could not find the answer how to count words in data frame and exclude if other word is found.
I have got below df:
words <- c("INSTANCE find", "LA LA LA", "instance during",
"instance", "instance", "instance", "find instance")
df <- data.frame(words)
df$words_count <- grepl("instance", df$words, ignore.case = T)
It counts all instances of "instance" I have been trying to exclude any row when word find is present as well.
I can add another grepl to look up for "find" and based on that exclude but I try to limit number of lines of my code.
I'm sure there's a solution using a single regular expression, but you could do
df$words_count <- Reduce(`-`, lapply(c('instance', 'find'), grepl, df$words)) > 0
or
df$words_count <- Reduce(`&`, lapply(c('instance', '^((?!find).)*$'), grepl, df$words, perl = T, ignore.case = T))
This might be easier to read
library(tidyverse)
df$words_count <- c('instance', '^((?!find).)*$') %>%
lapply(grepl, df$words, perl = T, ignore.case = T) %>%
reduce(`&`)
If all you need is the number of times "instance" appears in a string, negating all in that string if "find" is found anywhere:
df$counts <- sapply(gregexpr("\\binstance\\b", words, ignore.case=TRUE), function(a) length(a[a>0])) *
!grepl("\\bfind\\b", words, ignore.case=TRUE)
df
# words counts
# 1 INSTANCE find 0
# 2 LA LA LA 0
# 3 instance during 1
# 4 instance 1
# 5 instance 1
# 6 instance 1
# 7 find instance 0

Remove comma and or period except if certain condition holds for last occurrence in R

I would like to remove all commas and periods from string, except in the case that a string ends in a comma (or period) followed by one or two numbers.
Some examples would be:
12.345.67 #would become 12345.67
12.345,67 #would become 12345,67
12.345,6 #would become 12345,6
12.345.6 #would become 12345.6
12.345 #would become 12345
1,2.345 #would become 12345
and so forth
a stringi solution using same data as #Sotos would be:
library(stringi)
line 1 removes the last , or . character if more than 2 characters follow
line 2 removes the first , or . characters if there is more than 1 , or . left
x<-ifelse(stri_locate_last_regex(x,"([,.])")[,2]<(stri_length(x)-2),
stri_replace_last_regex(x,"([,.])",""),x)
x <- if(stri_count_regex(x,"([,.])") > 1){stri_replace_first_regex(x,"([,.])","")}
> x
[1] "12345.67" "12345,67" "12345,6" "12234" "1234" "12.45"
Another option is to use negative look ahead syntax ?! with the perl compatible regex:
df
# V1
# 1 12.345.67
# 2 12.345,67
# 3 12.345,6
# 4 12.345.6
# 5 12.345
# 6 1,2.345
df$V1 = gsub("[,.](?!\\d{1,2}$)", "", df$V1, perl = T)
df # remove , or . except they are followed by 1 or 2 digits at the end of string
# V1
# 1 12345.67
# 2 12345,67
# 3 12345,6
# 4 12345.6
# 5 12345
# 6 12345
One solution is to count the characters after the last comma/period (nchar(word(x, -1, sep = ',|\\.'))), and if the length is greater than 2, remove all delimiters (gsub(',|\\.', '', x)), otherwise just the first one (sub(',|\\.', '', x).
library(stringr)
ifelse(nchar(word(x, -1, sep = ',|\\.')) > 2, gsub(',|\\.', '', x), sub(',|\\.', '', x))
#[1] "12345.67" "12345,67" "12345,6" "12234" "1234" "12.45"
DATA
x <- c("12.345.67", "12.345,67", "12.345,6", "1,2.234", "1.234", "1,2.45")

R: extract and paste keyword matches

I am new to R and have been struggling with this one. I want to create a new column, that checks if a set of any of words ("foo", "x", "y") exist in column 'text', then write that value in new column.
I have a data frame that looks like this: a->
id text time username
1 "hello x" 10 "me"
2 "foo and y" 5 "you"
3 "nothing" 15 "everyone"
4 "x,y,foo" 0 "know"
The correct output should be:
a2 ->
id text time username keywordtag
1 "hello x" 10 "me" x
2 "foo and y" 5 "you" foo,y
3 "nothing" 15 "everyone" 0
4 "x,y,foo" 0 "know" x,y,foo
I have this:
df1 <- data.frame(text = c("hello x", "foo and y", "nothing", "x,y,foo"))
terms <- c('foo', 'x', 'y')
df1$keywordtag <- apply(sapply(terms, grepl, df1$text), 1, function(x) paste(terms[x], collapse=','))
Which works, but crashes R when my needleList contains 12k words and my text has 155k rows. Is there a way to do this that won't crash R?
This is a variation on what you have done, and what was suggested in the comments. This uses dplyr and stringr. There may be a more efficient way but this may not crash your R session.
library(dplyr)
library(stringr)
terms <- c('foo', 'x', 'y')
term_regex <- paste0('(', paste(terms, collapse = '|'), ')')
### Solution: this uses dplyr::mutate and stringr::str_extract_all
df1 %>%
mutate(keywordtag = sapply(str_extract_all(text, term_regex), function(x) paste(x, collapse=',')))
# text keywordtag
#1 hello x x
#2 foo and y foo,y
#3 nothing
#4 x,y,foo x,y,foo

Resources