Find string match in R - r

I have data.frame with 2 columns and thousands rows of random strings as:
Column1 Column2
"this is done in 1 hour" "in 1 hour"
I would like to get a new data.frame column like this:
Column3
"this is done"
So basically match the string according to the Column2 and get the remaining of Column1. How to approach this?
EDIT:
This would not solve the issues since the length of strings varies so I can't do:
substrRight <- function(x, n){
substr(x, nchar(x)-n+1, nchar(x))
}
substrRight(x, 3)
So I would need something like grepl matching.

You can do it with regular expression:
data <- data.frame(Column1 = "this is done in 1 hour", Column2 = "in 1 hour")
data$Column3 <- gsub(data$Column2, '', data$Column1) # Replace fist parameter by second in third.
EDIT:
For more than 1 row, you can use mapply:
data <- data.frame(Column1 = c("this is done in 1 hour", "this is a test"), Column2 = c("in 1 hour", "a test"))
data$Column3 <- mapply(gsub, data$Column2, '', data$Column1)

Here is an example of how you could do it:
# example data frame
testdata <- data.frame(colA=c("this is","a test"),colB=c("is","a"),stringsAsFactors=FALSE)
# adding the new column
newcol <- sapply(seq_len(nrow(testdata)),function(x) gsub(testdata[x,"colB"],"",testdata[x,"colA"],fixed=TRUE))
new.testdata <- transform(testdata,colC=newcol)
# result
new.testdata
# colA | colB | colC
# --------------------------
# 1 this is | is | th
# 2 a test | a | test
EDIT: gsub(str1,'',str2,fixed=TRUE) deletes all occurrences of str1 within str2 whereas using sub would only delete the first occurrence. Since str1 is usually interpreted as regular expression, it is important to set fixed=TRUE. Otherwise a mess happens if str1 contains characters such as .\+?*{}[]. To address the comment, the following would replace only the last occurrence of str1 in str2 leading to the desired output:
revColA <- lapply(testdata[["colA"]],function(x) paste0(substring(x,nchar(x):1,nchar(x):1)))
revColA <- lapply(revColA,paste,collapse='')
revColB <- lapply(testdata[["colB"]],function(x) paste0(substring(x,nchar(x):1,nchar(x):1)))
revColB <- lapply(revColB,paste,collapse='')
revNewCol <- sapply(seq_len(nrow(testdata)),function(x) sub(revColB[x],"",revColA[x],fixed=TRUE))
newcol <- lapply(revNewCol,function(x) paste0(substring(x,nchar(x):1,nchar(x):1)))
newcol <- sapply(newcol,paste,collapse='')
new.testdata <- transform(testdata,colC=newcol)
### output ###
# colA colB colC
------------------------
# 1 |this is | is | this
# 2 | a test | a | test

Related

Create a new column based on matched string in another column using grep function

I'm trying to use grep function to create a new column based on matched string (no match = 0, match =1) but not getting the expected results
#my data
data<- data.frame(col1 = c("name","no_match","123","0.19","rand?m","also_no_match"))
#string to match
match_txt <- "[0-9]\\.[0-9][0-9]|name|name2|wh??|[0-9]|\\?" #used "|" to match multiple strings
#create the new column using for loop
for (i in 1:nrow(data))
{
data$col2[i] <- grep(match_txt , data$col1[i])
}
# I get the error below:
# Error in data$col2[i] <- grep(match_txt, data$col1[i], ignore.case = TRUE) : replacement has length zero
#this is expected correct results:
expected_data <- data.frame(col1 = c("name","no_match","123","0.19","rand?m","also_no_match"),
col2 = c(1,0,1,1,1,0))
grep/grepl are vectorised. We can directly use them on the column with the pattern, just wrap it in as.integer to convert logical values TRUE/FALSE to 1/0 respectively.
data$col2 <- as.integer(grepl("[0-9]\\.[0-9][0-9]|name|name2|wh??|[0-9]|\\?", data$col1))
data
# col1 col2
#1 name 1
#2 no_match 0
#3 123 1
#4 0.19 1
#5 rand?m 1
#6 also_no_match 0

Remove first n words and take count

I have a dataframe with text column, I need to ignore or eliminate first 2 words and take count of string in that column.
b <- data.frame(text = c("hello sunitha what can I do for you?",
"hi john what can I do for you?")
Expected output in dataframe 'b': how can we remove first 2 words, so that count of 'what can I do for you? = 2
You can use gsub to remove the first two words and then tapply and count, i.e.
i1 <- gsub("^\\w*\\s*\\w*\\s*", "", b$text)
tapply(i1, i1, length)
#what can I do for you?
# 2
If you need to remove any range of words, we can amend i1 as follows,
i1 <- sapply(strsplit(as.character(b$text), ' '), function(i)paste(i[-c(2:4)], collapse = ' '))
tapply(i1, i1, length)
#hello I do for you? hi I do for you?
# 1 1
b=data.frame(text=c("hello sunitha what can I do for you?","hi john what can I do for you?"),stringsAsFactors = FALSE)
b$processed = sapply(b$text, function(x) (strsplit(x," ")[[1]]%>%.[-c(1:2)])%>%paste0(.,collapse=" "))
b$count = sapply(b$processed, function(x) length(strsplit(x," ")[[1]]))
> b
text processed count
1 hello sunitha what can I do for you? what can I do for you? 6
2 hi john what can I do for you? what can I do for you? 6
Are you looking for something like this? watch out for stringsAsFactors = FALSE else your texts will be factor type and harder to work on.

How to return rows of a df that contain strings from a character list

I have a character list. I would like to return rows in a df that contain any of the strings in the list in a given column.
I have tried things like:
hits <- df %>%
filter(column, any(strings))
strings <- c("ape", "bat", "cat")
head(df$column)
[1] "ape and some other text here"
[2] "just some random text"
[3] "Something about cats"
I would like only rows 1 and 3 returned
Thanks in advance for the help.
Use grepl() with a regular expression matching any of the strings in your strings vector:
strings <- c("ape", "bat", "cat")
Firstly, you can collapse the strings vector to the regex you need:
regex <- paste(strings, collapse = "|")
Which gives:
> regex <- paste(strings, collapse = "|")
> regex
[1] "ape|bat|cat"
The pipe symbol | acts as an or operator, so this regex ape|bat|cat will match ape or bat or cat.
If your data.frame df looks like this:
> df
# A tibble: 3 x 1
column
<chr>
1 ape and some other text here
2 just some random text
3 something about cats
Then you can run the following line of code to return just the rows matching your desired strings:
df[grepl(regex, df$column), ]
The output is as follows:
> df[grepl(regex, df$column), ]
# A tibble: 2 x 1
column
<chr>
1 ape and some other text here
2 something about cats
Note that the above example is case-insensitive, it will only match the lower case strings exactly as specified. You can overcome this easily using the ignore.case parameter of grepl() (note the upper case Cats):
> df[grepl(regex, df$column, ignore.case = TRUE), ]
# A tibble: 2 x 1
column
<chr>
1 ape and some other text here
2 something about Cats
This can be accomplished with a regular expression.
aColumn <- c("ape and some other text here","just some random text","Something about cats")
aColumn[grepl("ape|bat|cat",aColumn)]
...and the output:
> aColumn[grepl("ape|bat|cat",aColumn)]
[1] "ape and some other text here" "Something about cats"
>
One an also set up the regular expression in an R object, as follows.
# use with a variable
strings <- "ape|cat|bat"
aColumn[grepl(strings,aColumn)]

R function to input character vector

I currently have 10 vectors that look like the following:
string1 <- c("house", "home", "cabin")
string2 <-c("hotel", "hostel", "motel")
and so on for 10 strings.
R newbie learning functions. I have the following code I want to execute across these 10 strings, and turn in to a function. This code takes in these strings and searches for matches and creates a new variable:
a$string.i <- (1:nrow(a) %in% c(sapply(string1, grep, a$Contents, fixed = TRUE))) +0
As I am new to R, I'm stumped on how to turn this into a function. Do I need to first define the number of strings, then set 'string1' in the above code to x? How do I set the name of the variable = to the name of the string?
Some sample data:
a <- read.table(text='Contents other
1 "a house a home" "111"
2 "cabin in the woods" "121"', header=TRUE)
If you need a function, may be you can try:
fun1 <- function(namePrefix, dat){ #assuming that the datasets have a common prefix i.e. `string`
pat <- paste0("^", namePrefix, "\\d")
nm1 <- ls(pattern=pat, envir=.GlobalEnv)
lst <- mget(nm1, envir=.GlobalEnv)
lst2 <- lapply(lst, function(x)
(1:nrow(dat) %in% c(sapply(x, grep, dat$Contents, fixed=TRUE)))+0) #your code
dat[names(lst2)] <- lst2
dat
}
fun1("string", a)
# Contents other string1 string2
#1 a house a home 111 1 0
#2 cabin in the woods 121 1 0

splitting filename text by underscores using R

In R I'd like to take a collection of file names in the format below and return the number to the right of the second underscore (this will always be a number) and the text string to the right of the third underscore (this will be combinations of letters and numbers).
I have file names in this format:
HELP_PLEASE_4_ME
I want to extract the number 4 and the text ME
I'd then like to create a new field within my data frame where these two types of data can be stored. Any suggestions?
Here is an option using regexec and regmatches to pull out the patterns:
matches <- regmatches(df$a, regexec("^.*?_.*?_([0-9]+)_([[:alnum:]]+)$", df$a))
df[c("match.1", "match.2")] <- t(sapply(matches, `[`, -1)) # first result for each match is full regular expression so need to drop that.
Produces:
a match.1 match.2
1 HELP_PLEASE_4_ME 4 ME
2 SOS_WOW_3_Y34OU 3 Y34OU
This will break if any rows don't have the expected structure, but I think that is what you want to happen (i.e. be alerted that your data is not what you think it is). strsplit based approaches will require additional checking to ensure that your data is what you think it is.
And the data:
df <- data.frame(a=c("HELP_PLEASE_4_ME", "SOS_WOW_3_Y34OU"), stringsAsFactors=F)
The obligatory stringr version of #BrodieG's quite spiffy answer:
df[c("match.1", "match.2")] <-
t(sapply(str_match_all(df$a, "^.*?_.*?_([0-9]+)_([[:alnum:]]+)$"), "[", 2:3))
Put here for context only. You should accept BrodieG's answer.
Since you already know that you want the text that comes after the second and third underscore, you could use strsplit and take the third and fourth result.
> x <- "HELP_PLEASE_4_ME"
> spl <- unlist(strsplit(x, "_"))[3:4]
> data.frame(string = x, under2 = spl[1], under3 = spl[2])
## string under2 under3
## 1 HELP_PLEASE_4_ME 4 ME
Then for longer vectors, you could do something like the last two lines here.
## set up some data
> word1 <- c("HELLO", "GOODBYE", "HI", "BYE")
> word2 <- c("ONE", "TWO", "THREE", "FOUR")
> nums <- 20:23
> word3 <- c("ME", "YOU", "THEM", "US")
> XX <-paste0(word1, "_", word2, "_", nums, "_", word3)
> XX
## [1] "HELLO_ONE_20_ME" "GOODBYE_TWO_21_YOU"
## [3] "HI_THREE_22_THEM" "BYE_FOUR_23_US"
## ------------------------------------------------
## process it
> spl <- do.call(rbind, strsplit(XX, "_"))[, 3:4]
> data.frame(cbind(XX, spl))
## XX V2 V3
## 1 HELLO_ONE_20_ME 20 ME
## 2 GOODBYE_TWO_21_YOU 21 YOU
## 3 HI_THREE_22_THEM 22 THEM
## 4 BYE_FOUR_23_US 23 US

Resources