searching fields using grepl in R - r

I'm trying to use grepl to flag some data that might be interesting in a genetics dataset I have.
An example of the data looks like this
test <- c("AAT,TAA,TGA,A,G", "A,AAT,AAAT,AATAAT", "CA,CAA,CAAA")
pattern <- c("TAA", "G", "CAA")
df <- data.frame(test, pattern)
What I am trying to do is to create a third column, say result that evaluates whether the value in the pattern column is in the test column.
I tried this:
df.result <- df %>% mutate(result = grepl(pattern, test))
But for some reason I get a TRUE, TRUE, FALSE in the result column, which isn't what I'm expecting - I would expect a TRUE, FALSE, TRUE result.
I've played around with things like adding a comma to the end of each field, but that didn't seem to work either.
Would appreciate any help with this!
Thanks,
Steve

Use the apply() function:
df$result <- apply(df, 1, FUN=function(x) grepl(x[2], x[1])
df
# test pattern result
# 1 AAT,TAA,TGA,A,G TAA TRUE
# 2 A,AAT,AAAT,AATAAT G FALSE
# 3 CA,CAA,CAAA CAA TRUE
The apply function loops through each row of the df separately, feeding grepl with per row information. grepl cannot process a vector with three elements in the pattern argument. The help page says:
If a character vector of length 2 or more is supplied [as pattern], the first element is used with a warning.
Thus, the original command grepl(df$pattern, df$test) compared the first element from pattern (TAA) to the whole vector in test.

This can be otherwise done with mapply
df$result <- mapply(grepl, df$pattern, df$test)
df$result
#[1] TRUE FALSE TRUE

The stringi package provides string matching functions that are vectorised over both string and pattern;
library(stringi)
df %>% mutate(result = stri_detect_regex(test, pattern))
is one answer to the original question. An answer to the question about avoiding substring matches is
df %>% mutate(result = stri_detect_regex(test, stri_join('(^|,)', pattern, '(,|$)')))

Related

How to search a vector of words for words containing two specific letters

So I've got a vector of 5 letter words and I want to be able to create a function that extracts the words that contain ALL of the letters in the pattern.
For example, if my vector is ("aback", "abase", "abate", "agate", "allay") and I'm looking for words that contain BOTH "a" and "b", I want the function to return ("aback", "abase", "abate"). I don't care what position or how many times these letters occur in the words, only that the words contain both of them.
I've tried to do this by creating a function that is meant to combine grepl's with an &. But the problem here is the grepl function doesn't accept vectors as the pattern. My plan was for this function to achieve grepl("a", word_vec) & grepl("b", word_vec). I also need this to be scalable so if I want to search for all words containing "a" AND "b" AND "c", for example.
grepl_cat <- function(str, words_vec) {
pat <- str_split(str, "")
first_let = TRUE
for (i in 1:length(pat)) {
if (first_let){
result <- sapply(pat[i], grepl, x = word_vec)
first_let <- FALSE
}
print(pat[i])
result <- result & sapply(pat[i], grepl, x = word_vec)
}
return(result)
}
word_vec[grepl_cat("abc", word_vec)]
The function I've written above definitely isn't doing what it's intended to do.
I'm wondering if there an easier way to do this with regex patterns or there's a way to input each letter in the str into the grepl function as non-vectors.
A possible solution in base R:
s <- c("aback", "abase", "abate", "agate", "allay")
subset(s, grepl("(a)(b)", s))
#> [1] "aback" "abase" "abate"
Another possible solution, based on tidyverse:
library(tidyverse)
s <- c("aback", "abase", "abate", "agate", "allay")
s %>%
data.frame(s = .) %>%
filter(str_detect(s, "(a)(b)")) %>%
pull(s)
#> [1] "aback" "abase" "abate"
For a,b and c regex solution would be:
^.*a.*b.*c.*$
You may add more letters as needed
Demo1
Alternative regex approach:
^(?=.*a)(?=.*b)(?=.*c).*$
Demo 2

grep() and sub() and regular expression

I'd like to change the variable names in my data.frame from e.g. "pmm_StartTimev4_E2_C19_1" to "pmm_StartTimev4_E2_C19". So if the name ends with an underscore followed by any number it gets removed.
But I'd like for this to happen only if the variable name has the word "Start" in it.
I've got a muddled up bit of code that doesn't work. Any help would be appreciated!
# Current data frame:
dfbefore <- data.frame(a=c("pmm_StartTimev4_E2_C19_1","pmm_StartTimev4_E2_E2_C1","delivery_C1_C12"),b=c("pmm_StartTo_v4_E2_C19_2","complete_E1_C12_1","pmm_StartTo_v4_E2_C19"))
# Desired data frame:
dfafter <- data.frame(a=c("pmm_StartTimev4_E2_C19","pmm_StartTimev4_E2_E2_C1","delivery_C1_C12"),b=c("pmm_StartTo_v4_E2_C19","complete_E1_C12_1","pmm_StartTo_v4_E2_C19"))
# Current code:
sub((.*{1,}[0-9]*).*","",grep("Start",names(df),value = TRUE)
How about something like this using gsub().
stripcol <- function(x) {
gsub("(.*Start.*)_\\d+$", "\\1", as.character(x))
}
dfnew <- dfbefore
dfnew[] <- lapply(dfbefore, stripcol)
We use the regular expression to look for "Start" and then grab everything but the underscore number at the end. We use lapply to apply the function to all columns.
doit <- function(x){
x <- as.character(x)
if(grepl("Start",x)){
x <- gsub("_([0-9])","",x)
}
return(x)
}
apply(dfbefore,c(1,2),doit)
a b
[1,] "pmm_StartTimev4_E2_C19" "pmm_StartTo_v4_E2_C19"
[2,] "pmm_StartTimev4_E2_E2_C1" "complete_E1_C12_1"
[3,] "delivery_C1_C12" "pmm_StartTo_v4_E2_C19"
We can use sub to capture groups where the 'Start' substring is also present followed by an underscore and one or more numbers. In the replacement, use the backreference of the captured group. As there are multiple columns, use lapply to loop over the columns, apply the sub and assign the output back to the original data
out <- dfbefore
out[] <- lapply(dfbefore, sub,
pattern = "^(.*_Start.*)_\\d+$", replacement ="\\1")
out
dfafter[] <- lapply(dfafter, as.character)
all.equal(out, dfafter, check.attributes = FALSE)
#[1] TRUE

Trimming data frame in R with grep?

My dataframe, dat, has two columns which look like this:
value condition
2 learning/cat
4 learning/dog
1 naming/cat
6 naming/dog
I would like to 'trim' the data frame to only include rows in which condition contains "naming".
I've tried to do this with grep:
dat = dat[grep("naming", dat$condition, value = T)]
which causes the following error:
Error in `[.data.frame`(dat, grep("naming", dat$condition, value = T)) :
undefined columns selected
Can anyone suggest a fix? Any help would be greatly appreciated!
You can split up condition using separate from tidyr:
df = input_df %>% separate( condition, into = c("condition1", "condition2"), sep = "/")
Then just use filter:
only_naming_df = df %>% filter(condition1 == "naming")
The error is easy to fix once adding a comma after the parenthesis. But I want to have a list of available options to achieve this task. Belows are solution and comments from others and mine.
Use grep or grepl
grep returns the index (row number), while grepl returns a logical vector (TRUE or FALSE). Notice that when using grep in this case, value = T should not be added because it will return the string, which is not helpful for subsetting.
dat[grep("naming", dat$condition), ]
dat[grepl("naming", dat$condition), ]
Functions from dplyr and stringr
str_detect is equivalent to grepl(pattern, x), while str_which is equivalent to grep(pattern, x).
library(dplyr)
library(stringr)
dat %>% filter(str_detect(condition, "naming"))
dat %>% slice(str_which(condition, "naming"))
Data Preparation
# Create example dataframes
dat <- read.table(text = "value condition
2 learning/cat
4 learning/dog
1 naming/cat
6 naming/dog",
header = TRUE, stringsAsFactors = FALSE)

purrr::discard How to delete elements in a vector containing one or more specific strings

I would like to remove the elements containing '_1' and '_3' in the vector using the discard function from purrr. Here the example:
library(purrr)
x <- c("ABAC_13", "ZDRF73", "UYDS_12", "FGSH41", "GFSC_35" , "JHSC_29")
With discard we need to provide a logical vector indicating which values we need to discard.
To create a logical vector we use grepl giving TRUE values to the elements which have '_1' or '_3'
library(purrr)
discard(x, grepl("_1|_3", x))
#[1] "ZDRF73" "FGSH41" "JHSC_29"
and as #Lazarus Thurston commented using str_subset should be a better choice here.
str_subset(x, '_(1|3)', negate = TRUE)
As this is specific to tidyverse, we can use the syntax specific to it
library(tidyverse)
str_detect(x, "_[13]") %>%
discard(x, .)
#[1] "ZDRF73" "FGSH41" "JHSC_29"
If we need to remove the elements
grep("_\\d+", x, invert = TRUE, value = TRUE)
#[1] "ZDRF73" "FGSH41"
or if it is specific to 1, 3
grep("_[13]", x, invert = TRUE, value = TRUE)
#[1] "ZDRF73" "FGSH41" "JHSC_29"
If we need to remove the substring part,
sub("_\\d+", '', x)
This task can be performed using grepl(). Basically we want to find such occurrences that contains _1 or _3. The grepl output is a logical vector of TRUE/FALSE. Following that we remove those elements from x vector by using a subset and negating opearator i.e. x[!grepl("_1|_3", x)].
x <- c("ABAC_13", "ZDRF73", "UYDS_12", "FGSH41", "GFSC_35" , "JHSC_29")
x[!grepl("_1|_3", x)]

Removing Whitespace From a Whole Data Frame in R

I've been trying to remove the white space that I have in a data frame (using R). The data frame is large (>1gb) and has multiple columns that contains white space in every data entry.
Is there a quick way to remove the white space from the whole data frame? I've been trying to do this on a subset of the first 10 rows of data using:
gsub( " ", "", mydata)
This didn't seem to work, although R returned an output which I have been unable to interpret.
str_replace( " ", "", mydata)
R returned 47 warnings and did not remove the white space.
erase_all(mydata, " ")
R returned an error saying 'Error: could not find function "erase_all"'
I would really appreciate some help with this as I've spent the last 24hrs trying to tackle this problem.
Thanks!
A lot of the answers are older, so here in 2019 is a simple dplyr solution that will operate only on the character columns to remove trailing and leading whitespace.
library(dplyr)
library(stringr)
data %>%
mutate_if(is.character, str_trim)
## ===== 2020 edit for dplyr (>= 1.0.0) =====
df %>%
mutate(across(where(is.character), str_trim))
You can switch out the str_trim() function for other ones if you want a different flavor of whitespace removal.
# for example, remove all spaces
df %>%
mutate(across(where(is.character), str_remove_all, pattern = fixed(" ")))
If i understood you correctly then you want to remove all the white spaces from entire data frame, i guess the code which you are using is good for removing spaces in the column names.I think you should try this:
apply(myData, 2, function(x)gsub('\\s+', '',x))
Hope this works.
This will return a matrix however, if you want to change it to data frame then do:
as.data.frame(apply(myData, 2, function(x) gsub('\\s+', '', x)))
EDIT In 2020:
Using lapply and trimws function with both=TRUE can remove leading and trailing spaces but not inside it.Since there was no input data provided by OP, I am adding a dummy example to produce the results.
DATA:
df <- data.frame(val = c(" abc", " kl m", "dfsd "),
val1 = c("klm ", "gdfs", "123"),
num = 1:3,
num1 = 2:4,
stringsAsFactors = FALSE)
#situation: 1 (Using Base R), when we want to remove spaces only at the leading and trailing ends NOT inside the string values, we can use trimws
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, cols_to_be_rectified] <- lapply(df[, cols_to_be_rectified], trimws)
# situation: 2 (Using Base R) , when we want to remove spaces at every place in the dataframe in character columns (inside of a string as well as at the leading and trailing ends).
(This was the initial solution proposed using apply, please note a solution using apply seems to work but would be very slow, also the with the question its apparently not very clear if OP really wanted to remove leading/trailing blank or every blank in the data)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, cols_to_be_rectified] <- lapply(df[, cols_to_be_rectified],
function(x) gsub('\\s+', '', x))
## situation: 1 (Using data.table, removing only leading and trailing blanks)
library(data.table)
setDT(df)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, c(cols_to_be_rectified) := lapply(.SD, trimws), .SDcols = cols_to_be_rectified]
Output from situation1:
val val1 num num1
1: abc klm 1 2
2: kl m gdfs 2 3
3: dfsd 123 3 4
## situation: 2 (Using data.table, removing every blank inside as well as leading/trailing blanks)
cols_to_be_rectified <- names(df)[vapply(df, is.character, logical(1))]
df[, c(cols_to_be_rectified) := lapply(.SD, function(x) gsub('\\s+', '', x)), .SDcols = cols_to_be_rectified]
Output from situation2:
val val1 num num1
1: abc klm 1 2
2: klm gdfs 2 3
3: dfsd 123 3 4
Note the difference between the outputs of both situation, In row number 2: you can see that, with trimws we can remove leading and trailing blanks, but with regex solution we are able to remove every blank(s).
I hope this helps , Thanks
One possibility involving just dplyr could be:
data %>%
mutate_if(is.character, trimws)
Or considering that all variables are of class character:
data %>%
mutate_all(trimws)
Since dplyr 1.0.0 (only strings):
data %>%
mutate(across(where(is.character), trimws))
Or if all columns are strings:
data %>%
mutate(across(everything(), trimws))
Picking up on Fremzy and the comment from Stamper, this is now my handy routine for cleaning up whitespace in data:
df <- data.frame(lapply(df, trimws), stringsAsFactors = FALSE)
As others have noted this changes all types to character. In my work, I first determine the types available in the original and conversions required. After trimming, I re-apply the types needed.
If your original types are OK, apply the solution from MarkusN below https://stackoverflow.com/a/37815274/2200542
Those working with Excel files may wish to explore the readxl package which defaults to trim_ws = TRUE when reading.
Picking up on Fremzy and Mielniczuk, I came to the following solution:
data.frame(lapply(df, function(x) if(class(x)=="character") trimws(x) else(x)), stringsAsFactors=F)
It works for mixed numeric/charactert dataframes manipulates only character-columns.
You could use trimws function in R 3.2 on all the columns.
myData[,c(1)]=trimws(myData[,c(1)])
You can loop this for all the columns in your dataset. It has good performance with large datasets as well.
If you're dealing with large data sets like this, you could really benefit form the speed of data.table.
library(data.table)
setDT(df)
for (j in names(df)) set(df, j = j, value = df[[trimws(j)]])
I would expect this to be the fastest solution. This line of code uses the set operator of data.table, which loops over columns really fast. There is a nice explanation here: Fast looping with set.
R is simply not the right tool for such file size. However have 2 options :
Use ffdply and ff base
Use ff and ffbase packages:
library(ff)
library(ffabse)
x <- read.csv.ffdf(file=your_file,header=TRUE, VERBOSE=TRUE,
first.rows=1e4, next.rows=5e4)
x$split = as.ff(rep(seq(splits),each=nrow(x)/splits))
ffdfdply( x, x$split , BATCHBYTES=0,function(myData)
apply(myData,2,function(x)gsub('\\s+', '',x))
Use sed (my preference)
sed -ir "s/(\S)\s+(/S)/\1\2/g;s/^\s+//;s/\s+$//" your_file
If you want to maintain the variable classes in your data.frame - you should know that using apply will clobber them because it outputs a matrix where all variables are converted to either character or numeric. Building upon the code of Fremzy and Anthony Simon Mielniczuk you can loop through the columns of your data.frame and trim the white space off only columns of class factor or character (and maintain your data classes):
for (i in names(mydata)) {
if(class(mydata[, i]) %in% c("factor", "character")){
mydata[, i] <- trimws(mydata[, i])
}
}
I think that a simple approach with sapply, also works, given a df like:
dat<-data.frame(S=LETTERS[1:10],
M=LETTERS[11:20],
X=c(rep("A:A",3),"?","A:A ",rep("G:G",5)),
Y=c(rep("T:T",4),"T:T ",rep("C:C",5)),
Z=c(rep("T:T",4),"T:T ",rep("C:C",5)),
N=c(1:3,'4 ','5 ',6:10),
stringsAsFactors = FALSE)
You will notice that dat$N is going to become class character due to '4 ' & '5 ' (you can check with class(dat$N))
To get rid of the spaces on the numeic column simply convert to numeric with as.numeric or as.integer.
dat$N<-as.numeric(dat$N)
If you want to remove all the spaces, do:
dat.b<-as.data.frame(sapply(dat,trimws),stringsAsFactors = FALSE)
And again use as.numeric on col N (ause sapply will convert it to character)
dat.b$N<-as.numeric(dat.b$N)

Resources