How do I use grep on a data frame? - r

I have the following data frame:
> my.data
A.Seats B.Seats
1 14,15 14,15,16
2 7 7,8
3 12,13 16,17
4 <NA> 10,11
I would like to check if the string within any row in column "A.Seats" is found within the same row of column "B.Seats". So the output would look something like this:
A.Seats B.Seats Check
1 14,15 14,15,16 TRUE
2 7 7,8 TRUE
3 12,13 16,17 FALSE
4 <NA> 10,11 FALSE
But I don't know how to create this table. As a start, I tried using grep:
grep(my.data$A.Seats,my.data$B.Seats)
But I receive the following output
[1] 1
Warning message:
In grep(my.data$A.Seats, my.data$B.Seats) :
argument 'pattern' has length > 1 and only the first element will be used
...and I can't get past this error. Any ideas as to how I can get the intended result?
Many Thanks

The "stringi" library has some vectorized functions that might be useful for something like this. I would suggest the stri_detect() function. Here's an example with some reproducible sample data. Note the difference in the values in the first and last row, and the difference in the results according to whether a regex or fixed approach was taken:
my.data <- data.frame(
A.Seats = c("14,15", "7", "12,13", NA, "14,19"),
B.Seats = c("14,15,16", "7,8", "16,17", "10,11", "14,15,16"))
my.data
# A.Seats B.Seats
# 1 14,15 14,15,16
# 2 7 7,8
# 3 12,13 16,17
# 4 <NA> 10,11
# 5 14,19 14,15,16
library(stringi)
stri_detect(my.data$B.Seats, fixed = my.data$A.Seats)
# [1] TRUE TRUE FALSE NA FALSE
stri_detect(my.data$B.Seats, regex = gsub(",", "|", my.data$A.Seats))
# [1] TRUE TRUE FALSE NA TRUE
The first option above treats the values in my.data$A.Seats as a fixed string pattern. The second option treats it as a regular expression to match any of the values.
Note that this maintains NA as NA, but that can easily be changed to FALSE if you need to.
If you don't want to think too much about mapply, you can consider Vectorize to make a vectorized version of grepl. Something like the following should do it:
vGrepl <- Vectorize(grepl)
vGrepl(my.data$A.Seats, my.data$B.Seats) # pattern is fixed
# [1] 1 1 0 NA 0
vGrepl(gsub(",", "|", my.data$A.Seats), my.data$B.Seats) # pattern is regex
# 14|15 7 12|13 <NA> 14|19
# 1 1 0 NA 1
as.logical(vGrepl(my.data$A.Seats, my.data$B.Seats)) # coerce to logical
# [1] TRUE TRUE FALSE NA FALSE
Because this calls grepl on each element in the vector, I don't think this will scale well though.

This is an approach to get what you need
> List <- lapply(my.data, function(x) strsplit(as.character(x), ","))
> transform(my.data, Check=sapply(mapply("%in%", List[[1]], List[[2]]), any))
A.Seats B.Seats Check
1 14,15 14,15,16 TRUE
2 7 7,8 TRUE
3 12,13 16,17 FALSE
4 <NA> 10,11 FALSE
Here's an alternative using grep
>transform(my.data,
Check=sapply(suppressWarnings(mapply("grep", List[[1]], List[[2]])), any))

Related

How to match multiple columns without merge?

I have those two df's:
ID1 <- c("TRZ00897", "AAR9832", "NZU44447683209", "sxc89898989M", "RSU765th89", "FFF")
Date1 <- c("2022-08-21","2022-03-22","2022-09-24", "2022-09-21", "2022-09-22", "2022-09-22")
Data1 <- data.frame(ID1,Date1)
ID <- c("RSU765th89", "NZU44447683209", "AAR9832", "TRZ00897","ERD895655", "FFF", "IUHG0" )
Date <- c("2022-09-22","2022-09-21", "2022-03-22", "2022-08-21", "2022-09-21", "2022-09-22", "2022-09-22" )
Data2 <- data.frame(ID,Date)
I tried to get exact matches. An exact match is if ID and Date are the same in both df's, for example: "TRZ00897" "2022-08-21" is an exact match, because it is present in both df's
With the following line of code:
match(Data1$ID1, Data2$ID) == match(Data1$Date1, Data2$Date)
the output is:
TRUE TRUE NA NA TRUE FALSE
Obviously the last one should not be FALSE because "FFF" "2022-09-22" is in both df. The reason why it is FALSE is, that the Date"2022-09-22" occurred already in Data2 at index position 1.
match(Data1$ID1, Data2$ID)
4 3 2 NA 1 6
match(Data1$Date1, Data2$Date)
4 3 NA 2 1 1
So at the end, there is index position 6 and 1 which is not equal --> FALSE
How can I change this? Which function should I use to get the correct answer.
Note, I don't need to merge or join etc. I'm really looking for a function that can detect those patterns.
Combine the columns then match:
match(paste(Data1$ID1, Data1$Date1), paste(Data2$ID, Data2$Date))
# [1] 4 3 NA NA 1 6
To get logical outut use %in%:
paste(Data1$ID1, Data1$Date1) %in% paste(Data2$ID, Data2$Date)
# [1] TRUE TRUE FALSE FALSE TRUE TRUE
Try match with asplit (since you have different column names for two dataframes, I have to manually remove the names using unname, which can be avoided if both of them have the same names)
> match(asplit(unname(Data1), 1), asplit(unname(Data2), 1))
[1] 4 3 NA NA 1 6
Another option that is memory-expensive option is using interaction
> match(interaction(Data1), interaction(Data2))
[1] 4 3 NA NA 1 6
With mapply and %in%:
apply(mapply(`%in%`, Data1, Data2), 1, all)
[1] TRUE TRUE FALSE FALSE TRUE TRUE
rowSums(mapply(`%in%`, Data1, Data2)) == ncol(Data1)
Edit; for a subset of columns:
idx <- c(1, 2)
apply(mapply(`%in%`, Data1[idx], Data2[idx]), 1, all)
#[1] TRUE TRUE FALSE FALSE TRUE TRUE

How to use '%in%' operator in R?

I have been using the %in% operator for a long time since I knew about it.
However, I still don't understand how it works. At least, I thought that I knew how, but I always doubt about the order of the elements.
Here you have an example:
This is my dataframe:
df <- data.frame("col1"=c(1,2,3,4,30,21,320,123,4351,1234,3,0,43), "col2"=rep("something",13))
This how it looks
> df
col1 col2
1 1 something
2 2 something
3 3 something
4 4 something
5 30 something
6 21 something
7 320 something
8 123 something
9 4351 something
10 1234 something
11 3 something
12 0 something
13 43 something
Let's say I have a numerical vector:
myvector <- c(30,43,12,333334,14,4351,0,5,55,66)
And I want to check if all the numbers (or some) from my vector are in the previous dataframe. To do that, I always use %in%.
I thought 2 approaches:
#common in both: 30, 4351, 0, 43
# are the numbers from df$col1 in my vector?
trial1 <- subset(df, df$col1 %in% myvector)
# are the numbers of the vector in df$col1?
trial2 <- subset(df, myvector %in% df$col1)
Both approaches make sense to me and they should give the same result. However, only the result from trial1 is okay.
> trial1
col1 col2
5 30 something
9 4351 something
12 0 something
13 43 something
What I don't understand is why the second way is giving me some common numbers and some which are not in the vector.
col1 col2
1 1 something
2 2 something
6 21 something
7 320 something
11 3 something
12 0 something
Could someone explain to me how `%in% operator works and why the second way gives me the wrong result?
Thanks very much in advance
Regards
Answer is given, but a bit more detailed simply look at the %in% result
df$col1 %in% myvector
# [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE
The above one is correct as you subset df and keep the TRUE values, row 5, 9, 12, 13
versus
myvector %in% df$col1
# [1] TRUE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE
This one goes wrong as you subset df and tell to keep 1, 2, 6, 7 and as length here is only 10 it recycles 11, 12, 13 as TRUE, TRUE, FALSE again so you get 11 and 12 in your subset as well

How to use apply over a vector?

Suppose I have a data.frame like
a <- data.frame(col1=1:6,
col2=c('a','b',1,'c',2,3),
stringsAsFactors=F)
a
col1 col2
1 1 a
2 2 b
3 3 1
4 4 c
5 5 2
6 6 3
I want to have a vector saying which rows have col2 as a number. I'm trying something like
apply(a$col2,1,is.numeric)
or
apply(a$col2,FUN=is.numeric)
but it always says
Error in apply(a$col2, 1, is.numeric) :
dim(X) must have a positive length
If a$col2 (the X in apply) must be a matrix, then why does the help from the function say:
X: an array, including a matrix.
The help on arrays says:
An array in R can have one, two or more dimensions.
If an array can have only one dimension, then why can't a one-dimensional array be used in apply? What am I missing here?
(Beyond that, I still would like to know how to find the numeric rows in col2 without using a loop.)
First note that even the numbers in col2 are character since when combined with other elements which are character they get coerced to character.
str(a)
## 'data.frame': 6 obs. of 2 variables:
## $ col1: int 1 2 3 4 5 6
## $ col2: chr "a" "b" "1" "c" ...
1) grepl thus we should use character processing like this:
grepl("^\\d+$", a$col2)
## [1] FALSE FALSE TRUE FALSE TRUE TRUE
grepl is alredy vectorized so we don't need an apply or related function to iterate over the elements of col2.
2) (s)apply These also work but seems unnecessarily involved given that grepl alone works:
sapply(a$col2, grepl, pattern = "^\\d+$")
## a b 1 c 2 3
## FALSE FALSE TRUE FALSE TRUE TRUE
apply(array(a$col2), 1, grepl, pattern = "^\\d+$")
## [1] FALSE FALSE TRUE FALSE TRUE TRUE
3) type.convert Another approach is to use type.convert which will convert to numeric if it can be represented as one. Then we can use is.numeric.
sapply(a$col2, function(x) is.numeric(type.convert(x)))
## a b 1 c 2 3
## FALSE FALSE TRUE FALSE TRUE TRUE

Filter dataframe based on presence of sample in a seperate list

I am wanting to filter a dataframe with 1212 so it only contains that samples listed in a seperate list. The list has multiple values and I can't work out how to do this.
The df below is called RNASeq2
RNASeq2Norm_samples Substrng_RNASeq2Norm
1 TCGA-3C-AAAU-01A-11R-A41B-07 TCGA.3C.AAAU
2 TCGA-3C-AALI-01A-11R-A41B-07 TCGA.3C.AALI
3 TCGA-3C-AALJ-01A-31R-A41B-07 TCGA.3C.AALJ
4 TCGA-3C-AALK-01A-11R-A41B-07 TCGA.3C.AALK
5 TCGA-4H-AAAK-01A-12R-A41B-07 TCGA.4H.AAAK
6 TCGA-5L-AAT0-01A-12R-A41B-07 TCGA.5L.AAT0
7 TCGA-5L-AAT1-01A-12R-A41B-07 TCGA.5L.AAT1
8 TCGA-5T-A9QA-01A-11R-A41B-07 TCGA.5T.A9QA
.
.
.
1212
list = intersect_samples
intersect_samples: "TCGA.3C.AAAU" "TCGA.3C.AALI" "TCGA.3C.AALJ" "TCGA.3C.AALK" ... 1097
I have tried this code but returns all the original 1212 samples:
RNASeq_filtered <- RNASeq2[RNASeq2$Substrng_RNASeq2Norm %in% intersect_samples,]
Yet if I try
RNASeq_filtered <- RNASeq2[RNASeq2$Substrng_RNASeq2Norm %in% "TCGA.3C.AAAU",]
it will return the correct row
str(RNASeq2)
'data.frame': 1212 obs. of 2 variables:
$ RNASeq2 : Factor w/ 1212 levels "TCGA-3C-AAAU-01A-11R-A41B-07",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Substrng_RNASeq2Norm: Factor w/ 1093 levels "TCGA.3C.AAAU",..: 1 2 3 4 5 6 7 8 9 10 ...
str(intersect_samples)
chr [1:1093] "TCGA.3C.AAAU" "TCGA.3C.AALI" "TCGA.3C.AALJ" "TCGA.3C.AALK" "TCGA.4H.AAAK" ...
AFAIK R does not offer a convenience function to find a vector of search strings in a vector of strings using partial matching ("sub-strings").
%in is not the right function if you want to find a sub-string within a string since it compares only whole strings.
Instead use base R's grepl or the probably faster stri_detect_fixed function of the excellent stringi package.
Please note that I have abstracted the code and data (instead of using your code and data) for easier understanding.
library(stringi)
pattern = c("23", "45", "999")
data <- data.frame(row_num = 1:4,
string = c("123", "234", "345", "xyz"),
stringsAsFactors = FALSE)
# row_num string
# 1 1 123
# 2 2 234
# 3 3 345
# 4 4 xyz
string <- data$string # the column that contains the values to be filtered
# Iterate over each element in pattern and apply it to the string vector.
# Returns a logical vector of the same length as string (TRUE = found, FALSE = not found)
selected <- lapply(pattern, function(x) stri_detect_fixed(string, x))
# Or shorter:
# lapply(pattern, stri_detect_fixed, str = string)
selected # show the result (it is a list of logical vectors - one per search pattern element)
# [[1]]
# [1] TRUE TRUE FALSE FALSE
#
# [[2]]
# [1] FALSE FALSE TRUE FALSE
#
# [[3]]
# [1] FALSE FALSE FALSE FALSE
# "row-wise" reduce the logical vectors into one final vector using the logical "or" operator
# WARNING: Does not handle `NA`s correctly (one NA does makes any TRUE to NA)
selected.rows <- Reduce("|", selected)
# [1] TRUE TRUE TRUE FALSE
# To handle NAs correctly (if you have NAs) you can use this (slower) code:
selected.rows <- rowSums(as.data.frame(selected), na.rm = TRUE) > 0
# Use the logical vector as row selector (TRUE returns the row, FALSE ignores the row):
string[selected.rows]
# [1] 123 234 345

Select rows with identical columns from a data frame

I have a data frame with several columns.
I want to select the rows with no NAs (as with complete.cases)
and all columns identical.
E.g., for
> f <- data.frame(a=c(1,NA,NA,4),b=c(1,NA,3,40),c=c(1,NA,5,40))
> f
a b c
1 1 1 1
2 NA NA NA
3 NA 3 5
4 4 40 40
I want the vector TRUE,FALSE,FALSE,FALSE selecting just the first row because there all 3 columns are the same and none is NA.
I can do
Reduce("==",f[complete.cases(f),])
but that creates an intermediate data frame which I would love to avoid (to save memory).
Try this:
R > index <- apply(f, 1, function(x) all(x==x[1]))
R > index
[1] TRUE NA NA FALSE
R > index[is.na(index)] <- FALSE
R > index
[1] TRUE FALSE FALSE FALSE
The best (IMO) solution is from David Winsemius:
which( rowSums(f==f[[1]]) == length(f) )

Resources