subset of list of vector with grep? - r

I have a list of vector and I want to create a new list containing any value containing the letter 'a' but keep in internal structure.
l = list ( g1 = c('a','b','ca') ,
g2 = c('a','b') )
lapply(l, function(x) grep('a',x) )
lapply on provides the index number but what I want it to return are the values.
The end result should be a list with vector g1 containing a and ca whilst g2 with just a.
thanks!

Add value = TRUE.
lapply(l, function(x) grep('a', x, value = TRUE))
# $g1
# [1] "a" "ca"
#
# $g2
# [1] "a"

Alternatively, you can do:
lapply(l, function(x) x[grepl("a", x)])
$g1
[1] "a" "ca"
$g2
[1] "a"

If you want to try with tidyverse here are couple of approaches.
library(tidyverse)
map(l, ~grep('a', .x, value=T))
map(l, ~str_subset(.x, 'a')) # str_subset from stringr package is a wrapper for grep shown above.

Related

substitute the elements of a vector with values from dataframe

I need to substitute the elements of a vector which match the elements of a particular column in data frame in R.
Reproducible example:
a<-c("A","B","C","D")
b<-data.frame(col1=c("B","C","E"),col2=c("T","Y","N"))
I need to get the following vector:
new<-c("A","T","Y","D")
What I tried is:
new <- a
new <- b$col2[match(a, b$col1)]
which does the substitution, but converts the unmatched elements into NAs.
Any help is appreciated
You can make a data.table from a and then update only the rows for which there is a match when joining with b.
library(data.table)
setDT(b)
data.table(a)[b, on = .(a = col1), a := i.col2][]
# a
# 1: A
# 2: T
# 3: Y
# 4: D
In base R you could use your current approach but replace the NAs with elements of a using ifelse
temp <- as.character(b$col2[match(a, b$col1)])
ifelse(is.na(temp), a, temp)
# [1] "A" "T" "Y" "D"
You can use replace in base R:
a<-c("A","B","C","D")
b<-data.frame(col1=c("B","C","E"),col2=c("T","Y","N"), stringsAsFactors = F)
replace(a, which(a %in% b$col1), b$col2[b$col1 %in% a])
#[1] "A" "T" "Y" "D"

Summing a dataframe with lapply

Here I'm attempting to sum the columns for c1,c2,c3 and add total to a new column in res dataframe :
res <- data.frame("ID" = c(1,2), "c1" = c(1,2), "c2" = c(3,4), "c3" = c(5,6))
res_subset <- data.frame(res$c1 , res$c2 , res$c3)
tr <- t(res_subset)
s1 <- lapply(tr , function(x){
sum(x)
})
s1 contains :
s1
[[1]]
[1] 1
[[2]]
[1] 3
[[3]]
[1] 5
[[4]]
[1] 2
[[5]]
[1] 4
[[6]]
[1] 6
I take the transpose of the columns to be summed ( tr <- t(res_subset) ) as lapply executes function against each column but I'm attempting to execute function against row.
Is there an issue with how I take transpose as this appears to work for simpler example :
res1 <- data.frame("c1" = c(1,2), "c2" = c(3,4), "c3" = c(5,6))
lapply(res1 , function(x){
sum(x)
})
returns :
$c1
[1] 3
$c2
[1] 7
$c3
[1] 11
If I understood right what you need, just use rowSums() function.
res$sum <- rowSums(res[,2:4])
The function sum returns a scalar, which is not what you want here. Instead, col1 + col2 + ... gives the desired result. So you can use Reduce in combination with +:
res$sum <- Reduce(`+`, res[, c('c1','c2','c3')])
The + operator must be quoted with backticks, since we are using it as a function. (I think quoting with normal quotation marks is OK too.)
rowSums also works, but my understanding is that it will create an intermediate matrix, which is not efficient.

Remove duplicated elements from list

I have a list of character vectors:
my.list <- list(e1 = c("a","b","c","k"),e2 = c("b","d","e"),e3 = c("t","d","g","a","f"))
And I'm looking for a function that for any character that appears more than once across the list's vectors (in each vector a character can only appear once), will only keep the first appearance.
The result list for this example would therefore be:
res.list <- list(e1 = c("a","b","c","k"),e2 = c("d","e"),e3 = c("t","g","f"))
Note that it is possible that an entire vector in the list is eliminated so that the number of elements in the resulting list doesn't necessarily have to be equal to the input list.
We can unlist the list, get a logical list using duplicated and extract the elements in 'my.list' based on the logical index
un <- unlist(my.list)
res <- Map(`[`, my.list, relist(!duplicated(un), skeleton = my.list))
identical(res, res.list)
#[1] TRUE
Here is an alternative using mapply with setdiff and Reduce.
# make a copy of my.list
res.list <- my.list
# take set difference between contents of list elements and accumulated elements
res.list[-1] <- mapply("setdiff", res.list[-1],
head(Reduce(c, my.list, accumulate=TRUE), -1))
Keeping the first element of the list, we compute on subsequent elements and the a list of the cumulative vector of elements produced by Reduce with c and the accumulate=TRUE argument. head(..., -1) drops the final list item containing all elements so that the lengths align.
This returns
res.list
$e1
[1] "a" "b" "c" "k"
$e2
[1] "d" "e"
$e3
[1] "t" "g" "f"
Note that in Reduce, we could replace c with function(x, y) unique(c(x, y)) and accomplish the same ultimate output.
I found the solutions here very complex for my understanding and sought a simpler technique. Suppose you have the following list.
my_list <- list(a = c(1,2,3,4,5,5), b = c(1,2,2,3,3,4,4),
d = c("Mary", "Mary", "John", "John"))
The following much simpler piece of code removes the duplicates.
sapply(my_list, unique)
You will end up with the following.
$a
[1] 1 2 3 4 5
$b
[1] 1 2 3 4
$d
[1] "Mary" "John"
There is beauty in simplicity!

Extract distinct characters that differ between two strings

I have two strings, a <- "AERRRTX"; b <- "TRRA" .
I want to extract the characters in a not used in b, i.e. "ERX"
I tried the answer in Extract characters that differ between two strings , which uses setdiff. It returns "EX", because b does have "R" and setdiff will eliminate all three "R"s in a. My aim is to treat each character as distinct, so only two of the three R's in a should be eliminated.
Any suggestions on what I can use instead of setdiff, or some other approach to achieve my output?
A different approach using pmatch,
a1 <- unlist(strsplit(a, ""))
b1 <- unlist(strsplit(b, ""))
a1[!1:length(a1) %in% pmatch(b1, a1)]
#[1] "E" "R" "X"
Another example,
a <- "Ronak";b<-"Shah"
a1 <- unlist(strsplit(a, ""))
b1 <- unlist(strsplit(b, ""))
a1[!1:length(a1) %in% pmatch(b1, a1)]
# [1] "R" "o" "n" "k"
You can use the function vsetdiff from vecsets package
install.packages("vecsets")
library(vecsets)
a <- "AERRRTX"
b <- "TRRA"
Reduce(vsetdiff, strsplit(c(a, b), split = ""))
## [1] "E" "R" "X"
We can use Reduce() to successively eliminate from a each character found in b:
a <- 'AERRRTX'; b <- 'TRRA';
paste(collapse='',Reduce(function(as,bc) as[-match(bc,as,nomatch=length(as)+1L)],strsplit(b,'')[[1L]],strsplit(a,'')[[1L]]));
## [1] "ERX"
This will preserve the order of the surviving characters in a.
Another approach is to mark each character with its occurrence index in a, do the same for b, and then we can use setdiff():
a <- 'AERRRTX'; b <- 'TRRA';
pasteOccurrence <- function(x) ave(x,x,FUN=function(x) paste0(x,seq_along(x)));
paste(collapse='',substr(setdiff(pasteOccurrence(strsplit(a,'')[[1L]]),pasteOccurrence(strsplit(b,'')[[1L]])),1L,1L));
## [1] "ERX"
An alternative using data.table package`:
library(data.table)
x = data.table(table(strsplit(a, '')[[1]]))
y = data.table(table(strsplit(b, '')[[1]]))
dt = y[x, on='V1'][,N:=ifelse(is.na(N),0,N)][N!=i.N,res:=i.N-N][res>0]
rep(dt$V1, dt$res)
#[1] "E" "R" "X"

which() function for lists in R

This should be easy, but I am hoping to find out how to return the indices of a list that contain one element. For example, in the list below, let's say I want to find all indices where "a" is an element. I would want a function to return the index 1.
> x = list(c("a", "b"), "c")
> x
[[1]]
[1] "a" "b"
[[2]]
[1] "c"
> which(x=="a")
integer(0)
Of course, which() does not work here. Any help would be appreciated!
You need to iterate over the list elements and check for the element in each set. The
sapply(x, function(e) is.element('a', e))
## [1] TRUE FALSE
which(sapply(x, function(e) is.element('a', e)))
## [1] 1
The sapply expression returns a logical vector, indicating the presence of a each element of the list, and which returns the indices of the TRUE elements.
It's not exactly clear to me how you want result formatted. Since there are two list elements, it would be difficult to determine which list element the match came from when you have a longer list, should you simply want the indices as a vector. You could use which here. Just write
sapply(x, function(y) which(y == "a"))
Or you could use grep, which returns the index of the matched pattern. Here I'll show it used on the unlisted list, and then iterated over the list.
> grep("a", unlist(x))
# [1] 1
> sapply(x, function(y) grep("a", y))
# [[1]]
# [1] 1
# [[2]]
# integer(0)
You could also use %in% to see exactly where the occurrences of "a" are. This returns a logical vector.
> lapply(x, `%in%`, "a") ## or lapply(x, `==`, "a")
# [[1]]
# [1] TRUE FALSE
# [[2]]
# [1] FALSE

Resources