Remove duplicated elements from list - r

I have a list of character vectors:
my.list <- list(e1 = c("a","b","c","k"),e2 = c("b","d","e"),e3 = c("t","d","g","a","f"))
And I'm looking for a function that for any character that appears more than once across the list's vectors (in each vector a character can only appear once), will only keep the first appearance.
The result list for this example would therefore be:
res.list <- list(e1 = c("a","b","c","k"),e2 = c("d","e"),e3 = c("t","g","f"))
Note that it is possible that an entire vector in the list is eliminated so that the number of elements in the resulting list doesn't necessarily have to be equal to the input list.

We can unlist the list, get a logical list using duplicated and extract the elements in 'my.list' based on the logical index
un <- unlist(my.list)
res <- Map(`[`, my.list, relist(!duplicated(un), skeleton = my.list))
identical(res, res.list)
#[1] TRUE

Here is an alternative using mapply with setdiff and Reduce.
# make a copy of my.list
res.list <- my.list
# take set difference between contents of list elements and accumulated elements
res.list[-1] <- mapply("setdiff", res.list[-1],
head(Reduce(c, my.list, accumulate=TRUE), -1))
Keeping the first element of the list, we compute on subsequent elements and the a list of the cumulative vector of elements produced by Reduce with c and the accumulate=TRUE argument. head(..., -1) drops the final list item containing all elements so that the lengths align.
This returns
res.list
$e1
[1] "a" "b" "c" "k"
$e2
[1] "d" "e"
$e3
[1] "t" "g" "f"
Note that in Reduce, we could replace c with function(x, y) unique(c(x, y)) and accomplish the same ultimate output.

I found the solutions here very complex for my understanding and sought a simpler technique. Suppose you have the following list.
my_list <- list(a = c(1,2,3,4,5,5), b = c(1,2,2,3,3,4,4),
d = c("Mary", "Mary", "John", "John"))
The following much simpler piece of code removes the duplicates.
sapply(my_list, unique)
You will end up with the following.
$a
[1] 1 2 3 4 5
$b
[1] 1 2 3 4
$d
[1] "Mary" "John"
There is beauty in simplicity!

Related

subset of list of vector with grep?

I have a list of vector and I want to create a new list containing any value containing the letter 'a' but keep in internal structure.
l = list ( g1 = c('a','b','ca') ,
g2 = c('a','b') )
lapply(l, function(x) grep('a',x) )
lapply on provides the index number but what I want it to return are the values.
The end result should be a list with vector g1 containing a and ca whilst g2 with just a.
thanks!
Add value = TRUE.
lapply(l, function(x) grep('a', x, value = TRUE))
# $g1
# [1] "a" "ca"
#
# $g2
# [1] "a"
Alternatively, you can do:
lapply(l, function(x) x[grepl("a", x)])
$g1
[1] "a" "ca"
$g2
[1] "a"
If you want to try with tidyverse here are couple of approaches.
library(tidyverse)
map(l, ~grep('a', .x, value=T))
map(l, ~str_subset(.x, 'a')) # str_subset from stringr package is a wrapper for grep shown above.

Is there a R function for limiting the length of list elements?

I am struggling with a list manipulation in R right now. I have a list containing about 3000 elements, where each element is a character vector. The length of these character vectors is between 7 and 10.
I would like to manipulate this list in such a way, that those character vectors, that contain more than 7 elements, are limited to only the first 7 elements - hence drop the 8th, 9th, and 10th element/word/number of the respective character vector of the list.
Is there an easy way to do this? I hope you understand what I mean.
Thanks in advance!
You can use lapply as below:
mylist <- list(a = c("a", "b"),
b = c("a", "b", "c"))
mylist
$a
[1] "a" "b"
$b
[1] "a" "b" "c"
mylist2 <- lapply(mylist, function(x) {
x[1:min(length(x), 2)]
})
mylist2
$a
[1] "a" "b"
$b
[1] "a" "b"
What you need is an auxiliary function that will shorten your vector. Something like
shorten_vector <- function(y, max_length = 7){
# NOTE: assumes that there are at least 7 elements in the vector.
y[seq_len(max_length)]
}
you can then shorten the vectors in your list with
lapply(your_list, shorten_vector)
Or better
lapply(your_list, head, 7) # Thanks Moody
Reproducible example
# Make an object for an example. A list of length 15
# where each element is a character vector between length 7 and 10
random_length <- sample(7:10, 15, replace = TRUE)
char_list <-
lapply(random_length,
function(x){
letters[seq_len(x)]
})
# utility function
shorten_vector <- function(y, max_length = 7){
y[seq_len(max_length)]
}
lapply(char_list,
shorten_vector)
Bonus
You said in a comment on Sonny's answer that you weren't really sure how the lapply worked. At it's conceptual core, lapply is a wrapper around a for loop. The equivalent for loop would be
for(i in seq_along(char_list)){
char_list[[i]] <- shorten_vector(char_list[[i]])
}
char_list
The lapply just handles the iteration limits for you and looks a little cleaner on the screen.

r - dealing with NAs when using lapply to select sublist elements by position

I have a list of vectors, some of which are NA. I need to use lapply to select the second-to-last element of each vector. The problem is that NAs have length 1, so I cannot access their second-to-last element.
MyList <- list(a=c("a","b","c"),b=NA,c=c("d","e","f"))
VectorFromList <- unlist(lapply(MyList, function(x) return(x[length(x)-1])))
VectorFromList
a c
"b" "e"
As you can see, the resulting vector is shorter than the original input list, which is a problem if I want to append it as a column in a longer dataframe. My expected result is a vector of the same length of the original list:
[1] "a" NA "c"
How do I deal with NAs when using lapply to select subelements within a list?
Always look for at least the first one ... we can use max here:
unlist(lapply(MyList, function(x) return(x[max(1,length(x)-1)])))
# a b c
# "b" NA "e"
or alternatively
sapply(MyList, function(x) return(x[max(1,length(x)-1)]))
mapply(`[[`, MyList, pmax(1, lengths(MyList)-1))

Make a lookup table from a data.frame

I have a data.frame which has only one unique non-NA value in all columns but one, which only has NA.
data <- data.frame(A = c("egg", "egg"), B = c(NA, "bacon"), C = c("ham", "ham"), D = c(NA, NA))
How can I use it to create a lookup table of the form below?
lookup <- make_lookup(key=unique_values(data), value=names(data))
lookup[["egg"]] # returns "A"
lookup[["bacon"]] # returns "B"
lookup[["ham"]] # returns "C"
lookup[["NA"]] # returns "D"
EDIT
Based on Frank's answer below, I'm trying to have my lookup table refer to multiple values.
keys <- lapply(data, function(x) if(is.factor(x)) levels(x) else "bacon")
vals <- names(data)
keys
$A
[1] "egg"
$B
[1] "bacon"
$C
[1] "ham"
$D
[1] "bacon"
vals
[1] "A" "B" "C" "D"
tapply(vals, keys, c)
Error in tapply(vals, keys, c) : arguments must have same length
Here is one way. The lookup is a vector:
keys <- sapply(data,function(x)if(is.factor(x))levels(x)else "NA")
vals <- names(data)
lookup <- setNames(vals,keys)
I've replace NA with "NA" since I couldn't figure out how to use the former.
The syntax lookup[["egg"]] works, but also lookup["egg"]. The reverse lookup is rlookup <- keys, accessible the same way: rlookup["A"].
For keys with multiple values. If the keys may map to a vector of values, use
lookup <- tapply(vals,keys,c)
Try this out with keys <- sapply(data,function(x)if(is.factor(x))levels(x)else "bacon") and vals as above, for example (as in the OP's comment, below). Now the lookup is a list and so can only be accessed with double brackets: lookup[["bacon"]]. The reverse lookup works as before.
For general column classes. If the columns of data are not all factors, the if/else conditions will need to be modified or generalized. Here is a version of #akrun's generalized solution from the comments:
keys <- sapply(data,function(x)c(unique(as.character(x)[!is.na(x)]),"NA")[1])

which() function for lists in R

This should be easy, but I am hoping to find out how to return the indices of a list that contain one element. For example, in the list below, let's say I want to find all indices where "a" is an element. I would want a function to return the index 1.
> x = list(c("a", "b"), "c")
> x
[[1]]
[1] "a" "b"
[[2]]
[1] "c"
> which(x=="a")
integer(0)
Of course, which() does not work here. Any help would be appreciated!
You need to iterate over the list elements and check for the element in each set. The
sapply(x, function(e) is.element('a', e))
## [1] TRUE FALSE
which(sapply(x, function(e) is.element('a', e)))
## [1] 1
The sapply expression returns a logical vector, indicating the presence of a each element of the list, and which returns the indices of the TRUE elements.
It's not exactly clear to me how you want result formatted. Since there are two list elements, it would be difficult to determine which list element the match came from when you have a longer list, should you simply want the indices as a vector. You could use which here. Just write
sapply(x, function(y) which(y == "a"))
Or you could use grep, which returns the index of the matched pattern. Here I'll show it used on the unlisted list, and then iterated over the list.
> grep("a", unlist(x))
# [1] 1
> sapply(x, function(y) grep("a", y))
# [[1]]
# [1] 1
# [[2]]
# integer(0)
You could also use %in% to see exactly where the occurrences of "a" are. This returns a logical vector.
> lapply(x, `%in%`, "a") ## or lapply(x, `==`, "a")
# [[1]]
# [1] TRUE FALSE
# [[2]]
# [1] FALSE

Resources