Make a lookup table from a data.frame

Make a lookup table from a data.frame - r

I have a data.frame which has only one unique non-NA value in all columns but one, which only has NA.
data <- data.frame(A = c("egg", "egg"), B = c(NA, "bacon"), C = c("ham", "ham"), D = c(NA, NA))
How can I use it to create a lookup table of the form below?
lookup <- make_lookup(key=unique_values(data), value=names(data))
lookup[["egg"]] # returns "A"
lookup[["bacon"]] # returns "B"
lookup[["ham"]] # returns "C"
lookup[["NA"]] # returns "D"
EDIT
Based on Frank's answer below, I'm trying to have my lookup table refer to multiple values.
keys <- lapply(data, function(x) if(is.factor(x)) levels(x) else "bacon")
vals <- names(data)
keys
$A
[1] "egg"
$B
[1] "bacon"
$C
[1] "ham"
$D
[1] "bacon"
vals
[1] "A" "B" "C" "D"
tapply(vals, keys, c)
Error in tapply(vals, keys, c) : arguments must have same length

Here is one way. The lookup is a vector:
keys <- sapply(data,function(x)if(is.factor(x))levels(x)else "NA")
vals <- names(data)
lookup <- setNames(vals,keys)
I've replace NA with "NA" since I couldn't figure out how to use the former.
The syntax lookup[["egg"]] works, but also lookup["egg"]. The reverse lookup is rlookup <- keys, accessible the same way: rlookup["A"].
For keys with multiple values. If the keys may map to a vector of values, use
lookup <- tapply(vals,keys,c)
Try this out with keys <- sapply(data,function(x)if(is.factor(x))levels(x)else "bacon") and vals as above, for example (as in the OP's comment, below). Now the lookup is a list and so can only be accessed with double brackets: lookup[["bacon"]]. The reverse lookup works as before.
For general column classes. If the columns of data are not all factors, the if/else conditions will need to be modified or generalized. Here is a version of #akrun's generalized solution from the comments:
keys <- sapply(data,function(x)c(unique(as.character(x)[!is.na(x)]),"NA")[1])

Related

substitute the elements of a vector with values from dataframe

I need to substitute the elements of a vector which match the elements of a particular column in data frame in R.
Reproducible example:
a<-c("A","B","C","D")
b<-data.frame(col1=c("B","C","E"),col2=c("T","Y","N"))
I need to get the following vector:
new<-c("A","T","Y","D")
What I tried is:
new <- a
new <- b$col2[match(a, b$col1)]
which does the substitution, but converts the unmatched elements into NAs.
Any help is appreciated

You can make a data.table from a and then update only the rows for which there is a match when joining with b.
library(data.table)
setDT(b)
data.table(a)[b, on = .(a = col1), a := i.col2][]
# a
# 1: A
# 2: T
# 3: Y
# 4: D
In base R you could use your current approach but replace the NAs with elements of a using ifelse
temp <- as.character(b$col2[match(a, b$col1)])
ifelse(is.na(temp), a, temp)
# [1] "A" "T" "Y" "D"

You can use replace in base R:
a<-c("A","B","C","D")
b<-data.frame(col1=c("B","C","E"),col2=c("T","Y","N"), stringsAsFactors = F)
replace(a, which(a %in% b$col1), b$col2[b$col1 %in% a])
#[1] "A" "T" "Y" "D"

subset of list of vector with grep?

I have a list of vector and I want to create a new list containing any value containing the letter 'a' but keep in internal structure.
l = list ( g1 = c('a','b','ca') ,
g2 = c('a','b') )
lapply(l, function(x) grep('a',x) )
lapply on provides the index number but what I want it to return are the values.
The end result should be a list with vector g1 containing a and ca whilst g2 with just a.
thanks!

Add value = TRUE.
lapply(l, function(x) grep('a', x, value = TRUE))
# $g1
# [1] "a" "ca"
#
# $g2
# [1] "a"

Alternatively, you can do:
lapply(l, function(x) x[grepl("a", x)])
$g1
[1] "a" "ca"
$g2
[1] "a"

If you want to try with tidyverse here are couple of approaches.
library(tidyverse)
map(l, ~grep('a', .x, value=T))
map(l, ~str_subset(.x, 'a')) # str_subset from stringr package is a wrapper for grep shown above.

Is there a R function for limiting the length of list elements?

I am struggling with a list manipulation in R right now. I have a list containing about 3000 elements, where each element is a character vector. The length of these character vectors is between 7 and 10.
I would like to manipulate this list in such a way, that those character vectors, that contain more than 7 elements, are limited to only the first 7 elements - hence drop the 8th, 9th, and 10th element/word/number of the respective character vector of the list.
Is there an easy way to do this? I hope you understand what I mean.
Thanks in advance!

You can use lapply as below:
mylist <- list(a = c("a", "b"),
b = c("a", "b", "c"))
mylist
$a
[1] "a" "b"
$b
[1] "a" "b" "c"
mylist2 <- lapply(mylist, function(x) {
x[1:min(length(x), 2)]
})
mylist2
$a
[1] "a" "b"
$b
[1] "a" "b"

What you need is an auxiliary function that will shorten your vector. Something like
shorten_vector <- function(y, max_length = 7){
# NOTE: assumes that there are at least 7 elements in the vector.
y[seq_len(max_length)]
}
you can then shorten the vectors in your list with
lapply(your_list, shorten_vector)
Or better
lapply(your_list, head, 7) # Thanks Moody
Reproducible example
# Make an object for an example. A list of length 15
# where each element is a character vector between length 7 and 10
random_length <- sample(7:10, 15, replace = TRUE)
char_list <-
lapply(random_length,
function(x){
letters[seq_len(x)]
})
# utility function
shorten_vector <- function(y, max_length = 7){
y[seq_len(max_length)]
}
lapply(char_list,
shorten_vector)
Bonus
You said in a comment on Sonny's answer that you weren't really sure how the lapply worked. At it's conceptual core, lapply is a wrapper around a for loop. The equivalent for loop would be
for(i in seq_along(char_list)){
char_list[[i]] <- shorten_vector(char_list[[i]])
}
char_list
The lapply just handles the iteration limits for you and looks a little cleaner on the screen.

Remove duplicated elements from list

I have a list of character vectors:
my.list <- list(e1 = c("a","b","c","k"),e2 = c("b","d","e"),e3 = c("t","d","g","a","f"))
And I'm looking for a function that for any character that appears more than once across the list's vectors (in each vector a character can only appear once), will only keep the first appearance.
The result list for this example would therefore be:
res.list <- list(e1 = c("a","b","c","k"),e2 = c("d","e"),e3 = c("t","g","f"))
Note that it is possible that an entire vector in the list is eliminated so that the number of elements in the resulting list doesn't necessarily have to be equal to the input list.

We can unlist the list, get a logical list using duplicated and extract the elements in 'my.list' based on the logical index
un <- unlist(my.list)
res <- Map(`[`, my.list, relist(!duplicated(un), skeleton = my.list))
identical(res, res.list)
#[1] TRUE

Here is an alternative using mapply with setdiff and Reduce.
# make a copy of my.list
res.list <- my.list
# take set difference between contents of list elements and accumulated elements
res.list[-1] <- mapply("setdiff", res.list[-1],
head(Reduce(c, my.list, accumulate=TRUE), -1))
Keeping the first element of the list, we compute on subsequent elements and the a list of the cumulative vector of elements produced by Reduce with c and the accumulate=TRUE argument. head(..., -1) drops the final list item containing all elements so that the lengths align.
This returns
res.list
$e1
[1] "a" "b" "c" "k"
$e2
[1] "d" "e"
$e3
[1] "t" "g" "f"
Note that in Reduce, we could replace c with function(x, y) unique(c(x, y)) and accomplish the same ultimate output.

I found the solutions here very complex for my understanding and sought a simpler technique. Suppose you have the following list.
my_list <- list(a = c(1,2,3,4,5,5), b = c(1,2,2,3,3,4,4),
d = c("Mary", "Mary", "John", "John"))
The following much simpler piece of code removes the duplicates.
sapply(my_list, unique)
You will end up with the following.
$a
[1] 1 2 3 4 5
$b
[1] 1 2 3 4
$d
[1] "Mary" "John"
There is beauty in simplicity!

Extract distinct characters that differ between two strings

I have two strings, a <- "AERRRTX"; b <- "TRRA" .
I want to extract the characters in a not used in b, i.e. "ERX"
I tried the answer in Extract characters that differ between two strings , which uses setdiff. It returns "EX", because b does have "R" and setdiff will eliminate all three "R"s in a. My aim is to treat each character as distinct, so only two of the three R's in a should be eliminated.
Any suggestions on what I can use instead of setdiff, or some other approach to achieve my output?

A different approach using pmatch,
a1 <- unlist(strsplit(a, ""))
b1 <- unlist(strsplit(b, ""))
a1[!1:length(a1) %in% pmatch(b1, a1)]
#[1] "E" "R" "X"
Another example,
a <- "Ronak";b<-"Shah"
a1 <- unlist(strsplit(a, ""))
b1 <- unlist(strsplit(b, ""))
a1[!1:length(a1) %in% pmatch(b1, a1)]
# [1] "R" "o" "n" "k"

You can use the function vsetdiff from vecsets package
install.packages("vecsets")
library(vecsets)
a <- "AERRRTX"
b <- "TRRA"
Reduce(vsetdiff, strsplit(c(a, b), split = ""))
## [1] "E" "R" "X"

We can use Reduce() to successively eliminate from a each character found in b:
a <- 'AERRRTX'; b <- 'TRRA';
paste(collapse='',Reduce(function(as,bc) as[-match(bc,as,nomatch=length(as)+1L)],strsplit(b,'')[[1L]],strsplit(a,'')[[1L]]));
## [1] "ERX"
This will preserve the order of the surviving characters in a.
Another approach is to mark each character with its occurrence index in a, do the same for b, and then we can use setdiff():
a <- 'AERRRTX'; b <- 'TRRA';
pasteOccurrence <- function(x) ave(x,x,FUN=function(x) paste0(x,seq_along(x)));
paste(collapse='',substr(setdiff(pasteOccurrence(strsplit(a,'')[[1L]]),pasteOccurrence(strsplit(b,'')[[1L]])),1L,1L));
## [1] "ERX"

An alternative using data.table package`:
library(data.table)
x = data.table(table(strsplit(a, '')[[1]]))
y = data.table(table(strsplit(b, '')[[1]]))
dt = y[x, on='V1'][,N:=ifelse(is.na(N),0,N)][N!=i.N,res:=i.N-N][res>0]
rep(dt$V1, dt$res)
#[1] "E" "R" "X"

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Make a lookup table from a data.frame - r

Related

substitute the elements of a vector with values from dataframe

subset of list of vector with grep?

Is there a R function for limiting the length of list elements?

Remove duplicated elements from list

Extract distinct characters that differ between two strings

Categories

Resources