R - two data frame columns to list of key-value pairs - r

Say I have a data frame
DF1 <- data.frame("a" = c("a", "b", "c"), "b" = 1:3)
What is the easiest way to turn this into a list?
DF2 <- list("a" = 1, "b" = 2, "c" = 3)
It must be really simple but I can't find out the answer.

You can use setNames and as.list
DF2 <- setNames(as.list(DF1$b), DF1$a)

Related

How to use the same R recode function on multiple variables without coding each?

From the recode examples, what if I have two variables where I want to apply the same recode?
factor_vec1 <- factor(c("a", "b", "c"))
factor_vec2 <- factor(c("a", "d", "f"))
How can I recode the same answer without writing a recode for each factor_vec? These don't work, do I need to learn how to use purrr to do it, or is there another way?
Output 1: recode(c(factor_vec1, factor_vec2), a = "Apple")
Output 2: recode(c(factor_vec2, factor_vec2), a = "Apple", b =
"Banana")
If there are not many items needed to be recoded, you can try a simple lookup table approach using base R.
v1 <- c("a", "b", "c")
v2 <- c("a", "d", "f")
# lookup table
lut <- c("a" ="Apple",
"b" = "Banana",
"c" = "c",
"d" = "d",
"f" = "f")
lut[v1]
lut[v2]
You can reuse the lookup table for any relevant variables. The results are:
> lut[v1]
a b c
"Apple" "Banana" "c"
> lut[v2]
a d f
"Apple" "d" "f"
Use lists to hold multiple vectors and then you can apply same function using lapply/map.
library(dplyr)
list_fac <- lst(factor_vec1, factor_vec2)
list_fac <- purrr::map(list_fac, recode, a = "Apple", b = "Banana")
You can keep the vectors in list itself (which is better) or get the changed vectors in global environment using list2env.
list2env(list_fac, .GlobalEnv)

Speeding up recoding of a character column in R

I have some data where each data point is associated with a character vector of varying length. For example, it might be generated by the following function:
library(tidyverse)
set.seed(27)
generate_keyset <- function(...) {
sample(LETTERS[1:5], size = rpois(n = 1, lambda = 10), replace = TRUE)
}
generate_keyset()
#> [1] "A" "C" "A" "A" "A" "A" "A" "E" "C" "C" "A" "D" "A" "D" "C" "A"
I would like to summarize this keyset by converting it to a single number score. The way this works is straightforward: each key in the keyset has a value, and to get the value of the entire keyset I sum over the values. The key-value map is a tibble with several hundred entries, but you can imagine it looks like:
key_value_map <- tribble(
~key, ~value,
"A", 1,
"B", -2,
"C", 8,
"D", -4,
"E", 0
)
Currently I am scoring keysets with the following function:
score_keyset <- function(keyset) {
merged_keysets_to_map <- data.frame(
key = keyset,
stringsAsFactors = FALSE
) %>%
left_join(key_value_map, by = "key")
sum(merged_keysets_to_map$value)
}
score_keyset(LETTERS[1:4])
#> [1] 3
This works fine, except it is very slow, and I need to do this operation about a million times. For example, I would like the following to be much faster:
n <- 1e4 # in practice I have n = 1e6
fake_data <- tibble(
keyset = map(1:n, generate_keyset)
)
library(tictoc)
tic()
scored_data <- fake_data %>%
mutate(
value = map_dbl(keyset, score_keyset)
)
toc()
I am sure this is some much better way to do this with indexing but it is escaping me at the moment. Help speeding this up is much appreciated.
Instead of doing a join and then sum, it would be more efficient if we use a named vector to match
library(tibble)
sum(deframe(key_value_map)[generate_keyset()])
Checking the timings, the OP's tic/toc showed 45.728 sec
tic()
v1 <- deframe(key_value_map)
scored_data2 <- fake_data %>%
mutate(
value = map_dbl(keyset, ~ sum(v1[.x]))
)
toc()
#0.952 sec elapsed
identical(scored_data, scored_data2)
#[1] TRUE

How to find common variables in different data frames?

I have several data frames with similar (but not identical) series of variables (columns). I want to find a way for R to tell me what are the common variables across different data frames.
Example:
`a <- c(1, 2, 3)
b <- c(4, 5, 6)
c <- c(7, 8, 9)
df1 <- data.frame(a, b, c)
b <- c(1, 3, 5)
c <- c(2, 4, 6)
df2 <- data.frame(b, c)`
With df1 and df2, I would want some way for R to tell me that the common variables are b and c.
1) For 2 data frames:
intersect(names(df1), names(df2))
## [1] "b" "c"
To get the names that are in df1 but not in df2:
setdiff(names(df1), names(df2))
1a) and for any number of data frames (i.e. get the names common to all of them):
L <- list(df1, df2)
Reduce(intersect, lapply(L, names))
## [1] "b" "c"
2) An alternative is to use duplicated since the common names will be the ones that are duplicated if we concatenate the names of the two data frames.
nms <- c(names(df1), names(df2))
nms[duplicated(nms)]
## [1] "b" "c"
2a) To generalize that to n data frames use table and look for the names that occur the same number of times as data frames:
L <- list(df1, df2)
tab <- table(unlist(lapply(L, names)))
names(tab[tab == length(L)])
## [1] "b" "c"
Use intersect:
intersect(colnames(df1),colnames(df2))
OR
We can also check for the colname using %in%:
colnames(df1)[colnames(df1) %in% colnames(df2)]
Output:
[1] "b" "c"

Subsetting data from a dataframe and taking specific values from the subsetted values

I want to check if values (in example below "letters") in 1 dataframe appear in another dataframe. And if that is the case, I want a value (in example below "ranking") which is specific for that value from the first dataframe to be added to the second dataframe... What I have now Is the following:
Df1 <- data.frame(c("A", "C", "E"), c(1:3))
colnames(Df1) <- c("letters", "ranking")
Df2 <- data.frame(c("A", "B", "C", "D", "E"))
colnames(Df2) <- c("letters")
Df2$rank <- ifelse(Df2$letters %in% Df1$letters, 1, 0)
However... Instead of getting a '1' when the letters overlap, I want to get the specific 'ranking' number from Df1.
Thanks!
What you're looking for is called a merge:
merge(Df2, Df1, by="letters", all.x=TRUE)
Also, fun fact, you can create a dataframe and name the columns at the same time (and you'll usually want to "turn off" strings as factors):
df1 <- data.frame(
letters = c("a", "b", "c"),
ranking = 1:3,
stringsAsFactors = FALSE)
dplyr package is best for this.
Df2 <- Df2 %>%
left_join(Df1,by = "letters")
this will show a NA for "D" if you want to keep it.
Otherwise you can do semi_join
DF2 <- Df2 %>%
semi_join(Df1, by = "letters")
And this will only keep the ones they have in common (intersection)

Joining lists into a vector [duplicate]

This question already has answers here:
Create sequence of repeated values, in sequence?
(3 answers)
Closed 6 years ago.
I want to create the following vector using a, b, c repeating each letter thrice:
BB<-c("a","a","a","b","b","b","c","c","c")
This is my code:
Alphabet<-c("a","b","c")
AA<-list()
for(i in 1:3){
AA[[i]]<-rep(Alphabet[i],each=3)
}
BB<-do.call(rbind,AA)
But I am getting a dataframe:
dput(BB)
structure(c("a", "b", "c", "a", "b", "c", "a", "b", "c"), .Dim = c(3L,
3L))
What I am doing wrong?
As Akrun mentioned we can use the same rep function
create a vector which consists of letters a,b,c
A <- c("A","B","C")
Apply rep function for the same vector, use each as sub function
AA <- rep(A,each=3)
print(AA)
[1] "A" "A" "A" "B" "B" "B" "C" "C" "C"
You should use c function to concatenate, not the rbind. This will give you vector.
Alphabet<-c("a","b","c")
AA<-list()
for(i in 1:3){
AA[[i]]<-rep(Alphabet[i],each=3)
}
BB<-do.call(c,AA)
Akrun comment is also true, if thats what you want.
You can also concatenate the rep function like so:
BB <- c(rep("a", 3), rep("b", 3), rep("c", 3))
Here is a solution but note this form or appending is not efficient for large input arrays
Alphabet <- c("a","b","c")
bb <- c()
for (i in 1:length(Alphabet)) {
bb <- c(bb, rep(Alphabet[i], 3))
}

Resources