R pass column names to a function not as string - r

I am trying to pass column names to the following function.
unnest_dt <- function(tbl, ...) {
tbl <- as.data.table(tbl)
col <- ensyms(...)
clnms <- syms(setdiff(colnames(tbl), as.character(col)))
tbl <- as.data.table(tbl)
tbl <- eval(
expr(tbl[, lapply(.SD, unlist), by = list(!!!clnms), .SDcols = as.character(col)])
)
colnames(tbl) <- c(as.character(clnms), as.character(col))
tbl
}
The function is built for unnesting data frame with multiple list columns. Consider the following implementation of the function on a dummy data.
library(tibble)
df <- tibble(
a = LETTERS[1:5],
b = LETTERS[6:10],
list_column_1 = list(c(LETTERS[1:5]), "F", "G", "H", "I"),
list_column_2 = list(c(LETTERS[1:5]), "F", "G", "H", "I")
)
df <- unnest_dt2(df,list_column_1,list_column_2)
It serves the purpose. However, I am trying to loop over this function, and I need to pass column names to it. For example, I want to be able to do the following:
library(dplyr)
col <- colnames(df %>% select_if(is.list))
df <- unnest_dt2(df,col)
This expectedly gives the error. " Error in [.data.table(tbl, , lapply(.SD, unlist), by = list(a, b, list_column_1, :
column or expression 3 of 'by' or 'keyby' is type list. Do not quote column names. Usage: DT[,sum(colC),by=list(colA,month(colB))] "
Would anyone know how I can proceed with this? Any help would be greatly appreciated.

You can change the function to work with character vector.
unnest_dt <- function(tbl, ...) {
tbl <- as.data.table(tbl)
col <- c(...)
clnms <- syms(setdiff(colnames(tbl), col))
tbl <- as.data.table(tbl)
tbl <- eval(
expr(tbl[, lapply(.SD, unlist), by = list(!!!clnms),
.SDcols = as.character(col)])
)
colnames(tbl) <- c(as.character(clnms), as.character(col))
tbl
}
and then use :
unnest_dt(df,col)
# a b list_column_1 list_column_2
#1: A F A A
#2: A F B B
#3: A F C C
#4: A F D D
#5: A F E E
#6: B G F F
#7: C H G G
#8: D I H H
#9: E J I I

Related

How to rename multiple Columns in R?

My goal is to get a concise way to rename multiple columns in a data frame. Let's consider a small data frame df as below:
df <- data.frame(a=1, b=2, c=3)
df
Let's say we want to change the names from a, b, and c to Y, W, and Z respectively.
Defining a character vector containing old names and new names.
df names <- c(Y = "a", Z ="b", E = "c")
I would use this to rename the columns,
rename(df, !!!names)
df
suggestions?
One more !:
df <- data.frame(a=1, b=2, c=3)
df_names <- c(Y = "a", Z ="b", E = "c")
library(dplyr)
df %>% rename(!!!df_names)
## Y Z E
##1 1 2 3
A non-tidy way might be through match:
names(df) <- names(df_names)[match(names(df), df_names)]
df
## Y Z E
##1 1 2 3
You could try:
sample(LETTERS[which(LETTERS %in% names(df) == FALSE)], size= length(names(df)), replace = FALSE)
[1] "S" "D" "N"
Here, you don't really care what the new names are as you're using sample. Otherwise a straight forward names(df) < c('name1', 'name2'...

Function to select value from data frame based on value in list of lists

In R I need to write a function that applies the following rules:
fkt <- function(v1,v2){
- find list name within listOfLists that contains v1
- go to column in df, where listname == colname
- return value from row where df$col2 == v2
}
For Example:
df <- data.frame(Col1= c(1,2,3,4,5),
Col2= c("AAA","BBB","CCC","DDD","EEE"),
A = c("22","22","23","23","24"),
B = c("210","210","202","220","203"),
C = c("2000","2000","2010","2010","2200")
)
listOflists <- list(A <- c(1281, 1282, 1285, 1286, 1289),
B <- c(100,200,300,400,500,600,700,800,900,101,202,303,404,505,606),
C <-c(1000,1500,2000,2500,3000,3050,4000,4500,6000)
)
Then
fkt(800,"BBB")
> 210
I tried
fkt<- function(v1,v2){
r <- which(df$Col2== v1)
s <- ifelse(v2 %in% A, df$A[r],
ifelse( v2 %in% B ,df$A[r],df$C[r]))
return(s)
}
Alas, the result is NA.
And writing many ifelse() statements is not efficient - especially because the listOfLists might comprise >50 lists.
Can anyone advice me how to write this function in a general, programming efficient way as described above?
df <- data.frame(Col1= c(1,2,3,4,5),
Col2= c("AAA","BBB","CCC","DDD","EEE"),
A = c("22","22","23","23","24"),
B = c("210","210","202","220","203"),
C = c("2000","2000","2010","2010","2200"))
# Be cautious : = and <- are not equivalent, you were creating variables not named fields
listOflists <- list(A = c(1281, 1282, 1285, 1286, 1289),
B = c(100,200,300,400,500,600,700,800,900,101,202,303,404,505,606),
C = c(1000,1500,2000,2500,3000,3050,4000,4500,6000))
f <- function(v1)
for (i in 1:length(listOflists))
if (v1 %in% listOflists[[i]]) return(names(listOflists)[i])
fkt <- function(v1,v2) df[df$Col2==v2,f(v1)]
fkt(800,"BBB")
#[1] 210
#Levels: 202 203 210 220

Pass a df and variable argument in a function in R

How can I write a function with a df and variable argument, evaluating both? I read several posts and blog-posts from r-bloggers and I think I have some problem with the lazy-evaluation, but now I'm terribly confused.
This is my function:
RAM_char_func <- function(dataset, char_var){
a <- dataset[ , c("id", char_var)]
b <- a[[id]][is.na(a[[char_var]]) %in% FALSE]
c <- a[a[[id]] %in% b , ]
c
}
I get this:
Warning
Unknown or uninitialised column: 'char_var'.
This should give me a table (c) with two columns and based on the char_var n-amount lines. while the code works outside the function, I cannot manage to get it working inside the function. I also tried the tidyverse-idea with select and filter, but that doesnt work, too.
I'm using R 3.5.1 with Mac OS X (High Sierra, 10.13.6) and R Studio (latest version).
dataframe
df <- data.frame(id = c(1:10),
var_10 = c(101:110),
var_25 = c("a", "b", NA, "c", NA, "d", NA, "e", "f", NA),stringsAsFactors = F)
df
Code Outside of the function is:
a <- df[ , c("id", "v_25")]
a
b <- a$id[is.na(a$v_25) %in% FALSE]
b
c <- a[a$id %in% b , ]
c
or
library(tidyverse)
df %>%
select(id, var_25) %>%
filter(is.na(var_25) %in% FALSE)
Here is one way to solve your issue:
library(tidyverse)
library(rlang)
remove.na <- function(data, col){
symcol <- enquo(col)
data %>% select(id, !!symcol) %>% filter(!is.na(!!symcol))
}
df %>% remove.na(var_25)
# id var_25
# 1 1 a
# 2 2 b
# 3 4 c
# 4 6 d
# 5 8 e
# 6 9 f
all_equal(df %>% select(id, var_25) %>% filter(!is.na(var_25)), df %>% remove.na(var_25))
[1] TRUE
In the spirit of "teach a man how to fish", you can use the reprex package to be sure you're running your buggy code in a clean environment, and the error will be very explicit (you can also just run the code in a new session as long as you don't have a messy Rprofile file) :
library(reprex)
reprex({
df <- data.frame(id = c(1:10),
var_10 = c(101:110),
var_25 = c("a", "b", NA, "c", NA, "d", NA, "e", "f", NA),stringsAsFactors = F)
RAM_char_func <- function(dataset, char_var){
a <- dataset[ , c("id", char_var)]
b <- a[[id]][is.na(a[[char_var]]) %in% FALSE]
c <- a[a[[id]] %in% b , ]
c
}
RAM_char_func(df,"var_25")
})
If you use Rstudio it will show the output in the Viewer tab, and the error that I get is :
> Error in (function(x, i, exact) if (is.matrix(i)) as.matrix(x)[[i]] else .subset2(x, : objet 'id' introuvable
(sorry for the french)
It tells you object id can't be found, then you can check why your code is trying to access an object named id and you'll find easily the mistake in the 2nd instruction of your function.
If finding the sinful line is hard for you, try running traceback() right after the error or call debugonce(RAM_char_func) then RAM_char_func(df,"var_25") and browse until you find the line which fails.
your function contained an error. [[id]] must be [["id"]]
RAM_char_func <- function(dataset, char_var){
a <- dataset[ , c("id", char_var)]
b <- a[["id"]][is.na(a[[char_var]]) %in% FALSE]
c <- a[a[["id"]] %in% b , ]
c
}
data:
df <- data.frame(id = c(1:10),
var_10 = c(101:110),
var_25 = c("a", "b", NA, "c", NA, "d", NA, "e", "f", NA),stringsAsFactors = F)
call your function:
RAM_char_func(dataset=df,char_var="var_25")
# id var_25
#1 1 a
#2 2 b
#4 4 c
#6 6 d
#8 8 e
#9 9 f

data.table sum by group and return row with max value

I have a data.table in this fashion:
dd <- data.table(f = c("a", "a", "a", "b", "b"), g = c(1,2,3,4,5))
dd
I need to sum the values g by factor f, and finally return a single row data.table object that has the maximum value of g, but that also contains the factor information. i.e.
___f|g
1: b 9
My closest attempt so far is
tmp3 <- dd[, sum(g), by = f][, max(V1)]
tmp3
Which results in:
> tmp3
[1] 9
EDIT: I'm ideally looking for a purely data.table piece of code/workflow. I'm surprised that with all the speedy fast split-apply-combine wizardry and ability to subset your data in the form of 'example[i= subset, ]` that I haven't found a straight forward way to subset on a single value condition.
Here's one way to do it:
library(data.table)
dd <- data.table(
f = c("a", "a", "a", "b", "b"),
g = c(1,2,3,4,5))
##
> dd[,list(g = sum(g)),by=f][which.max(g),]
f g
1: b 9
You can use dplyr syntax on a data.table, in this case:
library(dplyr)
dd %>%
group_by(f) %>%
summarise (g = sum(g)) %>%
top_n(1, g)
Source: local data table [1 x 2]
f g
1 b 9

Subset and join a data frame by matching on nested list in R

I'm attempting to join two data frames, df and myData, according to elements in a column from each. The column in df purposefully contains nested lists, and I would like to join if an element in the nested list matches an element of myData. I'd like to keep unmatched rows in df (left join).
Here is an example, first without nested lists in df.
df = data.frame(a=1:5)
df$x1= c("a", "b", "g", "a", "a")
str(df)
'data.frame': 5 obs. of 2 variables:
$ a : int 1 2 3 4 5
$ x1: chr "a" "b" "g" "a" ...
myData <- data.frame(x1=c("a", "g", "q"), x2= c("za", "zg", "zq"), stringsAsFactors = FALSE)
Now, we can join on column x1:
#using a for loop
df$x2 <- NA
for(id in 1:nrow(myData)){
df$x2[df$x1 %in% myData$x1[id]] <- myData$x2[id]
}
Or using dplyr:
library(dplyr)
df = data.frame(a=1:5)
df$x1= c("a", "b", "g", "a", "a")
df %>%
left_join(myData)
Now, consider df with nested lists.
l1 = list(letters[1:5])
l2 = list(letters[6:10])
df = data.frame(a=1:5)
df$x1= c("a", "b", "g", l1, l2)
Using a for loop fails to match on elements of a nested list, as we expect:
df$x2 <- NA
for(id in 1:nrow(myData)){
df$x2[df$x1 %in% myData$x1[id]] <- myData$x2[id]
}
output:
df
a x1 x2
1 1 a za
2 2 b <NA>
3 3 g zg
4 4 a, b, c, d, e <NA>
5 5 f, g, h, i, j <NA>
Using dplyr:
df %>%
left_join(myData)
throws an error:
Joining by: c("x1", "x2")
Error: cannot join on column 'x1'
I think the solution needs to unlist the nested lists, but haven't sorted out how to work the unlist function into the above strategies.
I've also tried the above with the data.table package. How to accomplish this with data.table is may be an additional question. But, to the extent the data.table handles lists within data frames, I wanted to include it, as it may provide the best solution.
My actual data is about 100,000 rows, so the matching on lists with base R could be a performance annoyance (another reason to consider data.table ?)
Fwiw, the use of nested lists (and other structures) within data frames is something I would often do in Python, and it may be there is a better way to structure the data in the first place in R.
Thoughts?
Here is a possible solution:
df$x2 <- NA
for(id in 1:nrow(df))
{
df$x2[id] <- ifelse(
length(ff <- myData$x2[which(myData$x1 == intersect(df$x1[[id]], myData$x1))])==0,
NA,
ff)
}
df
# a x1 x2
#1 1 a za
#2 2 b <NA>
#3 3 g zg
#4 4 a, b, c, d, e za
#5 5 f, g, h, i, j zg
There are some potential pitfalls with the above solution. For example, if we change l1 to have 2 possible matches (e.g. "a" and "g") :
l1 = list(letters[1:7])
df$x1= c("a", "b", "g", l1, l2)
This solution will not catch both matches, as is:
df$x2 <- NA
for(id in 1:nrow(df))
{
df$x2[id] <- ifelse(
length(ff <- myData$x2[which(myData$x1 == intersect(df$x1[[id]], myData$x1))])==0,
NA,
ff)
}
Warning message:
In myData$x1 == intersect(df$x1[[id]], myData$x1) :
longer object length is not a multiple of shorter object length
You could modify it to allow multiple matches, if needed. Here are two different ways to do that, one uses paste and one uses list in the way you did in the problem.
df$x2 <- NA
for(id in 1:nrow(df))
{
df$x2[id] <-
paste(if (length(ff <- myData$x2[which(myData$x1 %in% intersect(df$x1[[id]], myData$x1))])==0)
NA else
ff, collapse=", ")
}
df$x2 <- NA
for(id in 1:nrow(df))
{
df$x2[id] <-
list(if (length(ff <- myData$x2[which(myData$x1 %in% intersect(df$x1[[id]], myData$x1))])==0)
NA else
ff)
}
Both will return the following, but the underlying structure will be different:
a x1 x2
1 1 a za
2 2 b NA
3 3 g zg
4 4 a, b, c, d, e, f, g za, zg
5 5 f, g, h, i, j zg
I think this might work. When you're recursively operating on a list, it's a good idea to write a helper function to get the values.
getMatch <- function(x, y) {
z <- y[[2]][sort(match(x, y[[1]]))]
z[!length(z)] <- NA
z
}
> rapply(unname(df[-1]), getMatch, y = myData)
# [1] "za" NA "zg" "za" "zg"
Or we can assign a new column using within
> within(df, { x2 <- sapply(df$x1, getMatch, y = myData) })
# a x1 x2
#1 1 a za
#2 2 b <NA>
#3 3 g zg
#4 4 a, b, c, d, e za
#5 5 f, g, h, i, j zg
Here's a data.table option:
library(data.table)
# convert to data.table in place
setDT(myData)
# using Frank's extended example
l1 = list(letters[1:7])
l2 = list(letters[6:10])
dt = data.table(a=1:5, x1 = c("a", "b", "g", l1, l2))
# unlist the lists (and to be honest, that's how I would store the data,
# I think the column of lists is a bad idea), then set the keys, merge, and
# go back to columns of lists
setkey(dt[, unlist(x1), by = a], V1)[myData, x2 := i.x2][,
list(x1 = list(V1), x2 = list(na.omit(x2))), keyby = a]
# a x1 x2
#1: 1 a za
#2: 2 b
#3: 3 g zg
#4: 4 a,b,c,d,e,f, za,zg
#5: 5 f,g,h,i,j zg

Resources