The problem
Hi all,
I am trying to join a few dataframes together dynamically. For me that means that I have a dataframe that I start with df_A, to which I want to join multiple other dataframesdf_B1, df_B2,df_B3, etc..
df_A contains a column for each of the df_B... tables to join against. Column_join_B1, Column_join_B2, Column_join_B3, etc. (Although in reality these have obscure names). These names are also in a vector df_A_join_names.
df_B1, df_B2, df_B3, etc.. are stored in a list df_B, which I understand is good practice to do :). This is also how I access them in my loop.
Each of these has two columns. One with the value to join against df_A The other with information.
I even tried renaming the first column to match the column in df_A before the join, but to no avail.
What I am trying
left_join() does not allow me simply use by = c(df_A_join_names[1], "Column_join_A") so I have to use setNames, but I cannot get this to work.
Below a function which I want to iterate in a loop:
my_join <- function(df_a, df_b, a_name, b_name){
df_joined <- left_join(df_a, df_b,
by = setNames(b_name, a_name))
I want to use this function in a loop to join all my df_B... dataframes against df_A.
for (i in 1: length(df_A_join_names)){
df_A <- my_join(df_a = df_A,
df_b = df_B[i],
a_name = as.character(df_A_join_names[i]),
b_name = "Column_join_A"
Running this I get:
Error in UseMethod("tbl_vars") :
no applicable method for 'tbl_vars' applied to an object of class "list"
Some stuff to play with
#Making df_A
A_a <- seq(1,10, by = 1)
Column_join_B1 <- seq(11,20, by = 1)
Column_join_B2 <- seq(21,30, by = 1)
df_A <- data.frame(cbind(A_a, Column_join_B1, Column_join_B2) )
#Making df_B
Column_join_A <- seq(11,20, by = 1)
B_a <- LETTERS[1:10]
df_B1 <- data.frame(Column_join_A, B_a )
Column_join_A <- seq(21,30, by = 1)
B_b <- LETTERS[11:20]
df_B2 <- data.frame(Column_join_A, B_b)
# In my own code I make this using a loop. maybe not the prettiest.
df_B <- list()
df_B[[1]] <- df_B1
df_B[[2]] <- df_B2
df_A_join_names <- c("Column_join_B1", "Column_join_B2")
I'm trying to apply this:
Dplyr join on by=(a = b), where a and b are variables containing strings?
I'm curious to hear what you guys think!
There's no need for building a specific function, you can simply use SetNames within left_join function:
df_B_join_name <- "Column_join_A"
for (i in 1: length(df_A_join_names)){
df_A <- left_join(df_A, df_B[[i]], by=c(setNames(nm = df_A_join_names[i], df_B_join_name)))
You were very close! The only thing you might need to change is the way you reference the data frame under list df_B. df_B[1] will still be a list, df_B[[1]] will return a data frame. I ran the code below and it worked for me.
for (i in 1: length(df_A_join_names)){
df_A <- my_join(df_a = df_A,
df_b = df_B[[i]],
a_name = as.character(df_A_join_names[i]),
b_name = "Column_join_A"
First, manage to rename the first column in df_B to match the column in df_A. So df_B will look like this:
# [[1]]
# Column_join_B1 B_a
# 1 11 A
# 2 12 B
# . . .
# . . .
# . . .
# [[2]]
# Column_join_B2 B_b
# 1 21 K
# 2 22 L
# . . .
# . . .
# . . .
Next, use Reduce() in base or reduce() in purrr to iterate the manipulation of left_join. You even don't need to use the for loop.
Reduce(left_join, df_B, init = df_A)
# A_a Column_join_B1 Column_join_B2 B_a B_b
# 1 1 11 21 A K
# 2 2 12 22 B L
# 3 3 13 23 C M
# 4 4 14 24 D N
# 5 5 15 25 E O
# 6 6 16 26 F P
# 7 7 17 27 G Q
# 8 8 18 28 H R
# 9 9 19 29 I S
# 10 10 20 30 J T
This is a follow up to my previous question here, which #ronak_shah was kind enough to answer. I apologize as some of this information may be redundant to anyone who saw that post, but figure best to post a new question, rather than modify the previous version.
I would still like to iterate through a stored list of columns and procedures to create n new columns based on this list. In the example below, we start with 3 columns, a, b, c and a simple function, func1.
The data frame col_mod identifies which column should be changed, what the second argument to the function that changes them should be, and then generates a statement to execute the function. Each of these modifications should be an addition to the original data frame, rather than replacements of the specified columns. The new names of these columns should be a_new and c_new, respectively.
At the bottom of the reprex below, I am able to obtain my desired result manually, but as before, I would like to automate this using a mapping function.
I am attempting to use the same approach that was provided as an answer to my previous question, but I keep on getting the following error: "Error in get(as.character(FUN), mode = "function", envir = envir) : object 'func1(a,3)' of mode 'function' was not found"
If anyone can help would be much appreciated!
## fake data
dat <- data.frame(a = 1:5,
b = 6:10,
c = 11:15)
## function
func1 <- function(x, y) {x + y}
## modification list
col_mod <- data.frame("col" = c("a", "c"),
"y_val" = c(3, 4),
stringsAsFactors = FALSE) %>%
mutate(func = paste0("func1(", col, ",", y_val, ")"))
## desired end result
dat %>%
mutate(a_new = func1(a, 3),
c_new = func1(c, 4))
## attempting to generate new columns based on #ronak_shah's answer to my previous
## question but fails to run
dat[paste0(col_mod$col, '_new')] <- Map(function(x, y),
dat[col_mod$col], col_mod$func)
We can use pmap from purrr, transmute the columns based on the name from the 'col' i.e. ..1, function from the 'func' i.e. ..3 and 'y_val' from ..2, assign (:=) the value to a new column by creating a string with paste (or str_c), and bind the columns to the original dataset
col_mod$func <- 'func1'
pmap(col_mod, ~ dat %>%
transmute(!! str_c(..1, "_new") :=!! rlang::sym(..1), ..2))) %>%
bind_cols(dat, .)
# a b c a_new c_new
#1 1 6 11 4 15
#2 2 7 12 5 16
#3 3 8 13 6 17
#4 4 9 14 7 18
#5 5 10 15 8 19
If we want to parse the function as it is, use the parse_expr and eval i.e. without changing the func column - it remains as func1(a, 3), and func1(c, 4)
pmap(col_mod, ~ dat %>%
transmute(!! str_c(..1, "_new") :=
eval(rlang::parse_expr(..3)))) %>%
bind_cols(dat, .)
# a b c a_new c_new
#1 1 6 11 4 15
#2 2 7 12 5 16
#3 3 8 13 6 17
#4 4 9 14 7 18
#5 5 10 15 8 19
Or using base R with Map
dat[paste0(col_mod$col, '_new')] <-, c(f =
function(x, y, z) eval(parse(text = z), envir = dat), unname(col_mod)))
I am attempting to create new variables using a function and lapply rather than working right in the data with loops. I used to use Stata and would have solved this problem with a method similar to that discussed here.
Since naming variables programmatically is so difficult or at least awkward in R (and it seems you can't use indexing with assign), I have left the naming process until after the lapply. I am then using a for loop to do the renaming prior to merging and again for the merging. Are there more efficient ways of doing this? How would I replace the loops? Should I be doing some sort of reshaping?
#Reproducible data
data <- data.frame("custID" = c(1:10, 1:20),
"v1" = rep(c("A", "B"), c(10,20)),
"v2" = c(30:21, 20:19, 1:3, 20:6), stringsAsFactors = TRUE)
#Function to analyze customer distribution for each category (v1)
pf <- function(cat, df) {
df <- df[df$v1 == cat,]
df <- df[order(-df$v2),]
#Divide the customers into top percents
nr <- nrow(df)
p10 <- round(nr * .10, 0)
cat("Number of people in the Top 10% :", p10, "\n")
p20 <- round(nr * .20, 0)
p11_20 <- p20-p10
cat("Number of people in the 11-20% :", p11_20, "\n")
#Keep only those customers in the top groups
df <- df[1:p20,]
#Create a variable to identify the percent group the customer is in
top_pct <- integer(length = p10 + p11_20)
#Identify those in each group
top_pct[1:p10] <- 10
top_pct[(p10+1):p20] <- 20
#Add this variable to the data frame
df$top_pct <- top_pct
#Keep only custID and the new variable
df <- subset(df, select = c(custID, top_pct))
##Run the customer distribution function
v1Levels <- levels(data$v1)
res <- lapply(v1Levels, pf, df = data)
#Explore the results
# Length Class Mode
# [1,] 2 data.frame list
# [2,] 2 data.frame list
# [[1]]
# custID top_pct
# 1 1 10
# 2 2 20
# [[2]]
# custID top_pct
# 11 1 10
# 16 6 10
# 12 2 20
# 17 7 20
##Merge the two data frames but with top_pct as a different variable for each category
#Change the new variable name
for(i in 1:length(res)) {
names(res[[i]])[2] <- paste0(v1Levels[i], "_top_pct")
#Merge the results
res_m <- res[[1]]
for(i in 2:length(res)) {
res_m <- merge(res_m, res[[i]], by = "custID", all = TRUE)
# custID A_top_pct B_top_pct
# 1 1 10 10
# 2 2 20 20
# 3 6 NA 10
# 4 7 NA 20
Stick to your Stata instincts and use a single data set:
DT <- data.table(data)
You can see the result by typing DT.
From here, you can group the within-v1 rank, r, if you want to. Following Stata idioms...
x = rep(0,.N)
x[r>.8] = 20
x[r>.9] = 10
This is like gen and then two replace ... if statements. Again, you can see the result with DT.
Finally, you can subset with
which gives
custID v1 v2 r g
1: 1 A 30 1.000 10
2: 2 A 29 0.900 20
3: 1 B 20 0.975 10
4: 2 B 19 0.875 20
5: 6 B 20 0.975 10
6: 7 B 19 0.875 20
These steps can also be chained together:
DT[,r:=rank(v2)/.N,by=v1][,g:={x = rep(0,.N);x[r>.8] = 20;x[r>.9] = 10;x}][g>0]
(Thanks to #ExperimenteR:)
To rearrange for the desired output in the OP, with values of v1 in columns, use dcast:
DT[,r:=rank(v2)/.N,by=v1][,g:={x = rep(0,.N);x[r>.8] = 20;x[r>.9] = 10;x}][g>0],
Currently, dcast requires the latest version of data.table, available (I think) from Github.
You don't need the function pf to achieve what you want. Try dplyr/tidyr combo
data %>%
group_by(v1) %>%
mutate(n=n()) %>%
filter(row_number() <= round(n * .2)) %>%
mutate(top_pct= ifelse(row_number()<=round(n* .1), 10, 20)) %>%
select(custID, top_pct) %>%
spread(v1, top_pct)
# custID A B
#1 1 10 10
#2 2 20 20
#3 6 NA 10
#4 7 NA 20
The idiomatic way to do this kind of thing in R would be to use a combination of split and lapply. You're halfway there with your use of lapply; you just need to use split as well.
lapply(split(data, data$v1), function(df) {
cutoff <- quantile(df$v2, c(0.8, 0.9))
top_pct <- ifelse(df$v2 > cutoff[2], 10, ifelse(df$v2 > cutoff[1], 20, NA))
na.omit(data.frame(id=df$custID, top_pct))
Finding quantiles is done with quantile.
I have a list of files. I also have a list of "names" which I substr() from the actual filenames of these files. I would like to add a new column to each of the files in the list. This column will contain the corresponding element in "names" repeated times the number of rows in the file.
For example:
df1 <- data.frame(x = 1:3, y=letters[1:3])
df2 <- data.frame(x = 4:6, y=letters[4:6])
filelist <- list(df1,df2)
ID <- c("1A","IB")
for( i in length(filelist)){
filelist[i]$SampleID <- rep(ID[i],nrow(filelist[i])
// basically create a new column in each of the dataframes in filelist, and fill the column with repeted corresponding values of ID
my output should be like:
filelist[1] should be:
x y SAmpleID
1 1 a 1A
2 2 b 1A
3 3 c 1A
x y SampleID
1 4 d IB
2 5 e IB
3 6 f IB
and so on.....
Any Idea how it could be done.
An alternate solution is to use cbind, and taking advantage of the fact that R will recylce values of a shorter vector.
For Example
x <- df2 # from above
cbind(x, NewColumn="Singleton")
# x y NewColumn
# 1 4 d Singleton
# 2 5 e Singleton
# 3 6 f Singleton
There is no need for the use of rep. R does that for you.
Therfore, you could put cbind(filelist[[i]], ID[[i]]) in your for loop or as #Sven pointed out, you can use the cleaner mapply:
filelist <- mapply(cbind, filelist, "SampleID"=ID, SIMPLIFY=F)
This is a corrected version of your loop:
for( i in seq_along(filelist)){
filelist[[i]]$SampleID <- rep(ID[i],nrow(filelist[[i]]))
There were 3 problems:
A final ) was missing after the command in the body.
Elements of lists are accessed by [[, not by [. [ returns a list of length one. [[ returns the element only.
length(filelist) is just one value, so the loop runs for the last element of the list only. I replaced it with seq_along(filelist).
A more efficient approach is to use mapply for the task:
mapply(function(x, y) "[<-"(x, "SampleID", value = y) ,
filelist, ID, SIMPLIFY = FALSE)
This one worked for me:
Create a new column for every dataframe in a list; fill the values of the new column based on existing column. (In your case IDs).
# Create dummy data
df1<-data.frame(a = c(1,2,3))
df2<-data.frame(a = c(5,6,7))
# Create a list
l<-list(df1, df2)
> l
1 1
2 2
3 3
1 5
2 6
3 7
# add new column 'b'
# create 'b' values based on column 'a'
l2<-lapply(l, function(x)
cbind(x, b = x$a*4))
Results in:
> l2
a b
1 1 4
2 2 8
3 3 12
a b
1 5 20
2 6 24
3 7 28
In your case something like:
filelist<-lapply(filelist, function(x)
cbind(x, b = x$SampleID))
The purrr way, using map2
map2(filelist, ID, ~cbind(.x, SampleID = .y))
# x y SampleId
#1 1 a 1A
#2 2 b 1A
#3 3 c 1A
# x y SampleId
#1 4 d IB
#2 5 e IB
#3 6 f IB
Or can also use
map2(filelist, ID, ~.x %>% mutate(SampleId = .y))
If you name the list, we can use imap and add the new column based on it's name.
names(filelist) <- c("1A","IB")
imap(filelist, ~cbind(.x, SampleID = .y))
#imap(filelist, ~.x %>% mutate(SampleId = .y))
which is similar to using Map
Map(cbind, filelist, SampleID = names(filelist))
A tricky way:
names(filelist) <- ID
result <- ldply(filelist, data.frame)
data_lst <- list(
data_1 = data.frame(c1 = 1:3, c2 = 3:1),
data_2 = data.frame(c1 = 1:3, c2 = 3:1)
f <- function (data, name){
data$name <- name
Map(f, data_lst , names(data_lst))
for my question I created a dummy data frame:
DF <- data.frame(a = rep(LETTERS[1:5], each=2), b = sample(40:49), c = sample(1:10))
a b c
1 A 49 2
2 A 43 3
3 B 40 7
4 B 47 1
5 C 41 9
6 C 48 8
7 D 45 6
8 D 42 5
9 E 46 10
10 E 44 4
How can I use the aggregation function on column a so that, for instance, for "A" the following value is calculated: 49-43 / 2+3?
I started like:
aggregate(DF, by=list(DF$a), FUN=function(x) {
The problem I have is that I do not know how to access the 4 different cells 49, 43, 2 and 3
I tried x[[1]][1] and similar stuff but don't get it working.
Inside aggregate, the function FUN is applied independently to each column of your data. Here you want to use a function that takes two columns as inputs, so a priori, you can't use aggregate for that.
Instead, you can use ddply from the plyr package:
ddply(DF, "a", summarize, res = (b[1] - b[2]) / sum(c))
# a res
# 1 A 1.2000000
# 2 B -0.8750000
# 3 C -0.4117647
# 4 D 0.2727273
# 5 E 0.1428571
When you aggregate the FUN argument can be anything you want. Keep in mind that the value passed will either be a vector (if x is one column) or a little data.frame or matrix (if x is more than one). However, aggregate doesn't let you access the columns of a multi-column argument. For example.
aggregate( . ~ a, data = DF, FUN = function(x) diff(x[,1]) / sum(x[,2]) )
That fails with an error even though I used . (which takes all of the columns of DF that I'm not using elsewhere). To see what aggregate is trying to do there look at the following.
aggregate( . ~ a, data = DF, FUN = sum )
The two columns, b, and c, were aggregated but from the first attempt we know that you can't do something that accesses each column separately. So, strictly sticking with aggregate you need two passes and three lines of code.
diffb <- aggregate( b ~ a, data = DF, FUN = diff )
Y <- aggregate( c ~ a, data = DF, FUN = sum )
Y$c <- diffb$b / Y$c
Now Y contains the result you want.
The by function is simpler than aggregate and all it does is split the original data.frame using the indices and then apply the FUN function.
l <- by( data = DF, INDICES = DF$a, FUN = function(x) diff(x$b)/sum(x$c), simplify = FALSE )
You have to do a little to get the result back into a data.frame if you really want one.
data.frame(a = names(l), x = unlist(l))
Using data.table could be faster and easier.
DT <- data.table(DF)
DT[, (-1*diff(b))/sum(c), by=a]
a V1
1: A 1.2000000
2: B -0.8750000
3: C -0.4117647
4: D 0.2727273
5: E 0.1428571
Using aggregate, not so good. I didn't a better way to do it using aggregate :( but here's an attempt.
B <- aggregate(DF$b, by=list(DF$a), diff)
C <- aggregate(DF$c, by=list(DF$a), sum)
data.frame(a=B[,1], Result=(-1*B[,2])/C[,2])
a Result
1 A 1.2000000
2 B -0.8750000
3 C -0.4117647
4 D 0.2727273
5 E 0.1428571
A data.table solution - for efficiency of time and memory.
DT <-
DT[, list(calc = diff(b) / sum(c)), by = a]
You can use the base by() function:
listOfRows <-
FUN=function(x){data.frame(a=x$a[1],res=(x$b[1] - x$b[2])/(x$c[1] + x$c[2]))})
newDF <-,listOfRows)
I would like to aggregate a data.frame over 3 categories, with one of them varying. Unfortunately this one varying category contains NAs (actually it's the reason why it needs to vary). Thus I created a list of data.frames. Every data.frame within this list contains only complete cases with respect to three variables (with only one of them changing).
Let's reproduce this:
mydata <- warpbreaks
names(mydata) <- c("someValue","group","size")
mydata$category <- c(1,2,3)
mydata$categoryA <- c("A","A","X","X","Z","Z")
# add some NA
mydata$category[c(8,10,19)] <- NA
mydata$categoryA[c(14,1,20)] <- NA
# create a list of dfs that contains TRUE FALSE
noNAList <- function(vec){
res <- !
testTF <- lapply(mydata[,c("category","categoryA")],noNAList)
# create a list of data.frames
selectDF <- function(TFvec){
res <- mydata[TFvec,]
# check x and see that it may contain NAs as long
# as it's not in one of the 3 categories I want to aggregate over
x <-lapply(testTF,selectDF)
## let's ddply get to work
doddply <- function(df){
ddply(df,.(group,size),summarize,sumTest = sum(someValue))
y <- lapply(x, doddply);y
y comes very close to what I want to get
group size sumTest
1 A L 375
2 A M 198
3 A H 185
4 B L 254
5 B M 259
6 B H 169
group size sumTest
1 A L 375
2 A M 204
3 A H 200
4 B L 254
5 B M 259
6 B H 169
But I need to implement aggregation over a third varying variable, which is in this case category and categoryA. Just like:
group size category sumTest sumTestTotal
1 A H 1 46 221
2 A H 2 46 221
3 A H 3 93 221
and so forth. How can I add names(x) to lapply, or do I need a loop or environment here?
Note that I want EITHER category OR categoryA added to the mix. In reality I have about 15 mutually exclusive categorical vars.
I think you might be making this really hard on yourself, if I understand your question correctly.
If you want to aggregate the data.frame 'myData' by three (or four) variables, you would simply do this:
aggregate(someValue ~ group + size + category + categoryA, sum, data=mydata)
group size category categoryA someValue
1 A L 1 A 51
2 B L 1 A 19
3 A M 1 A 17
4 B M 1 A 63
aggregate will automatically remove rows that include NA in any of the categories. If someValue is sometimes NA, then you can add the parameter na.rm=T.
I also noted that you put a lot of unnecessary code into functions. For example:
# create a list of data.frames
selectDF <- function(TFvec){
res <- mydata[TFvec,]
Can be written like:
selectDF <- function(TFvec) mydata[TFvec,]
Also, using lapply to create a list of two data frames without the NA is overkill. Try this code:
x = list(mydata[!$category),],mydata[!$categoryA),])
I know the question explicitly requests a ddply()/lapply() solution.
But ... if you are willing to come on over to the dark side, here is a data.table()-based function that should do the trick:
# Convert mydata to a data.table
dt <- data.table(mydata, key = c("group", "size"))
# Define workhorse function
myfunction <- function(dt, VAR) {
E <-
dt[i = !,
j = {n <- sum(.SD[,someValue])
.SD[, list(sumTest = sum(someValue),
sumTestTotal = n,
share = sum(someValue)/n),
by = VAR]
by = key(dt)]
# Test it out
s1 <- myfunction(dt, "category")
s2 <- myfunction(dt, "categoryA")
Here's how you could run this for a vector of different categorical variables:
catVars <- c("category", "categoryA")
ll <- lapply(catVars,
FUN = function(X) {, list(dt, X))
names(ll) <- catVars
lapply(ll, head, 3)
# $category
# group size category sumTest sumTestTotal share
# [1,] A H 2 46 185 0.2486486
# [2,] A H 3 93 185 0.5027027
# [3,] A H 1 46 185 0.2486486
# $categoryA
# group size categoryA sumTest sumTestTotal share
# [1,] A H A 79 200 0.395
# [2,] A H X 68 200 0.340
# [3,] A H Z 53 200 0.265
Finally, I found a solution that might not be as slick as Josh' but it works without no dark forces (data.table). You may laugh – here's my reproducible example using the same sample data as in the question.
qual <- c("category","categoryA")
# get T / F vectors
noNAList <- function(vec){
res <- !
selectDF <- function(TFvec) mydata[TFvec,]
NAcheck <- lapply(mydata[,qual],noNAList)
# create a list of data.frames
listOfDf <- lapply(NAcheck,selectDF)
workhorse <- function(charVec,listOfDf){
dfs <- list2env(listOfDf)
# create expression list
exlist <- list()
for(i in 1:length(qual)){
exlist[[qual[i]]] <- parse(text=paste("ddply(",qual[i],
",.(group,size,",qual[i],"),summarize,sumTest = sum(someValue))",
res <- lapply(exlist,eval,envir=dfs)
Is this more like what you mean? I find your example extremely difficult to understand. In the below code, the method can take any column, and then aggregate by it. It can return multiple aggregation functions of someValue. I then find all the column names you would like to aggregate by, and then apply the function to that vector.
# Build a method to aggregate by column. = function (column) {
names(by.list) = c('group','size',column)
aggregate(mydata$someValue, by=by.list, function(x) c(sum=sum(x),mean=mean(x)))
# Find all the column names you want to aggregate by
cols = names(mydata)[!(names(mydata) %in% c('someValue','group','size'))]
# Apply the method to each column name.
lapply (cols,