R ddply vector version - r

I am looking for a vector version of ddply.
I would like to do the following:
vector_ddply(frame1, frame2, ..., frameN, c("column1", "column2"), processingFunction);
Here all frames have both "column1" and "column2" and processingFunction takes N parameters.
Note that in my specific case it doesn't make sense to merge the N data frames into one.
The resulting frame would made of the unions of all the keys of the N frames.
Is there a way to achieve this ?
Thanks

Let's start with some sample data:
ll <- list(
f1 = data.frame( x = c("a", "b", "a", "b"), y = c(1,1,2,2), z = rnorm(4), p = 1:4 ),
f2 = data.frame( x = c("a", "b", "a", "b"), y = c(1,1,2,2), z = rnorm(4), q = 1:4 ),
f3 = data.frame( x = c("a", "b", "a", "b"), y = c(1,1,2,2), z = rnorm(4), r = 1:4 )
)
1. Solution: apply data.frame-wise
You want to ddply processingFunction on each data.frame individually, and combine the results to one resulting data.frame:
ldply( ll, ddply, .(x, y), summarise, z = processingFunction(z) )
2. Solution: apply on one rbinded data.frame
You want to apply processingFunction over all rows of the data.frames at once. So then you should just rbind all data.frames together to a large one. Just in case this is not directly possible because the individual frames have not all columns in common, you have to rbind on the common column subset:
commonCols <- Reduce( "intersect", lapply(ll, colnames) )
oneDf <- do.call( "rbind", lapply( ll, "[", commonCols ) )
ddply( oneDf, .(x,y), summarise, z = processingFunction(z) )

Related

Overriding data.table key order causes incorrect merge results

In the following example I use a dplyr::arrange on a data.table with a key. This overrides the sort on that column:
x <- data.table(a = sample(1000:1100), b = sample(c("A", NA, "B", "C", "D"), replace = TRUE), c = letters)
setkey(x, "a")
# lose order on datatable key
x <- dplyr::arrange(x, b)
y <- data.table(a = sample(1000:1100), f = c(letters, NA), g = c("AA", "BB", NA, NA, NA, NA))
setkey(y, "a")
res <- merge(x, y, by = c("a"), all.x = TRUE)
# try merge with key removed
res2 <- merge(x %>% as.data.frame() %>% as.data.table(), y, by = c("a"), all.x = TRUE)
# merge results are inconsistent
identical(res, res2)
I can see that if I ordered with x <- x[order(b)], I would maintain the sort on the key and the results would be consistent.
I am not sure why I cannot use dplyr::arrange and what relationship the sort key has with the merge. Any insight would be appreciated.
The problem is that with dplyr::arrange(x, b) you do not remove the sorted attribute from your data.table contrary to using x <- x[order(b)] or setorder(x, "b").
The data.table way would be to use setorder in the first place e.g.
library(data.table)
x <- data.table(a = sample(1000:1100), b = sample(c("A", NA, "B", "C", "D"), replace = TRUE), c = letters)
setorder(x, "b", "a", na.last=TRUE)
The wrong results of joins on data.tables which have a key although they are not sorted by it, is a known bug (see also #5361 in data.table bug tracker).

Assign multiple columns in R data.table while using "by" to reduce number of rows

I have a data.table on which I would like to perform a linear regression per group, and capture the slope and intercept. I would like the data to have one row per group, and (in addition to the grouping variable(s)) two columns with the slope and intercept from the regression.
I cannot get this to work, how can I do this?
Below I have a reproducible example, with two strategies I used.
dat <- data.table(
x = rnorm(6),
y = rnorm(6, 10),
g = c("A", "A", "A", "B", "B", "B")
)
x <- dat[ , c("intercept", "slope") := as.list(coef(lm(y ~ x))), by = "g"]
y <- dat[ , .(model = .(lm(y ~ x))), by = "g"]
z <- dat[ , .(coef = list(coef(lm(y ~ x)))), by = "g"]
z[ , c("intercept", "slope") := list(map_dbl(coef, 1), map_dbl(coef, 2))]]
In x, I have the correct columns, but all rows are repeated (this makes sense because I use :=).
In y, I have the correct number of rows (one for each group), but I need to extract the intercept and slope later on.
z gives the expected result but feels inefficient.
Is there a way I can do this all in one go?
# without desired colnames
dat[, as.list(coef(lm(y ~ x))), by = g]
# g (Intercept) x
# 1: A 9.567597 -0.25231210
# 2: B 10.373024 0.01000639
# with desired colnames
dat[, .(intercept = coef(lm(y~x))[1],
slope = coef(lm(y~x))[2]), by = g]
# g intercept slope
# 1: A 9.567597 -0.25231210
# 2: B 10.373024 0.01000639
reproducable sample data
set.seed(123)
dat <- data.table(
x = rnorm(6),
y = rnorm(6, 10),
g = c("A", "A", "A", "B", "B", "B")
)

Using apply functions instead of for loops in R

I have been trying to replace a for loop in my code with an apply function, and i attempted to do it in all the possible ways, using sapply and lapply and apply and mapply, always seems to not work out, the original function looks like this
ds1 <- data.frame(col1 = c(NA, 2), col2 = c("A", "B"))
ds2 <- data.frame(colA = c("A", "B"), colB = c(90, 110))
for(i in 1:nrow(ds1)){
if(is.na(ds1$col1[i])){
ds1$col1[i] <- ds2[ds2[,"colA"] == ds1$col2[i], "colB"]
}
}
My latest attempt with the apply family looks like this
ds1 <- data.frame(col1 = c(NA, 2), col2 = c("A", "B"))
ds2 <- data.frame(colA = c("A", "B"), colB = c(90, 110))
sFunc <- function(x, y, z){
if(is.na(x)){
return(z[z[,"colA"] == y, "colB"])
} else {
return(x)
}
}
ds1$col1 <- sapply(ds1$col1, sFunc, ds1$col2, ds2)
Which returns ds2$colB for each row, can someone explain to me what I got wrong about this?
sapply only iterates over the first vector you pass. The other arguments you pass will be treated as whole vectors in each loop. To iterate over multiple vectors you need multivariate apply, which is mapply.
sFunc <- function(x, y){
if(is.na(x)){
return(ds2[ds2[,"colA"] == y, "colB"])
} else {
return(x)
}
}
mapply(sFunc, ds1$col1, ds1$col2)
#> [1] 90 2
A join would be useful here. You can do it in base R :
transform(merge(ds1, ds2, by.x = "col2", by.y = "colA"),
col1 = ifelse(is.na(col1), colB, col1))[names(ds1)]
# col1 col2
#1 90 A
#2 2 B
Or with dplyr
library(dplyr)
inner_join(ds1, ds2, by = c("col2" = "colA")) %>%
mutate(col1 = coalesce(col1, colB)) %>%
select(names(ds1))

Return a changed list in R via lapply(), but objects in list not changed

I'm trying to loop through a list of data frames, dropping columns that don't match some condition. I want to change the data frames such that they're missing 1 column essentially. After executing the function, I'm able to change the LIST of data frames, but not the original data frames themselves.
df1 <- data.frame(
a = c("John","Peter","Dylan"),
b = c(1, 2, 3),
c = c("yipee", "ki", "yay"))
df2 <- data.frame(
a = c("Ray","Bob","Derek"),
b = c(4, 5, 6),
c = c("yum", "yummy", "donuts"))
df3 <- data.frame(
a = c("Bill","Sam","Nate"),
b = c(7, 8, 9),
c = c("I", "eat", "cake"))
l <- list(df1, df2, df3)
drop_col <- function(x) {
x <- x[, !names(x) %in% c("e", "b", "f")]
return(x)
}
l <- lapply(l, drop_col)
When I call the list l, I get a list of data frames with the changes I want. When I call an element in the list, df1 or df2 or df3, they do not have a dropped column.
I've looked at this solution and many others, I'm obviously missing something.
l list and df1 , df2 etc. dataframes are independent. They have nothing to do with each other. One way to get new changed dataframes is to assign names to the list and create new dataframe.
l <- lapply(l, drop_col)
names(l) <- paste0("df", 1:3)
list2env(l, .GlobalEnv)
The problem is that when you are creating l, you are filling it with copies of your data frames df1, df2, df3.
In R, it is not generally possible to pass references to variables. One workaround is to create an environment as #Ronak Shah does.
Another is to use get() and <<- to change the variable within the function.
drop_cols <- function(x) {
for(iter in x)
do.call("<<-", list(iter, drop_col(get(iter))))
}
drop_cols(c("df1","df2","df3"))
df1 <- data.frame(
a = c("John","Peter","Dylan"),
b = c(1, 2, 3),
c = c("yipee", "ki", "yay"))
df2 <- data.frame(
a = c("Ray","Bob","Derek"),
b = c(4, 5, 6),
c = c("yum", "yummy", "donuts"))
df3 <- data.frame(
a = c("Bill","Sam","Nate"),
b = c(7, 8, 9),
c = c("I", "eat", "cake"))
# Name the list elements:
l <- list(df1 = df1, df2 = df2, df3 = df3)
drop_col <- function(x) {
x <- x[, !names(x) %in% c("e", "b", "f")]
return(x)
}
l <- lapply(l, drop_col)
# View altered dfs:
View(l["df1"])

bind_rows() error: by reading in a function?

This block runs below, and produces df_all as intended, but when I uncomment the single function at the top (not even apply it here but I do need for other things) and rerun the same block, I get: Error in bind_rows_(x, .id): Argument 1 must be a data frame or a named atomic vector, not a function
library(data.table)
# addxtoy_newy_csv <- function(df) {
# zdf1 <- df %>% filter(Variable == "s44")
# setDT(df)
# setDT(zdf1)
# df[zdf1, Value := Value + i.Value, on=.(tstep, variable, Scenario)]
# setDF(df)
#}
tstep <- rep(c("a", "b", "c", "d", "e"), 5)
Variable <- c(rep(c("v"), 5), rep(c("w"), 5), rep(c("x"), 5), rep(c("y"), 5), rep(c("x"), 5))
Value <- c(1,2,3,4,5,10,11,12,13,14,33,22,44,57,5,3,2,1,2,3,34,24,11,11,7)
Scenario <- c(rep(c("i"), 20), rep(c("j"), 5) )
df1 <- data.frame(tstep, Variable, Value, Scenario)
tstep <- c("a", "b", "c", "d", "e")
Variable <- rep(c("x"), 5)
Value <- c(100, 34, 100,22, 100)
Scenario <- c(rep(c("i"), 5))
df2<- data.frame(tstep, Variable, Value, Scenario)
setDT(df1)
setDT(df2)
df1[df2, Value := Value + i.Value, on=.(tstep, Variable, Scenario)]
setDF(df1)
df_all <- mget(ls(pattern="df*")) %>% bind_rows()
The pattern you use in ls() will match any object with a "d" in its name, so addxtoy_newy_csv gets included in the list of object names. The f* in your pattern means you currently search for "d, followed by zero or more f's". I think a safer pattern to use would be ^df.*, to match objects that start with "df":
df1 = data.frame(x = 1:3)
df2 = data.frame(x = 4:6)
adder = function(x) x + 1
ls(pattern = "df*")
ls(pattern = "^df.*")

Resources