Maintaining order in split-apply-combine problems [duplicate] - r

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to ddply() without sorting?
I have the following data frame
dd1 = data.frame(cond = c("D","A","C","B","A","B","D","C"), val = c(11,7,9,4,3,0,5,2))
dd1
cond val
1 D 11
2 A 7
3 C 9
4 B 4
5 A 3
6 B 0
7 D 5
8 C 2
and now need to compute cumulative sums respecting the factor level in cond. The results should look like that:
> dd2 = data.frame(cond = c("D","A","C","B","A","B","D","C"), val = c(11,7,9,4,3,0,5,2), cumsum=c(11,7,9,4,10,4,16,11))
> dd2
cond val cumsum
1 D 11 11
2 A 7 7
3 C 9 9
4 B 4 4
5 A 3 10
6 B 0 4
7 D 5 16
8 C 2 11
It is important to receive the result data frame in the same order as the input data frame because there are other variables bound to that.
I tried ddply(dd1, .(cond), summarize, cumsum = cumsum(val)) but it didn't produce the result I expected.
Thanks

Use ave instead.
dd1$cumsum <- ave(dd1$val, dd1$cond, FUN=cumsum)

If doing this by hand is an option then split() and unsplit() with a suitable lapply() inbetween will do this for you.
dds <- split(dd1, dd1$cond)
dds <- lapply(dds, function(x) transform(x, cumsum = cumsum(x$val)))
unsplit(dds, dd1$cond)
The last line gives
> unsplit(dds, dd1$cond)
cond val cumsum
1 D 11 11
2 A 7 7
3 C 9 9
4 B 4 4
5 A 3 10
6 B 0 4
7 D 5 16
8 C 2 11
I separated the three steps, but these could be strung together or placed in a function if you are doing a lot of this.

A data.table solution:
require(data.table)
dt <- data.frame(dd1)
dt[, c.val := cumsum(val),by=cond]
> dt
# cond val c.val
# 1: D 11 11
# 2: A 7 7
# 3: C 9 9
# 4: B 4 4
# 5: A 3 10
# 6: B 0 4
# 7: D 5 16
# 8: C 2 11

Related

Repeat dataframe with new column in R

I have a dataframe:
my_df <- data.frame(var1 = c(1,2,3,4,5), var2 = c(6,7,8,9,10))
my_df
var1 var2
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
I also have a vector:
my_vec <- c("a", "b", "c")
I want to repeat the dataframe length(my_vec) times, filling in the values of a new variable with the vector values. Is there a simple way to do this? If possible, i'd like to do this in a dplyr chain. Desired output:
var1 var2 var3
1 1 6 a
2 2 7 a
3 3 8 a
4 4 9 a
5 5 10 a
6 1 6 b
7 2 7 b
8 3 8 b
9 4 9 b
10 5 10 b
11 1 6 c
12 2 7 c
13 3 8 c
14 4 9 c
15 5 10 c
We can use crossing or with expand_grid
library(tidyr)
crossing(my_df, var3 = my_vec)
#expand_grid(my_df, var3 = my_vec)
If the order is important, use arrange
library(dplyr)
crossing(my_df, var3 = my_vec) %>%
arrange(var3)
-output
# A tibble: 15 × 3
var1 var2 var3
<dbl> <dbl> <chr>
1 1 6 a
2 2 7 a
3 3 8 a
4 4 9 a
5 5 10 a
6 1 6 b
7 2 7 b
8 3 8 b
9 4 9 b
10 5 10 b
11 1 6 c
12 2 7 c
13 3 8 c
14 4 9 c
15 5 10 c
Though I don't think this is likely to be the simplest answer in practice, I specifically saw that you wanted a dplyr chain that would solve this, and so I tried to do this without using the pre-existing functions that do this for you.
For your example specifically, you could use this chain with the tibble package functions add_column and add_row
my_df %>%
tibble::add_column(var3 = my_vec[1]) %>%
tibble::add_row(tibble::add_column(my_df, var3 = my_vec[2])) %>%
tibble::add_row(tibble::add_column(my_df, var3 = my_vec[3]))
which directly yields
var1 var2 var3
1 1 6 a
2 2 7 a
3 3 8 a
4 4 9 a
5 5 10 a
6 1 6 b
7 2 7 b
8 3 8 b
9 4 9 b
10 5 10 b
11 1 6 c
12 2 7 c
13 3 8 c
14 4 9 c
15 5 10 c
Though the principle can be extended a bit, it can still be more adaptable for whatever it is you want to apply this to. So I decided to make a function to do it for you.
my_fxn <-
function(frame, yourVector, new.col.name = paste0("var", NCOL(frame) + 1)) {
require(tidyverse)
origcols <- colnames(frame)
for (i in 1:length(yourVector)) {
intermediateFrame <- tibble::add_column(
frame,
temp.name = rep_len(yourVector[[i]], nrow(frame))
)
colnames(intermediateFrame) <- append(origcols, new.col.name)
if (i == 1) {
Frame3 <- intermediateFrame
} else {
Frame3 <- tibble::add_row(Frame3, intermediateFrame)
}
}
return(Frame3)
}
Running my_fxn(my_df, my_vec) should get you the same data frame/table that we got above.
I also experimented with using a for loop outside a function on its own to do this, but decided that it was getting to be overkill. That approach is definitely also possible, though.

Dynamic select expression in function [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
I am trying to write a function that will convert this data frame
library(dplyr)
library(rlang)
library(purrr)
df <- data.frame(obj=c(1,1,2,2,3,3,3,4,4,4),
S1=rep(c("a","b"),length.out=10),PR1=rep(c(3,7),length.out=10),
S2=rep(c("c","d"),length.out=10),PR2=rep(c(7,3),length.out=10))
obj S1 PR1 S2 PR2
1 1 a 3 c 7
2 1 b 7 d 3
3 2 a 3 c 7
4 2 b 7 d 3
5 3 a 3 c 7
6 3 b 7 d 3
7 3 a 3 c 7
8 4 b 7 d 3
9 4 a 3 c 7
10 4 b 7 d 3
In to this data frame
df %>% {bind_rows(select(., obj, S = S1, PR = PR1),
select(., obj, S = S2, PR = PR2))}
obj S PR
1 1 a 3
2 1 b 7
3 2 a 3
4 2 b 7
5 3 a 3
6 3 b 7
7 3 a 3
8 4 b 7
9 4 a 3
10 4 b 7
11 1 c 7
12 1 d 3
13 2 c 7
14 2 d 3
15 3 c 7
16 3 d 3
17 3 c 7
18 4 d 3
19 4 c 7
20 4 d 3
But I would like the function to be able to work with any number of columns. So it would also work if I had S1, S2, S3, S4 or if there was an additional category ie DS1, DS2. Ideally the function would take as arguments the patterns that determine which columns are stacked on top of each other, the number of sets of each column, the names of the output columns and the names of any variables that should also be kept.
This is my attempt at this function:
stack_col <- function(df, patterns, nums, cnames, keep){
keep <- enquo(keep)
build_exp <- function(x){
paste0("!!sym(cnames[[", x, "]]) := paste0(patterns[[", x, "]],num)") %>%
parse_expr()
}
exps <- map(1:length(patterns), ~expr(!!build_exp(.)))
sel_fun <- function(num){
df %>% select(!!keep,
!!!exps)
}
map(nums, sel_fun) %>% bind_rows()
}
I can get the sel_fun part to work for a fixed number of patterns like this
patterns <- c("S", "PR")
cnames <- c("Species", "PR")
keep <- quo(obj)
sel_fun <- function(num){
df %>% select(!!keep,
!!sym(cnames[[1]]) := paste0(patterns[[1]], num),
!!sym(cnames[[2]]) := paste0(patterns[[2]], num))
}
sel_fun(1)
But the dynamic version that I have tried does not work and gives this error:
Error: `:=` can only be used within a quasiquoted argument
Here is a function to get the expected output. Loop through the 'patterns' and the corresponding new column names ('cnames') using map2, gather into 'long' format, rename the 'val' column to the 'cnames' passed into the function, bind the columns (bind_cols) and select the columns of interest
stack_col <- function(dat, pat, cname, keep) {
purrr::map2(pat, cname, ~
dat %>%
dplyr::select(keep, matches(.x)) %>%
tidyr::gather(key, val, matches(.x)) %>%
dplyr::select(-key) %>%
dplyr::rename(!! .y := val)) %>%
dplyr::bind_cols(.) %>%
dplyr::select(keep, cname)
}
stack_col(df, patterns, cnames, 1)
# obj Species PR
#1 1 a 3
#2 1 b 7
#3 2 a 3
#4 2 b 7
#5 3 a 3
#6 3 b 7
#7 3 a 3
#8 4 b 7
#9 4 a 3
#10 4 b 7
#11 1 c 7
#12 1 d 3
#13 2 c 7
#14 2 d 3
#15 3 c 7
#16 3 d 3
#17 3 c 7
#18 4 d 3
#19 4 c 7
#20 4 d 3
Also, multiple patterns reshaping can be done with data.table::melt
library(data.table)
melt(setDT(df), measure = patterns("^S\\d+", "^PR\\d+"),
value.name = c("Species", "PR"))[, variable := NULL][]
This solves your problem, although it does not fix your function:
The idea is to use gather and spread on the columns which starts with the specific pattern. Therefore I create a regex which matches the column names and then first gather all of them, extract the group and the rename the groups with the cnames. Finally spread takes separates the new columns.
library(dplyr)
library(purrr)
library(tidyr)
library(stringr)
patterns <- c("S", "PR")
cnames <- c("Species", "PR")
names(cnames) <- patterns
complete_pattern <- str_c("^", str_c(patterns, collapse = "|^"))
df %>%
mutate(rownumber = 1:n()) %>%
gather(new_variable, value, matches(complete_pattern)) %>%
mutate(group = str_extract(new_variable, complete_pattern),
group = str_replace_all(group, cnames),
group_number = str_extract(new_variable, "\\d+")) %>%
select(-new_variable) %>%
spread(group, value)
# obj rownumber group_number PR Species
# 1 1 1 1 3 a
# 2 1 1 2 7 c
# 3 1 2 1 7 b
# 4 1 2 2 3 d
# 5 2 3 1 3 a
# 6 2 3 2 7 c
# 7 2 4 1 7 b
# 8 2 4 2 3 d
# 9 3 5 1 3 a
# 10 3 5 2 7 c
# 11 3 6 1 7 b
# 12 3 6 2 3 d
# 13 3 7 1 3 a
# 14 3 7 2 7 c
# 15 4 8 1 7 b
# 16 4 8 2 3 d
# 17 4 9 1 3 a
# 18 4 9 2 7 c
# 19 4 10 1 7 b
# 20 4 10 2 3 d

How to delete duplicates but keep most recent data in R

I have the following two data frames:
df1 = data.frame(names=c('a','b','c','c','d'),year=c(11,12,13,14,15), Times=c(1,1,3,5,6))
df2 = data.frame(names=c('a','e','e','c','c','d'),year=c(12,12,13,15,16,16), Times=c(2,2,4,6,7,7))
I would like to know how I could merge the above df but only keeping the most recent Times depending on the year. It should look like this:
Names Year Times
a 12 2
b 12 2
c 16 7
d 16 7
e 13 4
I'm guessing that you do not mean to merge these but rather combine by stacking. Your question is ambiguous since the "duplication" could occur at the dataframe level or at the vector level. You example does not display any duplication at the dataframe level but would at the vector level. The best way to describe the problem is that you want the last (or max) Times entry within each group if names values:
> df1
names year Times
1 a 11 1
2 b 12 1
3 c 13 3
4 c 14 5
5 d 15 6
> df2
names year Times
1 a 12 2
2 e 12 2
3 e 13 4
4 c 15 6
5 c 16 7
6 d 16 7
> dfr <- rbind(df1,df2)
> dfr <-dfr[order(dfr$Times),]
> dfr[!duplicated(dfr, fromLast=TRUE) , ]
names year Times
1 a 11 1
2 b 12 1
6 a 12 2
7 e 12 2
3 c 13 3
8 e 13 4
4 c 14 5
5 d 15 6
9 c 15 6
10 c 16 7
11 d 16 7
> dfr[!duplicated(dfr$names, fromLast=TRUE) , ]
names year Times
2 b 12 1
6 a 12 2
8 e 13 4
10 c 16 7
11 d 16 7
This uses base R functions; there are also newer packages (such as plyr) that many feel make the split-apply-combine process more intuitive.
df <- rbind(df1, df2)
do.call(rbind, lapply(split(df, df$names), function(x) x[which.max(x$year), ]))
## names year Times
## a a 12 2
## b b 12 1
## c c 16 7
## d d 16 7
## e e 13 4
We could also use aggregate:
df <- rbind(df1,df2)
aggregate(cbind(df$year,df$Times)~df$names,df,max)
# df$names V1 V2
# 1 a 12 2
# 2 b 12 1
# 3 c 16 7
# 4 d 16 7
# 5 e 13 4
In case you wanted to see a data.table solution,
# load library
library(data.table)
# bind by row and convert to data.table (by reference)
df <- setDT(rbind(df1, df2))
# get the result
df[order(names, year), .SD[.N], by=.(names)]
The output is as follows:
names year Times
1: a 12 2
2: b 12 1
3: c 16 7
4: d 16 7
5: e 13 4
The final line orders the row-binded data by names and year, and then chooses the last observation (.sd[.N]) for each name.

Generate combination of data frame and vector

I know expand.grid is to create all combinations of given vectors. But is there a way to generate all combinations of a data frame and a vector by taking each row in the data frame as unique. For instance,
df <- data.frame(a = 1:3, b = 5:7)
c <- 9:10
how to create a new data frame that is the combination of df and c without expanding df:
df.c:
a b c
1 5 9
2 6 9
3 7 9
1 5 10
2 6 10
3 7 10
Thanks!
As for me the simplest way is merge(df, as.data.frame(c))
a b c
1 1 5 9
2 2 6 9
3 3 7 9
4 1 5 10
5 2 6 10
6 3 7 10
This may not scale when your dataframe has more than two columns per row, but you can just use expand.grid on the first column and then merge the second column in.
df <- data.frame(a = 1:3, b = 5:7)
c <- 9:10
combined <- expand.grid(a=df$a, c=c)
combined <- merge(combined, df)
> combined[order(combined$c), ]
a c b
1 1 9 5
3 2 9 6
5 3 9 7
2 1 10 5
4 2 10 6
6 3 10 7
You could also do something like this
do.call(rbind,lapply(9:10, function(x,d) data.frame(d, c=x), d=df)))
# or using rbindlist as a fast alternative to do.call(rbind,list)
library(data.table)
rbindlist(lapply(9:10, function(x,d) data.frame(d, c=x), d=df)))
or
rbindlist(Map(data.frame, c = 9:10, MoreArgs = list(a= 1:3,b=5:7)))
This question is really old but I found one more answer.
Use tidyr's expand_grid().
expand_grid(df, c)
# A tibble: 6 × 3
a b c
<int> <int> <int>
1 1 5 9
2 1 5 10
3 2 6 9
4 2 6 10
5 3 7 9
6 3 7 10

How to calculate difference from initial value for each group in R?

I have data arranged like this in R:
indv time val
A 6 5
A 10 10
A 12 7
B 8 4
B 10 3
B 15 9
For each individual (indv) at each time, I want to calculate the change in value (val) from the initial time. So I would end up with something like this:
indv time val val_1 val_change
A 6 5 5 0
A 10 10 5 5
A 12 7 5 2
B 8 4 4 0
B 10 3 4 -1
B 15 9 4 5
Can anyone tell me how I might do this? I can use
ddply(df, .(indv), function(x)x[which.min(x$time), ])
to get a table like
indv time val
A 6 5
B 8 4
However, I cannot figure out how to make a column val_1 where the minimum values are matched up for each individual. However, if I can do that, I should be able to add column val_change using something like:
df['val_change'] = df['val_1'] - df['val']
EDIT: two excellent methods were posted below, however both rely on my time column being sorted so that small time values are on top of high time values. I'm not sure this will always be the case with my data. (I know I can sort first in Excel, but I'm trying to avoid that.) How could I deal with a case when the table appears like this:
indv time value
A 10 10
A 6 5
A 12 7
B 8 4
B 10 3
B 15 9
Here is a data.table solution that will be memory efficient as it is setting by reference within the data.table. Setting the key will sort by the key variables
library(data.table)
DT <- data.table(df)
# set key to sort by indv then time
setkey(DT, indv, time)
DT[, c('val1','change') := list(val[1], val - val[1]),by = indv]
# And to show it works....
DT
## indv time val val1 change
## 1: A 6 5 5 0
## 2: A 10 10 5 5
## 3: A 12 7 5 2
## 4: B 8 4 4 0
## 5: B 10 3 4 -1
## 6: B 15 9 4 5
Here's a plyr solution using ddply
ddply(df, .(indv), transform,
val_1 = val[1],
change = (val - val[1]))
indv time val val_1 change
1 A 6 5 5 0
2 A 10 10 5 5
3 A 12 7 5 2
4 B 8 4 4 0
5 B 10 3 4 -1
6 B 15 9 4 5
To get your second table try this:
ddply(df, .(indv), function(x) x[which.min(x$time), ])
indv time val
1 A 6 5
2 B 8 4
Edit 1
To deal with unsorted data, like the one you posted in your edit try the following
unsort <- read.table(text="indv time value
A 10 10
A 6 5
A 12 7
B 8 4
B 10 3
B 15 9", header=T)
do.call(rbind, lapply(split(unsort, unsort$indv),
function(x) x[order(x$time), ]))
indv time value
A.2 A 6 5
A.1 A 10 10
A.3 A 12 7
B.4 B 8 4
B.5 B 10 3
B.6 B 15 9
Now you can apply the procedure described above to this sorted dataframe
Edit 2
A shorter way to sort your dataframe is using sortBy function from doBy package
library(doBy)
orderBy(~ indv + time, unsort)
indv time value
2 A 6 5
1 A 10 10
3 A 12 7
4 B 8 4
5 B 10 3
6 B 15 9
Edit 3
You can even sort your df using ddply
ddply(unsort, .(indv, time), sort)
value time indv
1 5 6 A
2 10 10 A
3 7 12 A
4 4 8 B
5 3 10 B
6 9 15 B
You can do this with the base functions. using your data
df <- read.table(text = "indv time val
A 6 5
A 10 10
A 12 7
B 8 4
B 10 3
B 15 9", header = TRUE)
We first split() df on the indv variable
sdf <- split(df, df$indv)
Next we transform each component of sdf adding in the val_1 and val_change variables in a manner similar to how you suggest
sdf <- lapply(sdf, function(x) transform(x, val_1 = val[1],
val_change = val - val[1]))
Finally we arrange for the individual components to be bound row wise into a single data frame:
df <- do.call(rbind, sdf)
df
Which gives:
R> df
indv time val val_1 val_change
A.1 A 6 5 5 0
A.2 A 10 10 5 5
A.3 A 12 7 5 2
B.4 B 8 4 4 0
B.5 B 10 3 4 -1
B.6 B 15 9 4 5
Edit
To address the sorting issue the OP raises in the comments, modify the lapply() call to include a sorting step prior to the transform(). For example:
sdf <- lapply(sdf, function(x) {
x <- x[order(x$time), ]
transform(x, val_1 = val[1],
val_change = val - val[1])
})
In use we have
## scramble `df`
df <- df[sample(nrow(df)), ]
## split
sdf <- split(df, df$indv)
## apply sort and transform
sdf <- lapply(sdf, function(x) {
x <- x[order(x$time), ]
transform(x, val_1 = val[1],
val_change = val - val[1])
})
## combine
df <- do.call(rbind, sdf)
which again gives:
R> df
indv time val val_1 val_change
A.1 A 6 5 5 0
A.2 A 10 10 5 5
A.3 A 12 7 5 2
B.4 B 8 4 4 0
B.5 B 10 3 4 -1
B.6 B 15 9 4 5

Resources