Apply function within each subset of a dataframe

Apply function within each subset of a dataframe - r

I have a dataframe and need to calculate the difference between successive entries within each ID, but would like to do this without having to create individual dataframes for each ID and then join back together (my current solution). Here is an example using a similar structure to the dataframes.
df = as.data.frame(matrix(nrow = 20,ncol =2 ))
names(df) = c("ID","number")
df$ID = sample(c("A","B","C"),20,replace = T)
df$number = rnorm(20,mean = 5)
I can easily calculate the difference between successive rows using this function
roll.dif <-function(x) {
difference = rollapply(x,width = 2, diff, fill=NA, align = "right")
return(difference)
}
df$dif = roll.dif(df$number)
however I would like to do this within each ID. I have tried using with based on this answer Apply function conditionally as
with(df, tapply(number, ID, FUN = roll.dif))
I have also tried using by
by(df$number,df$ID,FUN = roll.dif)
both of which give me the answers I am looking for, but I cannot figure out how to get them back into the dataframe. I would like the output to look like this:
ID number dif
1 A 3.967251 NA
2 B 3.771882 NA
3 A 5.920705 1.953454
4 A 7.517528 1.596823
5 B 5.252357 3.771882
6 B 4.811998 -0.440359
7 B 3.388951 -1.423047
8 A 5.284527 -2.233001
9 C 6.070546 NA
10 A 5.319934 0.035407
11 A 5.517615 0.197681
12 B 5.454738 2.065787
13 C 6.402359 0.331813
14 C 5.617123 -0.785236
15 A 5.692807 0.175192
16 C 4.902007 -0.715116
17 B 4.975184 -0.479554
18 A 6.05282 0.360013
19 C 3.677114 -1.224893
20 C 4.883414 1.2063

You can use dplyr package like this
df %>% group_by(ID) %>% mutate(dif=roll.dif(number))

We can use data.table
library(data.table)
setDT(df)[, dif := roll.dif(number), by = ID]
Or a base R option is ave
df$dif <- with(df, ave(number, ID, FUN = roll.dif))

Related

Using mutate with a stored list of formulas over specified columns

This is a follow up to my previous question here, which #ronak_shah was kind enough to answer. I apologize as some of this information may be redundant to anyone who saw that post, but figure best to post a new question, rather than modify the previous version.
I would still like to iterate through a stored list of columns and procedures to create n new columns based on this list. In the example below, we start with 3 columns, a, b, c and a simple function, func1.
The data frame col_mod identifies which column should be changed, what the second argument to the function that changes them should be, and then generates a statement to execute the function. Each of these modifications should be an addition to the original data frame, rather than replacements of the specified columns. The new names of these columns should be a_new and c_new, respectively.
At the bottom of the reprex below, I am able to obtain my desired result manually, but as before, I would like to automate this using a mapping function.
I am attempting to use the same approach that was provided as an answer to my previous question, but I keep on getting the following error: "Error in get(as.character(FUN), mode = "function", envir = envir) : object 'func1(a,3)' of mode 'function' was not found"
If anyone can help would be much appreciated!
library(tidyverse)
## fake data
dat <- data.frame(a = 1:5,
b = 6:10,
c = 11:15)
## function
func1 <- function(x, y) {x + y}
## modification list
col_mod <- data.frame("col" = c("a", "c"),
"y_val" = c(3, 4),
stringsAsFactors = FALSE) %>%
mutate(func = paste0("func1(", col, ",", y_val, ")"))
## desired end result
dat %>%
mutate(a_new = func1(a, 3),
c_new = func1(c, 4))
## attempting to generate new columns based on #ronak_shah's answer to my previous
## question but fails to run
dat[paste0(col_mod$col, '_new')] <- Map(function(x, y) match.fun(y)(x),
dat[col_mod$col], col_mod$func)

We can use pmap from purrr, transmute the columns based on the name from the 'col' i.e. ..1, function from the 'func' i.e. ..3 and 'y_val' from ..2, assign (:=) the value to a new column by creating a string with paste (or str_c), and bind the columns to the original dataset
library(dplyr)
library(purrr)
library(stringr)
library(tibble)
col_mod$func <- 'func1'
pmap(col_mod, ~ dat %>%
transmute(!! str_c(..1, "_new") :=
match.fun(..3)(!! rlang::sym(..1), ..2))) %>%
bind_cols(dat, .)
-output
# a b c a_new c_new
#1 1 6 11 4 15
#2 2 7 12 5 16
#3 3 8 13 6 17
#4 4 9 14 7 18
#5 5 10 15 8 19
If we want to parse the function as it is, use the parse_expr and eval i.e. without changing the func column - it remains as func1(a, 3), and func1(c, 4)
pmap(col_mod, ~ dat %>%
transmute(!! str_c(..1, "_new") :=
eval(rlang::parse_expr(..3)))) %>%
bind_cols(dat, .)
-output
# a b c a_new c_new
#1 1 6 11 4 15
#2 2 7 12 5 16
#3 3 8 13 6 17
#4 4 9 14 7 18
#5 5 10 15 8 19
Or using base R with Map
dat[paste0(col_mod$col, '_new')] <- do.call(Map, c(f =
function(x, y, z) eval(parse(text = z), envir = dat), unname(col_mod)))

Equivalent of row_number for columns dplyr

I am trying to apply a function to columns of a tibble, or data.frame, depending on the index of columns. It appears to me several time, and I give just one MWE
library(tidyverse)
test <- data.frame(a = c(1,2,3), b = c(7,8,9), c = c(3,5,6))
test <- test %>% as_tibble() %>% mutate_all( ~lead(., 2))
This will lead by 2 every columns (just an example). But what I want is to lead the first column by 1, the second by 2, and so on. Doing something like mutate_all(~lead(., col_number()).
For this little example, I know one way to do it, like:
test <- as.matrix(test)
for (i in 1:ncol(test)){ test[,i] <- lead(test[,i], i) }
There might be other way to do it too, haven't thought about it much (one needs to convert as a matrix first, otherwise it doesn't produce the right result, I don't really know why).
But I'd like to do it with a mutate or apply, being able to get the index of column in general. With a more complex example.
Any idea?

One option is using purrr::map2_df to sequentially lead every column based on column number.
purrr::map2_df(test, seq_along(test), dplyr::lead)
# A tibble: 3 x 3
# a b c
# <dbl> <dbl> <dbl>
#1 2 9 NA
#2 3 NA NA
#3 NA NA NA
We can also use base R Map
test[] <- Map(function(x, y) c(tail(x, -y), rep(NA, y)), test, seq_along(test))

We can use data.table shift
library(data.table)
setDT(test)[, Map(shift, .SD, n = 1:3, type = 'lead')]
# a b c
#1: 2 9 NA
#2: 3 NA NA
#3: NA NA NA
Or using purrr
library(purrr)
map2_dfr(test, 1:3, ~shift(.x, type = 'lead'))

Retrieving unique combinations [duplicate]

So I currently face a problem in R that I exactly know how to deal with in Stata, but have wasted over two hours to accomplish in R.
Using the data.frame below, the result I want is to obtain exactly the first observation per group, while groups are formed by multiple variables and have to be sorted by another variable, i.e. the data.frame mydata obtained by:
id <- c(1,1,1,1,2,2,3,3,4,4,4)
day <- c(1,1,2,3,1,2,2,3,1,2,3)
value <- c(12,10,15,20,40,30,22,24,11,11,12)
mydata <- data.frame(id, day, value)
Should be transformed to:
id day value
1 1 10
1 2 15
1 3 20
2 1 40
2 2 30
3 2 22
3 3 24
4 1 11
4 2 11
4 3 12
By keeping only one of the rows with one or multiple duplicate group-identificators (here that is only row[1]: (id,day)=(1,1)), sorting for value first (so that the row with the lowest value is kept).
In Stata, this would simply be:
bys id day (value): keep if _n == 1
I found a piece of code on the web, which properly does that if I first produce a single group identifier :
mydata$id1 <- paste(mydata$id,"000",mydata$day, sep="") ### the single group identifier
myid.uni <- unique(mydata$id1)
a<-length(myid.uni)
last <- c()
for (i in 1:a) {
temp<-subset(mydata, id1==myid.uni[i])
if (dim(temp)[1] > 1) {
last.temp<-temp[dim(temp)[1],]
}
else {
last.temp<-temp
}
last<-rbind(last, last.temp)
}
last
However, there are a few problems with this approach:
1. A single identifier needs to be created (which is quickly done).
2. It seems like a cumbersome piece of code compared to the single line of code in Stata.
3. On a medium-sized dataset (below 100,000 observations grouped in lots of about 6), this approach would take about 1.5 hours.
Is there any efficient equivalent to Stata's bys var1 var2: keep if _n == 1 ?

The package dplyr makes this kind of things easier.
library(dplyr)
mydata %>% group_by(id, day) %>% filter(row_number(value) == 1)
Note that this command requires more memory in R than in Stata: in R, a new copy of the dataset is created while in Stata, rows are deleted in place.

I would order the data.frame at which point you can look into using by:
mydata <- mydata[with(mydata, do.call(order, list(id, day, value))), ]
do.call(rbind, by(mydata, list(mydata$id, mydata$day),
FUN=function(x) head(x, 1)))
Alternatively, look into the "data.table" package. Continuing with the ordered data.frame from above:
library(data.table)
DT <- data.table(mydata, key = "id,day")
DT[, head(.SD, 1), by = key(DT)]
# id day value
# 1: 1 1 10
# 2: 1 2 15
# 3: 1 3 20
# 4: 2 1 40
# 5: 2 2 30
# 6: 3 2 22
# 7: 3 3 24
# 8: 4 1 11
# 9: 4 2 11
# 10: 4 3 12
Or, starting from scratch, you can use data.table in the following way:
DT <- data.table(id, day, value, key = "id,day")
DT[, n := rank(value, ties.method="first"), by = key(DT)][n == 1]
And, by extension, in base R:
Ranks <- with(mydata, ave(value, id, day, FUN = function(x)
rank(x, ties.method="first")))
mydata[Ranks == 1, ]

Using data.table, assuming the mydata object has already been sorted in the way you require, another approach would be:
library(data.table)
mydata <- data.table(my.data)
mydata <- mydata[, .SD[1], by = .(id, day)]
Using dplyr with magrittr pipes:
library(dplyr)
mydata <- mydata %>%
group_by(id, day) %>%
slice(1) %>%
ungroup()
If you don't add ungroup() to the end dplyr's grouping structure will still be present and might mess up some of your subsequent functions.

Pass a column name as an object and not a string for data.table

I'm using data.table to make aggregation, collapse and group by. The thing is that i know a method to do this with column number but when i put a by it directly make the aggregation. I just want the collapse to be done without group by but putting the by. i know this method:
dt[,X := list(paste(X, collapse = ";")),by = list(Y,Z)]
What i want to do now is:
dt[,names(dt)[1] := list(paste(names(dt)[1], collapse = ";")),by = list(Y,Z)]
But with this code it just write me X at each line
here is an example:
X <- c("a","b","c","d","e","f","g")
Y <- c(1,2,3,4,4,6,4)
Z <- c(10,11,23,8,8,1,3)
dt <- data.table(X,Y,Z)
This is the desired output, but i need to now this because i'm trying to do this in multiple columns (i have a data frame with 400 columns):
X Y Z
1: a 1 10
2: b 2 11
3: c 3 23
4: d;e 4 8
5: f 6 1
6: g 4 3

You should wrap names(dt)[1] inside get():
dt[,names(dt)[1] := list(paste(get(names(dt)[1]), collapse = ";")),by = list(Y,Z)]
Additionally, if you want to deduplicate your data you can use unique(dt).
To apply your functions to multiple columns, you can use .SD in combination with lapply(). For example pasting together the first two cols, grouped by Z:
dt[, lapply(.SD, function(x) paste(x, collapse=";")), by=list(Z),.SDcols=names(dt)[1:2]]

R help on aggregation function

for my question I created a dummy data frame:
set.seed(007)
DF <- data.frame(a = rep(LETTERS[1:5], each=2), b = sample(40:49), c = sample(1:10))
DF
a b c
1 A 49 2
2 A 43 3
3 B 40 7
4 B 47 1
5 C 41 9
6 C 48 8
7 D 45 6
8 D 42 5
9 E 46 10
10 E 44 4
How can I use the aggregation function on column a so that, for instance, for "A" the following value is calculated: 49-43 / 2+3?
I started like:
aggregate(DF, by=list(DF$a), FUN=function(x) {
...
})
The problem I have is that I do not know how to access the 4 different cells 49, 43, 2 and 3
I tried x[[1]][1] and similar stuff but don't get it working.

Inside aggregate, the function FUN is applied independently to each column of your data. Here you want to use a function that takes two columns as inputs, so a priori, you can't use aggregate for that.
Instead, you can use ddply from the plyr package:
ddply(DF, "a", summarize, res = (b[1] - b[2]) / sum(c))
# a res
# 1 A 1.2000000
# 2 B -0.8750000
# 3 C -0.4117647
# 4 D 0.2727273
# 5 E 0.1428571

When you aggregate the FUN argument can be anything you want. Keep in mind that the value passed will either be a vector (if x is one column) or a little data.frame or matrix (if x is more than one). However, aggregate doesn't let you access the columns of a multi-column argument. For example.
aggregate( . ~ a, data = DF, FUN = function(x) diff(x[,1]) / sum(x[,2]) )
That fails with an error even though I used . (which takes all of the columns of DF that I'm not using elsewhere). To see what aggregate is trying to do there look at the following.
aggregate( . ~ a, data = DF, FUN = sum )
The two columns, b, and c, were aggregated but from the first attempt we know that you can't do something that accesses each column separately. So, strictly sticking with aggregate you need two passes and three lines of code.
diffb <- aggregate( b ~ a, data = DF, FUN = diff )
Y <- aggregate( c ~ a, data = DF, FUN = sum )
Y$c <- diffb$b / Y$c
Now Y contains the result you want.
The by function is simpler than aggregate and all it does is split the original data.frame using the indices and then apply the FUN function.
l <- by( data = DF, INDICES = DF$a, FUN = function(x) diff(x$b)/sum(x$c), simplify = FALSE )
unlist(l)
You have to do a little to get the result back into a data.frame if you really want one.
data.frame(a = names(l), x = unlist(l))

Using data.table could be faster and easier.
library(data.table)
DT <- data.table(DF)
DT[, (-1*diff(b))/sum(c), by=a]
a V1
1: A 1.2000000
2: B -0.8750000
3: C -0.4117647
4: D 0.2727273
5: E 0.1428571
Using aggregate, not so good. I didn't a better way to do it using aggregate :( but here's an attempt.
B <- aggregate(DF$b, by=list(DF$a), diff)
C <- aggregate(DF$c, by=list(DF$a), sum)
data.frame(a=B[,1], Result=(-1*B[,2])/C[,2])
a Result
1 A 1.2000000
2 B -0.8750000
3 C -0.4117647
4 D 0.2727273
5 E 0.1428571

A data.table solution - for efficiency of time and memory.
library(data.table)
DT <- as.data.table(DF)
DT[, list(calc = diff(b) / sum(c)), by = a]

You can use the base by() function:
listOfRows <-
by(data=DF,
INDICES=DF$a,
FUN=function(x){data.frame(a=x$a[1],res=(x$b[1] - x$b[2])/(x$c[1] + x$c[2]))})
newDF <- do.call(rbind,listOfRows)