R aggregate by multiple rows and apply summarizing function [duplicate] - r

Edit -- This question was originally titled << Long to wide data reshaping in R >>
I'm just learning R and trying to find ways to apply it to help out others in my life. As a test case, I'm working on reshaping some data, and I'm having trouble following the examples I've found online. What I'm starting with looks like this:
ID Obs 1 Obs 2 Obs 3
1 43 48 37
1 27 29 22
1 36 32 40
2 33 38 36
2 29 32 27
2 32 31 35
2 25 28 24
3 45 47 42
3 38 40 36
And what I want to end up with will look like this:
ID Obs 1 mean Obs 1 std dev Obs 2 mean Obs 2 std dev
1 x x x x
2 x x x x
3 x x x x
And so forth. What I'm unsure of is whether I need additional information in my long-form data, or what. I imagine that the math part (finding the mean and standard deviations) will be the easy part, but I haven't been able to find a way that seems to work to reshape the data correctly to start in on that process.
Thanks very much for any help.

This is an aggregation problem, not a reshaping problem as the question originally suggested -- we wish to aggregate each column into a mean and standard deviation by ID. There are many packages that handle such problems. In the base of R it can be done using aggregate like this (assuming DF is the input data frame):
ag <- aggregate(. ~ ID, DF, function(x) c(mean = mean(x), sd = sd(x)))
Note 1: A commenter pointed out that ag is a data frame for which some columns are matrices. Although initially that may seem strange, in fact it simplifies access. ag has the same number of columns as the input DF. Its first column ag[[1]] is ID and the ith column of the remainder ag[[i+1]] (or equivalanetly ag[-1][[i]]) is the matrix of statistics for the ith input observation column. If one wishes to access the jth statistic of the ith observation it is therefore ag[[i+1]][, j] which can also be written as ag[-1][[i]][, j] .
On the other hand, suppose there are k statistic columns for each observation in the input (where k=2 in the question). Then if we flatten the output then to access the jth statistic of the ith observation column we must use the more complex ag[[k*(i-1)+j+1]] or equivalently ag[-1][[k*(i-1)+j]] .
For example, compare the simplicity of the first expression vs. the second:
ag[-1][[2]]
## mean sd
## [1,] 36.333 10.2144
## [2,] 32.250 4.1932
## [3,] 43.500 4.9497
ag_flat <- do.call("data.frame", ag) # flatten
ag_flat[-1][, 2 * (2-1) + 1:2]
## Obs_2.mean Obs_2.sd
## 1 36.333 10.2144
## 2 32.250 4.1932
## 3 43.500 4.9497
Note 2: The input in reproducible form is:
Lines <- "ID Obs_1 Obs_2 Obs_3
1 43 48 37
1 27 29 22
1 36 32 40
2 33 38 36
2 29 32 27
2 32 31 35
2 25 28 24
3 45 47 42
3 38 40 36"
DF <- read.table(text = Lines, header = TRUE)

There are a few different ways to go about it. reshape2 is a helpful package.
Personally, I like using data.table
Below is a step-by-step
If myDF is your data.frame:
library(data.table)
DT <- data.table(myDF)
DT
# this will get you your mean and SD's for each column
DT[, sapply(.SD, function(x) list(mean=mean(x), sd=sd(x)))]
# adding a `by` argument will give you the groupings
DT[, sapply(.SD, function(x) list(mean=mean(x), sd=sd(x))), by=ID]
# If you would like to round the values:
DT[, sapply(.SD, function(x) list(mean=round(mean(x), 3), sd=round(sd(x), 3))), by=ID]
# If we want to add names to the columns
wide <- setnames(DT[, sapply(.SD, function(x) list(mean=round(mean(x), 3), sd=round(sd(x), 3))), by=ID], c("ID", sapply(names(DT)[-1], paste0, c(".men", ".SD"))))
wide
ID Obs.1.men Obs.1.SD Obs.2.men Obs.2.SD Obs.3.men Obs.3.SD
1: 1 35.333 8.021 36.333 10.214 33.0 9.644
2: 2 29.750 3.594 32.250 4.193 30.5 5.916
3: 3 41.500 4.950 43.500 4.950 39.0 4.243
Also, this may or may not be helpful
> DT[, sapply(.SD, summary), .SDcols=names(DT)[-1]]
Obs.1 Obs.2 Obs.3
Min. 25.00 28.00 22.00
1st Qu. 29.00 31.00 27.00
Median 33.00 32.00 36.00
Mean 34.22 36.11 33.22
3rd Qu. 38.00 40.00 37.00
Max. 45.00 48.00 42.00

Here is probably the simplest way to go about it (with a reproducible example):
library(plyr)
df <- data.frame(ID=rep(1:3, 3), Obs_1=rnorm(9), Obs_2=rnorm(9), Obs_3=rnorm(9))
ddply(df, .(ID), summarize, Obs_1_mean=mean(Obs_1), Obs_1_std_dev=sd(Obs_1),
Obs_2_mean=mean(Obs_2), Obs_2_std_dev=sd(Obs_2))
ID Obs_1_mean Obs_1_std_dev Obs_2_mean Obs_2_std_dev
1 1 -0.13994642 0.8258445 -0.15186380 0.4251405
2 2 1.49982393 0.2282299 0.50816036 0.5812907
3 3 -0.09269806 0.6115075 -0.01943867 1.3348792
EDIT: The following approach saves you a lot of typing when dealing with many columns.
ddply(df, .(ID), colwise(mean))
ID Obs_1 Obs_2 Obs_3
1 1 -0.3748831 0.1787371 1.0749142
2 2 -1.0363973 0.0157575 -0.8826969
3 3 1.0721708 -1.1339571 -0.5983944
ddply(df, .(ID), colwise(sd))
ID Obs_1 Obs_2 Obs_3
1 1 0.8732498 0.4853133 0.5945867
2 2 0.2978193 1.0451626 0.5235572
3 3 0.4796820 0.7563216 1.4404602

I add the dplyr solution.
set.seed(1)
df <- data.frame(ID=rep(1:3, 3), Obs_1=rnorm(9), Obs_2=rnorm(9), Obs_3=rnorm(9))
library(dplyr)
df %>% group_by(ID) %>% summarise_each(funs(mean, sd))
# ID Obs_1_mean Obs_2_mean Obs_3_mean Obs_1_sd Obs_2_sd Obs_3_sd
# (int) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl)
# 1 1 0.4854187 -0.3238542 0.7410611 1.1108687 0.2885969 0.1067961
# 2 2 0.4171586 -0.2397030 0.2041125 0.2875411 1.8732682 0.3438338
# 3 3 -0.3601052 0.8195368 -0.4087233 0.8105370 0.3829833 1.4705692

Here's another take on the data.table answers, using #Carson's data, that's a bit more readable (and also a little faster, because of using lapply instead of sapply):
library(data.table)
set.seed(1)
dt = data.table(ID=c(1:3), Obs_1=rnorm(9), Obs_2=rnorm(9), Obs_3=rnorm(9))
dt[, c(mean = lapply(.SD, mean), sd = lapply(.SD, sd)), by = ID]
# ID mean.Obs_1 mean.Obs_2 mean.Obs_3 sd.Obs_1 sd.Obs_2 sd.Obs_3
#1: 1 0.4854187 -0.3238542 0.7410611 1.1108687 0.2885969 0.1067961
#2: 2 0.4171586 -0.2397030 0.2041125 0.2875411 1.8732682 0.3438338
#3: 3 -0.3601052 0.8195368 -0.4087233 0.8105370 0.3829833 1.4705692

The updated dplyr solution, as for 2020
1: summarise_each_() is deprecated as of dplyr 0.7.0.
and
2: funs() is deprecated as of dplyr 0.8.0.
ag.dplyr <- DF %>% group_by(ID) %>% summarise(across(.cols = everything(),list(mean = mean, sd = sd)))

There is a helpful function in the psych package.
You should try the following implementation:
psych::describeBy(data$dependentvariable, group = data$groupingvariable)

Related

calculate statistics and/or a function on multiple columns in subsets of a dataframe in R

I have a dataframe (sample of the following form):
DateTime Ind1 Ind2 V1 V2 Ac1 Ac2 w1 w2 w3 shift
2016-05-01 00:01:00 U A 5 7 20 100 50 70 200 1
2016-05-01 00:01:20 U A 5 7 20 109 35 77 140 1
2016-05-01 00:01:40 U A 5 7 40 120 55 97 160 1
...
2016-05-01 00:08:20 U A 5 7 15 157 70 70 204 2
...
2016-05-02 00:08:20 U A 5 7 28 147 65 90 240 2
...
2016-05-02 00:20:00 U A 5 7 35 210 45 100 167 3
I need a new dataframe where some statistics (e.g. mean, standard deviation) for the columns v1 to w3 are listed for each date-and-shift combination, something similar to the following:
Date shift Ind1 Ind2 avgV1 sdV1 avgV2 sdV2 avgAC1 ....
2016-05-01 1 U A 5.3 2.9 7.8 4.5 108 .....
2016-05-01 2 U A 6.7 3.5 8.9 5.0 99 .....
SOLUTION TRIED:
I can do the following steps.
1) extract date from DateTime
df$Date <- format(as.POSIXct(df$DateTime, format="%Y-%m-%d %H:%M:%S"), format="%Y-%m-%d")
2) label the data by date and shift.
df$DateShift <- paste(df$Date, df$shift)
3) for each subset, calculate some statistics on a col:
tmp_df <- data.frame(levels(as.factor(df$DateShift)))
avgV1 <- tapply(df$V1, df$DateShift, FUN=mean)
sdV1 <- tapply(df$V1, df$DateShift, FUN=sd)
avgV2<- tapply(df$V2, df$DateShift, FUN=mean)
....
However, I have more than 50 columns in the original dataframe, with different types of names (not as simple as in the example above).
Moreover, the statistics that I want to compute may vary (say, calculation of max and min, or some other user-defined function).
So I don't want to code by hand for the different combinations of columns and type of statistic (mean, standard dev, etc.)
What is the way to automate this?
I am sure the dplyr solutions are coming, but the doBy package works very well for this kind of thing, unless you have many (millions+) rows, in which case it will be slow.
library(doBy)
df_avg <- summaryBy(. ~ Date + Shift, FUN=c(mean, median, sd), data=df, na.rm=TRUE)
Will give a dataframe with V1.mean, V1.median, and so on.
The . ~ means "summarize all numeric variables". If you want to keep information from some factors in the dataframe, use the argument id.vars = ~somefac+somefac2, for example.
library(dplyr)
df %>%
mutate(Date = as.Date(DateTime)) %>%
group_by(Date, shift) %>%
summarise_each(funs(mean))

Extracting corresponding other values in mutate when group_by is applied

I have a data frame with patient data and measurements of different variables over time.
The data frame looks a bit like this but more lab-values variables:
df <- data.frame(id=c(1,1,1,1,2,2,2,2,2),
time=c(0,3,7,35,0,7,14,28,42),
labvalue1=c(4.04,NA,2.93,NA,NA,3.78,3.66,NA,2.54),
labvalue2=c(NA,63.8,62.8,61.2,78.1,NA,77.6,75.3,NA))
> df2
id time labvalue1 labvalue2
1 1 0 4.04 NA
2 1 3 NA 63.8
3 1 7 2.93 62.8
4 1 35 NA 61.2
5 2 0 NA 78.1
6 2 7 3.78 NA
7 2 14 3.66 77.6
8 2 28 NA 75.3
9 2 42 2.54 NA
I want to calculate for each patient (with unique ID) the decrease or slope per day for the first and last measurement. To compare the slopes between patients. Time is in days. So, eventually I want a new variable, e.g. diff_labvalues - for each value, that gives me for labvalue1:
For patient 1: (2.93-4.04)/ (7-0) and for patient 2: (2.54-3.78)/(42-7) (for now ignoring the measurements in between, just last-first); etc for labvalue2, and so forth.
So far I have used dplyr, created the first1 and last1 functions, because first() and last() did not work with the NA values.
Thereafter, I have grouped_by 'id', used mutate_all (because there are more lab-values in the original df) calculated the difference between the last1() and first1() lab-values for that patient.
But cannot find HOW to extract the values of the corresponding time values (the delta-time value) which I need to calculate the slope of the decline.
Eventually I want something like this (last line):
first1 <- function(x) {
first(na.omit(x))
}
last1 <- function(x) {
last(na.omit(x))
}
df2 = df %>%
group_by(id) %>%
mutate_all(funs(diff=(last1(.)-first1(.)) / #it works until here
(time[position of last1(.)]-time[position of first1(.)]))) #something like this
Not sure if tidyverse even has a solution for this, so any help would be appreciated. :)
We can try
df %>%
group_by(id) %>%
filter(!is.na(labs)) %>%
summarise(diff_labs = (last(labs) - first(labs))/(last(time) - first(time)))
# A tibble: 2 x 2
# id diff_labs
# <dbl> <dbl>
#1 1 -0.15857143
#2 2 -0.03542857
and
> (2.93-4.04)/ (7-0)
#[1] -0.1585714
> (2.54-3.78)/(42-7)
#[1] -0.03542857
Or another option is data.table
library(data.table)
setDT(df)[!is.na(labs), .(diff_labs = (labs[.N] - labs[1])/(time[.N] - time[1])) , id]
# id diff_labs
#1: 1 -0.15857143
#2: 2 -0.03542857

R - Create a new variable where each observation depends on another table and other variables in the data frame

I have the two following tables:
df <- data.frame(eth = c("A","B","B","A","C"),ZIP1 = c(1,1,2,3,5))
Inc <- data.frame(ZIP2 = c(1,2,3,4,5,6,7),A = c(56,98,43,4,90,19,59), B = c(49,10,69,30,10,4,95),C = c(69,2,59,8,17,84,30))
eth ZIP1 ZIP2 A B C
A 1 1 56 49 69
B 1 2 98 10 2
B 2 3 43 69 59
A 3 4 4 30 8
C 5 5 90 10 17
6 19 4 84
7 59 95 39
I would like to create a variable Inc in the df data frame where for each observation, the value is the intersection of the eth and ZIP of the observation. In my example, it would lead to:
eth ZIP1 Inc
A 1 56
B 1 49
B 2 10
A 3 43
C 5 17
A loop or quite brute force could solve it but it takes time on my dataset, I'm looking for a more subtle way maybe using data.table. It seems to me that it is a very standard question and I'm apologizing if it is, my unability to formulate a precise title for this problem (as you may have noticed..) is maybe why I haven't found any similar question in searching on the forum..
Thanks !
Sure, it can be done in data.table:
library(data.table)
setDT(df)
df[ melt(Inc, id.var="ZIP2", variable.name="eth", value.name="Inc"),
Inc := i.Inc
, on=c(ZIP1 = "ZIP2","eth") ]
The syntax for this "merge-assign" operation is X[i, Xcol := expression, on=merge_cols].
You can run the i = melt(Inc, id.var="ZIP", variable.name="eth", value.name="Inc") part on its own to see how it works. Inside the merge, columns from i can be referred to with i.* prefixes.
Alternately...
setDT(df)
setDT(Inc)
df[, Inc := Inc[.(ZIP1), eth, on="ZIP2", with=FALSE], by=eth]
This is built on a similar idea. The package vignettes are a good place to start for this sort of syntax.
We can use row/column indexing
df$Inc <- Inc[cbind(match(df$ZIP1, Inc$ZIP2), match(df$eth, colnames(Inc)))]
df
# eth ZIP1 Inc
#1 A 1 56
#2 B 1 49
#3 B 2 10
#4 A 3 43
#5 C 5 17
What about this?
library(reshape2)
merge(df, melt(Inc, id="ZIP2"), by.x = c("ZIP1", "eth"), by.y = c("ZIP2", "variable"))
ZIP1 eth value
1 1 A 56
2 1 B 49
3 2 B 10
4 3 A 43
5 5 C 17
Another option:
library(dplyr)
library(tidyr)
Inc %>%
gather(eth, value, -ZIP2) %>%
left_join(df, ., by = c("eth", "ZIP1" = "ZIP2"))
my solution(which maybe seems awkward)
for (i in 1:length(df$eth)) {
df$Inc[i] <- Inc[as.character(df$eth[i])][df$ZIP[i],]
}

Using apply() function to update the factor levels of multiple columns of a data frame in R

Straight to the question. Say I have a following data frame:
> head(temp)
Gender Age Agegroup
2 Male 63 61+
3 Male 60 50-60
4 Male 55 50-60
5 Male 36 30-39
7 Male 39 30-39
8 Male 63 61+
Calling a summary function:
> summary(temp)
Gender Age Agegroup
Male :864692 Min. :25.00 25-29:0
Female: 0 1st Qu.:35.00 30-39:205237
Median :45.00 40-49:235622
Mean :44.48 50-60:250977
3rd Qu.:54.00 61+ : 68807
Max. :64.00
As you can see there are zero observations for the Female factor and 25-29 factor levels. As a result, I dont need those levels. I remove them using the following code:
temp$Gender<-factor(temp$Gender)
temp$Agegroup<-factor(temp$Agegroup)
My question is: how would I use the one of the apply function to execute the code I used to remove levels? Something that will look like:
# Pseudo code just to illustrate my purpose
temp[,c(1,3)]<-apply(temp[,c(1,3)],FUN=factor)
It will be handy in case I need to update the levels of lots of columns of a data frame. Thanks. Let me know if you need more clarification.
You're looking for droplevels.
Here's some sample data similar to yours:
set.seed(1)
mydf <- data.frame(A = factor(rep("M", 5), levels = c("M", "F")),
B = sample(20:50, 5, TRUE))
mydf$C <- cut(mydf$B, seq(0, 80, 10))
mydf
# A B C
# 1 M 28 (20,30]
# 2 M 31 (30,40]
# 3 M 37 (30,40]
# 4 M 48 (40,50]
# 5 M 26 (20,30]
summary(mydf)
# A B C
# M:5 Min. :26 (20,30]:2
# F:0 1st Qu.:28 (30,40]:2
# Median :31 (40,50]:1
# Mean :34 (0,10] :0
# 3rd Qu.:37 (10,20]:0
# Max. :48 (50,60]:0
# (Other):0
Now, let's use droplevels and see what happens:
mydf2 <- droplevels(mydf)
summary(mydf2)
# A B C
# M:5 Min. :26 (20,30]:2
# 1st Qu.:28 (30,40]:2
# Median :31 (40,50]:1
# Mean :34
# 3rd Qu.:37
# Max. :48
If you really wanted to use an *apply approach, perhaps you can use lapply as follows:
mydf3 <- mydf ## Create a copy of your original just in case
mydf3[] <- lapply(mydf3, factor)
summary(mydf3)
# A B C
# M:5 26:1 (20,30]:2
# 28:1 (30,40]:2
# 31:1 (40,50]:1
# 37:1
# 48:1

split apply recombine, plyr, data.table in R

I am doing the classic split-apply-recombine thing in R. My data set is a bunch of firms over time. The applying I am doing is running a regression for each firm and returning the residuals, therefore, I am not aggregating by firm. plyr is great for this but it takes a very very long time to run when the number of firms is large. Is there a way to do this with data.table?
Sample Data:
dte, id, val1, val2
2001-10-02, 1, 10, 25
2001-10-03, 1, 11, 24
2001-10-04, 1, 12, 23
2001-10-02, 2, 13, 22
2001-10-03, 2, 14, 21
I need to split by each id (namely 1 and 2). Run a regression, return the residuals and append it as a column to my data. Is there a way to do this using data.table?
DWin's answer is correct for v1.8.0 (as currently on CRAN). But in v1.8.1 (on R-Forge repository), := now works by group. It works for non-contiguous groups too so there is no need to setkey first for it to line up.
dtb <- as.data.table(dat)
dtb
dte id val1 val2
1: 2001-10-02 1 10 25
2: 2001-10-03 1 11 24
3: 2001-10-04 1 12 23
4: 2001-10-02 2 13 22
5: 2001-10-03 2 14 21
dtb[, resid:=residuals(lm(val1 ~ val2)), by=id]
dte id val1 val2 resid
1: 2001-10-02 1 10 25 1.631688e-15
2: 2001-10-03 1 11 24 -3.263376e-15
3: 2001-10-04 1 12 23 1.631688e-15
4: 2001-10-02 2 13 22 0.000000e+00
5: 2001-10-03 2 14 21 0.000000e+00
To upgrade to v1.8.1 just install from the R-Forge repo. (R 2.15.0+ is needed when installing any binary package from R-Forge) :
install.packages("data.table", repos="http://R-Forge.R-project.org")
or install from source if you can't upgrade to latest R. data.table itself only needs R 2.12.0+.
Extending to the 1MM case :
DT = data.table(dte=Sys.Date()+1:1000000,
id=sample(1:2, 1000000, repl=TRUE),
val1=runif(1000000), val2=runif(1000000) )
setkey(DT, id)
system.time(ans1 <- cbind(DT, DT[, residuals(lm(val1 ~ val2)), by="id"]) )
user system elapsed
12.272 0.872 13.182
ans1
dte id val1 val2 id V1
1: 2012-07-02 1 0.8369147 0.57553383 1 0.336647598
2: 2012-07-05 1 0.0109102 0.02532214 1 -0.488633325
3: 2012-07-06 1 0.4977762 0.16607786 1 -0.001952414
---
999998: 4750-05-27 2 0.1296722 0.62645838 2 -0.370627034
999999: 4750-05-28 2 0.2686352 0.04890710 2 -0.231952238
1000000: 4750-05-29 2 0.9981029 0.91626787 2 0.497948275
system.time(DT[, resid:=residuals(lm(val1 ~ val2)), by=id])
user system elapsed
7.436 0.648 8.107
DT
dte id val1 val2 resid
1: 2012-07-02 1 0.8369147 0.57553383 0.336647598
2: 2012-07-05 1 0.0109102 0.02532214 -0.488633325
3: 2012-07-06 1 0.4977762 0.16607786 -0.001952414
---
999998: 4750-05-27 2 0.1296722 0.62645838 -0.370627034
999999: 4750-05-28 2 0.2686352 0.04890710 -0.231952238
1000000: 4750-05-29 2 0.9981029 0.91626787 0.497948275
The example above only has 2 groups, is quite small at under 40MB, and Rprof shows 96% of the time is spent in lm. So in these cases := by group is not for a speed advantage really, but more for the convenience; i.e., less code needed to write and no superfluous columns added to the output. As size grows, the avoidance of copies comes into it and speed advantages start to show. Especially, transform in j will slow down terribly as the number of groups increases.
I'm guessing this needs to be sorted by "id" to line up properly. Luckily that happens automatically when you set the key:
dat <-read.table(text="dte, id, val1, val2
2001-10-02, 1, 10, 25
2001-10-03, 1, 11, 24
2001-10-04, 1, 12, 23
2001-10-02, 2, 13, 22
2001-10-03, 2, 14, 21
", header=TRUE, sep=",")
dtb <- data.table(dat)
setkey(dtb, "id")
dtb[, residuals(lm(val1 ~ val2)), by="id"]
#---------------
cbind(dtb, dtb[, residuals(lm(val1 ~ val2)), by="id"])
#---------------
dte id val1 val2 id.1 V1
[1,] 2001-10-02 1 10 25 1 1.631688e-15
[2,] 2001-10-03 1 11 24 1 -3.263376e-15
[3,] 2001-10-04 1 12 23 1 1.631688e-15
[4,] 2001-10-02 2 13 22 2 0.000000e+00
[5,] 2001-10-03 2 14 21 2 0.000000e+00
> dat <- data.frame(dte=Sys.Date()+1:1000000,
id=sample(1:2, 1000000, repl=TRUE),
val1=runif(1000000), val2=runif(1000000) )
> dtb <- data.table(dat)
> setkey(dtb, "id")
> system.time( cbind(dtb, dtb[, residuals(lm(val1 ~ val2)), by="id"]) )
user system elapsed
1.696 0.798 2.466
> system.time( dtb[,transform(.SD,r = residuals(lm(val1~val2))),by = "id"] )
user system elapsed
1.757 0.908 2.690
EDIT from Matthew :
This is all correct for v1.8.0 on CRAN. With the small addition that transform in j is the subject of data.table wiki point 2: "For speed don't transform() by group, cbind() afterwards". But, := now works by group in v1.8.1 and is both simple and fast. See my answer for illustration (but no need to vote for it).
Well, I voted for it. Here is the console command to install v 1.8.1on a Mac (if you have the proper XCode tools avaialble, since it only there in source):
install.packages("data.table", repos= "http://R-Forge.R-project.org", type="source",
lib="/Library/Frameworks/R.framework/Versions/2.14/Resources/lib")
(For some reason I could not get the Mac GUI Package Installer to read r-forge as a repository.)

Resources