mean and standard deviation by group for multiple variables [duplicate] - r

This question already has answers here:
plyr package writing the same function over multiple columns
(2 answers)
Closed 9 years ago.
I am sure this question has been answered before, but I would like to caclulate mean and sd by treatment for multiple variables (100s) all at once and cannot figure out how to do it aside from using a long winded ddply code.
This is a portion of my dataframe (g):
trt blk til res sand silt clay ibd1_6 ibd9_14 ibd_ave
1 CTK 1 CT K 74 15 11 1.323 1.593 1.458
2 CTK 2 CT K 71 15 14 1.575 1.601 1.588
3 CTK 3 CT K 72 14 14 1.551 1.594 1.573
4 CTR 1 CT R 72 15 13 1.560 1.647 1.604
5 CTR 2 CT R 73 14 13 1.612 1.580 1.596
6 CTR 3 CT R 73 13 14 1.709 1.577 1.643
7 ZTK 1 ZT K 72 16 12 1.526 1.546 1.536
8 ZTK 2 ZT K 71 16 13 1.292 1.626 1.459
9 ZTK 3 ZT K 71 17 12 1.623 1.607 1.615
10 ZTR 1 ZT R 66 16 18 1.719 1.709 1.714
11 ZTR 2 ZT R 67 17 16 1.529 1.708 1.618
12 ZTR 3 ZT R 66 17 17 1.663 1.655 1.659
I would like to have a function that does what ddply does, i.e ddply(g, trt, meanSand=mean(sand), sdSand=sd(sand), meanSilt=mean(silt). . . .) without having to write it all out. Any ideas? Thank you for your patience!

The function you will likely want to apply to your dataframe is aggregate() with either mean or sd as the function parameter.

assuming myDF is your original dataset:
library(data.table)
myDT <- data.table(myDF)
# Which variables to calculate All columns but the first five? :
variables <- tail( names(myDT), -5)
myDT[, lapply(.SD, function(x) list(mean(x), sd(x))), .SDcols=variables, by=list(trt, til)]
## OR Separately, if you prefer shorter `lapply` statements
myDT[, lapply(.SD, mean), .SDcols=variables, by=list(trt, til)]
myDT[, lapply(.SD, sd), .SDcols=variables, by=list(trt, til)]
--
> myDT[, lapply(.SD, mean), .SDcols=variables, by=list(trt, til)]
# trt til silt clay ibd1_6 ibd9_14 ibd_ave
# 1: CTK CT 14.66667 13.00000 1.483000 1.596000 1.539667
# 2: CTR CT 14.00000 13.33333 1.627000 1.601333 1.614333
# 3: ZTK ZT 16.33333 12.33333 1.480333 1.593000 1.536667
# 4: ZTR ZT 16.66667 17.00000 1.637000 1.690667 1.663667
> myDT[, lapply(.SD, sd), .SDcols=variables, by=list(trt, til)]
# trt til silt clay ibd1_6 ibd9_14 ibd_ave
# 1: CTK CT 0.5773503 1.7320508 0.13908271 0.004358899 0.07112196
# 2: CTR CT 1.0000000 0.5773503 0.07562407 0.039576929 0.02514624
# 3: ZTK ZT 0.5773503 0.5773503 0.17015973 0.041797129 0.07800214
# 4: ZTR ZT 0.5773503 1.0000000 0.09763196 0.030892286 0.04816984

aggregate(g[, c("sand", "silt", "clay")], g$trt, function(x) c(mean=mean(x), sd=sd(x) ) )
Using an anonymous function with aggregate.data.frame allows one to get both values with one call. You only want to pass in the columns to be aggregated.If you had a long list of columns and only wanted to exclude let's say the first 4 from calculations, it could be written as:
aggregate(g[, names(g)[-(1:4)], g$trt, function(x) c(mean=mean(x), sd=sd(x) ) )

Related

Subset dataframe based on the condition in a column of another dataframe

I have two data frames where each line represent data from one individual. Lines in the first data frame (that enter the specific analysis of geometric morphometry) correspond to the lines in the second data frame (additional descriptions of animals as sampling site or sex). I would like to subset the first data frame based on the condition form the second data frame (e.g. select all lines of the first data frame that are females, but sex of the animal is defined in the second dataframe). It is possible to do it by adding new column to the first data frame, subset it based on this new column and remove the column. Is there any other more elegant way to do it?
df1
[,1] [,2] [,3] [,4] [,5] [,6]
IMGP6995.JPG -0.07612235 0.08189661 0.020690012 0.07532420 0.05373111 0.07139840
IMGP6997.JPG -0.06759482 0.09449720 0.022907275 0.08807724 0.05953926 0.08256468
IMGP6998.JPG -0.06902234 0.08418980 0.013522385 0.08186618 0.05375763 0.07769076
IMGP6999.JPG -0.07201136 0.08475765 0.009462017 0.08080315 0.06148776 0.07059229
IMGP7001.JPG -0.08112908 0.08485488 0.037193459 0.07971364 0.05834018 0.07917079
IMGP7012.JPG -0.07059829 0.07905529 0.021803102 0.07480276 0.04849282 0.07270644
IMGP7013.JPG -0.07176010 0.08561111 0.009568661 0.08297752 0.06374573 0.08272648
IMGP7014.JPG -0.06751993 0.08895038 0.016800152 0.08799522 0.04776876 0.08100145
IMGP7015.JPG -0.07945826 0.07844136 0.008176800 0.07431915 0.06471417 0.07348312
IMGP7017.JPG -0.06587874 0.09280032 0.010204330 0.09085868 0.05290771 0.08739235
df2
number site m m..evis. m..gonads. sex SL TL AP RP
37 10 KB 1.263 1.003 0.136 F 39.38949 47.72564 NA NA
38 11 KB 4.215 3.510 0.093 F 53.48064 65.29663 NA NA
39 12 KB 3.508 2.997 0.079 F 51.59589 64.76600 NA NA
40 13 KB 3.250 2.752 0.085 F 49.55853 61.74319 NA NA
41 14 KB 3.596 3.149 0.101 F 51.42303 64.79511 NA NA
42 10 KKB 3.257 2.451 0.270 M 55.07909 67.52057 1468.017 598.9462
43 11 KKB 3.493 2.275 0.666 M 54.24882 65.61726 1722.414 757.1050
44 12 KKB 3.066 2.210 0.300 M 53.56323 64.09848 1410.891 638.4123
45 13 KKB 3.294 2.193 0.652 M 51.66717 63.49136 1428.063 651.1915
46 14 KKB 2.803 1.871 0.582 M 50.91185 60.90951 1236.438 660.8433
df1 after subset
[,1] [,2] [,3] [,4] [,5] [,6]
IMGP6995.JPG -0.07612235 0.08189661 0.020690012 0.07532420 0.05373111 0.07139840
IMGP6997.JPG -0.06759482 0.09449720 0.022907275 0.08807724 0.05953926 0.08256468
IMGP6998.JPG -0.06902234 0.08418980 0.013522385 0.08186618 0.05375763 0.07769076
IMGP6999.JPG -0.07201136 0.08475765 0.009462017 0.08080315 0.06148776 0.07059229
IMGP7001.JPG -0.08112908 0.08485488 0.037193459 0.07971364 0.05834018 0.07917079
df1[df2$sex %in% "F", ]
# [,1] [,2] [,3] [,4] [,5] [,6]
# IMGP6995.JPG -0.07612235 0.08189661 0.020690012 0.07532420 0.05373111 0.07139840
# IMGP6997.JPG -0.06759482 0.09449720 0.022907275 0.08807724 0.05953926 0.08256468
# IMGP6998.JPG -0.06902234 0.08418980 0.013522385 0.08186618 0.05375763 0.07769076
# IMGP6999.JPG -0.07201136 0.08475765 0.009462017 0.08080315 0.06148776 0.07059229
# IMGP7001.JPG -0.08112908 0.08485488 0.037193459 0.07971364 0.05834018 0.07917079
Explanation
Your df1 looks like a matrix, not a data.frame. But the solution I provided will also work if df1 is a data frame.
df2$sex %in% "F" reports if sex matches F. and reports a logical vector with TRUE and FALSE. After that, you can use that to subset df1.
Data
df1 <- matrix(c(-0.07612235, 0.08189661, 0.020690012, 0.07532420, 0.05373111, 0.07139840,
-0.06759482, 0.09449720, 0.022907275, 0.08807724, 0.05953926, 0.08256468,
-0.06902234, 0.08418980, 0.013522385, 0.08186618, 0.05375763, 0.07769076,
-0.07201136, 0.08475765, 0.009462017, 0.08080315, 0.06148776, 0.07059229,
-0.08112908, 0.08485488, 0.037193459, 0.07971364, 0.05834018, 0.07917079,
-0.07059829, 0.07905529, 0.021803102, 0.07480276, 0.04849282, 0.07270644,
-0.07176010, 0.08561111, 0.009568661, 0.08297752, 0.06374573, 0.08272648,
-0.06751993, 0.08895038, 0.016800152, 0.08799522, 0.04776876, 0.08100145,
-0.07945826, 0.07844136, 0.008176800, 0.07431915, 0.06471417, 0.07348312,
-0.06587874, 0.09280032, 0.010204330, 0.09085868, 0.05290771, 0.08739235),
ncol = 6, byrow = TRUE)
rownames(df1) <- c("IMGP6995.JPG", "IMGP6997.JPG", "IMGP6998.JPG", "IMGP6999.JPG",
"IMGP7001.JPG", "IMGP7012.JPG", "IMGP7013.JPG", "IMGP7014.JPG",
"IMGP7015.JPG", "IMGP7017.JPG")
df2 <- read.table(text = " number site m m..evis. m..gonads. sex SL TL AP RP
37 10 KB 1.263 1.003 0.136 F 39.38949 47.72564 NA NA
38 11 KB 4.215 3.510 0.093 F 53.48064 65.29663 NA NA
39 12 KB 3.508 2.997 0.079 F 51.59589 64.76600 NA NA
40 13 KB 3.250 2.752 0.085 F 49.55853 61.74319 NA NA
41 14 KB 3.596 3.149 0.101 F 51.42303 64.79511 NA NA
42 10 KKB 3.257 2.451 0.270 M 55.07909 67.52057 1468.017 598.9462
43 11 KKB 3.493 2.275 0.666 M 54.24882 65.61726 1722.414 757.1050
44 12 KKB 3.066 2.210 0.300 M 53.56323 64.09848 1410.891 638.4123
45 13 KKB 3.294 2.193 0.652 M 51.66717 63.49136 1428.063 651.1915
46 14 KKB 2.803 1.871 0.582 M 50.91185 60.90951 1236.438 660.8433",
header = TRUE, stringsAsFactors = FALSE)

Averaging every n columns and keep first two columns in the new data.frame in r

I have a data frame for daily time series with 4 observation for every day (every 6 hours) for each x and y (I have 202552 cells).
> head(tab,10)
x y X1990.05.01.01.00.00 X1990.05.01.07.00.00 X1990.05.01.13.00.00 X1990.05.01.19.00.00 X1990.05.02.01.00.00 X1990.05.02.07.00.00 X1990.05.02.13.00.00
1 5.000 60 276.9105 277.8516 278.9908 279.2422 279.6751 279.8078 280.4396
2 5.125 60 276.8863 277.8682 278.9966 279.2543 279.6863 279.7885 280.4033
3 5.250 60 276.8621 277.8830 279.0006 279.2659 279.6989 279.7688 280.3661
4 5.375 60 276.8379 277.8969 279.0029 279.2772 279.7123 279.7477 280.3289
5 5.500 60 276.8142 277.9094 279.0033 279.2879 279.7257 279.7244 280.2909
6 5.625 60 276.7913 277.9224 279.0033 279.2987 279.7396 279.6993 280.2523
7 5.750 60 276.7707 277.9363 279.0020 279.3094 279.7531 279.6715 280.2142
8 5.875 60 276.7537 277.9520 279.0002 279.3202 279.7656 279.6406 280.1770
9 6.000 60 276.7416 277.9713 278.9980 279.3314 279.7773 279.6070 280.1407
10 6.125 60 276.7357 277.9946 278.9953 279.3435 279.7871 279.5707 280.1071
X1990.05.02.19.00.00 X1990.05.03.01.00.00 X1990.05.03.07.00.00 X1990.05.03.13.00.00 X1990.05.03.19.00.00 X1990.05.04.01.00.00 X1990.05.04.07.00.00
1 280.5674 280.3316 280.3796 280.2308 280.6216 280.6216 280.1842
2 280.5414 280.3106 280.3697 280.2133 280.6220 280.6368 280.2053
3 280.5145 280.2886 280.3594 280.1927 280.6184 280.6503 280.2227
4 280.4858 280.2653 280.3482 280.1703 280.6113 280.6619 280.2380
5 280.4562 280.2420 280.3379 280.1466 280.6010 280.6722 280.2492
6 280.4262 280.2192 280.3280 280.1219 280.5880 280.6816 280.2572
7 280.3957 280.1981 280.3209 280.0973 280.5732 280.6910 280.2613
8 280.3661 280.1793 280.3159 280.0748 280.5571 280.7009 280.2626
9 280.3384 280.1640 280.3155 280.0542 280.5414 280.7112 280.2599
10 280.3128 280.1542 280.3195 280.0385 280.5270
I'd like to compute the daily average for every 4 columns (as each day has 4 measurements). I was able to use this function but I need to keep x and y for each row.
### daily mean
byapply <- function(x, by, fun, ...)
{
# Create index list
if (length(by) == 1)
{
nc <- ncol(x)
split.index <- rep(1:ceiling(nc / by), each = by, length.out = nc)
} else # 'by' is a vector of groups
{
nc <- length(by)
split.index <- by
}
index.list <- split(seq(from = 1, to = nc), split.index)
# Pass index list to fun using sapply() and return object
sapply(index.list, function(i)
{
do.call(fun, list(x[, i], ...))
})
}
DM<- data.frame(byapply(tab[3:2800], 4, rowMeans))
> head(DM, 10)
X1 X2 X3 X4 X5
1 278.2488 280.1225 280.3909 279.4138 276.6809
2 278.2514 280.1049 280.3789 279.4395 276.7141
3 278.2529 280.0871 280.3648 279.4634 276.7437
4 278.2537 280.0687 280.3488 279.4858 276.7691
5 278.2537 280.0493 280.3319 279.5066 276.7909
6 278.2539 280.0294 280.3143 279.5264 276.8090
7 278.2546 280.0086 280.2974 279.5453 276.8244
8 278.2565 279.9873 280.2818 279.5639 276.8377
9 278.2605 279.9658 280.2688 279.5819 276.8495
10 278.2673 279.9444 280.2598 279.5998 276.8611
Then I can use cbind to link daily means with each x and y
lonlat<-tab[-(3:2800)]
DMxy<- data.frame(cbind(lonlat, DM))
But I am looking for a way that I can compute the daily average directly by keeping the first two columns (x and y) in the new data frame (without deleting x and y) to minimize any possible error in cobind
Instead of
DM<- data.frame(byapply(tab[3:2800], 4, rowMeans))
try
DM2 <- cbind(byapply(tab[-(1:2)], 4, rowMeans), tab[1:2])
That will get you the desired result in a single step. Also, you minimize the chance of a mistake because you don't need to know the length of your dataframe; tab[-(1:2)] means "Every column in tab except the first two".
Classic textbook case to not store data in wide format due to needed operations such as grouped aggregation, specifically averaging. Consider melting your data into long format, and aggregate by the day for each X and Y grouping:
DATA (OP's posted example but filled in missing row 10 last two values)
txt= ' x y X1990.05.01.01.00.00 X1990.05.01.07.00.00 X1990.05.01.13.00.00 X1990.05.01.19.00.00 X1990.05.02.01.00.00 X1990.05.02.07.00.00 X1990.05.02.13.00.00 X1990.05.02.19.00.00 X1990.05.03.01.00.00 X1990.05.03.07.00.00 X1990.05.03.13.00.00 X1990.05.03.19.00.00 X1990.05.04.01.00.00 X1990.05.04.07.00.00
1 5.000 60 276.9105 277.8516 278.9908 279.2422 279.6751 279.8078 280.4396 280.5674 280.3316 280.3796 280.2308 280.6216 280.6216 280.1842
2 5.125 60 276.8863 277.8682 278.9966 279.2543 279.6863 279.7885 280.4033 280.5414 280.3106 280.3697 280.2133 280.6220 280.6368 280.2053
3 5.250 60 276.8621 277.8830 279.0006 279.2659 279.6989 279.7688 280.3661 280.5145 280.2886 280.3594 280.1927 280.6184 280.6503 280.2227
4 5.375 60 276.8379 277.8969 279.0029 279.2772 279.7123 279.7477 280.3289 280.4858 280.2653 280.3482 280.1703 280.6113 280.6619 280.2380
5 5.500 60 276.8142 277.9094 279.0033 279.2879 279.7257 279.7244 280.2909 280.4562 280.2420 280.3379 280.1466 280.6010 280.6722 280.2492
6 5.625 60 276.7913 277.9224 279.0033 279.2987 279.7396 279.6993 280.2523 280.4262 280.2192 280.3280 280.1219 280.5880 280.6816 280.2572
7 5.750 60 276.7707 277.9363 279.0020 279.3094 279.7531 279.6715 280.2142 280.3957 280.1981 280.3209 280.0973 280.5732 280.6910 280.2613
8 5.875 60 276.7537 277.9520 279.0002 279.3202 279.7656 279.6406 280.1770 280.3661 280.1793 280.3159 280.0748 280.5571 280.7009 280.2626
9 6.000 60 276.7416 277.9713 278.9980 279.3314 279.7773 279.6070 280.1407 280.3384 280.1640 280.3155 280.0542 280.5414 280.7112 280.2599
10 6.125 60 276.7357 277.9946 278.9953 279.3435 279.7871 279.5707 280.1071 280.3128 280.1542 280.3195 280.0385 280.5270 280.6581 280.3139'
df <- read.table(text=txt, header=TRUE)
CODE
library(reshape2)
mdf <- melt(df, id.vars = c('x', 'y'), variable.name = "day")
mdf$day <- gsub("X", "", mdf$day)
mdf$datetime <- as.POSIXct(mdf$day, format="%Y.%m.%d.%H.%M.%S")
mdf$day <- format(mdf$datetime, "%Y-%m-%d")
head(mdf)
# x y day value datetime
# 1 5.000 60 1990-05-01 276.9105 1990-05-01 01:00:00
# 2 5.125 60 1990-05-01 276.8863 1990-05-01 01:00:00
# 3 5.250 60 1990-05-01 276.8621 1990-05-01 01:00:00
# 4 5.375 60 1990-05-01 276.8379 1990-05-01 01:00:00
# 5 5.500 60 1990-05-01 276.8142 1990-05-01 01:00:00
# 6 5.625 60 1990-05-01 276.7913 1990-05-01 01:00:00
aggdf <- aggregate(value ~ x + y + day, mdf, FUN=mean)
aggdf <- with(aggdf, aggdf[order(x,y),]) # RE-ORDER BY X
row.names(aggdf) <- NULL # RESET ROW NAMES
head(aggdf, 12)
# x y day value
# 1 5.000 60 1990-05-01 278.2488
# 2 5.000 60 1990-05-02 280.1225
# 3 5.000 60 1990-05-03 280.3909
# 4 5.000 60 1990-05-04 280.4029
# 5 5.125 60 1990-05-01 278.2514
# 6 5.125 60 1990-05-02 280.1049
# 7 5.125 60 1990-05-03 280.3789
# 8 5.125 60 1990-05-04 280.4211
# 9 5.250 60 1990-05-01 278.2529
# 10 5.250 60 1990-05-02 280.0871
# 11 5.250 60 1990-05-03 280.3648
# 12 5.250 60 1990-05-04 280.4365

Automate regression by rows

I have a data.frame
set.seed(100)
exp <- data.frame(exp = c(rep(LETTERS[1:2], each = 10)), re = c(rep(seq(1, 10, 1), 2)), age1 = seq(10, 29, 1), age2 = seq(30, 49, 1),
h = c(runif(20, 10, 40)), h2 = c(40 + runif(20, 4, 9)))
I'd like to make a lm for each row in a data set (h and h2 ~ age1 and age2)
I do it by loop
exp$modelh <- 0
for (i in 1:length(exp$exp)){
age = c(exp$age1[i], exp$age2[i])
h = c(exp$h[i], exp$h2[i])
model = lm(age ~ h)
exp$modelh[i] = coef(model)[1] + 100 * coef(model)[2]
}
and it works well but takes some time with very large files. Will be grateful for the faster solution f.ex. dplyr
Using dplyr, we can try with rowwise() and do. Inside the do, we concatenate (c) the 'age1', 'age2' to create 'age', likewise, we can create 'h', apply lm, extract the coef to create the column 'modelh'.
library(dplyr)
exp %>%
rowwise() %>%
do({
age <- c(.$age1, .$age2)
h <- c(.$h, .$h2)
model <- lm(age ~ h)
data.frame(., modelh = coef(model)[1] + 100*coef(model)[2])
} )
gives the output
# exp re age1 age2 h h2 modelh
#1 A 1 10 30 19.23298 46.67906 68.85506
#2 A 2 11 31 17.73018 47.55402 66.17050
#3 A 3 12 32 26.56967 46.69174 84.98486
#4 A 4 13 33 11.69149 47.74486 61.98766
#5 A 5 14 34 24.05648 46.10051 82.90167
#6 A 6 15 35 24.51312 44.85710 89.21053
#7 A 7 16 36 34.37208 47.85151 113.37492
#8 A 8 17 37 21.10962 48.40977 74.79483
#9 A 9 18 38 26.39676 46.74548 90.34187
#10 A 10 19 39 15.10786 45.38862 75.07002
#11 B 1 20 40 28.74989 46.44153 100.54666
#12 B 2 21 41 36.46497 48.64253 125.34773
#13 B 3 22 42 18.41062 45.74346 81.70062
#14 B 4 23 43 21.95464 48.77079 81.20773
#15 B 5 24 44 32.87653 47.47637 115.95097
#16 B 6 25 45 30.07065 48.44727 101.10688
#17 B 7 26 46 16.13836 44.90204 84.31080
#18 B 8 27 47 20.72575 47.14695 87.00805
#19 B 9 28 48 20.78425 48.94782 84.25406
#20 B 10 29 49 30.70872 44.65144 128.39415
We could do this with the devel version of data.table i.e. v1.9.5. Instructions to install the devel version are here.
We convert the 'data.frame' to 'data.table' (setDT), create a column 'rn' with the option keep.rownames=TRUE. We melt the dataset by specifying the patterns in the measure to convert from 'wide' to 'long' format. Grouped by 'rn', we do the lm and get the coef. This can be assigned as a new column in the original dataset ('exp') while removing the unwanted 'rn' column by assigning (:=) it to NULL.
library(data.table)#v1.9.5+
modelh <- melt(setDT(exp, keep.rownames=TRUE), measure=patterns('^age', '^h'),
value.name=c('age', 'h'))[, {model <- lm(age ~h)
coef(model)[1] + 100 * coef(model)[2]},rn]$V1
exp[, modelh:= modelh][, rn := NULL]
exp
# exp re age1 age2 h h2 modelh
# 1: A 1 10 30 19.23298 46.67906 68.85506
# 2: A 2 11 31 17.73018 47.55402 66.17050
# 3: A 3 12 32 26.56967 46.69174 84.98486
# 4: A 4 13 33 11.69149 47.74486 61.98766
# 5: A 5 14 34 24.05648 46.10051 82.90167
# 6: A 6 15 35 24.51312 44.85710 89.21053
# 7: A 7 16 36 34.37208 47.85151 113.37492
# 8: A 8 17 37 21.10962 48.40977 74.79483
# 9: A 9 18 38 26.39676 46.74548 90.34187
#10: A 10 19 39 15.10786 45.38862 75.07002
#11: B 1 20 40 28.74989 46.44153 100.54666
#12: B 2 21 41 36.46497 48.64253 125.34773
#13: B 3 22 42 18.41062 45.74346 81.70062
#14: B 4 23 43 21.95464 48.77079 81.20773
#15: B 5 24 44 32.87653 47.47637 115.95097
#16: B 6 25 45 30.07065 48.44727 101.10688
#17: B 7 26 46 16.13836 44.90204 84.31080
#18: B 8 27 47 20.72575 47.14695 87.00805
#19: B 9 28 48 20.78425 48.94782 84.25406
#20: B 10 29 49 30.70872 44.65144 128.39415
Great (double) answer from #akrun.
Just a suggestion for your future analysis as you mentioned "it's an example of a bigger problem". Obviously, if you are really interested in building models rowwise then you'll create more and more columns as your age and h observations increase. If you get N observations you'll have to use 2xN columns for those 2 variables only.
I'd suggest to use a long data format in order to increase your rows instead of your columns.
Something like:
exp[1,] # how your first row (model building info) looks like
# exp re age1 age2 h h2
# 1 A 1 10 30 19.23298 46.67906
reshape(exp[1,], # how your model building info is transformed
varying = list(c("age1","age2"),
c("h","h2")),
v.names = c("age_value","h_value"),
direction = "long")
# exp re time age_value h_value id
# 1.1 A 1 1 10 19.23298 1
# 1.2 A 1 2 30 46.67906 1
Apologies if the "bigger problem" refers to something else and this answer is irrelevant.
With base R, the function sprintf can help us create formulas. And lapply carries out the calculation.
strings <- sprintf("c(%f,%f) ~ c(%f,%f)", exp$age1, exp$age2, exp$h, exp$h2)
lst <- lapply(strings, function(x) {model <- lm(as.formula(x));coef(model)[1] + 100 * coef(model)[2]})
exp$modelh <- unlist(lst)
exp
# exp re age1 age2 h h2 modelh
# 1 A 1 10 30 19.23298 46.67906 68.85506
# 2 A 2 11 31 17.73018 47.55402 66.17050
# 3 A 3 12 32 26.56967 46.69174 84.98486
# 4 A 4 13 33 11.69149 47.74486 61.98766
# 5 A 5 14 34 24.05648 46.10051 82.90167
# 6 A 6 15 35 24.51312 44.85710 89.21053
# 7 A 7 16 36 34.37208 47.85151 113.37493
# 8 A 8 17 37 21.10962 48.40977 74.79483
# 9 A 9 18 38 26.39676 46.74548 90.34187
# 10 A 10 19 39 15.10786 45.38862 75.07002
# 11 B 1 20 40 28.74989 46.44153 100.54666
# 12 B 2 21 41 36.46497 48.64253 125.34773
# 13 B 3 22 42 18.41062 45.74346 81.70062
# 14 B 4 23 43 21.95464 48.77079 81.20773
# 15 B 5 24 44 32.87653 47.47637 115.95097
# 16 B 6 25 45 30.07065 48.44727 101.10688
# 17 B 7 26 46 16.13836 44.90204 84.31080
# 18 B 8 27 47 20.72575 47.14695 87.00805
# 19 B 9 28 48 20.78425 48.94782 84.25406
# 20 B 10 29 49 30.70872 44.65144 128.39416
In the lapply function the expression as.formula(x) is what converts the formulas created in the first line into a format usable by the lm function.
Benchmark
library(dplyr)
library(microbenchmark)
set.seed(100)
big.exp <- data.frame(age1=sample(30, 1e4, T),
age2=sample(30:50, 1e4, T),
h=runif(1e4, 10, 40),
h2= 40 + runif(1e4,4,9))
microbenchmark(
plafort = {strings <- sprintf("c(%f,%f) ~ c(%f,%f)", big.exp$age1, big.exp$age2, big.exp$h, big.exp$h2)
lst <- lapply(strings, function(x) {model <- lm(as.formula(x));coef(model)[1] + 100 * coef(model)[2]})
big.exp$modelh <- unlist(lst)},
akdplyr = {big.exp %>%
rowwise() %>%
do({
age <- c(.$age1, .$age2)
h <- c(.$h, .$h2)
model <- lm(age ~ h)
data.frame(., modelh = coef(model)[1] + 100*coef(model)[2])
} )}
,times=5)
t: seconds
expr min lq mean median uq max neval cld
plafort 13.00605 13.41113 13.92165 13.56927 14.53814 15.08366 5 a
akdplyr 26.95064 27.64240 29.40892 27.86258 31.02955 33.55940 5 b
(Note: I downloaded the newest 1.9.5 devel version of data.table today, but continued to receive errors when trying to test it.
The results also differ fractionally (1.93 x 10^-8). Rounding likely accounts for the difference.)
all.equal(pl, ak)
[1] "Attributes: < Component “class”: Lengths (1, 3) differ (string compare on first 1) >"
[2] "Attributes: < Component “class”: 1 string mismatch >"
[3] "Component “modelh”: Mean relative difference: 1.933893e-08"
Conclusion
The lapply approach seems to perform well compared to dplyr with respect to speed, but it's 5 digit rounding may be an issue. Improvements may be possible. Perhaps using apply after converting to matrix to increase speed and efficiency.

Combining factor levels in R 3.2.1

In previous versions of R I could combine factor levels that didn't have a "significant" threshold of volume using the following little function:
whittle = function(data, cutoff_val){
#convert to a data frame
tab = as.data.frame.table(table(data))
#returns vector of indices where value is below cutoff_val
idx = which(tab$Freq < cutoff_val)
levels(data)[idx] = "Other"
return(data)
}
This takes in a factor vector, looks for levels that don't appear "often enough" and combines all of those levels into one "Other" factor level. An example of this is as follows:
> sort(table(data$State))
05 27 35 40 54 84 9 AP AU BE BI DI G GP GU GZ HN HR JA JM KE KU L LD LI MH NA
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
OU P PL RM SR TB TP TW U VD VI VS WS X ZH 47 BL BS DL M MB NB RP TU 11 DU KA
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3
BW ND NS WY AK SD 13 QC 01 BC MT AB HE ID J NO LN NM ON NE VT UT IA MS AO AR ME
4 4 4 4 5 5 6 6 7 7 7 8 8 8 9 10 11 17 23 26 26 30 31 31 38 40 44
OR KS HI NV WI OK KY IN WV AL CO WA MN NH MO SC LA TN AZ IL NC MI GA OH ** CT DE
45 47 48 57 57 64 106 108 112 113 120 125 131 131 135 138 198 200 233 492 511 579 645 646 840 873 1432
RI DC TX MA FL VA MD CA NJ PA NY
1782 2513 6992 7027 10527 11016 11836 12221 15485 16359 34045
Now when I use whittle it returns me the following message:
> delete = whittle(data$State, 1000)
Warning message:
In `levels<-`(`*tmp*`, value = c("Other", "Other", "Other", "Other", :
duplicated levels in factors are deprecated
How can I modify my function so that it has the same effect but doesn't use these "deprecated" factor levels? Converting to a character, tabling, and then converting to the character "Other"?
I've always found it easiest (less typing and less headache) to convert to character and back for these sorts of operations. Keeping with your as.data.frame.table and using replace to do the replacement of the low-frequency levels:
whittle <- function(data, cutoff_val) {
tab = as.data.frame.table(table(data))
factor(replace(as.character(data), data %in% tab$data[tab$Freq < cutoff_val], "Other"))
}
Testing on some sample data:
state <- factor(c("MD", "MD", "MD", "VA", "TX"))
whittle(state, 2)
# [1] MD MD MD Other Other
# Levels: MD Other
I think this verison should work. The levels<- function allows you to collapse by assigning a list (see ?levels).
whittle <- function(data, cutoff_val){
tab <- table(data)
shouldmerge <- tab < cutoff_val
tokeep <- names(tab)[!shouldmerge]
tomerge <- names(tab)[shouldmerge]
nv <- c(as.list(setNames(tokeep,tokeep)), list("Other"=tomerge))
levels(data)<-nv
return(data)
}
And we test it with
set.seed(15)
x<-factor(c(sample(letters[1:10], 100, replace=T), sample(letters[11:13], 10, replace=T)))
table(x)
# x
# a b c d e f g h i j k l m
# 5 11 8 8 7 5 13 14 14 15 2 3 5
y <- whittle(x, 9)
table(y)
# y
# b g h i j Other
# 11 13 14 14 15 43
It's worth adding to this answer that the new forcats package contains the fct_lump() function which is dedicated to this.
Using #MrFlick's data:
x <- factor(c(sample(letters[1:10], 100, replace=T),
sample(letters[11:13], 10, replace=T)))
library(forcats)
library(magrittr) ## for %>% ; could also load dplyr
fct_lump(x, n=5) %>% table
# b g h i j Other
#11 13 14 14 15 43
The n argument specifies the number of most common values to preserve.
Here's another way of doing it by replacing all the items below the threshold with the first and then renaming that level to Other.
whittle <- function(x, thresh) {
belowThresh <- names(which(table(x) < thresh))
x[x %in% belowThresh] <- belowThresh[1]
levels(x)[levels(x) == belowThresh[1]] <- "Other"
factor(x)
}

Add new columns to a data.table containing many variables

I want to add many new columns simultaneously to a data.table based on by-group computations. A working example of my data would look something like this:
Time Stock x1 x2 x3
1: 2014-08-22 A 15 27 34
2: 2014-08-23 A 39 44 29
3: 2014-08-24 A 20 50 5
4: 2014-08-22 B 42 22 43
5: 2014-08-23 B 44 45 12
6: 2014-08-24 B 3 21 2
Now I want to scale and sum many of the variables to get an output like:
Time Stock x1 x2 x3 x2_scale x3_scale x2_sum x3_sum
1: 2014-08-22 A 15 27 34 -1.1175975 0.7310560 121 68
2: 2014-08-23 A 39 44 29 0.3073393 0.4085313 121 68
3: 2014-08-24 A 20 50 5 0.8102582 -1.1395873 121 68
4: 2014-08-22 B 42 22 43 -0.5401315 1.1226726 88 57
5: 2014-08-23 B 44 45 12 1.1539172 -0.3274462 88 57
6: 2014-08-24 B 3 21 2 -0.6137858 -0.7952265 88 57
A brute force implementation of my problem would be:
library(data.table)
set.seed(123)
d <- data.table(Time = rep(seq.Date( Sys.Date(), length=3, by="day" )),
Stock = rep(LETTERS[1:2], each=3 ),
x1 = sample(1:50, 6),
x2 = sample(1:50, 6),
x3 = sample(1:50, 6))
d[,x2_scale:=scale(x2),by=Stock]
d[,x3_scale:=scale(x3),by=Stock]
d[,x2_sum:=sum(x2),by=Stock]
d[,x3_sum:=sum(x3),by=Stock]
Other posts describing a similar issue (Add multiple columns to R data.table in one function call? and Assign multiple columns using := in data.table, by group) suggest the following solution:
d[, c("x2_scale","x3_scale"):=list(scale(x2),scale(x3)), by=Stock]
d[, c("x2_sum","x3_sum"):=list(sum(x2),sum(x3)), by=Stock]
But again, this would get very messy with a lot of variables and also this brings up an error message with scale (but not with sum since this isn't returning a vector).
Is there a more efficient way to achieve the required result (keeping in mind that my actual data set is quite large)?
I think with a small modification to your last code you can easily do both for as many variables you want
vars <- c("x2", "x3") # <- Choose the variable you want to operate on
d[, paste0(vars, "_", "scale") := lapply(.SD, function(x) scale(x)[, 1]), .SDcols = vars, by = Stock]
d[, paste0(vars, "_", "sum") := lapply(.SD, sum), .SDcols = vars, by = Stock]
## Time Stock x1 x2 x3 x2_scale x3_scale x2_sum x3_sum
## 1: 2014-08-22 A 13 14 32 -1.1338934 1.1323092 87 44
## 2: 2014-08-23 A 25 39 9 0.7559289 -0.3701780 87 44
## 3: 2014-08-24 A 18 34 3 0.3779645 -0.7621312 87 44
## 4: 2014-08-22 B 44 8 6 -0.4730162 -0.7258662 59 32
## 5: 2014-08-23 B 49 3 18 -0.6757374 1.1406469 59 32
## 6: 2014-08-24 B 15 48 8 1.1487535 -0.4147807 59 32
For simple functions (that don't need special treatment like scale) you could easily do something like
vars <- c("x2", "x3") # <- Define the variable you want to operate on
funs <- c("min", "max", "mean", "sum") # <- define your function
for(i in funs){
d[, paste0(vars, "_", i) := lapply(.SD, eval(i)), .SDcols = vars, by = Stock]
}
Another variation using data.table
vars <- c("x2", "x3")
d[, paste0(rep(vars, each=2), "_", c("scale", "sum")) := do.call(`cbind`,
lapply(.SD, function(x) list(scale(x)[,1], sum(x)))), .SDcols=vars, by=Stock]
d
# Time Stock x1 x2 x3 x2_scale x2_sum x3_scale x3_sum
#1: 2014-08-22 A 15 27 34 -1.1175975 121 0.7310560 68
#2: 2014-08-23 A 39 44 29 0.3073393 121 0.4085313 68
#3: 2014-08-24 A 20 50 5 0.8102582 121 -1.1395873 68
#4: 2014-08-22 B 42 22 43 -0.5401315 88 1.1226726 57
#5: 2014-08-23 B 44 45 12 1.1539172 88 -0.3274462 57
#6: 2014-08-24 B 3 21 2 -0.6137858 88 -0.7952265 57
Based on comments from #Arun, you could also do:
cols <- paste0(rep(vars, each=2), "_", c("scale", "sum"))
d[,(cols):= unlist(lapply(.SD, function(x) list(scale(x)[,1L], sum(x))),
rec=F), by=Stock, .SDcols=vars]
You're probably looking for a pure data.table solution, but you could also consider using dplyr here since it works with data.tables as well (no need for conversion). Then, from dplyr you could use the function mutate_all as I do in this example here (with the first data set you showed in your question):
library(dplyr)
dt %>%
group_by(Stock) %>%
mutate_all(funs(sum, scale), x2, x3)
#Source: local data table [6 x 9]
#Groups: Stock
#
# Time Stock x1 x2 x3 x2_sum x3_sum x2_scale x3_scale
#1 2014-08-22 A 15 27 34 121 68 -1.1175975 0.7310560
#2 2014-08-23 A 39 44 29 121 68 0.3073393 0.4085313
#3 2014-08-24 A 20 50 5 121 68 0.8102582 -1.1395873
#4 2014-08-22 B 42 22 43 88 57 -0.5401315 1.1226726
#5 2014-08-23 B 44 45 12 88 57 1.1539172 -0.3274462
#6 2014-08-24 B 3 21 2 88 57 -0.6137858 -0.7952265
You can easily add more functions to be calculated which will create more columns for you. Note that mutate_all applies the function to each column except the grouping variable (Stock) by default. But you can either specify the columns you only want to apply the functions to (which I did in this example) or you can specify which columns you don't want to apply the functions to (that would be, e.g. -c(x2,x3) instead of where I wrote x2, x3).
EDIT: replaced mutate_each above with mutate_all as mutate_each will be deprecated in the near future.
EDIT: cleaner version using functional. I think this is the closest to the dplyr answer.
library(functional)
funs <- list(scale=Compose(scale, c), sum=sum) # See data.table issue #783 on github for the need for this
cols <- paste0("x", 2:3)
cols.all <- outer(cols, names(funs), paste, sep="_")
d[,
c(cols.all) := unlist(lapply(funs, Curry(lapply, X=.SD)), rec=F),
.SDcols=cols,
by=Stock
]
Produces:
Time Stock x1 x2 x3 x2_scale x3_scale x2_sum x3_sum
1: 2014-08-22 A 15 27 34 -1.1175975 0.7310560 121 68
2: 2014-08-23 A 39 44 29 0.3073393 0.4085313 121 68
3: 2014-08-24 A 20 50 5 0.8102582 -1.1395873 121 68
4: 2014-08-22 B 42 22 43 -0.5401315 1.1226726 88 57
5: 2014-08-23 B 44 45 12 1.1539172 -0.3274462 88 57
6: 2014-08-24 B 3 21 2 -0.6137858 -0.7952265 88 57

Resources