Subset dataframe based on the condition in a column of another dataframe - r

I have two data frames where each line represent data from one individual. Lines in the first data frame (that enter the specific analysis of geometric morphometry) correspond to the lines in the second data frame (additional descriptions of animals as sampling site or sex). I would like to subset the first data frame based on the condition form the second data frame (e.g. select all lines of the first data frame that are females, but sex of the animal is defined in the second dataframe). It is possible to do it by adding new column to the first data frame, subset it based on this new column and remove the column. Is there any other more elegant way to do it?
df1
[,1] [,2] [,3] [,4] [,5] [,6]
IMGP6995.JPG -0.07612235 0.08189661 0.020690012 0.07532420 0.05373111 0.07139840
IMGP6997.JPG -0.06759482 0.09449720 0.022907275 0.08807724 0.05953926 0.08256468
IMGP6998.JPG -0.06902234 0.08418980 0.013522385 0.08186618 0.05375763 0.07769076
IMGP6999.JPG -0.07201136 0.08475765 0.009462017 0.08080315 0.06148776 0.07059229
IMGP7001.JPG -0.08112908 0.08485488 0.037193459 0.07971364 0.05834018 0.07917079
IMGP7012.JPG -0.07059829 0.07905529 0.021803102 0.07480276 0.04849282 0.07270644
IMGP7013.JPG -0.07176010 0.08561111 0.009568661 0.08297752 0.06374573 0.08272648
IMGP7014.JPG -0.06751993 0.08895038 0.016800152 0.08799522 0.04776876 0.08100145
IMGP7015.JPG -0.07945826 0.07844136 0.008176800 0.07431915 0.06471417 0.07348312
IMGP7017.JPG -0.06587874 0.09280032 0.010204330 0.09085868 0.05290771 0.08739235
df2
number site m m..evis. m..gonads. sex SL TL AP RP
37 10 KB 1.263 1.003 0.136 F 39.38949 47.72564 NA NA
38 11 KB 4.215 3.510 0.093 F 53.48064 65.29663 NA NA
39 12 KB 3.508 2.997 0.079 F 51.59589 64.76600 NA NA
40 13 KB 3.250 2.752 0.085 F 49.55853 61.74319 NA NA
41 14 KB 3.596 3.149 0.101 F 51.42303 64.79511 NA NA
42 10 KKB 3.257 2.451 0.270 M 55.07909 67.52057 1468.017 598.9462
43 11 KKB 3.493 2.275 0.666 M 54.24882 65.61726 1722.414 757.1050
44 12 KKB 3.066 2.210 0.300 M 53.56323 64.09848 1410.891 638.4123
45 13 KKB 3.294 2.193 0.652 M 51.66717 63.49136 1428.063 651.1915
46 14 KKB 2.803 1.871 0.582 M 50.91185 60.90951 1236.438 660.8433
df1 after subset
[,1] [,2] [,3] [,4] [,5] [,6]
IMGP6995.JPG -0.07612235 0.08189661 0.020690012 0.07532420 0.05373111 0.07139840
IMGP6997.JPG -0.06759482 0.09449720 0.022907275 0.08807724 0.05953926 0.08256468
IMGP6998.JPG -0.06902234 0.08418980 0.013522385 0.08186618 0.05375763 0.07769076
IMGP6999.JPG -0.07201136 0.08475765 0.009462017 0.08080315 0.06148776 0.07059229
IMGP7001.JPG -0.08112908 0.08485488 0.037193459 0.07971364 0.05834018 0.07917079

df1[df2$sex %in% "F", ]
# [,1] [,2] [,3] [,4] [,5] [,6]
# IMGP6995.JPG -0.07612235 0.08189661 0.020690012 0.07532420 0.05373111 0.07139840
# IMGP6997.JPG -0.06759482 0.09449720 0.022907275 0.08807724 0.05953926 0.08256468
# IMGP6998.JPG -0.06902234 0.08418980 0.013522385 0.08186618 0.05375763 0.07769076
# IMGP6999.JPG -0.07201136 0.08475765 0.009462017 0.08080315 0.06148776 0.07059229
# IMGP7001.JPG -0.08112908 0.08485488 0.037193459 0.07971364 0.05834018 0.07917079
Explanation
Your df1 looks like a matrix, not a data.frame. But the solution I provided will also work if df1 is a data frame.
df2$sex %in% "F" reports if sex matches F. and reports a logical vector with TRUE and FALSE. After that, you can use that to subset df1.
Data
df1 <- matrix(c(-0.07612235, 0.08189661, 0.020690012, 0.07532420, 0.05373111, 0.07139840,
-0.06759482, 0.09449720, 0.022907275, 0.08807724, 0.05953926, 0.08256468,
-0.06902234, 0.08418980, 0.013522385, 0.08186618, 0.05375763, 0.07769076,
-0.07201136, 0.08475765, 0.009462017, 0.08080315, 0.06148776, 0.07059229,
-0.08112908, 0.08485488, 0.037193459, 0.07971364, 0.05834018, 0.07917079,
-0.07059829, 0.07905529, 0.021803102, 0.07480276, 0.04849282, 0.07270644,
-0.07176010, 0.08561111, 0.009568661, 0.08297752, 0.06374573, 0.08272648,
-0.06751993, 0.08895038, 0.016800152, 0.08799522, 0.04776876, 0.08100145,
-0.07945826, 0.07844136, 0.008176800, 0.07431915, 0.06471417, 0.07348312,
-0.06587874, 0.09280032, 0.010204330, 0.09085868, 0.05290771, 0.08739235),
ncol = 6, byrow = TRUE)
rownames(df1) <- c("IMGP6995.JPG", "IMGP6997.JPG", "IMGP6998.JPG", "IMGP6999.JPG",
"IMGP7001.JPG", "IMGP7012.JPG", "IMGP7013.JPG", "IMGP7014.JPG",
"IMGP7015.JPG", "IMGP7017.JPG")
df2 <- read.table(text = " number site m m..evis. m..gonads. sex SL TL AP RP
37 10 KB 1.263 1.003 0.136 F 39.38949 47.72564 NA NA
38 11 KB 4.215 3.510 0.093 F 53.48064 65.29663 NA NA
39 12 KB 3.508 2.997 0.079 F 51.59589 64.76600 NA NA
40 13 KB 3.250 2.752 0.085 F 49.55853 61.74319 NA NA
41 14 KB 3.596 3.149 0.101 F 51.42303 64.79511 NA NA
42 10 KKB 3.257 2.451 0.270 M 55.07909 67.52057 1468.017 598.9462
43 11 KKB 3.493 2.275 0.666 M 54.24882 65.61726 1722.414 757.1050
44 12 KKB 3.066 2.210 0.300 M 53.56323 64.09848 1410.891 638.4123
45 13 KKB 3.294 2.193 0.652 M 51.66717 63.49136 1428.063 651.1915
46 14 KKB 2.803 1.871 0.582 M 50.91185 60.90951 1236.438 660.8433",
header = TRUE, stringsAsFactors = FALSE)

Related

Matching elements of two vectors based on proximity

I got two vectors:
a<-c(268, 1295, 1788, 2019, 2422)
b<-c(266, 952, 1295, 1791, 2018)
I want to match the elements of b to the elements of a, based on the smallest difference. So a[1] would be matched to b[1].
However, each element can only be matched to a single other element. It is possible that elements cannot be matched. If two elements of b have the smallest difference to the same element in a, then the element with the smaller difference is matched.
For example 952 and 1295 are closest to element a[2], as 1295 is closer (in this case even equal to) a[2] it would get matched with 1295.
The final solution for this particular example should look like this.
268 NA 1295 1788 2019 2422
266 952 1295 1791 2018 NA
Some of the item are not matched and although it would be possible to match 952 and 2422 the code I need would not considere them a match because matches were found inbetween them. The vectors are also strictly increasing.
With my coding capabilities I would use tons of if statements to solve that issue. But I was wondering whether this is a know problem, and I am aware of the terminology of such or if someone would have an idea for an elegant solution
A base R approach, although probably not the most elegant one:
aux1 <- apply(abs(outer(a, b, `-`)), 2, function(r) c(min(r), which.min(r)))
colnames(aux1) <- 1:length(b)
aux2 <- tapply(aux1[1, ], factor(aux1[2, ], levels = 1:length(a)),
function(x) as.numeric(names(which.min(x))))
rbind(cbind(a, b = b[aux2]), cbind(a = NA, b = b[-aux2[!is.na(aux2)]]))
# a b
# [1,] 268 266
# [2,] 1295 1295
# [3,] 1788 1791
# [4,] 2019 2018
# [5,] 2422 NA
# [6,] NA 952
Here aux1 contains closest a elements to b (2nd row) and the corresponding distances (1st row).
tmp
# [,1] [,2] [,3] [,4] [,5]
# [1,] 2 343 0 3 1
# [2,] 1 2 2 3 4
Then aux2 may already be enough for your purposes.
out
# 1 2 3 4 5
# 1 3 4 5 NA
aux1 showed some ties but aux2 now gives which element of a (2nd row) should be assigned to which element of b (names). Then in the last line we bind the rest of the elements.
In a more complex case we have
a <- c(932, 1196, 1503, 2819, 3317, 3845, 4118, 4544)
b <- c(1190, 1498, 2037, 2826, 3323, 4128, 4618, 1190, 1498, 2037, 2826, 3323, 4128, 4618)
# ....
rbind(cbind(a, b = b[aux2]), cbind(a = NA, b = b[-aux2[!is.na(aux2)]]))
# a b
# [1,] 932 NA
# [2,] 1196 1190
# [3,] 1503 1498
# [4,] 2819 2826
# [5,] 3317 3323
# [6,] 3845 NA
# [7,] 4118 4128
# [8,] 4544 4618
# [9,] NA 2037
# [10,] NA 1190
# [11,] NA 1498
# [12,] NA 2037
# [13,] NA 2826
# [14,] NA 3323
# [15,] NA 4128
# [16,] NA 4618

R:how to rewrite my coding to work in high efficiency?

I have a matrix (named rating) with dim n x 140000 and another matrix (named trust) with dim nxn where n varying when I change the group and n might have value from 1-15000. I need to multiply each column of rating by trust. for example:
trust= rating=
a1 a2 a3 a4 a5 1 2 3 4 5 6 7 8
b1 b2 b3 b4 b5 2 5 7 8 9 2 1 6
c1 c2 c3 c4 c5 3 5 3 6 8 1 2 5
d1 d2 d3 d4 d5 4 7 8 2 4 5 6 7
e1 e2 e3 e4 e5 5 2 5 7 8 9 1 4
answer1= answer2=
a1.1 a2.2 a3.3 a4.4 a5.5 a1.2 a2.5 a3.5 a4.7 a5.2
b1.1 b2.2 b3.3 b4.4 b5.5 b1.2 b2.5 b3.5 b4.7 b5.2
c1.1 c2.2 c3.3 c4.4 c5.5 c1.2 c2.5 c3.5 c4.7 c5.2
d1.1 d2.2 d3.3 d4.4 d5.5 d1.2 d2.5 d3.5 d4.7 d5.2
e1.1 e2.2 e3.3 e4.4 e5.5 e1.2 e2.5 e3.5 e4.7 e5.2
and answer3 must multiply by 3rd column and so on. Then add each rows of answer1, answer2, ... and store into a vector. Then store each vector into a list for future use.
for (k in 1:ncol(rating)) {
clmy <- as.matrix(rating[, k])
answer <- sweep(trust, MARGIN = 2, clmy, '*')
sumtrustbyrating <- rowSums(answer)
LstsumRbyT[[k]] <- sumtrustbyrating
sumtrustbyrating = NULL
}
It is working perfectly if I change the ncol(rating) to a small value (about 100). But for the actual data, I have 140000 columns. It takes time and I couldn't get the final execution result. Please help me to enhance the performance of my code for a huge data set.
How about a matrix product? Or is that too slow?
rating <- matrix(c(1, 2, 3, 4, 5,2, 5, 5, 6, 3, 3, 4, 1, 2, 1), ncol=3)
trust <- matrix(rep(1:5, rep(5, 1)), 5, byrow=TRUE)
Running your code above yields
LstsumRbyT
[[1]]
[1] 55 55 55 55 55
[[2]]
[1] 66 66 66 66 66
[[3]]
[1] 27 27 27 27 27
which is the same as
trust %*% rating
[,1] [,2] [,3]
[1,] 55 66 27
[2,] 55 66 27
[3,] 55 66 27
[4,] 55 66 27
[5,] 55 66 27
If this isn't enough then this could be improved a bit in RCppArmadillo I guess.
To add to the benchmarking discussion. If your for loop above is renamed f() then I get
microbenchmark(trust %*% rating, f())
Unit: microseconds
expr min lq mean median uq max neval cld
trust %*% rating 1.418 1.7010 2.97663 2.7215 3.5965 14.452 100 a
f() 593.890 700.9775 764.00515 766.5535 792.6375 1511.104 100 b
which is quite a substantial speedup with the normal matrix product.
I would vectorize everything:
library(data.table)
set.seed(666)#in order to have reproducible results
n<-10#number of cols and rows
(trust<-matrix(runif(n*n),ncol=n,nrow=n))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 0.77436849 0.77589308 0.98422408 0.4697785 0.2444375 0.06913359 0.7748744 0.60379428 0.7659585 0.13247078
[2,] 0.19722419 0.01637905 0.60134555 0.3976166 0.5309707 0.08462063 0.8120639 0.32826395 0.7758464 0.07851311
[3,] 0.97801384 0.09574478 0.03834435 0.8046367 0.1183959 0.12994557 0.2606025 0.66611781 0.3125150 0.37822385
[4,] 0.20132735 0.14216354 0.14149569 0.5088974 0.9833834 0.74613202 0.6515950 0.87478750 0.8422173 0.57962476
[5,] 0.36124443 0.21112624 0.80638553 0.6349154 0.8977528 0.03887918 0.9238039 0.06887527 0.3141499 0.53642512
[6,] 0.74261194 0.81125644 0.26668568 0.4942517 0.7385738 0.68563542 0.2661061 0.79346301 0.7565639 0.10853192
[7,] 0.97872844 0.03654720 0.04270205 0.2801309 0.3773107 0.14397736 0.2661330 0.57142701 0.9675244 0.74031515
[8,] 0.49811371 0.89163741 0.61217452 0.9087104 0.6061688 0.89107996 0.9109179 0.04894407 0.1694229 0.45178964
[9,] 0.01331584 0.48323641 0.55334840 0.7841162 0.5121943 0.08963612 0.5905635 0.98035135 0.6968752 0.64610821
[10,] 0.25994613 0.46666453 0.85350077 0.5589970 0.9892467 0.03773272 0.9181476 0.91453735 0.8726508 0.74929873
(rating<-matrix(sample(n*n),ncol=n,nrow=n))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] 58 19 13 25 23 96 38 100 47 93
[2,] 37 22 45 41 4 18 52 83 89 39
[3,] 87 36 15 40 94 11 31 63 35 10
[4,] 59 88 81 64 68 27 92 56 49 46
[5,] 24 90 8 44 43 82 14 57 79 66
[6,] 95 74 48 70 7 33 34 42 60 50
[7,] 26 65 73 61 32 12 97 98 9 69
[8,] 21 86 1 99 6 72 75 20 71 62
[9,] 29 85 55 30 53 80 77 2 28 51
[10,] 67 91 76 16 5 3 84 54 78 17
A function:
prod1<-function(m1,m2){
res<-NULL
if(dim(m1)[1]==dim(m2)[1])
res<-rbindlist(data.table(rbindlist(data.table(lapply(seq_along(1:nrow(m2)),function(y) {lapply(seq_along(1:nrow(m1)[1]),function(x){m1[,x]*m2[y,x]})})))$V1))
return(res)
}
will produce: (answer1<-prod1(trust,rating))#sequence of arguments DOES matter
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1: 44.9133724 14.7419685 12.7949130 11.744463 5.622062 6.636824 29.445226 60.379428 36.000049 12.319782
2: 11.4390031 0.3112020 7.8174921 9.940414 12.212325 8.123580 30.858427 32.826395 36.464780 7.301719
3: 56.7248030 1.8191509 0.4984765 20.115918 2.723107 12.474775 9.902897 66.611781 14.688207 35.174818
4: 11.6769863 2.7011073 1.8394440 12.722435 22.617819 71.628674 24.760610 87.478750 39.584213 53.905103
5: 20.9521768 4.0113985 10.4830118 15.872884 20.648315 3.732401 35.104546 6.887527 14.765046 49.887537
6: 43.0714926 15.4138724 3.4669138 12.356293 16.987197 65.821000 10.112033 79.346301 35.558503 10.093469
7: 56.7662495 0.6943967 0.5551267 7.003272 8.678146 13.821827 10.113054 57.142701 45.473646 68.849309
8: 28.8905951 16.9411108 7.9582688 22.717759 13.941883 85.543676 34.614880 4.894407 7.962877 42.016436
9: 0.7723185 9.1814918 7.1935292 19.602904 11.780468 8.605067 22.441414 98.035135 32.753133 60.088064
10: 15.0768755 8.8666260 11.0955099 13.974926 22.752673 3.622341 34.889611 91.453735 41.014587 69.684782
Finally the answer2 is given via the function
prod2<-function(m1,m2){
res<-NULL
if(dim(m1)[1]==dim(m2)[1])
res<-rbindlist(data.table(rbindlist(data.table(lapply(seq_along(2:nrow(m2)),function(y) {lapply(seq_along(2:nrow(m1)[1]),function(x){m1[,x]*m2[y,x+1]})})))$V1))
return(res)
}
and in particular answer2<-prod2(trust,rating), yielding:
V1 V2 V3 V4 V5 V6 V7 V8 V9
1: 14.7130013 10.0866100 24.6056020 10.804906 23.46600 2.627076 77.48744 28.378331 71.23414
2: 3.7472596 0.2129277 15.0336387 9.145181 50.97318 3.215584 81.20639 15.428406 72.15371
3: 18.5822630 1.2446822 0.9586087 18.506645 11.36601 4.937932 26.06025 31.307537 29.06390
4: 3.8252197 1.8481260 3.5373923 11.704640 94.40481 28.353017 65.15950 41.115012 78.32621
5: 6.8636441 2.7446411 20.1596381 14.603053 86.18427 1.477409 92.38039 3.237138 29.21594
6: 14.1096269 10.5463338 6.6671419 11.367790 70.90308 26.054146 26.61061 37.292761 70.36044
7: 18.5958403 0.4751135 1.0675513 6.443011 36.22183 5.471140 26.61330 26.857069 89.97977
8: 9.4641605 11.5912864 15.3043631 20.900338 58.19221 33.861038 91.09179 2.300371 15.75633
9: 0.2530009 6.2820733 13.8337100 18.034672 49.17065 3.406172 59.05635 46.076514 64.80939
10: 4.9389764 6.0666389 21.3375191 12.856932 94.96768 1.433843 91.81476 42.983255 81.15652
Benchmarking
library(microbenchmark)
library("ggplot2")
set.seed(666)
global_func<-function(n){
trust<-matrix(runif(n*n),ncol=n,nrow=n)
rating<-matrix(sample(n*n),ncol=n,nrow=n)
prod1<-function(m1,m2){
res<-NULL
if(dim(m1)[1]==dim(m2)[1])
res<-rbindlist(data.table(rbindlist(data.table(lapply(seq_along(1:nrow(m2)),function(y) {lapply(seq_along(1:nrow(m1)[1]),function(x){m1[,x]*m2[y,x]})})))$V1))
return(res)
}
prod2<-function(m1,m2){
res<-NULL
if(dim(m1)[1]==dim(m2)[1])
res<-rbindlist(data.table(rbindlist(data.table(lapply(seq_along(2:nrow(m2)),function(y) {lapply(seq_along(2:nrow(m1)[1]),function(x){m1[,x]*m2[y,x+1]})})))$V1))
return(res)
}
return(list(prod1(trust,rating),prod2(trust,rating)))
}
Let's compare times vs number of cols/rows (n)---Use with caution
tm<-microbenchmark(global_func(10),
global_func(50),
global_func(100),
global_func(500),
times = 100
)
autoplot(tm)

Outputting percentiles by filtering a data frame

Note that, as requested in the comments, that this question has been revised.
Consider the following example:
df <- data.frame(FILTER = rep(1:10, each = 10), VALUE = 1:100)
I would like to, for each value of FILTER, create a data frame which contains the 1st, 2nd, ..., 99th percentiles of VALUE. The final product should be
PERCENTILE df_1 df_2 ... df_10
1 [first percentiles]
2 [second percentiles]
etc., where df_i is based on FILTER == i.
Note that FILTER, although it contains numbers, is actually categorical.
The way I have been doing this is by using dplyr:
nums <- 1:10
library(dplyr)
for (i in nums){
df_temp <- filter(df, FILTER == i)$VALUE
assign(paste0("df_", i), quantile(df_temp, probs = (1:99)/100))
}
and then I would have to cbind these (with 1:99 in the first column), but I would rather not type in every single df name. I have considered using a loop on the names of these data frames, but this would involve using eval(parse()).
Here's a basic outline of a possibly smoother approach. I have not included every single aspect of your desired output, but the modification should be fairly straightforward.
df <- data.frame(FILTER = rep(1:10, each = 10), VALUE = 1:100)
df_s <- lapply(split(df,df$FILTER),
FUN = function(x) quantile(x$VALUE,probs = c(0.25,0.5,0.75)))
out <- do.call(cbind,df_s)
colnames(out) <- paste0("df_",colnames(out))
> out
df_1 df_2 df_3 df_4 df_5 df_6 df_7 df_8 df_9 df_10
25% 3.25 13.25 23.25 33.25 43.25 53.25 63.25 73.25 83.25 93.25
50% 5.50 15.50 25.50 35.50 45.50 55.50 65.50 75.50 85.50 95.50
75% 7.75 17.75 27.75 37.75 47.75 57.75 67.75 77.75 87.75 97.75
I did this for just 3 quantiles to keep things simple, but it obviously extends. And you can add the 1:99 column afterwards as well.
I suggest that you use a list.
list_of_dfs <- list()
nums <- 1:10
for (i in nums){
list_of_dfs[[i]] <- nums*i
}
df <- data.frame(list_of_dfs[[1]])
df <- do.call("cbind",args=list(df,list_of_dfs))
colnames(df) <- paste0("df_",1:10)
You'll get the result you want:
df_1 df_2 df_3 df_4 df_5 df_6 df_7 df_8 df_9 df_10
1 1 2 3 4 5 6 7 8 9 10
2 2 4 6 8 10 12 14 16 18 20
3 3 6 9 12 15 18 21 24 27 30
4 4 8 12 16 20 24 28 32 36 40
5 5 10 15 20 25 30 35 40 45 50
6 6 12 18 24 30 36 42 48 54 60
7 7 14 21 28 35 42 49 56 63 70
8 8 16 24 32 40 48 56 64 72 80
9 9 18 27 36 45 54 63 72 81 90
10 10 20 30 40 50 60 70 80 90 100
How about using get?
df <- data.frame(1:10)
for (i in nums) {
df <- cbind(df, get(paste0("df_", i)))
}
# get rid of first useless column
df <- df[, -1]
# get names
names(df) <- paste0("df_", nums)
df

mean and standard deviation by group for multiple variables [duplicate]

This question already has answers here:
plyr package writing the same function over multiple columns
(2 answers)
Closed 9 years ago.
I am sure this question has been answered before, but I would like to caclulate mean and sd by treatment for multiple variables (100s) all at once and cannot figure out how to do it aside from using a long winded ddply code.
This is a portion of my dataframe (g):
trt blk til res sand silt clay ibd1_6 ibd9_14 ibd_ave
1 CTK 1 CT K 74 15 11 1.323 1.593 1.458
2 CTK 2 CT K 71 15 14 1.575 1.601 1.588
3 CTK 3 CT K 72 14 14 1.551 1.594 1.573
4 CTR 1 CT R 72 15 13 1.560 1.647 1.604
5 CTR 2 CT R 73 14 13 1.612 1.580 1.596
6 CTR 3 CT R 73 13 14 1.709 1.577 1.643
7 ZTK 1 ZT K 72 16 12 1.526 1.546 1.536
8 ZTK 2 ZT K 71 16 13 1.292 1.626 1.459
9 ZTK 3 ZT K 71 17 12 1.623 1.607 1.615
10 ZTR 1 ZT R 66 16 18 1.719 1.709 1.714
11 ZTR 2 ZT R 67 17 16 1.529 1.708 1.618
12 ZTR 3 ZT R 66 17 17 1.663 1.655 1.659
I would like to have a function that does what ddply does, i.e ddply(g, trt, meanSand=mean(sand), sdSand=sd(sand), meanSilt=mean(silt). . . .) without having to write it all out. Any ideas? Thank you for your patience!
The function you will likely want to apply to your dataframe is aggregate() with either mean or sd as the function parameter.
assuming myDF is your original dataset:
library(data.table)
myDT <- data.table(myDF)
# Which variables to calculate All columns but the first five? :
variables <- tail( names(myDT), -5)
myDT[, lapply(.SD, function(x) list(mean(x), sd(x))), .SDcols=variables, by=list(trt, til)]
## OR Separately, if you prefer shorter `lapply` statements
myDT[, lapply(.SD, mean), .SDcols=variables, by=list(trt, til)]
myDT[, lapply(.SD, sd), .SDcols=variables, by=list(trt, til)]
--
> myDT[, lapply(.SD, mean), .SDcols=variables, by=list(trt, til)]
# trt til silt clay ibd1_6 ibd9_14 ibd_ave
# 1: CTK CT 14.66667 13.00000 1.483000 1.596000 1.539667
# 2: CTR CT 14.00000 13.33333 1.627000 1.601333 1.614333
# 3: ZTK ZT 16.33333 12.33333 1.480333 1.593000 1.536667
# 4: ZTR ZT 16.66667 17.00000 1.637000 1.690667 1.663667
> myDT[, lapply(.SD, sd), .SDcols=variables, by=list(trt, til)]
# trt til silt clay ibd1_6 ibd9_14 ibd_ave
# 1: CTK CT 0.5773503 1.7320508 0.13908271 0.004358899 0.07112196
# 2: CTR CT 1.0000000 0.5773503 0.07562407 0.039576929 0.02514624
# 3: ZTK ZT 0.5773503 0.5773503 0.17015973 0.041797129 0.07800214
# 4: ZTR ZT 0.5773503 1.0000000 0.09763196 0.030892286 0.04816984
aggregate(g[, c("sand", "silt", "clay")], g$trt, function(x) c(mean=mean(x), sd=sd(x) ) )
Using an anonymous function with aggregate.data.frame allows one to get both values with one call. You only want to pass in the columns to be aggregated.If you had a long list of columns and only wanted to exclude let's say the first 4 from calculations, it could be written as:
aggregate(g[, names(g)[-(1:4)], g$trt, function(x) c(mean=mean(x), sd=sd(x) ) )

vectorize this for loop (current row is dependent on row above)

Suppose I want to create n=3 random walk paths (pathlength = 100) given a pre-generated matrix (100x3) of plus/minus ones. The first path will start at 10, the second at 20, the third at 30:
set.seed(123)
given.rand.matrix <- replicate(3,sign(rnorm(100)))
path <- matrix(NA,101,3)
path[1,] = c(10,20,30)
for (j in 2:101) {
path[j,]<-path[j-1,]+given.rand.matrix[j-1,]
}
The end values (given the seed and rand matrix) are 14, 6, 34... which is the desired result... but...
Question: Is there a way to vectorize the for loop? The problem is that the path matrix is not yet fully populated when calculating. Thus, replacing the loop with
path[2:101,]<-path[1:100,]+given.rand.matrix
returns mostly NAs. I just want to know if this type of for loop is avoidable in R.
Thank you very much in advance.
Definitely vectorizable: Skip the initialization of path, and use cumsum over the matrix:
path <- apply( rbind(c(10,20,30),given.rand.matrix), 2, cumsum)
> head(path)
[,1] [,2] [,3]
[1,] 10 20 30
[2,] 9 19 31
[3,] 8 20 32
[4,] 9 19 31
[5,] 10 18 32
[6,] 11 17 31
> tail(path)
[,1] [,2] [,3]
[96,] 15 7 31
[97,] 14 8 32
[98,] 15 9 33
[99,] 16 8 32
[100,] 15 7 33
[101,] 14 6 34

Resources