iterative wilcox on data.frame in R - r

I am trying, or rather I wish I could try, to write a loop in R that executes the Wilcoxon test (wilcox.test) in an iterative way, comparing 2 groups of values in each row of a data.frame, and returning for each row the p-value that is then put in a dataframe with its associated row label.
The data.frame is as follows:
> tab[1:5,]
mol E12 E15 E22 E25 E26 E27 E38 E44 E47
1 A 7362.40 2475.93 3886.06 5825.59 6882.00 3250.05 3406.65 6416.29 7786.73
2 B 5391.42 2037.88 3330.05 4043.83 5766.20 2591.69 3603.95 14431.89 8320.70
3 C 1195.89 241.24 252.46 865.97 1970.28 899.22 346.36 1135.86 1179.31
4 D 502.64 171.41 434.29 508.22 419.34 260.13 298.14 326.70 167.07
5 E 181.63 171.41 165.30 150.47 164.09 109.19 122.76 212.74 155.60
Column labels are: mol, the specific molecule evaluated (about 20); E12 to E47 the samples for which the value of each molecule is measured.
Groups to be compared are:
P; samples E12, E25, E26, E27, E44. D; samples E15, E22, E38, E47.
The output should look like this:
mol p-value
A 1
B 0.5556
C 0.9048
etc.
I tried to use a for in cycle, but I am absolutely not able to manage it in this, for me complicated, context.
Any help with comments on the meaning of the instructions for a newbie like me is much appreciated.

apply() works like a looper on matrices and arrays. In this case, with margin=1 it loops along the rows. Each row, temporarily converted into a vector x, is passed on to function(x) wilcox.test(x[P], x[D])$p.value, the result being one p-value per row. P and D are logical vectors specifying which elements within x should be used in each sample.
tab0 <- read.table(text="mol E12 E15 E22 E25 E26 E27 E38 E44 E47
A 7362.40 2475.93 3886.06 5825.59 6882.00 3250.05 3406.65 6416.29 7786.73
B 5391.42 2037.88 3330.05 4043.83 5766.20 2591.69 3603.95 14431.89 8320.70
C 1195.89 241.24 252.46 865.97 1970.28 899.22 346.36 1135.86 1179.31
D 502.64 171.41 434.29 508.22 419.34 260.13 298.14 326.70 167.07
E 181.63 171.41 165.30 150.47 164.09 109.19 122.76 212.74 155.60",
header=TRUE)
tab <- as.matrix(tab0[,-1])
P <- colnames(tab) %in% c("E12", "E25", "E26", "E27", "E44")
D <- colnames(tab) %in% c("E15", "E22", "E38", "E47")
pv <- apply(tab, 1, function(x) wilcox.test(x[P], x[D])$p.value)
data.frame(tab0[1], p.val=signif(pv, 4))
# mol p.val
# 1 A 0.5556
# 2 B 0.4127
# 3 C 0.1111
# 4 D 0.1905
# 5 E 0.9048

Related

Slicing long form R data into multiple columns

I've been given some data that I've combined into long form, but I need to get it into a certain format for a deliverable. I've tinkered with dataframe and list options and cannot seem to find a way to get the data I have into the output form I need. Any thoughts and solutions are appreciated.
If the desired output form seems odd for R, it is because other people will open the resulting data in Excel for additional study. So I will save the final data as a csv or Excel file. The full data in the desired form will have 40 rows (+header) and 110 columns (55 student and score pairs).
Here is example code for my long form data:
class
student
score
1
a
0.4977
1
b
0.7176
1
c
0.9919
1
d
0.3800
1
e
0.7774
2
f
0.9347
2
g
0.2121
2
h
0.6517
2
i
0.1256
2
j
0.2672
3
k
0.3861
3
l
0.0134
3
m
0.3824
3
n
0.8697
3
o
0.3403
Here is an example of how I need the final data to appear:
class_1_student
class_1_score
class_2_student
class_2_score
class_3_student
class_3_score
a
0.4977
f
0.9347
k
0.3861
b
0.7176
g
0.2121
l
0.0134
c
0.9919
h
0.6517
m
0.3824
d
0.3800
i
0.1256
n
0.8697
e
0.7774
j
0.2672
o
0.3403
Here is R code to generate the sample long form and desired form data:
set.seed(1)
d <- data.frame(
class=c(rep(1,5), rep(2,5), rep(3,5)),
student=c(letters[1:5], letters[6:10], letters[11:15]),
score=round(runif(15, 0, 1),4)
)
d2 <- data.frame(
class_1_student = d[1:5,2],
class_1_score = d[1:5,3],
class_2_student = d[6:10,2],
class_2_score = d[6:10,3],
class_3_student = d[11:15,2],
class_3_score = d[11:15,3]
)
If it's helpful, I also have the student and score data in separate matrices (1 row per student and 1 column per class) that I could use to help generate the final data.
You can just split data:
library(tidyverse)
split(select(d, -class), d$class) %>%
imap(~setNames(.x, str_c("class", .y, names(.x), sep = "_"))) %>%
bind_cols()
Column binding will work only if the groups are of equal sizes.

Plot In R with Multiple Lines Based On A Particular Variable?

I have this accelerometer dataset and, let's say that I have some n number of observations for each subject (30 subjects total) for body-acceleration x time.
I want to make a plot so that it plots these body acceleration x time points for each subject in a different color on the y axis and the x axis is just an index. I tried this:
ggplot(data = filtered_data_walk, aes(x = seq_along(filtered_data_walk$'body-acceleration-mean-y-time'), y = filtered_data_walk$'body-acceleration-mean-y-time')) +
geom_line(aes(color = filtered_data_walk$subject))
But, the problem is that it doesn't superimpose the 30 lines, instead, they run along side each other. In other words, I end up with n1 + n2 + n3 + ... + n30 x index points, instead of max{n1, n2, ..., n30}. This is my first time posting, so I hope this makes sense (I know my formatting is bad).
One solution I thought of was to create a new variable which gives a value of 1 to n for all the observations of each subject. So, for example, if I had 6 observations for subject1, 4 observations for subject2, and 9 observations for subject3, this new variable would be sequenced like:
1 2 3 4 5 6 1 2 3 4 1 2 3 4 5 6 7 8 9
Is there an easy way to do this? Please help, ty.
Assuming your data is formatted as a data.frame or matrix, for a toy dataset like
x <- data.frame(replicate(5, rnorm(10)))
x
# X1 X2 X3 X4 X5
# 1 -1.36452272 -1.46446475 2.0444381 0.001585876 -1.1085990
# 2 -1.41303046 -0.14690269 1.6179084 -0.310162018 -1.5528733
# 3 -0.15319554 -0.18779791 -0.3005058 0.351619212 1.6282955
# 4 -0.38712167 -0.14867239 -1.0776359 0.106694311 -0.7065382
# 5 -0.50711166 -0.95992916 1.3522922 1.437085757 -0.7921355
# 6 -0.82377208 0.50423328 -0.5366513 -1.315263679 1.0604499
# 7 -0.01462037 -1.15213287 0.9910678 0.372623508 1.9002438
# 8 1.49721113 -0.84914197 0.2422053 0.337141898 1.2405208
# 9 1.95914245 -1.43041783 0.2190829 -1.797396822 0.4970690
# 10 -1.75726827 -0.04123615 -0.1660454 -1.071688768 -0.3331887
...you might be able to get there with something like
plot(x[,1], type='l', xlim=c(1, nrow(x)), ylim=c(min(x), max(x)))
for(i in 2:ncol(x)) lines(x[,i], col=i)
You could play with formatting some more, of course, do things with lty= and lwd= and maybe a color ramp of your own choosing, etc.
If your data is in the format below...
x <- data.frame(id=c("A","A","A","B","B","B","B","C","C"), acc=rnorm(9))
x
# id acc
# 1 A 0.1796964
# 2 A 0.8770237
# 3 A -2.4413527
# 4 B 0.9379746
# 5 B -0.3416141
# 6 B -0.2921062
# 7 B 0.1440221
# 8 C -0.3248310
# 9 C -0.1058267
...you could get there with
maxn <- max(with(x, tapply(acc, id, length)))
ids <- sort(unique(x$id))
plot(x$acc[x$id==ids[1]], type='l', xlim=c(1,maxn), ylim=c(min(x$acc),max(x$acc)))
for(i in 2:length(ids)) lines(x$acc[x$id==ids[i]], col=i)
Hope this helps, and that I interpreted your problem right--
That's pretty quick to do if you are OK with using dplyr. group_by to enforce a separate counter for each subject, mutate to add the actual counter, and your ggplot should work. Example with iris dataset:
group_by(iris, Species) %>%
mutate(index = seq_along(Petal.Length)) %>%
ggplot() + geom_line(aes(x=index, y=Petal.Length, color=Species))

How to maintain the order of elements of a row when using by and rbind function in r?

I have written a function which takes a subset of data based on the value of name column.It Computes the outlier for column "mark" and replaces all the outliers.
However when I try to combine these different subsets, the order of my elements changes. Is there any way by which I can maintain the order of my elements in the column "mark"
My data set is:
name mark
A 100.0
B 0.5
C 100.0
A 50.0
B 90.0
B 1000.0
C 1200.0
C 5000.0
A 210.0
The function which I have written is :
data.frame(do.call("rbind", as.list(by(data, data$name,
function(x){apply(x[, .(mark)],2,
function(y) {y[y > (quantile(x$mark, na.rm=TRUE)[[3]][[1]] + 1.5 * IQR(x$mark))]
<- (quantile(x$mark, na.rm=TRUE)[[3]][[1]] + 1.5 * IQR(x$mark));y})}))))
The result of the above function is the first column below (I've manually added back name for illustratory purposes):
mark NAME
100.000 ----- A
50.000 ----- A
210.000 ----- A
0.500 ----- B
90.000 ----- B
839.625 ----- B
100.000 ----- C
1200.000 ----- C
4875.000 ----- C
In the above result, the order of the values for mark column are changed. Is there any way by which I can maintain the order of the elements ?
Are you sure that code is doing what you think it is?
It looks like you're replacing any value greater than the median (third returned value of quantile) with the median + 1.5*IQR. Maybe that's what you intend, I don't know. The bigger problem is that you're doing that in an apply function, so it's going to re-calculate that median and IQR each iteration, updated with the previous rows already being changed. I'd wager that's not what you intend, but I suppose I've seen stranger.
A better option might be to create an external function to do the work, which takes in all of the data, does the calculation, then outputs all the data. I like dplyr for this simply because it's clean.
Reading your data in (why the "----"?)
scores <- read.table(text="
name mark
A 100.0
B 0.5
C 100.0
A 50.0
B 90.0
B 1000.0
C 1200.0
C 5000.0
A 210.0", header=TRUE)
and creating a function that does something a little more sensible; replaces any value greater than the 75% quantile (referenced by name so you know what it is) or less than the 25% quantile with that limiting value
scale_outliers <- function(data) {
lim <- quantile(data, na.rm = TRUE)
data[data > lim["75%"]] <- lim["75%"]
data[data < lim["25%"]] <- lim["25%"]
return(data)
}
Chaining this processing into dplyr::mutate is neat, and can then be passed on to ggplot. Here's the original data
gg1 <- scores %>% ggplot(aes(x=name, y=mark))
gg1 <- gg1 + geom_point() + geom_boxplot() + coord_cartesian(ylim=range(scores$mark))
gg1
And if we alter it with the new function we get the data back without rows changed around
scores %>% mutate(new_mark = scale_outliers(mark))
#> name mark new_mark
#> 1 A 100.0 100
#> 2 B 0.5 90
#> 3 C 100.0 100
#> 4 A 50.0 90
#> 5 B 90.0 90
#> 6 B 1000.0 1000
#> 7 C 1200.0 1000
#> 8 C 5000.0 1000
#> 9 A 210.0 210
and we can plot that,
gg2 <- scores %>% mutate(new_mark = scale_outliers(mark)) %>% ggplot(aes(x=name, y=new_mark))
gg2 <- gg2 + geom_point() + geom_boxplot() + coord_cartesian(ylim=range(scores$mark))
gg2
Best of all, if you now want to do that quantile comparison group-wise (say, by the name column, it's as easy as using dplyr::group_by(name),
gg3 <- scores %>% group_by(name) %>% mutate(new_mark = scale_outliers(mark)) %>% ggplot(aes(x=name, y=new_mark))
gg3 <- gg3 + geom_point() + geom_boxplot() + coord_cartesian(ylim=range(scores$mark))
gg3
A slightly refactored version of Hack-R's answer -- you can add a index to your data.table:
data <- data.table(name = c("A", "B","C", "A","B","B","C","C","A"),mark = c(100,0.5,100,50,90,1000,1200,5000,210))
data[,i:=.I]
Then you perform your calculation but you keep the name and i:
df <- data.frame(do.call("rbind", as.list(
by(data, data$name,
function(x) cbind(i=x$i,
name=x$name,
apply(x[, .(mark)], 2,function(y) {y[y > (quantile(x$mark, na.rm=TRUE)[[3]][[1]] + 1.5 * IQR(x$mark))] <- (quantile(x$mark, na.rm=TRUE)[[3]][[1]] + 1.5 * IQR(x$mark));y})
)))))
And finally you order using the index:
df[order(df$i),]
i name mark
1 1 A 100
4 2 B 0.5
7 3 C 100
2 4 A 50
5 5 B 90
6 6 B 839.625
8 7 C 1200
9 8 C 4875
3 9 A 210

Passing a character vector as arguments to a function in plyr

I suspect I'm Doing It Wrong, but I'd like to pass a character vector as an argument to a function in ddply. There's a lot of Q&A on removing quotes, etc. but none of it seems to work for me (eg. Remove quotes from a character vector in R and http://r.789695.n4.nabble.com/Pass-character-vector-to-function-argument-td3045226.html).
# reproducible data
df1<-data.frame(a=sample(1:50,10),b=sample(1:50,10),c=sample(1:50,10),d=(c("a","b","c","a","a","b","b","a","c","d")))
df2<-data.frame(a=sample(1:50,9),b=sample(1:50,9),c=sample(1:50,9),d=(c("e","f","g","e","e","f","f","e","g")))
df3<-data.frame(a=sample(1:50,8),b=sample(1:50,8),c=sample(1:50,8),d=(c("h","i","j","h","h","i","i","h")))
#make a list
list.1<-list(df1=df1,df2=df2,df3=df3)
# desired output
lapply(list.1, function(x) ddply(x, .(d), function(x) data.frame(am=mean(x$a), bm=mean(x$b), cm=mean(x$c))))
$df1
d am bm cm
1 a 31.00000 29.25000 18.50000
2 b 31.66667 24.33333 34.66667
3 c 18.50000 5.50000 24.50000
4 d 36.00000 39.00000 43.00000
$df2
d am bm cm
1 e 18.25000 32.50000 18
2 f 27.66667 41.33333 24
3 g 25.00000 7.50000 42
$df3
d am bm cm
1 h 36.00000 25.00000 20.50000
2 i 25.33333 37.33333 24.33333
3 j 32.00000 32.00000 46.00000
But my actual use-case has many new columns and different types of calculations that I want to calculate in the ddply function. So I want to do something like:
# here's a simple version of a function that I want to send to ddply
func <- "am=mean(x$a), bm=mean(x$b), cm=mean(x$c)"
# here's how I imagine it might work
lapply(list.1, function(x) ddply(x, .(d), function(x) data.frame(func)) )
# not the desired outcome...
$df1
d func
1 a am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
2 b am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
3 c am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
4 d am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
$df2
d func
1 e am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
2 f am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
3 g am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
$df3
d func
1 h am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
2 i am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
3 j am=mean(x$a), bm=mean(x$b), cm=mean(x$c)
I've tried noquote, deparse, eval(as.symbol()), do.call(data.frame, ...) and some of the methods here: https://github.com/hadley/devtools/wiki/Evaluation on func to no avail. The solution might be obvious at this point (ie. melt everything!), but in case it's not, here's a longer example that's closer to my use case:
# sample data
s <- 23 # number of samples
r <- 10 # number of runs per sample
el <- 17 # number of elements
mydata <- data.frame(ID = unlist(lapply(LETTERS[1:s], function(x) rep(x, r))),
run = rep(1:r, s))
# insert fake element data
mydata[letters[1:el]] <- lapply(1:el, function(i) rnorm(s*r, runif(1)*i^2))
# generate all combinations of 5 runs from ten runs
su <- 5 # number of runs to sample from ten runs
idx <- combn(unique(mydata$run), su)
# RSE function
RSE <- function(x) {100*( (sd(x)/sqrt(length(x)))/mean(x) )}
# make a list of dfs for all samples for each combination of five runs
# to prepare to calculate RSEs
combys1 <- lapply(1:ncol(idx), function(i) mydata[mydata$run %in% idx[,i],] )
# make a list of dfs with RSE for each ID, for each combination of runs
combys2 <- lapply(1:length(combys1), function(i) ddply(combys1[[i]], "ID", summarise, RSEa=RSE(a), RSEb=RSE(b), RSEc=RSE(c), meana=mean(a), meanb=mean(b), meanc=mean(c)))
I want to replace RSEa=RSE(a), RSEb=RSE(b), RSEc=RSE(c), meana=mean(a), meanb=mean(b), meanc=mean(c) in the last line above with the object doRSE from here, to avoid lots of typing:
# prepare to calculate new colums with RSE and means
RSEs <- sapply(3:ncol(mydata), function(j) paste0("RSE",names(mydata[j])))
RSExs <- sapply(3:ncol(mydata), function(j) paste0("RSE(",names(mydata[j]),")"))
doRSE <- paste0(sapply(1:length(RSEs), function(x) paste0(RSEs[x],"=",RSExs[x])), collapse=",", sep="")
I'm open to solutions involving base, data.table and dirty tricks. Seems like these are close to what I want, but I can't quite translate them to my problem:
Pass character argument and evaluate,
Force evaluation of multiple variables using vector of character,
Using a vector of characters that correspond to an expression as an argument to a function
UPDATE Here's the catch: I want to be able to modify the func in the simple example (or doRSE in my use-case) to create a bunch of new columns that result from various calculations on the existing columns to explore the data. I want a workflow that allows the resulting dataframes to have new columns that were not in the original dataframes. Sorry that wasn't more clear in the original question. I can't see how to adapt #Marius' answer to do this, but #mnel's is helpful (see update below)
Working through #mnel's excellent dirty tricks, with some minor fixes I can get the desired result on my use-case:
# #mnel's solution, adapted (no period before eval)
combys2 <- lapply(combys1, function(x) do.call(ddply,c(.data = quote(x),
.variables = quote(.(ID)), .fun = quote(summarize),
eval(parse(text = sprintf('.(%s)', doRSE ))))))
head(combys2)
[[1]]
ID RSEa RSEb RSEc RSEd RSEe RSEf RSEg RSEh RSEi
1 A 168.30658 21.68632 5.657228 5.048057 4.162017 2.9581874 1.849009 0.6925148 0.4393491
2 B 26.55071 26.20427 4.782578 4.385409 2.342764 2.1813874 2.719625 1.1576681 0.6427935
3 C 73.83165 14.47216 8.154435 6.273202 3.046978 1.2179457 2.811405 1.1401837 0.8167067
4 D 31.96170 57.89260 9.438220 7.388410 3.755772 0.8601780 3.724875 0.8358204 0.9939387
5 E 63.22537 60.35532 5.839690 11.691304 3.828430 0.9217787 4.204300 0.8217187 0.7876634
6 F 56.37635 65.37907 4.149568 5.496308 2.227544 2.1548455 2.847291 1.1956212 0.2506518
7 G 69.32232 23.63214 4.255847 7.979225 4.917660 1.6185960 3.156521 0.3265555 0.8133279
8 H 29.82015 40.74184 7.372100 7.464792 2.749862 0.6054420 4.061368 0.9973909 1.3807720
9 I 50.58114 19.53732 2.989920 9.767678 4.000249 1.7451322 1.175397 0.9952093 0.9095086
10 J 92.96462 39.77475 6.140688 10.295668 3.407726 2.4663758 3.030444 0.5743419 0.9296482
11 K 90.72381 42.25092 2.483069 6.781054 3.142082 1.8080633 2.891740 1.1996176 0.8525290
12 L -385.24547 40.81267 4.506087 8.148382 2.976488 0.8304432 2.234134 0.2108664 0.4979777
13 M 22.77743 33.98332 2.913926 8.764639 2.307293 0.8366635 3.229944 1.0003125 0.3878567
14 N 66.75163 34.16087 6.611326 13.865377 1.285522 1.3863958 4.165575 0.7379386 0.4515194
15 O 37.37188 100.57479 5.738877 5.724862 2.839638 1.1366610 3.186332 0.7383855 0.3954544
16 P 17.08913 26.62210 6.060130 4.110893 2.688908 2.6970727 1.609043 1.3860834 0.8780010
17 Q 13.96392 74.92279 5.469304 8.467638 2.974131 1.2135436 3.284564 0.6232778 1.0759226
18 R 42.59899 30.75952 4.842832 8.764158 1.874020 1.5791048 3.427342 1.4479638 0.2964455
19 S 26.03307 15.56352 6.968717 7.783876 4.439733 2.0764179 4.683080 0.7459654 1.1268772
20 T 71.57945 33.81362 7.147049 11.201551 2.128315 2.2051611 2.419805 0.2688807 1.1559635
21 U 73.93002 11.77155 7.738910 7.207041 1.478491 1.4409844 4.042419 0.5883490 0.5585716
22 V 67.93166 39.54994 5.701551 8.636122 2.472963 1.6514199 2.627965 1.0359048 0.8747136
23 W 11.23057 12.51272 7.003448 7.424559 4.102693 0.6614847 2.246305 1.3422405 0.2665246
RSEj RSEk RSEl RSEm RSEn RSEo RSEp RSEq
1 0.6366733 0.3713819 2.1993487 0.3865293 0.5436581 0.9187585 0.4344699 0.8915868
2 0.3445095 0.2932025 1.8563179 0.5397595 1.0433388 0.3533622 0.1942316 0.1941072
3 0.2720344 0.5507595 2.0305726 0.4377259 0.8589854 0.5690906 0.1397337 0.4043247
4 0.6606667 0.6769112 3.4737352 0.5674656 1.2519256 0.8718298 0.1162969 0.8287504
5 0.4620774 0.5598069 1.9236112 0.7990046 0.9832732 0.6847352 0.4070675 0.9005185
6 0.7981610 0.4005493 0.9721068 0.2770989 1.7054674 0.3110139 0.4521183 0.8740444
7 0.3969116 0.4717575 4.1341106 0.7510628 0.9998299 0.5342292 0.4319642 1.1861705
8 0.2963956 0.2652221 0.4775827 0.2617120 0.8261874 0.5266087 0.1900943 0.2350553
9 0.2609359 0.5431035 2.6478440 0.1606919 0.7407281 0.6802262 0.1802069 0.7438792
10 0.4239787 0.8753544 3.4218030 0.5467869 0.7404017 0.5581173 0.3682014 0.6361436
11 0.4188502 0.8629862 4.4181479 0.1623873 0.8018811 0.5873609 0.3592134 0.5357984
12 0.5790265 0.5009210 3.7534287 0.1933726 0.5809601 0.5777868 0.3400925 0.4783890
13 0.3562582 0.2552756 2.1393219 0.1849345 0.5796194 0.6129469 0.3363311 0.4382125
14 0.7921502 0.6147990 2.9054634 0.5852325 1.4954072 0.9983203 0.2937837 0.7654504
15 0.5840424 0.2757707 1.5695675 0.3305385 0.8712636 0.5816490 0.1985457 0.7213289
16 0.3301280 0.3008273 2.9014987 0.4540833 0.5966479 0.9042004 0.1631630 0.7262141
17 0.5882511 0.2820978 3.0652666 0.4518936 1.3168151 0.4749311 0.2244693 0.6583083
18 0.4048816 0.3708787 3.2207478 0.2603412 1.3168318 0.3318745 0.3120436 0.6210711
19 0.4425123 0.3602076 3.7609863 0.5399527 0.8302572 0.3246904 0.1952143 0.2915325
20 0.5877835 0.6339015 1.6908570 0.3223056 0.5239339 0.6607198 0.2808094 0.3697380
21 0.4454056 0.7733354 4.3433420 0.4391075 0.5503594 0.5893406 0.2262403 0.2361512
22 0.9583940 0.6365843 3.0033951 0.6507968 0.8610046 0.6363198 0.2866719 0.5736855
23 0.4969730 0.3895182 2.0021608 0.3354475 1.4398250 0.7386870 0.2458906 0.3414804
...
...
You can do some ugly computing on the language using quote and plyr::.
Reading https://github.com/hadley/devtools/wiki/Computing-on-the-language will probably help understand whether you really want to do this.
Anyway, an approach could be to use
use .() to create your vector of arguments eg and use how summarize works
.(am=mean(a), bm=mean(b), cm=mean(c))
and if you really wanted to use a character string
foo<- "am=mean(a), bm=mean(b), cm=mean(c)"
eval(parse(text = sprintf('.(%s)', foo )))
Use quote liberally to create your list to be passed to to do.call
for example
lapply(list.1, function(x) do.call(ddply,c(.data = quote(x),
.variables = quote(.(d)), .fun = quote(summarize),
.(am=mean(a), bm=mean(b), cm=mean(c)))))
Oh boy is that ugly.
Or, you could use data.tables
library(data.table)
listDT <- lapply(list.1, data.table)
lapply(listDT, function(x) x[,lapply(.SD, mean), by = 'd'])
or
mystuff <- sprintf('list(%s)', foo)
lapply(listDT, function(x) x[, eval(parse(text = mystuff)), by = 'd'])
However, if you had all the same columns in all your data.tables, it would be more efficient to create one large data.table (with an identifer for each element of the list) and work on that.
Here's a ddply function that calculates the mean for all the columns that aren't d in your dataframes:
lapply(list.1,
function(x) {
ddply(
x,
.(d),
function(df_part) {
result_df <- data.frame(d=df_part$d[1])
non_d_cols <- colnames(df_part)[! colnames(df_part) == "d"]
for (col in non_d_cols) {
col_mean <- mean(df_part[[col]])
col_name <- paste0(col, "_mean")
result_df[[col_name]] <- col_mean
}
return(result_df)
})
})
That seems to me like the simplest way to do it, and it should generalize well to other calculations you might want to do on those columns. Maybe you could pass in a character vector argument of the columns you want to calculate the mean for, and use that in place of non_d_cols.

If statement for weighted averaged in R

I have a data file that is several million lines long, and contains information from many groups. Below is an abbreviated section:
MARKER GROUP1_A1 GROUP1_A2 GROUP1_FREQ GROUP1_N GROUP2_A1 GROUP2_A2 GROUP2_FREQ GROUP2_N
rs10 A C 0.055 1232 A C 0.055 3221
rs1000 A G 0.208 1232 A G 0.208 3221
rs10000 G C 0.134 1232 C G 0.8624 3221
rs10001 C A 0.229 1232 A C 0.775 3221
I would like to created a weighted average of the frequency (FREQ) variable (which in itself is straightforward), however in this case some of the rows are mismatched (rows 3 & 4). If the letters do not line up, then the frequency of the second group needs to be subtracted by 1 before the weighted mean of that marker is calculated.
I would like to set up a simple IF statement, but I am unsure of the syntax of such a task.
Any insight or direction is appreciated!
Say you've read your data in a data frame called mydata. Then do the following:
mydata$GROUP2_FREQ <- mydata$GROUP2_FREQ - (mydata$GROUP1_A1 != mydata$GROUP2_A1)
It works because R treats TRUE values as 1 and FALSE values as 0.
EDIT: Try the following instead:
mydata$GROUP2_FREQ <- abs( (as.character(mydata$GROUP1_A1) !=
as.character(mydata$GROUP2_A1)) -
as.numeric(mydata$GROUP2_FREQ) )

Resources