This question already has answers here:
R group by aggregate
(3 answers)
Closed 2 years ago.
I have a data frame df that has an ID-column (called SNP) in which some IDs occur more than once.
CHR SNP A1 A2 MAF NCHROBS
1: 1 1:197 C T 0.3148 314
2: 1 1:205 G C 0.2058 314
3: 1 1:206 A C 0.0000 314
4: 1 1:219 C G 0.8472 314
5: 1 1:223 A C 0.7265 314
6: 1 1:224 G T 0.3295 314
7: 1 1:197 C T 0.3148 314
8: 1 1:205 G C 0.0000 314
9: 1 1:206 A C 0.0000 314
10: 1 1:219 C G 0.0000 314
11: 1 1:223 A C 0.0000 314
12: 1 1:224 G T 0.0000 314
13: 1 1:197 C T 0.4753 314
14: 1 1:205 G C 0.1964 314
15: 1 1:206 A C 0.0000 314
16: 1 1:219 C G 0.6594 314
17: 1 1:223 A C 0.8946 314
18: 1 1:224 G T 0.2437 314
I would like to calculate the mean and standard deviation (SD) from the values in the MAF-column that share the same ID.
df <-
list.files(pattern = "*.csv") %>%
map_df(~fread(.))
colMeans(df, rows=df$SNP == "1:197", cols=df$MAF)
Why is it not possible to specify values based on conditions with colMeans?
Since you have a data.table,
df[, .(mu = mean(MAF), sigma = sd(MAF)), by = .(SNP) ]
# SNP mu sigma
# 1: 1:197 0.3683000 0.09266472
# 2: 1:205 0.1340667 0.11620023
# 3: 1:206 0.0000000 0.00000000
# 4: 1:219 0.5022000 0.44493914
# 5: 1:223 0.5403667 0.47545926
# 6: 1:224 0.1910667 0.17093936
If you prefer base (despite using data.table), then
aggregate(dat$MAF, list(dat$SNP), function(a) c(mu = mean(a), sigma = sd(a)))
# Group.1 x.mu x.sigma
# 1 1:197 0.36830000 0.09266472
# 2 1:205 0.13406667 0.11620023
# 3 1:206 0.00000000 0.00000000
# 4 1:219 0.50220000 0.44493914
# 5 1:223 0.54036667 0.47545926
# 6 1:224 0.19106667 0.17093936
Using dplyr
library(dplyr)
df %>%
group_by(SNP) %>%
summarise(mean = mean(MAF),
sd = sd(MAF))
Gives us:
SNP mean sd
<chr> <dbl> <dbl>
1 1:197 0.368 0.0927
2 1:205 0.134 0.116
3 1:206 0 0
4 1:219 0.502 0.445
5 1:223 0.540 0.475
6 1:224 0.191 0.171
To answer your question as to why colMeans is not working:
If you look at the doucmentation of colMeans using ?colMeans you will realize that you are passing the wrong named arguments. The docs give the following example: colMeans(x, na.rm=FALSE, dims=1). And you will realize, that it doesn't have (or takes) any arguments named rows and cols. So when you try to run your code, you will get the unused arguments error.
As to the question, if it is possible to pass conditional statements in colMeans you will have to pass those statements with df, i.e. you can pass the subset of df as follows:
colMeans(df[df$SNP == "1:197", "MAF", drop=F], na.rm=F, dims=1)
Note it is important to pass the argument drop=F in this case, as you are subsetting on single column. When you subset on single column, [ operator simplies the result and convert the dataframe to numeric vector. But when using drop=F, it preserves the dimension of originally passed dataframe.
If a numeric vector is passed to colMeans you will get an error as the colMeans accept x to be of atleast 2 dimensions.
As to the other question of how to calculate column mean, I believe others have highlighted quite nice approaches in this thread, any of those approaches work, you just have to choose one.
You could use the tapply() function, given SNP is a factor:
mean.CHR=tapply(df$SNP,df$MAF,mean)
sd.CHR=tapply(df$SNP,df$MAF,sd)
Related
I have a data.table for which I want to add columns of random binomial numbers based on one column as number of trials and multiple probabilities based on other columns:
require(data.table)
DT = data.table(
ID = letters[sample.int(26,10, replace = T)],
Quantity=as.integer(100*runif(10))
)
prob.vecs <- LETTERS[1:5]
DT[,(prob.vecs):=0]
set.seed(123)
DT[,(prob.vecs):=lapply(.SD, function(x){runif(.N,0,0.2)}), .SDcols=prob.vecs]
DT
ID Quantity A B C D E
1: b 66 0.05751550 0.191366669 0.17790786 0.192604847 0.02856000
2: l 9 0.15766103 0.090666831 0.13856068 0.180459809 0.08290927
3: u 38 0.08179538 0.135514127 0.12810136 0.138141056 0.08274487
4: d 27 0.17660348 0.114526680 0.19885396 0.159093484 0.07376909
5: o 81 0.18809346 0.020584937 0.13114116 0.004922737 0.03048895
6: f 44 0.00911130 0.179964994 0.14170609 0.095559194 0.02776121
7: d 81 0.10562110 0.049217547 0.10881320 0.151691908 0.04660682
8: t 81 0.17848381 0.008411907 0.11882840 0.043281587 0.09319249
9: x 79 0.11028700 0.065584144 0.05783195 0.063636202 0.05319453
10: j 43 0.09132295 0.190900730 0.02942273 0.046325157 0.17156554
Now I want to add five columns Quantity_A Quantity_B Quantity_C Quantity_D Quantity_E
which apply the rbinom with the correspoding probability and quantity from the second column.
So for example the first entry for Quantity_A would be:
set.seed(741)
sum(rbinom(66,1,0.05751550))
> 2
This problem seems very similar to this post: How do I pass column-specific arguments to lapply in data.table .SD? but I cannot seem to make it work. My try:
DT[,(paste0("Quantity_", prob.vecs)):= mapply(function(x, Quantity){sum(rbinom(Quantity, 1 , x))}, .SD), .SDcols = prob.vecs]
Error in rbinom(Quantity, 1, x) :
argument "Quantity" is missing, with no default
Any ideas?
I seemed to have found a work-around, though I am not quite sure why this works (probably has something to do with the function rbinom not beeing vectorized in both arguments):
first define an index:
DT[,Index:=.I]
and then do it by index:
DT[,(paste0("Quantity_", prob.vecs)):= lapply(.SD,function(x){sum(rbinom(Quantity, 1 , x))}), .SDcols = prob.vecs, by=Index]
set.seed(789)
ID Quantity A B C D E Index Quantity_A Quantity_B Quantity_C Quantity_D Quantity_E
1: c 37 0.05751550 0.191366669 0.17790786 0.192604847 0.02856000 1 0 4 7 8 0
2: c 51 0.15766103 0.090666831 0.13856068 0.180459809 0.08290927 2 3 5 9 19 3
3: r 7 0.08179538 0.135514127 0.12810136 0.138141056 0.08274487 3 0 0 2 2 0
4: v 53 0.17660348 0.114526680 0.19885396 0.159093484 0.07376909 4 8 4 16 12 3
5: d 96 0.18809346 0.020584937 0.13114116 0.004922737 0.03048895 5 17 3 12 0 4
6: u 52 0.00911130 0.179964994 0.14170609 0.095559194 0.02776121 6 1 3 8 6 0
7: m 43 0.10562110 0.049217547 0.10881320 0.151691908 0.04660682 7 6 1 7 6 2
8: z 3 0.17848381 0.008411907 0.11882840 0.043281587 0.09319249 8 1 0 2 1 1
9: m 3 0.11028700 0.065584144 0.05783195 0.063636202 0.05319453 9 1 0 0 0 0
10: o 4 0.09132295 0.190900730 0.02942273 0.046325157 0.17156554 10 0 0 0 0 0
numbers look about right to me
If someone finds a solution without the index would still be appreciated.
I have the Rank, Status & counts as a data frame created by aggregating the parent data frame. I would like to find the ratio/percentage as below.
i.e, what is the incomplete percentage/ratio amongst the complete & incomplete total for each Rank.
Rank Status `n()`
<fct> <fct> <int> <ratio>
1 A Incomplete 602
2 A Complete 9443 602/9443
3 B Incomplete 1425
4 B Complete 10250 ----
5 C Incomplete 1347 ----
6 C Complete 6487
7 D Incomplete 1118
8 D Complete 3967
9 E Incomplete 715
10 E Complete 1948
I tried the sapply() to iterate over & calculate the ratio & store it in another df. But is there any better way to do it?
Otherwise, if the stacked bar plot can label the percentage/ratio as above it would be great.
The stacked bar I tried shows the percentage of total count not the ratio.
Thanks.
Using dplyr:
library(dplyr)
df <- data.frame(Rank = c("A", "A", "B", "B", "C", "C", "D", "D", "E", "E"),
Status = c("Incomplete", "Complete","Incomplete", "Complete",
"Incomplete", "Complete","Incomplete", "Complete",
"Incomplete", "Complete"),
Count = c(602, 9443, 1425, 10250, 1347, 6487, 1118, 3967, 715, 1948))
# Ratio
df %>% group_by(Rank) %>% mutate(Ratio = Count/sum(Count))
# A tibble: 10 x 4
# Groups: Rank [5]
# Rank Status Count Ratio
# <fct> <fct> <dbl> <dbl>
# 1 A Incomplete 602. 0.0599
# 2 A Complete 9443. 0.940
# 3 B Incomplete 1425. 0.122
# 4 B Complete 10250. 0.878
# 5 C Incomplete 1347. 0.172
# 6 C Complete 6487. 0.828
# 7 D Incomplete 1118. 0.220
# 8 D Complete 3967. 0.780
# 9 E Incomplete 715. 0.268
#10 E Complete 1948. 0.732
# Percentage
df %>% group_by(Rank) %>% mutate(Percentage = (Count/sum(Count))*100)
# A tibble: 10 x 4
# Groups: Rank [5]
# Rank Status Count Percentage
# <fct> <fct> <dbl> <dbl>
# 1 A Incomplete 602. 5.99
# 2 A Complete 9443. 94.0
# 3 B Incomplete 1425. 12.2
# 4 B Complete 10250. 87.8
# 5 C Incomplete 1347. 17.2
# 6 C Complete 6487. 82.8
# 7 D Incomplete 1118. 22.0
# 8 D Complete 3967. 78.0
# 9 E Incomplete 715. 26.8
#10 E Complete 1948. 73.2
using dcast in data.table
Code:
library('data.table')
dcast(setDT(df), formula = Rank~Status, value.var = "count")[, ratio := Incomplete / Complete][]
If you have duplicate status within a given rank, for example Rank A has two Incomplete Status with count 602 and 605, then this will take care of it.
dcast(setDT(df2)[, .(count = sum(count)), by = .(Rank, Status)], # sum count by Status and Rank
formula = Rank~Status, value.var = "count")[, ratio := Incomplete / Complete][]
Output:
without duplicate Status
# Rank Complete Incomplete ratio
# 1: A 9443 602 0.06375093
# 2: B 10250 1425 0.13902439
# 3: C 6487 1347 0.20764606
# 4: D 3967 1118 0.28182506
# 5: E 1948 715 0.36704312
with duplicate Status
# Rank Complete Incomplete ratio
# 1: A 9443 1207 0.1278195
# 2: B 10250 1425 0.1390244
# 3: C 6487 1347 0.2076461
# 4: D 3967 1118 0.2818251
# 5: E 1948 715 0.3670431
Data:
without duplicate Status
df <- read.table(text='Rank Status `n()`
1 A Incomplete 602
2 A Complete 9443
3 B Incomplete 1425
4 B Complete 10250
5 C Incomplete 1347
6 C Complete 6487
7 D Incomplete 1118
8 D Complete 3967
9 E Incomplete 715
10 E Complete 1948')
colnames(df)[3] <- 'count'
with duplicate status:
df2 <- read.table(text='Rank Status `n()`
1 A Incomplete 602
2 A Incomplete 605
2.1 A Complete 9443
3 B Incomplete 1425
4 B Complete 10250
5 C Incomplete 1347
6 C Complete 6487
7 D Incomplete 1118
8 D Complete 3967
9 E Incomplete 715
10 E Complete 1948')
colnames(df2)[3] <- 'count'
I didn't use dplyr package but i think the following logic would work:
lets say your dataframe is df
# creating sample script as yours
p <- c("Incomplete","Complete","Incomplete","Complete","Incomplete","Complete")
q <- c(604,9443,1425,10250,1347,6487)
# ignoring the ranks
df <- data.frame("Status" = p,"counts" = q)
ratiovector <- sample(c(0),size = NROW(df), replace = T)
kcomp <- which(df$Status == "Complete")
kincomp <- which(df$Status == "Incomplete")
ratiovector[kcomp] <- df$counts[kincomp]/df$counts[kcomp]
dfnew <- cbind(df,"ratio" = ratiovector)
# print dfnew
dfnew
# if you want it in string form convert it.
in base R:
df$ratio <- ave(df$Count,df$Rank,FUN=function(x)x/sum(x))
# Rank Status Count ratio
# 1 A Incomplete 602 0.05993031
# 2 A Complete 9443 0.94006969
# 3 B Incomplete 1425 0.12205567
# 4 B Complete 10250 0.87794433
# 5 C Incomplete 1347 0.17194281
# 6 C Complete 6487 0.82805719
# 7 D Incomplete 1118 0.21986234
# 8 D Complete 3967 0.78013766
# 9 E Incomplete 715 0.26849418
# 10 E Complete 1948 0.73150582
I have been playing around with the newer data.table conditional merge feature and it is very cool. I have a situation where I have two tables, dtBig and dtSmall, and there are multiple row matches in both datasets when this conditional merge takes place. Is there a way to aggregate these matches using a function like max or min for these multiple matches? Here is a reproducible example that tries to mimic what I am trying to accomplish.
Set up environment
## docker run --rm -ti rocker/r-base
## install.packages("data.table", type = "source",repos = "http://Rdatatable.github.io/data.table")
Create two fake datasets
A create a "big" table with 50 rows (10 values for each ID).
library(data.table)
set.seed(1L)
# Simulate some data
dtBig <- data.table(ID=c(sapply(LETTERS[1:5], rep, 10, simplify = TRUE)), ValueBig=ceiling(runif(50, min=0, max=1000)))
dtBig[, Rank := frank(ValueBig, ties.method = "first"), keyby=.(ID)]
ID ValueBig Rank
1: A 266 3
2: A 373 4
3: A 573 5
4: A 909 9
5: A 202 2
---
46: E 790 9
47: E 24 1
48: E 478 2
49: E 733 7
50: E 693 6
Create a "small" dataset similar to the first, but with 10 rows (2 values for each ID)
dtSmall <- data.table(ID=c(sapply(LETTERS[1:5], rep, 2, simplify = TRUE)), ValueSmall=ceiling(runif(10, min=0, max=1000)))
ID ValueSmall
1: A 478
2: A 862
3: B 439
4: B 245
5: C 71
6: C 100
7: D 317
8: D 519
9: E 663
10: E 407
Merge
I next want to perform a merge by ID and needs to merge only where ValueSmall is greater than or equal to ValueBig. For the matches, I want to grab the max ranked value in dtBig. I tried doing this two different ways. Method 2 gives me the desired output, but I am unclear why the output is different at all. It seems like it is just returning the last matched value.
## Method 1
dtSmall[dtBig, RankSmall := max(i.Rank), by=.EACHI, on=.(ID, ValueSmall >= ValueBig)]
## Method 2
setorder(dtBig, ValueBig)
dtSmall[dtBig, RankSmall2 := max(i.Rank), by=.EACHI, on=.(ID, ValueSmall >= ValueBig)]
Results
ID ValueSmall RankSmall RankSmall2 DesiredRank
1: A 478 1 4 4
2: A 862 1 7 7
3: B 439 3 4 4
4: B 245 1 2 2
5: C 71 1 1 1
6: C 100 1 1 1
7: D 317 1 2 2
8: D 519 3 5 5
9: E 663 2 5 5
10: E 407 1 1 1
Is there a better data.table way of grabbing the max value in another data.table with multiple matches?
I next want to perform a merge by ID and needs to merge only where ValueSmall is greater than or equal to ValueBig. For the matches, I want to grab the max ranked value in dtBig.
setorder(dtBig, ID, ValueBig, Rank)
dtSmall[, r :=
dtBig[.SD, on=.(ID, ValueBig <= ValueSmall), mult="last", x.Rank ]
]
ID ValueSmall r
1: A 478 4
2: A 862 7
3: B 439 4
4: B 245 2
5: C 71 1
6: C 100 1
7: D 317 2
8: D 519 5
9: E 663 5
10: E 407 1
I imagine it is considerably faster to sort dtBig and take the last matching row rather than to compute the max by .EACHI, but am not entirely sure. If you don't like sorting, just save the previous sort order so it can be reverted to afterwards.
Is there a way to aggregate these matches using a function like max or min for these multiple matches?
For this more general problem, .EACHI works, just making sure you're doing it for each row of the target table (dtSmall in this case), so...
dtSmall[, r :=
dtBig[.SD, on=.(ID, ValueBig <= ValueSmall), max(x.Rank), by=.EACHI ]$V1
]
I used the function ddply (package plyr) to calculate the mean of a response variable for each group "Trial" and "Treatment". I get this data frame:
Trial Treatment N Mean
1 A 458 125.258
1 B 459 168.748
2 A 742 214.266
2 B 142 475.786
3 A 247 145.689
3 B 968 234.129
4 A 436 456.287
This data frame suggests that in the trial 4 and treatment B, there are no observations for the response variable (as no row is specified in the data frame). So, is it possible to automatically add a row of zeros in the data frame (built with the function “ddply”) when there are no observations for a given response variable?
I would like to get this data frame:
Trial Treatment N Mean
1 A 458 125.258
1 B 459 168.748
2 A 742 214.266
2 B 142 475.786
3 A 247 145.689
3 B 968 234.129
4 A 436 456.287
4 B 0 0
We can merge the original dataset with another data.frame created with the full combination of unique values in 'Trial', and 'Treatment'. It will give an output with the missing combinations filled with NA. If needed, this can be changed to 0 (but it is better to have the missing combination as NA).
res <- merge(expand.grid(lapply(df1[1:2], unique)), df1, all.x=TRUE)
is.na(res) <- res==0
Or with dplyr/tidyr, we can use complete (from tidyr)
library(dplyr)
library(tidyr)
df1 %>%
complete(Trial, Treatment, fill= list(N=0, Mean=0))
# Trial Treatment N Mean
# (int) (chr) (dbl) (dbl)
#1 1 A 458 125.258
#2 1 B 459 168.748
#3 2 A 742 214.266
#4 2 B 142 475.786
#5 3 A 247 145.689
#6 3 B 968 234.129
#7 4 A 436 456.287
#8 4 B 0 0.000
I have the following data frame (this is only the head of the data frame). The ID column is subject (I have more subjects in the data frame, not only subject #99). I want to calculate the mean "rt" by "subject" and "condition" only for observations that have z.score (in absolute values) smaller than 1.
> b
subject rt ac condition z.score
1 99 1253 1 200_9 1.20862682
2 99 1895 1 102_2 2.95813507
3 99 1049 1 68_1 1.16862102
4 99 1732 1 68_9 2.94415384
5 99 765 1 34_9 -0.63991180
7 99 1016 1 68_2 -0.03191493
I know I can to do it using tapply or dcast (from reshape2) after subsetting the data:
b1 <- subset(b, abs(z.score) < 1)
b2 <- dcast(b1, subject~condition, mean, value.var = "rt")
subject 34_1 34_2 34_9 68_1 68_2 68_9 102_1 102_2 102_9 200_1 200_2 200_9
1 99 1028.5714 957.5385 861.6818 837.0000 969.7222 856.4000 912.5556 977.7273 858.7800 1006.0000 1015.3684 913.2449
2 5203 957.8889 815.2500 845.7750 933.0000 893.0000 883.0435 926.0000 879.2778 813.7308 804.2857 803.8125 843.7200
3 5205 1456.3333 1008.4286 850.7170 1142.4444 910.4706 998.4667 935.2500 980.9167 897.4681 1040.8000 838.7917 819.9710
4 5306 1022.2000 940.5882 904.6562 1525.0000 1216.0000 929.5167 955.8571 981.7500 902.8913 997.6000 924.6818 883.4583
5 5307 1396.1250 1217.1111 1044.4038 1055.5000 1115.6000 980.5833 1003.5714 1482.8571 941.4490 1091.5556 1125.2143 989.4918
6 5308 659.8571 904.2857 966.7755 960.9091 1048.6000 904.5082 836.2000 1753.6667 926.0400 870.2222 1066.6667 930.7500
In the example above for b1 each of the subjects had observations that met the subset demands.
However, it can be that for a certain subject I won't have observations after I subset. In this case I want to get NA in b2 for that subject in the specific condition in which he doesn't have observations that meet the subset demands. Does anyone have an idea for a way to do that?
Any help will be greatly appreciated.
Best,
Ayala
There is a drop argument in dcast that you can use in this situation, but you'll need to convert subject to a factor.
Here is a dataset with a second subject ID that has no values that meet your condition that the absolute value of z.score is less than one.
library(reshape2)
bb = data.frame(subject=c(99,99,99,99,99,11,11,11), rt=c(100,150,2,4,10,15,1,2),
ac=rep(1,8), condition=c("A","A","B","D","C","C","D","D"),
z.score=c(0.2,0.3,0.2,0.3,.2,2,2,2))
If you reshape this to a wide format with dcast, you lose subject number 11 even with the drop argument.
dcast(subset(bb, abs(z.score) < 1), subject ~ condition, fun = mean,
value.var = "rt", drop = FALSE)
subject A B C D
1 99 125 2 10 4
Make subject a factor.
bb$subject = factor(bb$subject)
Now you can dcast with drop = FALSE to keep all subjects in the wide dataset.
dcast(subset(bb, abs(z.score) < 1), subject ~ condition, fun = mean,
value.var = "rt", drop = FALSE)
subject A B C D
1 11 NaN NaN NaN NaN
2 99 125 2 10 4
To get NA instead of NaN you can use the fill argument.
dcast(subset(bb, abs(z.score) < 1), subject ~ condition, fun = mean,
value.var = "rt", drop = FALSE, fill = as.numeric(NA))
subject A B C D
1 11 NA NA NA NA
2 99 125 2 10 4
Is it the following you are after? I created a similar dataset "bb"
library("plyr") ###needed for . function below
bb<- data.frame(subject=c(99,99,99,99,99,11,11,11),rt=c(100,150,2,4,10,15,1,2), ac=rep(1,8) ,condition=c("A","A","B","D","C","C","D","D"), z.score=c(0.2,0.3,0.2,0.3,1.5,-0.3,0.8,0.7))
bb
subject rt ac condition z.score
#1 99 100 1 A 0.2
#2 99 150 1 A 0.3
#3 99 2 1 B 0.2
#4 99 4 1 D 0.3
#5 99 10 1 C 1.5
#6 11 15 1 C -0.3
#7 11 1 1 D 0.8
#8 11 2 1 D 0.7
Then you call dcast with subset included:
cc<-dcast(bb,subject~condition, mean, value.var = "rt",subset = .(abs(z.score)<1))
cc
subject A B C D
#1 11 NaN NaN 15 1.5
#2 99 125 2 NaN 4.0