Compute stats for several columns at the same time using sapply - r

I have a dataframe as follows:
# A tibble: 6 x 4
Placebo High Medium Low
<dbl> <dbl> <dbl> <dbl>
1 0.0400 -0.04 0.0100 0.0100
2 0.04 0 -0.0100 0.04
3 0.0200 -0.1 -0.05 -0.0200
4 0.03 -0.0200 0.03 -0.00700
5 -0.00500 -0.0100 0.0200 0.0100
6 0.0300 -0.0100 NA NA
You could get the cohensD for two of the columns using the cohen.d() function from the effsize package:
df <- data.frame(Placebo = c(0.0400, 0.04, 0.0200, 0.03, -0.00500, 0.0300),
Low = c(-0.04, 0, -0.1, -0.0200, -0.0100, -0.0100),
Medium = c(0.0100, -0.0100, -0.05, 0.03, 0.0200, NA ),
High = c(0.0100, 0.04, -0.0200, -0.00700, 0.0100, NA))
library(effsize)
cohen.d(as.vector(na.omit(df$Placebo)), as.vector(na.omit(df$High)))
Interestingly enough, I'm getting the following error with this code:
Error in data[, group] : incorrect number of dimensions
However, I would like to create a function that allows you to obtain all the cohensd between one of the columns and the rest of them.
In order to get the cohensD of all columns against the Placebo we would use something like:
sapply(df, function(i) cohen.d(pull(df, as.vector(na.omit(!!Placebo))), as.vector(na.omit(i))))
But I'm not sure this would work anyway.
Edit: I don't want to erase the full row, as cohens d can be computed for different length vectors. Ideally, I would like to get the stat with the NA removed for each column independetly

It may be better to remove the NA on each of the columns separately by creating a logical index along with 'Placebo'
library(dplyr)
library(effsize)
df %>%
summarise(across(Low:High, ~ list({
i1 <- complete.cases(Placebo)& complete.cases(.x)
cohen.d(Placebo[i1], .x[i1])})))
Or if we want to use lapply/sapply, loop over the columns other than Placebo
lapply(df[-1], function(x) {
x1 <- na.omit(cbind(df$Placebo, x))
cohen.d(x1[,1], x1[,2])
})
-output
$Low
Cohen's d
d estimate: 1.947312 (large)
95 percent confidence interval:
lower upper
0.3854929 3.5091319
$Medium
Cohen's d
d estimate: 0.9622504 (large)
95 percent confidence interval:
lower upper
-0.5782851 2.5027860
$High
Cohen's d
d estimate: 0.8884639 (large)
95 percent confidence interval:
lower upper
-0.6402419 2.4171697

Related

how to calculate standard deviation of values in 10 intervals?

I want to calculate a standard deviation step by 10 in R; for example
For a large number of values, I want to calculate the SD of the values in 10 intervals. 0-10, 10-20, 20-30 ...
Example: I have a vector of :
exemple <- seq (0,100,10)
If I do sd (example) : I have the value of standard deviation but for all values in example.
But, how can I do to calculate the standard deviation to this example selecting 10 by 10 steps ?
But instead of calculating the standard deviation of all these values, I want to calculate it between 0 and 10, between 10 and 20, between 20 and 30 etc…
I precise in interval 0-10 : we have values, in intervals 10-20, we have also values.. etc.
exemple2 0 to 10, we have values : 0.2, 0.3, 0.5, 0.7, 0.6, 0.7, 0.03, 0.09, 0.1, 0.05
An image for more illustrations :
Can someone help me please ?
You may use cut/findInterval to divide the data into groups and take sd of each group.
set.seed(123)
vec <- runif(100, max = 100)
tapply(vec, cut(vec, seq(0,100,10)), sd)
# (0,10] (10,20] (20,30] (30,40] (40,50] (50,60] (60,70] (70,80] (80,90] (90,100]
#3.438162 2.653866 2.876299 2.593230 2.353325 2.755474 2.454519 3.282779 3.658064 3.021508
Here is a solution using dplyr:
library(dplyr)
## Create a random a dataframe with a random variable with 1000 values between 1 and 100
df <- data.frame(x = runif(1000, 1, 100)
## Create a grouping variables, binning by 10
df$group <- findInterval(df$x, seq(10, 100, by=10))
## Calculate SD by group
df %>%
group_by(group) %>%
summarise(Std.dev = sd(x))
# A tibble: 10 x 2
group St.dev
* <int> <dbl>
1 0 2.58
2 1 2.88
3 2 2.90
4 3 2.71
5 4 2.84
6 5 2.90
7 6 2.88
8 7 2.68
9 8 2.98
10 9 2.89

R: Tables with returns [Stargazer example]

I have a dataset containing different stock returns that I want to show in a stargazer table. The problem is that the last row in the dataframe contains NA's in 2 of the 3 columns. Also when I output with stargazer it shows mean, max, min, etc. I only want the actual return value that I have in my dataframe.
Example code:
#Creating dataframe
X <- data.frame("Group" = c("Value", "Growth", "HML"), "Excess of riskfree" = c(0.1, 0.2,NA),
"Excess of Market" = c(0.2,0.4,NA), "Nominal" = c(0.5, 0.6, 0.01))
#Displaying my dataframe
> X
Group Excess.of.riskfree Excess.of.Market Nominal
1 Value 0.1 0.2 0.50
2 Growth 0.2 0.4 0.60
3 HML NA NA 0.01
#Setting up stargazer table
stargazer(X, title="Table 1: Returns", align=T, digits=4, out="Table1_Ret.txt", no.space=T, flip=T)
#This gives the following table
Table 1: Returns
=====================================================
Statistic Excess.of.riskfree Excess.of.Market Nominal
-----------------------------------------------------
N 2 2 3
Mean 0.1500 0.3000 0.3700
St. Dev. 0.0707 0.1414 0.3158
Min 0.1000 0.2000 0.0100
Pctl(25) 0.1250 0.2500 0.2550
Pctl(75) 0.1750 0.3500 0.5500
Max 0.2000 0.4000 0.6000
-----------------------------------------------------
Basically, I want the stargazer table to be somewhat equal to the display of my dataframe in R (Group as Rows and variables as column names). And just display the return values, not the statistical approach which seems to be the default layout.
Doesn't necessarily have to be a table from the stargazer package, if there are other (simpler) solution I would be glad to receive that as well!
All you have to do is add the summary= FALSE option and set the flip option to F:
stargazer(X, summary=F, title="Table 1: Returns", align=T, digits=4, out="Table1_Ret.txt", no.space=T, flip=F)
#Gives you this:
Table 1: Returns
====================================================
Group Excess.of.riskfree Excess.of.Market Nominal
----------------------------------------------------
1 Value 0.1000 0.2000 0.5000
2 Growth 0.2000 0.4000 0.6000
3 HML 0.0100
----------------------------------------------------
Also: stargazer just leaves the NA cells blank. If you want to have it in the table just add it as String:
X <- data.frame("Group" = c("Value", "Growth", "HML"), "Excess of riskfree" = c(0.1, 0.2,"NA"),
"Excess of Market" = c(0.2,0.4,"NA"), "Nominal" = c(0.5, 0.6, 0.01))
#Then you get this:
Table 1: Returns
====================================================
Group Excess.of.riskfree Excess.of.Market Nominal
----------------------------------------------------
1 Value 0.1 0.2 0.5000
2 Growth 0.2 0.4 0.6000
3 HML NA NA 0.0100
----------------------------------------------------
Have a look at the stargazer documentation 1 for further layout options.

Producing anova from already summarized data

I have a table that looks like this:
I'm trying to run aov() on the above table, but I'm only able to create a partial output. I'm not sure how to include the standard deviation in the calculation.
Right now I'm concatenating and repeating each group like so:
groups <- c(rep('LHS', 121), rep('HS', 546), rep('Jr', 97), rep('Bachelors', 253), rep('Graduate', 155))
And then doing the same for the means (since I don't have access to the original data sheet):
means <- c(rep(38.67, 121), rep(39.6, 546), rep(41.39, 97), rep(42.55, 253), rep(40.85, 155))
At this point I can create a data fame and then run aov on it:
df <- data.frame(groups, means)
groups.aov <- aov(means ~ groups, data = df)
Unfortunately summary(groups.aov) only gives me a partial result.
Df Sum Sq Mean Sq F value Pr(>F)
groups 4 2004 501 4.247e+27 <2e-16 ***
Residuals 1167 0 0
Any other way I can go, where I can factor in the SD?
We simulate some data so that we know the calculations are correct:
set.seed(100)
df = data.frame(
groups=rep(letters[1:4],times=seq(20,35,by=5)),
value=rnorm(110,rep(1:4,times=seq(20,35,by=5)),1))
We get back something like the table you see above:
library(dplyr)
res <- df %>% group_by(groups) %>% summarize_all(c(mean=mean,sd=sd,n=length))
total <- data.frame(groups="total",mean=mean(df$value),sd=sd(df$value),n=nrow(df))
rbind(res,total)
# A tibble: 5 x 4
groups mean sd n
<fct> <dbl> <dbl> <int>
1 a 0.937 1.14 20
2 b 1.91 0.851 25
3 c 3.01 0.780 30
4 d 4.01 0.741 35
5 total 2.70 1.42 110
We always work with the sum of squares in anova. So from sd back to sum of squares, you usually multiply by n-1, and from there you can derive the F value. The detailed calculations:
# number of groups
ngroups=nrow(res)# number of groups
# total sum of squares
SST = (total$sd^2)*(total$n-1)
#error within groups
SSE = sum((res$sd^2)*(res$n-1))
aovtable = data.frame(
Df = c(ngroups-1,total$n-ngroups-1),
SumSq = c(SST-SSE,SSE)
)
aovtable$MeanSq = aovtable$SumSq / aovtable$Df
aovtable$F = c(aovtable$MeanSq[1]/aovtable$MeanSq[2],NA)
aovtable$p = c(pf(aovtable$F[1],aovtable$Df[1],aovtable$Df[2],lower.tail=FALSE),NA)
And we can compare the two results:
aovtable
Df SumSq MeanSq F p
1 3 140.55970 46.8532330 62.62887 2.705082e-23
2 105 78.55147 0.7481092 NA NA
summary(aov(value~groups,data=df))
Df Sum Sq Mean Sq F value Pr(>F)
groups 3 140.56 46.85 63.23 <2e-16 ***
Residuals 106 78.55 0.74

Wrong degrees of freedom in lsmeans and SE calculation in R

I have this sample data:
Sample Replication Days
1 1 10
1 1 14
1 1 13
1 1 14
2 1 NA
2 1 5
2 1 18
2 1 20
1 2 16
1 2 NA
1 2 18
1 2 21
2 2 15
2 2 7
2 2 12
2 2 14
I have four observations for each sample with a total of 64 samples in each of the two replications. In total, I have 512 values for both the replications. I also have some missing values designated as 'NA'. I prformed ANOVA for Mean values for each Sample for each Rep that I generated using
library(tidyverse)
df <- Data %>% group_by(Sample, Rep) %>% summarise(Mean = mean(Days, na.rm = TRUE))
curve.anova <- aov(Mean~Rep+Sample, data=df)
Result of anova is:
> summary(curve.anova)
Df Sum Sq Mean Sq F value Pr(>F)
Rep 1 6.1 6.071 2.951 0.0915 .
Sample 63 1760.5 27.945 13.585 <2e-16 ***
Residuals 54 111.1 2.057
I created a table for mean and SE values,
ANOVA<-lsmeans(curve.anova, ~Sample)
ANOVA<-summary(ANOVA)
write.csv(ANOVA, file="Desktop/ANOVA.csv")
A few lines from file are:
Sample lsmean SE df lower.CL upper.CL
1 24.875 1.014145417 54 22.84176086 26.90823914
2 25.5 1.014145417 54 23.46676086 27.53323914
3 31.32575758 1.440722628 54 28.43728262 34.21423253
4 26.375 1.014145417 54 24.34176086 28.40823914
5 26.42424242 1.440722628 54 23.53576747 29.31271738
6 25.5 1.014145417 54 23.46676086 27.53323914
7 28.375 1.014145417 54 26.34176086 30.40823914
8 24.875 1.014145417 54 22.84176086 26.90823914
9 21.16666667 1.014145417 54 19.13342752 23.19990581
10 23.875 1.014145417 54 21.84176086 25.90823914
df for all 64 samples is 54 and the error bars in the ggplot are mostly equal for all the Samples. SE values are larger than the manually calculated values. Based on anova results, df=54 is for residuals.
I want to double check the ANOVA results so that they are correct and I am correctly generating lsmeans and SE to plot a bargraph using ggplot with confirdence interval error bars.
I will appreciate any help. Thank you!
After reading your comments, I think your workflow as an issue. Basically, when you are applying your anova test, you are doing it on means of the different samples.
So, in your example, when you are doing :
curve.anova <- aov(Mean~Rep+Sample, data=df)
You are comparing these values:
> df
# A tibble: 4 x 3
# Groups: Sample [2]
Sample Replication Mean
<dbl> <dbl> <dbl>
1 1 1 12.8
2 1 2 18.3
3 2 1 14.3
4 2 2 12
So, basically, you are comparing two groups with two values per group.
So, when you tried to remove the Replication group, you get an error because the output of:
df = Data %>% group_by(Sample %>% summarise(Mean = mean(Days, na.rm = TRUE))
is now:
# A tibble: 2 x 2
Sample Mean
<dbl> <dbl>
1 1 15.1
2 2 13
So, applying anova test on that dataset means that you are comparing two groups with one value each. So, you can't compute residuals and SE.
Instead, you should do it on the full dataset without trying to calculate the mean first:
anova_data <- aov(Days~Sample+Replication, data=Data)
anova_data2 <- aov(Days~Sample, data=Data)
And their output are:
> summary(anova_data)
Df Sum Sq Mean Sq F value Pr(>F)
Sample 1 16.07 16.071 0.713 0.416
Replication 1 9.05 9.054 0.402 0.539
Residuals 11 247.80 22.528
2 observations deleted due to missingness
> summary(anova_data2)
Df Sum Sq Mean Sq F value Pr(>F)
Sample 1 16.07 16.07 0.751 0.403
Residuals 12 256.86 21.41
2 observations deleted due to missingness
Now, you can apply lsmeans:
A_d = summary(lsmeans(anova_data, ~Sample))
A_d2 = summary(lsmeans(anova_data2, ~Sample))
> A_d
Sample lsmean SE df lower.CL upper.CL
1 15.3 1.8 11 11.29 19.2
2 12.9 1.8 11 8.91 16.9
Results are averaged over the levels of: Replication
Confidence level used: 0.95
> A_d2
Sample lsmean SE df lower.CL upper.CL
1 15.1 1.75 12 11.33 19.0
2 13.0 1.75 12 9.19 16.8
Confidence level used: 0.95
It does not change a lot the mean and the SE (which is good because it means that your replicate are consistent and you don't have too much variabilities between those) but it reduces the confidence interval.
So, to plot it, you can:
library(ggplot2)
ggplot(A_d, aes(x=as.factor(Sample), y=lsmean)) +
geom_bar(stat="identity", colour="black") +
geom_errorbar(aes(ymin = lsmean - SE, ymax = lsmean + SE), width = .5)
Based on your initial question, if you want to check that the output of ANOVA is correct, you can mimick fake data like this:
d2 <- data.frame(Sample = c(rep(1,10), rep(2,10)),
Days = c(rnorm(10, mean =3), rnorm(10, mean = 8)))
Then,
curve.d2 <- aov(Days ~ Sample, data = d2)
ANOVA2 <- lsmeans(curve.d2, ~Sample)
ANOVA2 <- summary(ANOVA2)
And you get the following output:
> summary(curve.d2)
Df Sum Sq Mean Sq F value Pr(>F)
Sample 1 139.32 139.32 167.7 1.47e-10 ***
Residuals 18 14.96 0.83
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> ANOVA2
Sample lsmean SE df lower.CL upper.CL
1 2.62 0.288 18 2.02 3.23
2 7.90 0.288 18 7.29 8.51
Confidence level used: 0.95
And for the plot
ggplot(ANOVA2, aes(x=as.factor(Sample), y=lsmean)) +
geom_bar(stat="identity", colour="black") +
geom_errorbar(aes(ymin = lsmean - SE, ymax = lsmean + SE), width = .5)
As you can see, we get lsmeans for d2 close to 3 and 8 what we set at the first place. So, I think your output are correct. Maybe your data do not present any significant differences and the computation of SE are the same because the distribution of your data are the same. It is what it is.
I hope this answer helps you.
Data
df = data.frame(Sample = c(rep(1,4), rep(2,4),rep(1,4), rep(2,4)),
Replication = c(rep(1,8), rep(2,8)),
Days = c(10,14,13,14,NA,5,18,20,16,NA,18,21,15,7,12,14))

Combining and appending columns of different lengths, by row number, R

I'm working with biochemical data from subjects, analysing the results by sex. I have 19 biochemical tests to analyse for each sex, for each of two drugs (haematology and anatomy tests coming later).
For reasons of reproducibility of results and for preventing transcription errors, I am trying to summarise each test into one table. Included in the table output, I need a column for the Dunnett post hoc comparison p-values. Because the Dunnett test compares to the control results, with a control and 3 drug levels I only get 3 p-values. However, I have 4 mean and sd values.
Using ddply to get the mean and sd results (having limited the number of significant figures, I get a dataset that looks like this:
Sex<- c(rep("F",4), rep("M",4))
Druglevel <- c(rep(0:3,2))
Sample <- c(rep(10,8))
Mean <- c(0.44, 0.50, 0.46, 0.49, 0.48, 0.55, 0.47, 0.57)
sd <- c(0.07, 0.07, 0.09, 0.12, 0.18, 0.19, 0.13, 0.41)
Drug1Biochem1 <- data.frame(Sex, Druglevel, Sample, Mean, sd)
I have used glht in the package multcomp to perform the Dunnett tests on the aov object I constructed from undertaking a normal aov. I've extracted the p-values from the glht summary (I've rounded these to three decimal places). The male and female analyses have been run using separate ANOVA so I have one set of output for each sex. The female results are:
femaleR <- c(0.371, 0.973, 0.490)
and the male results are:
maleR <- c(0.862, 0.999, 0.738)
How can I append a column for the p-values to my original dataframe (Drug1Biochem1) so that both femaleR and maleR are in that final column, with row 1 and row 5 of that column empty (i.e. no p-values for the control)?
I wish to output the resulting combination to html, which can be inserted into a Word document so no transcription errors occur. I have set a seed value so that the results of the program are reproducible (when I finally stop debugging).
In summary, I would like a data frame (or table, or whatever I can output to html) that has the following format:
Sex Druglevel Sample Mean sd p-value
F 0 10 0.44 0.07
F 1 10 0.50 0.07 0.371
F 2 10 0.46 0.09 0.973
F 3 10 0.49 0.12 0.480
M 0 10 0.48 0.18
M 1 10 0.55 0.19 0.862
M 2 10 0.47 0.13 0.999
M 3 10 0.57 0.41 0.738
For each test, I wish to reproduce this exact table. There will always be 4 groups per sex, and there will never be a p-value for the control, which will always be summarised in row 1 (F) and row 5 (M).
You could try merge
dN <- data.frame(Sex=rep(c('M', 'F'), each=3), Druglevel=1:3,
pval=c(maleR, femaleR))
merge(Drug1Biochem1, dN, by=c('Sex', 'Druglevel'), all=TRUE)
# Sex Druglevel Sample Mean sd pval
#1 F 0 10 0.44 0.07 NA
#2 F 1 10 0.50 0.07 0.371
#3 F 2 10 0.46 0.09 0.973
#4 F 3 10 0.49 0.12 0.490
#5 M 0 10 0.48 0.18 NA
#6 M 1 10 0.55 0.19 0.862
#7 M 2 10 0.47 0.13 0.999
#8 M 3 10 0.57 0.41 0.738

Resources