Related
In the following plot:
I would like to change count in legend to be in percent instead of number of counts. To generate the plot, I have written the following script:
library("RColorBrewer")
df <- read.csv("/home/adam/Desktop/data_norm.csv")
d <- ggplot(df, aes(case1, case2)) + geom_hex(bins = 30) + theme_bw() +
theme(text = element_text(face = "bold", size = 16)) + xlab("Case 2") + ylab("Case 1")
d <- d + scale_fill_gradientn(colors = brewer.pal(3,"Dark2"))
Using dput function to generate a reproducible example:
structure(list(ID = c(14L, 15L, 38L, 6L, 7L, 1L, 32L, 31L,
17L, 30L, 19L, 24L, 5L, 5L, 7L, 8L, 35L, 4L, 1L, 6L, 45L, 58L,
59L, 5L, 11L, 29L, 6L, 7L, 22L, 23L, 3L, 4L, 25L, 3L, 20L, 16L,
21L, 109L, 108L, 54L, 111L, 105L, 114L, 28L, 27L, 2L, 24L, 26L,
50L, 49L, 51L, 48L, 56L, 54L, 53L, 55L, 57L, 52L, 25L, 22L, 34L,
23L, 19L, 38L, 39L, 18L, 13L, 27L, 11L), case1 = c(2L, 0L,
0L, 0L, 4L, 17L, 11L, 7L, 9L, 11L, 14L, 5L, 1L, 0L, 0L, 0L, 1L,
0L, 0L, 0L, 0L, 0L, 0L, 26L, 0L, 16L, 0L, 0L, 6L, 4L, 1L, 10L,
3L, 13L, 13L, 12L, 6L, 0L, 0L, 11L, 0L, 0L, 0L, 0L, 3L, 16L,
4L, 3L, 0L, 0L, 0L, 11L, 0L, 0L, 0L, 0L, 0L, 8L, 5L, 7L, 8L,
7L, 4L, 0L, 1L, 15L, 2L, 19L, 2L), case2 = c(30L, 0L, 0L,
0L, 30L, 30L, 29L, 29L, 29L, 29L, 29L, 29L, 30L, 30L, 30L, 30L,
30L, 30L, 30L, 30L, 0L, 29L, 25L, 30L, 30L, 29L, 0L, 0L, 29L,
29L, 30L, 30L, 30L, 30L, 29L, 29L, 29L, 0L, 3L, 29L, 16L, 14L,
0L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 23L, 29L, 30L,
30L, 30L, 30L, 30L, 30L, 30L, 30L, 29L, 0L, 30L, 29L, 30L, 29L,
30L)), class = "data.frame", row.names = c(NA, -69L))
How can I change in the script so that I show the number of counts as percentage instead of showing the exact number of counts?
To change the way that the labels for a scale are shown without changing the underlying values, you can pass a reformatting function to the labels= argument of any scale_* function:
plot <- ggplot(df, aes(case1, case2)) + geom_hex(bins = 30) +
theme_bw() +
theme(text = element_text(face = "bold", size = 16)) +
xlab("Case 2") +
ylab("Case 1")
To convert from number of cases to percentage of total cases, we just divide each value by the total number of cases in df:
plot + scale_fill_gradientn(colors = brewer.pal(3,"Dark2"),
labels = function(x) x/nrow(df))
The answers to How can I change the Y-axis figures into percentages in a barplot? provide a number of ways to convert them into proper percentages, but the easiest is to use percent from the scales package (which is included with ggplot2):
plot + scale_fill_gradientn(colors = brewer.pal(3,"Dark2"),
labels = function(x) scales::percent(x/nrow(df)))
If you want to specify breaks so that the scale lists specific round percentages, note that the listed breaks will need to reference the original values, not the transformed percentages. You can do that by reversing whatever conversion you used in labels:
plot + scale_fill_gradientn(colors = brewer.pal(3,"Dark2"),
labels = function(x) scales::percent(x/nrow(df)),
breaks = c(.05, .1, .15, .2, .25, .3) * nrow(df))
I have a dataset with a variable that has a left-skewed distribution (the tail is on the left).
variable <- c(rep(35, 2), rep(36, 4), rep(37, 16), rep(38, 44), rep(39, 72), rep(40, 30))
I just want to make this data have a more normal distribution so I can perform an anova, but using log10, or log2 makes it still way left-skewed. What transformation can I use to make this data more normal?
EDIT: My model is: mod <- lme(reponse ~ variable*variable2, random=~group, data=data), so Kruskal Wallace would work except for the random effect and one predictor term thing. I did a Shapiro Wilk test, and my data is definitely non-normal. If justifiable, I would like to transform my data to give the ANOVA a better chance of detecting a significant result. Either that, or a mixed effect test for non-normal data.
#Ben Bolker - Thank you for your reply; I appreciate it. I did read your answer, but I'm still reading up on exactly what some of your suggestions mean (I'm very new to statistics). My p-value is fairly close to significant and I don't want to p-hack, but I also want to give my data the best chance I can of being significant. If I can't justify transforming my data or using something besides ANOVA, then so be it.
I've provided a dataframe snapshot below. My response variable is "temp.max", the maximum temperature at which a plant dies. My predictor variables are "growth.chamber" (either a 29 or 21 degree growth chamber) and "environment" (either field or forest). My random variable is "groupID" (the group the plants were raised in, consisting of 5-10 individuals). This is a reciprocal transplant experiment, so I raised both forest and field plants in both 21 and 29 degree chambers. What I want to know is if "temp.max" differs between field and forest populations, whether "temp.max" differs between growth chambers, and whether there is any interaction between environment and growth chamber in regards to temp.max. I would very, very much appreciate any help. Thank you.
> dput(data)
structure(list(groupID = structure(c(12L, 12L, 12L, 12L, 12L,
12L, 12L, 12L, 12L, 12L, 14L, 14L, 14L, 14L, 14L, 14L, 14L, 14L,
14L, 14L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 15L, 16L, 16L,
16L, 16L, 16L, 19L, 19L, 19L, 19L, 19L, 19L, 19L, 19L, 19L, 19L,
18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 17L, 17L, 17L,
17L, 17L, 17L, 17L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L, 20L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 6L, 6L, 6L, 6L, 6L, 6L, 6L,
6L, 6L, 6L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 7L, 7L, 7L, 7L, 7L, 7L,
7L, 7L, 7L, 7L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 13L, 13L, 13L,
13L, 13L, 13L, 13L, 13L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L,
9L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 11L, 11L,
11L, 11L, 11L, 8L, 8L, 8L, 8L, 8L), .Label = c("GRP_104", "GRP_111",
"GRP_132", "GRP_134", "GRP_137", "GRP_142", "GRP_145", "GRP_147",
"GRP_182", "GRP_192", "GRP_201", "GRP_28", "GRP_31", "GRP_40",
"GRP_68", "GRP_70", "GRP_78", "GRP_83", "GRP_92", "GRP_98"), class = "factor"),
individual = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L,
1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 16L, 17L, 1L, 2L, 3L,
4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 1L, 2L,
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L,
7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 15L, 16L, 20L, 1L, 2L, 3L,
4L, 5L, 6L, 7L, 16L, 1L, 2L, 3L, 4L, 5L, 11L, 12L, 14L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 18L, 19L, 20L, 1L, 2L, 3L, 4L, 5L,
16L, 17L, 18L, 19L, 20L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L,
4L, 5L), temp.max = c(39L, 35L, 39L, 39L, 35L, 40L, 40L,
40L, 40L, 39L, 39L, 39L, 39L, 39L, 39L, 39L, 39L, 38L, 38L,
38L, 39L, 39L, 40L, 38L, 40L, 39L, 39L, 40L, 40L, 39L, 39L,
39L, 39L, 39L, 39L, 39L, 39L, 39L, 39L, 39L, 39L, 40L, 38L,
40L, 40L, 40L, 40L, 40L, 40L, 39L, 40L, 39L, 39L, 40L, 39L,
39L, 39L, 39L, 38L, 38L, 38L, 38L, 40L, 39L, 39L, 38L, 38L,
39L, 39L, 37L, 39L, 39L, 37L, 39L, 39L, 39L, 39L, 37L, 39L,
39L, 38L, 37L, 38L, 38L, 38L, 36L, 36L, 36L, 37L, 37L, 40L,
39L, 40L, 39L, 39L, 37L, 37L, 38L, 38L, 38L, 37L, 38L, 38L,
38L, 37L, 38L, 38L, 37L, 38L, 40L, 38L, 38L, 38L, 38L, 37L,
38L, 39L, 38L, 38L, 38L, 38L, 38L, 40L, 38L, 40L, 39L, 39L,
39L, 39L, 39L, 39L, 39L, 39L, 39L, 40L, 40L, 39L, 39L, 38L,
37L, 39L, 37L, 39L, 39L, 39L, 39L, 39L, 39L, 40L, 39L, 39L,
40L, 40L, 38L, 40L, 40L, 36L, 38L, 38L, 38L, 38L, 37L, 37L,
38L, 38L, 38L, 39L, 39L), environment = structure(c(1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("field", "forest"), class = "factor"), growth.chamber = c(29L,
29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 21L, 21L, 21L,
21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L,
21L, 21L, 21L, 21L, 29L, 29L, 29L, 29L, 29L, 21L, 21L, 21L,
21L, 21L, 21L, 21L, 21L, 21L, 21L, 29L, 29L, 29L, 29L, 29L,
29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L,
21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 29L, 29L, 29L,
29L, 29L, 29L, 29L, 29L, 29L, 29L, 21L, 21L, 21L, 21L, 21L,
21L, 21L, 21L, 21L, 21L, 29L, 29L, 29L, 29L, 29L, 21L, 21L,
21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 29L, 29L, 29L, 29L,
29L, 29L, 29L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L,
21L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 21L, 21L, 21L,
21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L, 21L,
21L, 21L, 21L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L, 29L,
29L, 21L, 21L, 21L, 21L, 21L, 29L, 29L, 29L, 29L, 29L)), .Names = c("groupID",
"individual", "temp.max", "environment", "growth.chamber"), row.names = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 21L, 22L, 23L, 24L, 25L,
26L, 27L, 28L, 29L, 30L, 41L, 42L, 43L, 44L, 45L, 46L, 47L, 48L,
49L, 58L, 59L, 60L, 61L, 62L, 68L, 69L, 70L, 71L, 72L, 73L, 74L,
75L, 76L, 77L, 88L, 89L, 90L, 91L, 92L, 93L, 94L, 95L, 96L, 97L,
108L, 109L, 110L, 111L, 112L, 113L, 114L, 122L, 123L, 124L, 125L,
126L, 127L, 128L, 129L, 130L, 139L, 140L, 141L, 142L, 143L, 144L,
145L, 146L, 147L, 148L, 158L, 159L, 160L, 161L, 162L, 163L, 164L,
165L, 166L, 167L, 178L, 179L, 180L, 181L, 182L, 188L, 189L, 190L,
191L, 192L, 193L, 194L, 195L, 196L, 197L, 208L, 209L, 210L, 211L,
212L, 213L, 214L, 222L, 223L, 224L, 225L, 226L, 227L, 228L, 229L,
230L, 231L, 242L, 243L, 244L, 245L, 246L, 247L, 248L, 249L, 258L,
259L, 260L, 261L, 262L, 263L, 264L, 265L, 272L, 273L, 274L, 275L,
276L, 277L, 278L, 279L, 280L, 281L, 292L, 293L, 294L, 295L, 296L,
297L, 298L, 299L, 300L, 301L, 312L, 313L, 314L, 315L, 316L, 322L,
323L, 324L, 325L, 326L), class = "data.frame")
tl;dr you probably don't actually need to worry about the skew here.
There are a few issues here, and since they're mostly statistical rather than programming-related, this question is probably more relevant for CrossValidated.
If I copied your data correctly, they're equivalent to this:
dd <- rep(35:40,c(2,4,16,44,72,30))
plot(table(dd))
Your data are discrete - that's why the density plot that #user113156 posts has distinct peaks.
Here are the issues:
the most important is that for most statistical purposes you're not actually interested in the Normality of the marginal distribution, which is what you're showing here. Rather, you want to know whether the distribution of the residuals from a model is Normal or not; for an ANOVA, this is equivalent to asking whether the distribution of values within each group is Normal (and the groups have similar within-group variances).
Normality is not very important; ANOVA is robust to moderate degrees of non-Normality (e.g. see here).
Log transformation modifies your data in the wrong direction (i.e. it will tend to increase the left skewness). In general fixing this kind of left-skewed data requires a transformation like raising to a power >1 (the opposite direction from log- or square root-transformation), but when the values are far from zero it doesn't usually help very much anyway.
Some statistical options if you are worried:
a non-parametric, rank-based test like the Kruskal-Wallis test (the rank-based analogue of 1-way ANOVA)
do an ANOVA, but use a permutation-based approach to test statistical significance.
use an ordinal model
use hierarchical bootstrapping (resample within replacement within and between clusters) to derive more robust confidence intervals on parameters
Your variable follows a discrete distribution. You have integer values ranging from 35 (n=2) to 40 (n=30). I think you need to carry out some ordinal analysis collapsing values from 35 to 37 that have fewer observations in one category. Otherwise you could perform a non-parametric analysis using kruskal.test() function.
I have bad news and good news.
the bad news is that I don't see statistically significant patterns in your data.
the good news is that, given the structure of your experimental design, you can analyze your data much more simply (you don't need mixed models)
load packages, adjust defaults
library(ggplot2); theme_set(theme_bw())
library(dplyr)
check structure of data
Tabulating the data confirms that this is a nested design; each group occurs within a single environment/growth chamber combination.
tt <- with(dd,table(groupID,
interaction(environment,growth.chamber)))
## exactly one non-zero entry per group
all(rowSums(tt>0)==1)
aggregate data
Convert growth.chamber to a categorical variable; collapse each group to its mean temp.max value (and record the number of observations per group)
dda <-(dd
%>% mutate(growth.chamber=factor(growth.chamber))
%>% group_by(groupID,environment,growth.chamber)
%>% summarise(n=n(),temp.max=mean(temp.max))
)
ggplot(dda,aes(growth.chamber,temp.max,
colour=environment))+
geom_boxplot(aes(fill=environment),alpha=0.2)+
geom_point(position=position_dodge(width=0.75),
aes(size=n),alpha=0.5)+
scale_size(range=c(3,7))
Analysis
Now that we've aggregated (without losing any information we care about), we can use a linear regression with weights specifying the number of samples per observation:
m1 <- lm(temp.max~growth.chamber*environment,weights=n,
data=dda)
Checking distribution etc. of residuals:
plot(m1)
This all looks fine; no indication of serious bias, heteroscedasticity, non-Normality, or outliers ...
summary(m1)
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 38.2558 0.2858 133.845 <2e-16 ***
## growth.chamber29 0.3339 0.4144 0.806 0.432
## environmentforest 0.2442 0.3935 0.620 0.544
## growth.chamber29:environmentforest 0.3240 0.5809 0.558 0.585
## Residual standard error: 1.874 on 16 degrees of freedom
## Multiple R-squared: 0.2364, Adjusted R-squared: 0.09318
## F-statistic: 1.651 on 3 and 16 DF, p-value: 0.2174
Or a coefficient plot (dotwhisker::dwplot(m1))
While the plot of the data doesn't look like it's just noise, the statistical analysis suggests that we can't really distinguish it from noise ...
I have function in R to get mean of anything, which is any column name of dataframe
mean_anything <- function(directory, anything){
files_full <- list.files(directory, full.names=TRUE)
seethis <- lapply(files_full, read.csv)
output <- do.call(rbind, seethis)
mean(output$anything, na.rm=TRUE)
}
If I invoke mean_anything("diet_data", "Age") I get
[1] NA
Warning message:
In mean.default(output$anything, na.rm = TRUE) :
argument is not numeric or logical: returning NA
However, if I replace
mean(output$anything, na.rm=TRUE)
with
mean(output$Age, na.rm=TRUE)
Then the function will output [1] 36.4
I tried using single and double quotes around anything, I tried output[anything], how to fix?
dput(output)
structure(list(Patient.Name = structure(c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L,
5L), .Label = c("Andy", "David", "John", "Mike", "Steve"), class = "factor"),
Age = c(30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L,
30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L,
30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 35L, 35L, 35L, 35L,
35L, 35L, 35L, 35L, 35L, 35L, 35L, 35L, 35L, 35L, 35L, 35L,
35L, 35L, 35L, 35L, 35L, 35L, 35L, 35L, 35L, 35L, 35L, 35L,
35L, 35L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L,
22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L,
22L, 22L, 22L, 22L, 22L, 22L, 22L, 22L, 40L, 40L, 40L, 40L,
40L, 40L, 40L, 40L, 40L, 40L, 40L, 40L, 40L, 40L, 40L, 40L,
40L, 40L, 40L, 40L, 40L, 40L, 40L, 40L, 40L, 40L, 40L, 40L,
40L, 40L, 55L, 55L, 55L, 55L, 55L, 55L, 55L, 55L, 55L, 55L,
55L, 55L, 55L, 55L, 55L, 55L, 55L, 55L, 55L, 55L, 55L, 55L,
55L, 55L, 55L, 55L, 55L, 55L, 55L, 55L), Weight = c(140L,
140L, 140L, 139L, 138L, 138L, 138L, 138L, 138L, 138L, 138L,
138L, 137L, 137L, 138L, 139L, 139L, 137L, 137L, 137L, 137L,
137L, 137L, 135L, 135L, 135L, 135L, 135L, 135L, 135L, 210L,
209L, 209L, 209L, 209L, 209L, 209L, 208L, 208L, 208L, 208L,
208L, 208L, 207L, 206L, 206L, 206L, 205L, 205L, 205L, 205L,
204L, 204L, 204L, 203L, 203L, 202L, 202L, 202L, 201L, 175L,
175L, 175L, 175L, 175L, 175L, 175L, 175L, 175L, 175L, 175L,
175L, 175L, 175L, 175L, 175L, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, 177L, 188L, 188L, 188L, 188L, 189L,
189L, 189L, 189L, 189L, 189L, 189L, 189L, 189L, 189L, 190L,
190L, 190L, 190L, 190L, 190L, 190L, 190L, 190L, 192L, 192L,
192L, 192L, 192L, 192L, 192L, 225L, 225L, 225L, 224L, 224L,
224L, 223L, 223L, 223L, 223L, 223L, 222L, 221L, 221L, 221L,
220L, 220L, 219L, 219L, 219L, 218L, 217L, 217L, 217L, 216L,
215L, 215L, 214L, 214L, 214L), Day = c(1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L,
19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L,
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L,
15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L,
27L, 28L, 29L, 30L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L,
11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L,
23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L,
19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L,
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L,
15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L,
27L, 28L, 29L, 30L)), .Names = c("Patient.Name", "Age", "Weight",
"Day"), row.names = c(NA, 150L), class = "data.frame")
I solved it.
mean(output[,anything], na.rm=TRUE)
i.e.
mean_anything <- function(directory, anything){
files_full <- list.files(directory, full.names=TRUE)
seethis <- lapply(files_full, read.csv)
output <- do.call(rbind, seethis)
mean(output[,anything], na.rm=TRUE)
}
Where anything is any column in the dataframe
I have a data.frame where the first 13 rows contain site/observation information. Each column represents 1 individual, however most individuals have an A and B observation (although some only have A while a few have an A, B, and C observation). I'd like to average each row for every individual, and create a new data.frame from this information.
Example (small subset with row 1, row 7, row 13, and row 56-61):
OriginalID Tree003A Tree003B Tree008B Tree013A
1 Township LY LY LY LY
7 COFECHA ID LY1A003A LY1A003B LY1A008B LY1A013A
13 PathLength 37.5455 54.8963 57.9732 64.0679
56 2006 1.538 1.915 0.827 2.722
57 2007 1.357 1.923 0.854 2.224
58 2008 1.311 2.204 0.669 2.515
59 2009 0.702 1.125 0.382 2.413
60 2010 0.937 1.556 0.907 2.315
61 2011 0.942 1.268 1.514 1.858
I'd like to create a new data.frame that averages each individual's annual observations, whether they have an A, A and B, or A B and C observation. Individual's IDs are in Row 7 (COFECHA ID):
Intended Output:
OriginalID Tree003avg Tree008avg Tree013avg
1 Township LY LY LY
7 COFECHA ID LY1A003avg LY1A008avg LY1A013avg
13 PathLength 46.2209 57.9732 64.0679
56 2006 1.727 0.827 2.722
57 2007 1.640 0.854 2.224
58 2008 1.758 0.669 2.515
59 2009 0.914 0.382 2.413
60 2010 1.247 0.907 2.315
61 2011 1.105 1.514 1.858
Any ideas on how to average the columns would be great. I've been trying to modify the following code, but due to the 13 rows of additional information at the top of the data.frame, I didn't know how to specify to only average rows 14:61.
rowMeans(subset(LY011B, select = c("LY1A003A", "LY1A003B")), na.rm=TRUE)
The code for a larger set of the data that I'm working with is:
> dput(LY011B)
structure(list(OriginalTreeID = structure(c(58L, 53L, 57L, 59L,
51L, 61L, 50L, 55L, 56L, 60L, 54L, 49L, 52L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L,
19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L,
32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L,
45L, 46L, 47L, 48L), .Label = c("1964", "1965", "1966", "1967",
"1968", "1969", "1970", "1971", "1972", "1973", "1974", "1975",
"1976", "1977", "1978", "1979", "1980", "1981", "1982", "1983",
"1984", "1985", "1986", "1987", "1988", "1989", "1990", "1991",
"1992", "1993", "1994", "1995", "1996", "1997", "1998", "1999",
"2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007",
"2008", "2009", "2010", "2011", "AnalysisDateTime", "COFECHA ID",
"CoreLetter", "PathLength", "Plot#", "RingCount", "SiteID", "SP",
"Subplot#", "Township", "Tree#", "YearLastRing", "YearLastWhiteWood"
), class = "factor"), Tree003A = structure(c(35L, 8L, 34L, 7L,
34L, 21L, 36L, 31L, 37L, 30L, 32L, 29L, 33L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 23L, 22L, 25L, 28L, 27L, 24L, 26L, 20L, 16L,
15L, 6L, 18L, 12L, 10L, 3L, 9L, 11L, 19L, 17L, 14L, 13L, 2L,
4L, 5L), .Label = c("", "0.702", "0.803", "0.937", "0.942", "0.961",
"003", "1", "1.09", "1.116", "1.124", "1.224", "1.311", "1.357",
"1.471", "1.509", "1.538", "1.649", "1.679", "1.782", "1999",
"2.084", "2.148", "2.162", "2.214", "2.313", "2.429", "2.848",
"2/19/2014 11:06", "2011", "23017323011sp1", "24", "37.5455",
"A", "LY", "LY1A003A", "sp1"), class = "factor"), Tree003B = structure(c(56L,
19L, 54L, 18L, 55L, 49L, 57L, 51L, 58L, 50L, 52L, 48L, 53L, 1L,
1L, 1L, 1L, 10L, 7L, 8L, 6L, 5L, 4L, 3L, 2L, 11L, 9L, 30L, 15L,
24L, 20L, 23L, 33L, 37L, 42L, 13L, 44L, 36L, 12L, 16L, 21L, 27L,
35L, 41L, 38L, 26L, 40L, 14L, 46L, 32L, 28L, 17L, 31L, 22L, 39L,
43L, 45L, 47L, 25L, 34L, 29L), .Label = c("", "0.073", "0.092",
"0.173", "0.174", "0.358", "0.413", "0.425", "0.58", "0.697",
"0.719", "0.843", "0.883", "0.896", "0.937", "0.941", "0.964",
"003", "1", "1.048", "1.067", "1.075", "1.097", "1.119", "1.125",
"1.176", "1.207", "1.267", "1.268", "1.27", "1.297", "1.402",
"1.429", "1.556", "1.662", "1.693", "1.704", "1.735", "1.76",
"1.792", "1.816", "1.881", "1.915", "1.92", "1.923", "2.155",
"2.204", "2/19/2014 11:06", "2000", "2011", "23017323011sp1",
"48", "54.8963", "A", "B", "LY", "LY1A003B", "sp1"), class = "factor"),
Tree008B = structure(c(59L, 24L, 57L, 23L, 58L, 52L, 60L,
54L, 61L, 53L, 55L, 51L, 56L, 19L, 14L, 13L, 22L, 7L, 8L,
9L, 4L, 6L, 3L, 1L, 2L, 10L, 25L, 47L, 43L, 49L, 46L, 40L,
50L, 48L, 44L, 17L, 36L, 31L, 27L, 30L, 39L, 37L, 34L, 45L,
38L, 32L, 41L, 29L, 42L, 33L, 28L, 26L, 21L, 11L, 15L, 16L,
18L, 12L, 5L, 20L, 35L), .Label = c("0.302", "0.31", "0.318",
"0.357", "0.382", "0.412", "0.452", "0.476", "0.5", "0.539",
"0.591", "0.669", "0.673", "0.787", "0.79", "0.827", "0.835",
"0.854", "0.879", "0.907", "0.917", "0.967", "008", "1",
"1.027", "1.037", "1.141", "1.152", "1.172", "1.263", "1.383",
"1.411", "1.446", "1.498", "1.514", "1.611", "1.671", "1.685",
"1.695", "1.719", "1.783", "1.879", "1.884", "1.927", "1.97",
"2.019", "2.069", "2.35", "2.696", "2.979", "2/19/2014 11:06",
"2000", "2011", "23017323011sp1", "48", "57.9732", "A", "B",
"LY", "LY1A008B", "sp1"), class = "factor"), Tree013A = structure(c(45L,
6L, 44L, 5L, 44L, 38L, 46L, 40L, 47L, 39L, 42L, 37L, 43L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 10L,
13L, 8L, 22L, 14L, 18L, 24L, 4L, 11L, 25L, 7L, 36L, 41L,
33L, 29L, 17L, 28L, 23L, 21L, 16L, 26L, 15L, 3L, 20L, 12L,
2L, 9L, 34L, 35L, 27L, 32L, 31L, 30L, 19L), .Label = c("",
"0.608", "0.916", "0.945", "013", "1", "1.125", "1.18", "1.388",
"1.423", "1.493", "1.498", "1.554", "1.579", "1.619", "1.629",
"1.719", "1.756", "1.858", "1.867", "1.869", "1.876", "1.9",
"1.916", "2.023", "2.089", "2.224", "2.246", "2.247", "2.315",
"2.413", "2.515", "2.547", "2.645", "2.722", "2.785", "2/19/2014 11:11",
"2002", "2011", "23017323011sp1", "3.375", "34", "64.0679",
"A", "LY", "LY1A013A", "sp1"), class = "factor")), .Names = c("OriginalTreeID",
"Tree003A", "Tree003B", "Tree008B", "Tree013A"), row.names = c(NA,
61L), class = "data.frame")
Here is another approach where most of the work is done
by rearranging the data with the reshape package.
After the data is "munged", it can be rearranged into almost anything
you want with the cast function.
# I'm used to the transpose
y = t(x)
# Make the first row the column names
# Also get rid of hashes. They make things difficult
library(stringr)
colnames(y) = str_replace( y[1,], "#", "" )
y = data.frame(y[-1,],check.names=FALSE)
# reshape the data by defining the "ID" variables
library(reshape)
z = melt(y,id.vars=c("Township","Plot","Subplot","Tree",
"CoreLetter","COFECHA ID","SiteID","SP","AnalysisDateTime"))
z$value = as.numeric(as.character(z$value))
# Now you can do lots of things!
# All the info you wanted is there, but it's in a different format
# than your "intended output"
cast( z, Tree ~ variable, mean, na.rm=TRUE )
# To get to your "intended output"
out = cast( z, Township + Plot + Subplot + Tree ~ variable, mean, na.rm=TRUE )
out[["COFECHA ID"]] = with(out,paste0(Township,Plot,Subplot,Tree,"avg"))
out2 = out[,c(1,ncol(out),8:(ncol(out)-1))]
out3 = cbind(colnames(out2),t(out2))
colnames(out3) = c("OriginalID",paste0("Tree",out$Tree,"avg"))
# For kicks, here are some other things. Have fun!
cast(z, Tree ~ variable, median, na.rm=TRUE ) # the median instead of the mean
cast(z, Tree + CoreLetter ~ variable ) # back to your original data
cast(z, CoreLetter ~ variable, length ) # How many measurements from each core?
cast(z, CoreLetter ~ variable, mean ) # The average across different cores
For even more fun!
library(ggplot2)
d = z[-c(1:16), ] # A not so pretty hack
colnames(d)[10] = "Year"
d$Year = as.integer(as.character(d$Year))
ggplot(d, aes(x=Year, y=value, group=Tree, color=Tree, shape=CoreLetter)) +
geom_point() + geom_smooth(method="loess",span=0.3)
Does this mean that early 2000's were dry?
try this.....
d.f <- your data structure...above
subset the data
d.f <- d[-(1:13), -1]
c.n <- colnames(d.f)
build the grouping var
f <- gsub(".?$", "", c.n)
f <- d[4, 2:ncol(d)]
split the dataframeinto sub-dataframes
d.f <- apply(d.f, 2, as.numeric)
d.f[is.na(d.f)] <- 0
d.f.g <- as.data.frame(t(d.f))
a <- split(d.f.g, f)
calculate the groupwise averages as colMeans (because transposed)
grp.means <- lapply(a, colMeans)
the grp.means is a list of dataframes each containing the date averages for each grp. re-form this as required, you'll probably want to transpose again.
I have the following data:
points <- structure(list(A = structure(c(1L, 1L, 2L, 2L, 3L, 4L,
4L, 5L, 5L, 6L, 6L, 7L, 8L, 9L, 9L, 10L, 11L, 12L, 12L, 13L,
14L, 15L, 16L, 16L, 17L, 17L, 18L, 18L, 19L, 19L, 20L, 21L, 21L,
22L, 23L, 24L, 24L, 25L, 26L, 26L, 27L, 28L, 29L, 30L, 30L, 31L,
32L, 32L, 33L, 34L, 34L, 35L, 35L, 36L, 36L, 37L, 37L, 38L, 38L,
39L, 39L, 40L, 41L), .Label = c("00017dd3-f55e-e011-854c-00237de2db9e",
"0005f624-565a-e011-854c-00237de2db9e", "0007b82f-bfe0-4b55-963e-be5a2a1e7f7b",
"00095b52-fd0a-e011-9264-00237de2db9e", "00098835-9554-4898-8d4b-82d42b8b4464",
"000a727f-8334-e011-854c-00237de2db9e", "000c0a31-f459-4365-aa3a-1978deb89f67",
"000e36a4-6e56-4851-8d36-2caf0bdd63ec", "000f05a6-cf94-4518-8de7-1773cbea8198",
"00105574-a775-43e8-8472-c8b294e46786", "00112a96-3c47-409c-83bd-6f30d8d77100",
"0012f133-f68e-e011-986b-78e7d1fa76f8", "0012f899-1c45-4917-90b7-11bea31e467e",
"0014606b-17b7-46d6-957f-e23b43fcc773", "001478e2-3e50-486c-ae3b-d1ceb36f0fd0",
"00159bab-ce82-454a-9343-f7d8f1500a68", "0015b84e-a48d-443e-936e-cabdb80604dc",
"0018f8ba-c289-4483-bf74-5cd0e6c6ae9e", "0019487f-f31e-4e3e-b499-fd48077f71f9",
"00199523-c42f-47fd-a44a-066fb726f6dd", "0019dace-41e1-439f-8b73-328d02537fe7",
"001a346e-2a15-45d4-9fb1-6b4e2448d362", "001b0c90-5c86-4290-bad3-0d6794a6bfe8",
"001c0d0d-3059-e011-854c-00237de2db9e", "001c9cbb-8c79-4cbf-bc50-219a70ab20b8",
"001dcf83-7492-e011-986b-78e7d1fa76f8", "001dd5cf-3e3b-4ceb-823c-346c15f88878",
"001e0ef7-b977-436a-ab20-8c4af4f5b230", "001fc407-da48-4c42-9325-7756b160cbbd",
"001fdaa1-9471-e011-81d2-78e7d1fa76f8", "0020029f-2667-4c03-b99f-d803eccd27d4",
"00218e00-896e-e011-81d2-78e7d1fa76f8", "002196af-60c7-4baf-abdb-589b3a481686",
"0021a908-7ff6-df11-9264-00237de2db9e", "0021bced-909a-e011-986b-78e7d1fa76f8",
"0021f0fb-cb9f-e011-986b-78e7d1fa76f8", "00228254-9b20-4d40-a4a5-a7c608f81dfa",
"002357ba-5656-4308-bb92-6cc97f50d7aa", "0025eafd-a64f-e011-854c-00237de2db9e",
"0026b36c-ebc2-43f0-a0f7-72f43b70530b", "00277e09-543e-449a-8571-38f71a21cee2"
), class = "factor"), B = structure(c(10L, 10L, 27L,
27L, 28L, 23L, 23L, 38L, 38L, 24L, 24L, 19L, 35L, 26L, 26L, 28L,
5L, 36L, 36L, 21L, 11L, 1L, 14L, 14L, 4L, 4L, 9L, 9L, 16L, 16L,
3L, 7L, 7L, 13L, 37L, 17L, 17L, 29L, 15L, 15L, 12L, 31L, 32L,
8L, 8L, 2L, 30L, 30L, 39L, 6L, 6L, 22L, 22L, 20L, 20L, 34L, 34L,
18L, 18L, 33L, 33L, 25L, 29L), .Label = c("Aashu", "Actonica Studio",
"appyminds", "blackink", "BroeckiE", "Challenge Solutions LLC",
"CPP_MSP", "Datentechnik Innovation GmbH", "DerekM", "Dimension Srl",
"Dmitry Kazarin", "edg3", "fruitymo", "Geckosan", "Genera Interactive SL",
"HandyWare", "Infinite Square", "JTO.C Sq.", "JuJuZ", "Kitten Flavour",
"Krofita", "Mark Agholor", "MCTronix.com", "Michael Snow", "michaloxo",
"mobilewares.net", "NotoMedia LLC", "OKR", "P.F. CHAUVET", "Panoylhs",
"Pratik Gandhi", "raavr", "ReadBooks", "RGP", "Seesmic", "The KeitaCorp",
"viileetek", "Violineage", "Yalla Apps"), class = "factor"),
Date = structure(c(1302926400, 1302926400, 1302408000,
1302408000, 1327467600, 1292994000, 1292994000, 1322370000,
1322370000, 1297486800, 1297486800, 1326949200, 1321333200,
1314763200, 1314763200, 1328418000, 1327381200, 1307505600,
1307505600, 1325221200, 1324530000, 1327381200, 1326862800,
1326862800, 1326171600, 1326171600, 1325566800, 1325566800,
1327122000, 1327122000, 1320379200, 1324702800, 1324702800,
1327726800, 1327986000, 1301544000, 1301544000, 1332302400,
1308369600, 1308369600, 1325912400, 1331611200, 1325912400,
1304481600, 1304481600, 1325653200, 1304395200, 1304395200,
1322542800, 1294117200, 1294117200, 1309147200, 1309147200,
1309320000, 1309320000, 1313208000, 1313208000, 1325739600,
1325739600, 1300334400, 1300334400, 1325826000, 1321938000
), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("A",
"B", "Date"), row.names = c(NA, -63L), class = "data.frame")
I am trying to draw a scatter plot with the Y-axis sorted based on the date field. However, I am only able to do the following where I treat A as a factor. Any suggestions on how to achieve this?
p = ggplot(points, aes(Date, factor(A))) +
geom_point(aes(colour=factor(A)), size=1.8) +
scale_shape(solid=FALSE) +
scale_y_discrete("", breaks=NA)
The error comes when you ask ggplot to consider a date value as a discrete variable after just swapping the x and y positions. It goes away when you remove that:
p = ggplot(points, aes(x=A, y=Date)) +
geom_point(aes(colour=factor(A)), size=1.8) +
scale_shape(solid=FALSE)
p
Or you can get rid of the lines by applying the discrete axis call to the x-axis:
p = ggplot(points, aes(x=A, y=Date)) +
geom_point(aes(colour=factor(A)), size=1.8) +
scale_shape(solid=FALSE) +
scale_x_discrete("", breaks=NULL)
P
Unfortunately the title of the question doesn't seem to have a very clear connection to the text of the question, so I cannot tell if this was what you were asking for.