summarize data from csv using R - r

I'm new to R, and I wrote some code to summarize data from .csv file according to my needs.
here is the code.
raw <- read.csv("trees.csv")
looks like this
SNAME CNAME FAMILY PLOT INDIVIDUAL CAP H
1 Alchornea triplinervia (Spreng.) M. Arg. Tainheiro Euphorbiaceae 5 176 15 9.5
2 Andira fraxinifolia Benth. Angelim Fabaceae 3 321 12 6.0
3 Andira fraxinifolia Benth. Angelim Fabaceae 3 326 14 7.0
4 Andira fraxinifolia Benth. Angelim Fabaceae 3 327 18 5.0
5 Andira fraxinifolia Benth. Angelim Fabaceae 3 328 12 6.0
6 Andira fraxinifolia Benth. Angelim Fabaceae 3 329 21 7.0
#add 2 other rows
for (i in 1:nrow(raw)) {
raw$VOLUME[i] <- treeVolume(raw$CAP[i],raw$H[i])
raw$BASALAREA[i] <- treeBasalArea(raw$CAP[i])
}
#here comes.
I need a new data frame, with the mean of columns H and CAP and the sums of columns VOLUME and BASALAREA. This dataframe is grouped by column SNAME and subgrouped by column PLOT.
plotSummary = merge(
aggregate(raw$CAP ~ raw$SNAME * raw$PLOT, raw, mean),
aggregate(raw$H ~ raw$SNAME * raw$PLOT, raw, mean))
plotSummary = merge(
plotSummary,
aggregate(raw$VOLUME ~ raw$SNAME * raw$PLOT, raw, sum))
plotSummary = merge(
plotSummary,
aggregate(raw$BASALAREA ~ raw$SNAME * raw$PLOT, raw, sum))
The functions treeVolume and treeBasal area just return numbers.
treeVolume <- function(radius, height) {
return (0.000074230*radius**1.707348*height**1.16873)
}
treeBasalArea <- function(radius) {
return (((radius**2)*pi)/40000)
}
I'm sure that there is a better way of doing this, but how?

I can't manage to read your example data in, but I think I've made something that generally represents it...so give this a whirl. This answer builds off of Greg's suggestion to look at plyr and the functions ddply to group by segments of your data.frame and numcolwise to calculate your statistics of interest.
#Sample data
set.seed(1)
dat <- data.frame(sname = rep(letters[1:3],2), plot = rep(letters[1:3],2),
CAP = rnorm(6),
H = rlnorm(6),
VOLUME = runif(6),
BASALAREA = rlnorm(6)
)
#Calculate mean for all numeric columns, grouping by sname and plot
library(plyr)
ddply(dat, c("sname", "plot"), numcolwise(mean))
#-----
sname plot CAP H VOLUME BASALAREA
1 a a 0.4844135 1.182481 0.3248043 1.614668
2 b b 0.2565755 3.313614 0.6279025 1.397490
3 c c -0.8280485 1.627634 0.1768697 2.538273
EDIT - response to updated question
Ok - now that your question is more or less reproducible, here's how I'd approach it. First of all, you can take advantage of the fact that R is a vectorized meaning that you can calculate ALL of the values from VOLUME and BASALAREA in one pass, without looping through each row. For that bit, I recommend the transform function:
dat <- transform(dat, VOLUME = treeVolume(CAP, H), BASALAREA = treeBasalArea(CAP))
Secondly, realizing that you intend to calculate different statistics for CAP & H and then VOLUME & BASALAREA, I recommend using the summarize function, like this:
ddply(dat, c("sname", "plot"), summarize,
meanCAP = mean(CAP),
meanH = mean(H),
sumVOLUME = sum(VOLUME),
sumBASAL = sum(BASALAREA)
)
Which will give you an output that looks like:
sname plot meanCAP meanH sumVOLUME sumBASAL
1 a a 0.5868582 0.5032308 9.650184e-06 7.031954e-05
2 b b 0.2869029 0.4333862 9.219770e-06 1.407055e-05
3 c c 0.7356215 0.4028354 2.482775e-05 8.916350e-05
The help pages for ?ddply, ?transform, ?summarize should be insightful.

Look at the plyr package. I will split the data by the SNAME variable for you, then you give it code to do the set of summaries that you want (mixing mean and sum and whatever), then it will put the pieces back together for you. You probably want either the 'ddply' or the 'daply' function in that package.

Related

Creating a loop with compare_means

I am trying to create a loop to use compare_means (ggpubr library in R) across all columns in a dataframe and then select only significant p.adjusted values, but it does not work well.
Here is some code
head(df3)
sampleID Actio Beta Gammes Traw Cluster2
gut10 10 2.2 55 13 HIGH
gut12 20 44 67 12 HIGH
gut34 5.5 3 89 33 LOW
gut26 4 45 23 4 LOW
library(ggpubr)
data<-list()
for (i in 2:length(df3)){
data<-compare_means(df3[[i]] ~ Cluster2, data=df3, paired = FALSE,p.adjust.method="bonferroni",method = "wilcox.test")
}
Error: `df3[i]` must evaluate to column positions or names, not a list
I would like to create an output to convert in dataframe with all the information contained in compare_means output
Thanks a lot
Try this:
library(ggpubr)
data<-list()
for (i in 2:(length(df3)-1)){
new<-df3[,c(i,"Cluster2")]
colnames(new)<-c("interest","Cluster2")
data<-compare_means(interest ~ Cluster2, data=new, paired = FALSE,p.adjust.method="bonferroni",method = "wilcox.test")
}

How to maintain the order of elements of a row when using by and rbind function in r?

I have written a function which takes a subset of data based on the value of name column.It Computes the outlier for column "mark" and replaces all the outliers.
However when I try to combine these different subsets, the order of my elements changes. Is there any way by which I can maintain the order of my elements in the column "mark"
My data set is:
name mark
A 100.0
B 0.5
C 100.0
A 50.0
B 90.0
B 1000.0
C 1200.0
C 5000.0
A 210.0
The function which I have written is :
data.frame(do.call("rbind", as.list(by(data, data$name,
function(x){apply(x[, .(mark)],2,
function(y) {y[y > (quantile(x$mark, na.rm=TRUE)[[3]][[1]] + 1.5 * IQR(x$mark))]
<- (quantile(x$mark, na.rm=TRUE)[[3]][[1]] + 1.5 * IQR(x$mark));y})}))))
The result of the above function is the first column below (I've manually added back name for illustratory purposes):
mark NAME
100.000 ----- A
50.000 ----- A
210.000 ----- A
0.500 ----- B
90.000 ----- B
839.625 ----- B
100.000 ----- C
1200.000 ----- C
4875.000 ----- C
In the above result, the order of the values for mark column are changed. Is there any way by which I can maintain the order of the elements ?
Are you sure that code is doing what you think it is?
It looks like you're replacing any value greater than the median (third returned value of quantile) with the median + 1.5*IQR. Maybe that's what you intend, I don't know. The bigger problem is that you're doing that in an apply function, so it's going to re-calculate that median and IQR each iteration, updated with the previous rows already being changed. I'd wager that's not what you intend, but I suppose I've seen stranger.
A better option might be to create an external function to do the work, which takes in all of the data, does the calculation, then outputs all the data. I like dplyr for this simply because it's clean.
Reading your data in (why the "----"?)
scores <- read.table(text="
name mark
A 100.0
B 0.5
C 100.0
A 50.0
B 90.0
B 1000.0
C 1200.0
C 5000.0
A 210.0", header=TRUE)
and creating a function that does something a little more sensible; replaces any value greater than the 75% quantile (referenced by name so you know what it is) or less than the 25% quantile with that limiting value
scale_outliers <- function(data) {
lim <- quantile(data, na.rm = TRUE)
data[data > lim["75%"]] <- lim["75%"]
data[data < lim["25%"]] <- lim["25%"]
return(data)
}
Chaining this processing into dplyr::mutate is neat, and can then be passed on to ggplot. Here's the original data
gg1 <- scores %>% ggplot(aes(x=name, y=mark))
gg1 <- gg1 + geom_point() + geom_boxplot() + coord_cartesian(ylim=range(scores$mark))
gg1
And if we alter it with the new function we get the data back without rows changed around
scores %>% mutate(new_mark = scale_outliers(mark))
#> name mark new_mark
#> 1 A 100.0 100
#> 2 B 0.5 90
#> 3 C 100.0 100
#> 4 A 50.0 90
#> 5 B 90.0 90
#> 6 B 1000.0 1000
#> 7 C 1200.0 1000
#> 8 C 5000.0 1000
#> 9 A 210.0 210
and we can plot that,
gg2 <- scores %>% mutate(new_mark = scale_outliers(mark)) %>% ggplot(aes(x=name, y=new_mark))
gg2 <- gg2 + geom_point() + geom_boxplot() + coord_cartesian(ylim=range(scores$mark))
gg2
Best of all, if you now want to do that quantile comparison group-wise (say, by the name column, it's as easy as using dplyr::group_by(name),
gg3 <- scores %>% group_by(name) %>% mutate(new_mark = scale_outliers(mark)) %>% ggplot(aes(x=name, y=new_mark))
gg3 <- gg3 + geom_point() + geom_boxplot() + coord_cartesian(ylim=range(scores$mark))
gg3
A slightly refactored version of Hack-R's answer -- you can add a index to your data.table:
data <- data.table(name = c("A", "B","C", "A","B","B","C","C","A"),mark = c(100,0.5,100,50,90,1000,1200,5000,210))
data[,i:=.I]
Then you perform your calculation but you keep the name and i:
df <- data.frame(do.call("rbind", as.list(
by(data, data$name,
function(x) cbind(i=x$i,
name=x$name,
apply(x[, .(mark)], 2,function(y) {y[y > (quantile(x$mark, na.rm=TRUE)[[3]][[1]] + 1.5 * IQR(x$mark))] <- (quantile(x$mark, na.rm=TRUE)[[3]][[1]] + 1.5 * IQR(x$mark));y})
)))))
And finally you order using the index:
df[order(df$i),]
i name mark
1 1 A 100
4 2 B 0.5
7 3 C 100
2 4 A 50
5 5 B 90
6 6 B 839.625
8 7 C 1200
9 8 C 4875
3 9 A 210

How to make subgroups out of a data matrix with continuous and categorical data in R

I have a data set:
and wish to check assumptions. I would like to check normal distribution and deviation, and wish to make subgroups. For example, I would like to check the distribution of my 100 group for each of the variables listed (OM,redox,etc.). Is there a way I can make a "100" group for these variables and then test them??
Thank you for your help!
You can apply functions to subgroups using tapply:
set.seed(10)
df <- data.frame(id = 100:104,
redox = rnorm(25,mean = 20,sd = 10),
depth = runif(25,min = 10,max = 30))
tapply(df$redox,df$id,sd)
Which results in
> tapply(df$redox,df$id,sd)
100 101 102 103 104
6.181492 11.067056 4.863818 14.269076 7.962710
If you want to run a test on multiple columns simultaneously, use aggregate:
aggregate(df[,2:3],by = list(df$id),sd)
Which gives:
Group.1 redox depth
1 100 6.181492 6.319090
2 101 11.067056 5.869627
3 102 4.863818 2.808336
4 103 14.269076 3.438697
5 104 7.962710 6.296606
To test for normality, you can use shapiro.test:
aggregate(df[,2:3],by = list(df$id),function(x) shapiro.test(x)$statistic)

Calculate the mean per subject and repeat the value for each subject's row

This is the first time that I ask a question on stack overflow. I have tried searching for the answer but I cannot find exactly what I am looking for. I hope someone can help.
I have a huge data set of 20416 observation. Basically, I have 83 subjects and for each subject I have several observations. However, the number of observations per subject is not the same (e.g. subject 1 has 256 observations, while subject 2 has only 64 observations).
I want to add an extra column containing the mean of the observations for each subject (the observations are reading times (RT)).
I tried with the aggregate function:
aggregate (RT ~ su, data, mean)
This formula returns the correct mean per subject. But then I cannot simply do the following:
data$mean <- aggregate (RT ~ su, data, mean)
as R returns this error:
Error in $<-.data.frame(tmp, "mean", value = list(su = 1:83, RT
= c(378.1328125, : replacement has 83 rows, data has 20416
I understand that the formula lacks a command specifying that the mean for each subject has to be repeated for all the subject's rows (e.g. if subject 1 has 256 rows, the mean for subject 1 has to be repeated for 256 rows, if subject 2 has 64 rows, the mean for subject 2 has to be repeated for 64 rows and so forth).
How can I achieve this in R?
The data.table syntax lends itself well to this kind of problem:
Dt[, Mean := mean(Value), by = "ID"][]
# ID Value Mean
# 1: a 0.05881156 0.004426491
# 2: a -0.04995858 0.004426491
# 3: b 0.64054432 0.038809830
# 4: b -0.56292466 0.038809830
# 5: c 0.44254622 0.099747707
# 6: c -0.10771992 0.099747707
# 7: c -0.03558318 0.099747707
# 8: d 0.56727423 0.532377247
# 9: d -0.60962095 0.532377247
# 10: d 1.13808538 0.532377247
# 11: d 1.03377033 0.532377247
# 12: e 1.38789640 0.568760936
# 13: e -0.57420308 0.568760936
# 14: e 0.89258949 0.568760936
As we are applying a grouped operation (by = "ID"), data.table will automatically replicate each group's mean(Value) the appropriate number of times (avoiding the error you ran into above).
Data:
Dt <- data.table::data.table(
ID = sample(letters[1:5], size = 14, replace = TRUE),
Value = rnorm(14))[order(ID)]
Staying in Base R, ave is intended for this use:
data$mean = with(data, ave(x = RT, su, FUN = mean))
Simply merge your aggregated means data with full dataframe joined by the subject:
aggdf <- aggregate (RT ~ su, data, mean)
names(aggdf)[2] <- "MeanOfRT"
df <- merge(df, aggdf, by="su")
Another compelling way of handling this without generating extra data objects is by using group_by of dplyr package:
# Generating some data
data <- data.table::data.table(
su = sample(letters[1:5], size = 14, replace = TRUE),
RT = rnorm(14))[order(su)]
# Performing
> data %>% group_by(su) %>%
+ mutate(Mean = mean(RT)) %>%
+ ungroup()
Source: local data table [14 x 3]
su RT Mean
1 a -1.62841746 0.2096967
2 a 0.07286149 0.2096967
3 a 0.02429030 0.2096967
4 a 0.98882343 0.2096967
5 a 0.95407214 0.2096967
6 a 1.18823435 0.2096967
7 a -0.13198711 0.2096967
8 b -0.34897914 0.1469982
9 b 0.64297557 0.1469982
10 c -0.58995261 -0.5899526
11 d -0.95995198 0.3067978
12 d 1.57354754 0.3067978
13 e 0.43071258 0.2462978
14 e 0.06188307 0.2462978

Calculation using two subsets of variable column in long data frame with R Reshape

I have a dataframe that has two sets of data that I need to multiply for a calculation. A simple version would be
sample = data.frame(apples=c(10,20,25,30,40,NA,NA,15))
sample$oranges = c(25,60,90,86,10,67,45,10)
sample$oats = c(65,75,85,95,105,115,125,135)
sample$eggs = c(23,22,21,20,19,18,17,16)
sample$consumer =c('john','mark','luke','paul','peter','thomas','matthew','brian')
sample$mealtime = c('breakfast','lunch','lunch','snack','lunch','breakfast','snack','dinner')
s1 = melt(sample,id.vars=c(5,6),measure.vars=c(1:4))
and what I'm trying to do is something along the lines of
s2 = dcast(s1, mealtime ~ ., function(x) (x[variable == 'oranges'] * x[variable =='apples'])/sum(x[variable == 'apples'])
In practice its a much longer data.frame and a more elaborate calculation but the principle should be the same. Thanks -- first post to SO so apologies for any errors.
The output would be a data frame that has mealtimes as the Id var and the apple weighted average of the orange data as the values for each mealtime.
Something along the lines of
Group.1 x
1 breakfast 1.785714
2 dinner 1.071429
3 lunch 27.500000
4 snack 18.428571
This was calculated using
sample$wa = sample$oranges*sample$apples/sum(sample$apples)
aggregate(sample$wa,by=list(sample$mealtime),sum,na.rm=T)
which feels off mathematically but was meant to be a quick kludgy approximation.
This is a much better task for plyr than it is for reshape.
library(plyr)
s1<-ddply(sample,.(mealtime), function(x) {return(sum(x$apples,x$oranges))})
And now you have clarified the output:
ddply(sample,.(mealtime), summarize,
wavg.oranges = sum(apples * oranges, na.rm=TRUE) / sum(apples, na.rm=TRUE))
# mealtime wavg.oranges
# 1 breakfast 25.00000
# 2 dinner 10.00000
# 3 lunch 45.29412
# 4 snack 86.00000

Resources