Creating a loop with compare_means - r

I am trying to create a loop to use compare_means (ggpubr library in R) across all columns in a dataframe and then select only significant p.adjusted values, but it does not work well.
Here is some code
head(df3)
sampleID Actio Beta Gammes Traw Cluster2
gut10 10 2.2 55 13 HIGH
gut12 20 44 67 12 HIGH
gut34 5.5 3 89 33 LOW
gut26 4 45 23 4 LOW
library(ggpubr)
data<-list()
for (i in 2:length(df3)){
data<-compare_means(df3[[i]] ~ Cluster2, data=df3, paired = FALSE,p.adjust.method="bonferroni",method = "wilcox.test")
}
Error: `df3[i]` must evaluate to column positions or names, not a list
I would like to create an output to convert in dataframe with all the information contained in compare_means output
Thanks a lot

Try this:
library(ggpubr)
data<-list()
for (i in 2:(length(df3)-1)){
new<-df3[,c(i,"Cluster2")]
colnames(new)<-c("interest","Cluster2")
data<-compare_means(interest ~ Cluster2, data=new, paired = FALSE,p.adjust.method="bonferroni",method = "wilcox.test")
}

Related

How to use Pivottabler on concatenated strings in R

I have a table that looks like this-
LDAutGroup PatientDays ExposedDays sex Ageband DrugGroup Prop LowerCI UpperCI concat
Group1 100 23 M 5 to 10 PSY 23 15.84 32.15 23 (15.84 -32.15) F
Group2 500 56 F 11 to 17 HYP 11.2 8.73 14.27 11.2 (8.73 -14.27)
Group3 300 89 M 18 and over PSY 29.67 24.78 35.07 29.67 (24.78 -35.07)
Group1 200 34 F 5 to 10 PSY 17 12.43 22.82 17 (12.43 -22.82)
Group2 456 78 M 11 to 17 ANX 17.11 13.93 20.83 17.11 (13.93 -20.83)
Following this, I want a pivot table to lay out the concat column as the valuename. However, the pivottabler only works on integers or numeric values. The following code runs right with either of the Prop, LowerCI or UpperCI columns on their own, but gives an error message for the concat column-
library(readr)
library(dplyr)
library(epitools)
library(gtools)
library(reshape2)
library(binom)
library(pivottabler)
pt <- PivotTable$new()
pt$addData(a)
pt$addColumnDataGroups("LDAutGroup")
pt$addColumnDataGroups("sex")
pt$addRowDataGroups("DrugGroup")
pt$addRowDataGroups("Ageband")
pt$defineCalculation(calculationName="TotalTrains", type="value", valueName="Prop")
pt$renderPivot()
Is there a way I can make this work on the concat column? I want a table that has the following layout and the cells populated with the strings in concat column in the table above
Group1 Group2 Group3
M F M F M F
ANX 11 to 17
18 and over
Total
HYP 11 to 17
18 and over
5 to 10
Total
PSY 18 and over
5 to 10
Total
I am the pivottabler package author.
As you say, pivottabler currently only pivots integer/numerical columns. A workaround exists however, using a custom cell calculation function to calculate the value in each cell. Custom calculation functions were intended for more complex use cases, so using them in this way is a sledgehammer approach, but it does the job, and I suppose makes sense in some scenarios, e.g. if you have other numerical pivot tables and want a uniform appearance for the pivot tables in your output.
Adapting an example from the package vignettes:
library(pivottabler)
library(dplyr)
trainsConcatendated <- mutate(bhmtrains, ConcatValue = paste(TOC, TrainCategory, sep=" "))
getConcatenatedValue <- function(pivotCalculator, netFilters, format, baseValues, cell) {
# get the data frame
trains <- pivotCalculator$getDataFrame("trainsConcatendated")
# apply the filters coming from the headers in the pivot table
filteredTrains <- pivotCalculator$getFilteredDataFrame(trains, netFilters)
# get the distinct values
distinctValues <- distinct(filteredTrains, ConcatValue)
# get the value of the concatenated column
# this just returns the first concatenated value for the cell
# if there are multiple values, the others are ignored
if(length(distinctValues$ConcatValue)==0) { tv <- "" }
else { tv <- distinctValues$ConcatValue[1] }
# build the return value
# the raw value must be numerical, so simply set this to zero
value <- list()
value$rawValue <- 0
value$formattedValue <- tv
return(value)
}
pt <- PivotTable$new()
pt$addData(trainsConcatendated)
pt$addColumnDataGroups("TrainCategory", addTotal=FALSE)
pt$addRowDataGroups("TOC", addTotal=FALSE)
pt$defineCalculation(calculationName="ConcatValue",
type="function", calculationFunction=getConcatenatedValue)
pt$renderPivot()
Results:
It is speculative to apply the same function for CI (lower or upper) as it could be for mean statistics to report subtotal levels, as well as no sense for concat to report subtotals (at least in the simple way of pivot table).
With no subtotal you can easily use tidyr library and report variable with character type in spread format of table : here is 2 line code. First is create groups for columns and second line is to change table format to spread version
library(tidyr)
Table_Original <- unite(Table_Original, "Col_pivot", c("LDAutGroup", "sex"), sep = "_", remove = F)
Table_Pivot <- spread(Table_Original[ ,c("Col_pivot","DrugGroup", "Ageband", "concat")], Col_pivot, concat)

Accessing the values by their rowname and columnname,instead of numbers

I have a table which has multiple columns and rows. I want to access the each value by its column name and rowname, and make a plot with these values.
The table looks like this with 101 columns:
IDs Exam1 Exam2 Exam3 Exam4 .... Exam100
Ellie 12 48 33 64
Kate 98 34 21 76
Joe 22 53 49 72
Van 77 40 12
Xavier 88 92
What I want is to be able to reach the marks for given row (IDs),and given column(exams) as:
table[Ellie,Exam3] --> 48
table[Ellie,Exam100] --> 64
table[Ellie,Exam2] --> (empty)
Then with these numbers, I want to see the distribution of how Ellie did comparing the rest of exams to Exam2,3 and 100.
I have almost figured out this part with R:
library(data.table)
library(ggplot2)
pdf("distirbution_given_row.pdf")
selectedvalues <- c(table[Ellie,Exam3] ,table[Ellie,Exam100])
library(plyr)
cdat <- ddply(selected values, "IDs", summarise, exams.mean=mean(exams))
selectedvaluesggplot <- ggplot(selectedvalues, aes(x=IDs, colour=exams)) + geom_density() + geom_vline(data=cdat, aes(xintercept=exams.mean, colour=IDs), linetype="dashed", size=1)
dev.off()
Which should generate the Ellie's marks for exams of interests versus the rest of the marks ( if it is a blank, then it should not be seen as zero. It is still a blank.)
Red: Marks for Exam3, 100 and 2 , Blue: The marks for the remaining 97 exams
(The code and the plot are taken as an example of ggplot2 from this link.)
All ideas are appreciated!
For accessing your data at least you can do the following:
df=data.frame(IDs=c("Ellie","Kate","Joe","Van","Xavier"),Exam1=c(12,98,22,77,NA),Exam2=c(NA,34,53,NA,NA),
Exam3=c(48,21,49,40,NA),Exam4=c(33,76,NA,12,88))
row.names(df)=df$IDs
df=df%>%select(-IDs)
> df['Joe','Exam2']
[1] 53
Now I prepared an example with random created numbers to illustrate a bit what you could do. First let us create an example data frame
df=as.data.frame(matrix(rnorm(505,50,10),ncol=101))
colnames(df)=c("IDs",paste0("Exam",as.character(1:100)))
df$IDs=c("Ellie","Kate","Joe","Van","Xavier")
To work with ggplot it is recomended to convert it to long format:
df0=df%>%gather(key="exams",value="score",-IDs)
From here on you can play with your variables as desired. For instance plotting the density of the score per ID:
ggplot(df0, aes(x=score,col=IDs)) + geom_density()
or selecting only Exams 2,3,100 and plotting density for different exams
df0=df0%>%filter(exams=="Exam2"|exams=="Exam3"|exams=="Exam100")
ggplot(df0, aes(x=score,col=exams)) + geom_density()
IIUC - you want to plot each IDs select exams with all else exams. Consider the following steps:
Reshape your data to long format even replace NAs with zero as needed.
Run by() to subset data by IDs and build mean aggregrate data and ggplots.
Within by, create a SelectValues indicator column on the select exams then graph with vertical line mean summation.
Data
txt = 'IDs Exam1 Exam2 Exam3 Exam4 Exam100
Ellie 12 NA 48 33 64
Kate 98 34 21 76 NA
Joe 22 53 49 NA 72
Van 77 NA 40 12 NA
Xavier NA NA NA 88 92'
exams_df <- read.table(text=txt, header = TRUE)
# ADD OTHER EXAM COLUMNS (SEEDED FOR REPRODUCIBILITY)
set.seed(444)
exams_df[paste0("Exam", seq(5:99))] <- replicate(99-4, sample(100, 5))
Reshape and Graph
library(ggplot2) # ONLY PACKAGE NEEDED
# FILL NA
exams_df[is.na(exams_df)] <- 0
# RESHAPE (BASE R VERSION)
exams_long_df <- reshape(exams_df,
timevar = "Exam",
times = names(exams_df)[grep("Exam", names(exams_df))],
v.names = "Score",
varying = names(exams_df)[grep("Exam", names(exams_df))],
new.row.names = 1:1000,
direction = "long")
# GRAPH BY EACH ID
by(exams_long_df, exams_long_df$IDs, FUN=function(df) {
df$SelectValues <- ifelse(df$Exam %in% c("Exam1", "Exam3", "Exam100"), "Select Exams", "All Else")
cdat <- aggregate(Score ~ SelectValues, df, FUN=mean)
ggplot(df, aes(Score, colour=SelectValues)) +
geom_density() + xlim(-50, 120) +
labs(title=paste(df$IDs[[1]], "Density Plot of Scores"), x ="Exam Score", y = "Density") +
geom_vline(data=cdat, aes(xintercept=Score, colour=SelectValues), linetype="dashed", size=1)
})
Output

Class probabilities in Neural networks

I use the caret package with multi-layer perception.
My dataset consists of a labelled output value, which can be either A,B or C. The input vector consists of 4 variables.
I use the following lines of code to calculate the class probabilities for each input value:
fit <- train(device~.,data=dataframetrain[1:100,], method="mlp",
trControl=trainControl(classProbs=TRUE))
(p=(predict(fit,newdata=dataframetest,type=("prob"))))
I thought that the class probabilities for each record must sum up to one. But I get the following:
rowSums(p)
# 1 2 3 4 5 6 7 8
# 1.015291 1.015265 1.015291 1.015291 1.015291 1.014933 1.015011 1.015291
# 9 10 11 12 13 14 15 16
# 1.014933 1.015206 1.015291 1.015291 1.015291 1.015224 1.015011 1.015291
Can anybody help me because I don't know what I did wrong.
There's probably nothing wrong, it just seems that caret returns the values of the neurons in the output layer without converting them to probabilities (correct me if I'm wrong). When using the RSNNS::mlp function outside of caret the rows of the predictions also don't sum to one.
Since all output neurons have the same activation function the outputs can be converted to probabilities by dividing the predictions by the respective row sum, see this question.
This behavior seems to be true when using method = "mlp" or method = "mlpWeightDecay" but when using method = "nnet" the predictions do sum to one.
Example:
library(RSNNS)
data(iris)
#shuffle the vector
iris <- iris[sample(1:nrow(iris),length(1:nrow(iris))),1:ncol(iris)]
irisValues <- iris[,1:4]
irisTargets <- iris[,5]
irisTargetsDecoded <- decodeClassLabels(irisTargets)
iris2 <- splitForTrainingAndTest(irisValues, irisTargetsDecoded, ratio=0.15)
iris2 <- normTrainingAndTestSet(iris2)
set.seed(432)
model <- mlp(iris2$inputsTrain, iris2$targetsTrain,
size=5, learnFuncParams=c(0.1), maxit=50,
inputsTest=iris2$inputsTest, targetsTest=iris2$targetsTest)
predictions <- predict(model,iris2$inputsTest)
head(rowSums(predictions))
# 139 26 17 104 54 82
# 1.0227419 1.0770722 1.0642565 1.0764587 0.9952268 0.9988647
probs <- predictions / rowSums(predictions)
head(rowSums(probs))
# 139 26 17 104 54 82
# 1 1 1 1 1 1
# nnet example --------------------------------------
library(caret)
training <- sample(seq_along(irisTargets), size = 100, replace = F)
modelCaret <- train(y = irisTargets[training],
x = irisValues[training, ],
method = "nnet")
predictionsCaret <- predict(modelCaret,
newdata = irisValues[-training, ],
type = "prob")
head(rowSums(predictionsCaret))
# 122 100 89 134 30 86
# 1 1 1 1 1 1
I don't know how much flexibility the caret package offers in these choices, but the standard way to make a neural net produce outputs which sum to one is to use the softmax function as the activation function in the output layer.

plot new values for best fit nonlinear curve

I have created the best fit for a non linear function. It seems to be working correctly:
#define a function
fncTtr <- function(n,d) (d/n)*((sqrt(1+2*(n/d))-1))
#fit
dFit <- nls(dData$ttr~fncTtr(dData$n,d),data=dData,start=list(d=25),trace=T)
summary(dFit)
plot(dData$ttr~dData$n,main="Fitted d value",pch=19,)
xl <- seq(min(dData$n),max(dData$n), (max(dData$n) - min(dData$n))/1000)
lines(xl,predict(dFit,newdata=xl,col=blue)
The plot for my observations are coming out correctly. I am having problems to display the best fit curve on my plot. I create the xl independent variable with 1000 values and I want to define the new values using the best fit. When I call the "lines" procedure, I get the error message:
Error in xy.coords(x, y) : 'x' and 'y' lengths differ
If I try to execute only the predict function:
a <-predict(dFit,newdata=xl)
str(a)
I can see that xl has 1000 components but "a" has only 16 components. Shouldn't I have the same number of values in a?
data used:
n ttr d
1 35 0.6951 27.739
2 36 0.6925 28.072
3 37 0.6905 28.507
4 38 0.6887 28.946
5 39 0.6790 28.003
6 40 0.6703 27.247
7 41 0.6566 25.735
8 42 0.6605 26.981
9 43 0.6567 27.016
10 44 0.6466 26.026
11 45 0.6531 27.667
12 46 0.6461 27.128
13 47 0.6336 25.751
14 48 0.6225 24.636
15 49 0.6214 24.992
16 50 0.6248 26.011
Ok, I think I found the solution, however I'm not sure I would be able to explain it.
When calling predict.nls, what you're inputting to argument newdata has to be named according to the variable with which you're predicting (here n) and the name has to match that given in the original call to nls.
#Here I replaced dData$n with n
dFit <- nls(ttr~fncTtr(n,d),data=dData,start=list(d=25),trace=T)
plot(dData$ttr~dData$n,main="Fitted d value",pch=19,)
xl <- seq(min(dData$n),max(dData$n), (max(dData$n) - min(dData$n))/1000)
a <- predict(dFit,newdata=list(n=xl))
length(a)==length(xl)
[1] TRUE
lines(xl,a,col="blue")

summarize data from csv using R

I'm new to R, and I wrote some code to summarize data from .csv file according to my needs.
here is the code.
raw <- read.csv("trees.csv")
looks like this
SNAME CNAME FAMILY PLOT INDIVIDUAL CAP H
1 Alchornea triplinervia (Spreng.) M. Arg. Tainheiro Euphorbiaceae 5 176 15 9.5
2 Andira fraxinifolia Benth. Angelim Fabaceae 3 321 12 6.0
3 Andira fraxinifolia Benth. Angelim Fabaceae 3 326 14 7.0
4 Andira fraxinifolia Benth. Angelim Fabaceae 3 327 18 5.0
5 Andira fraxinifolia Benth. Angelim Fabaceae 3 328 12 6.0
6 Andira fraxinifolia Benth. Angelim Fabaceae 3 329 21 7.0
#add 2 other rows
for (i in 1:nrow(raw)) {
raw$VOLUME[i] <- treeVolume(raw$CAP[i],raw$H[i])
raw$BASALAREA[i] <- treeBasalArea(raw$CAP[i])
}
#here comes.
I need a new data frame, with the mean of columns H and CAP and the sums of columns VOLUME and BASALAREA. This dataframe is grouped by column SNAME and subgrouped by column PLOT.
plotSummary = merge(
aggregate(raw$CAP ~ raw$SNAME * raw$PLOT, raw, mean),
aggregate(raw$H ~ raw$SNAME * raw$PLOT, raw, mean))
plotSummary = merge(
plotSummary,
aggregate(raw$VOLUME ~ raw$SNAME * raw$PLOT, raw, sum))
plotSummary = merge(
plotSummary,
aggregate(raw$BASALAREA ~ raw$SNAME * raw$PLOT, raw, sum))
The functions treeVolume and treeBasal area just return numbers.
treeVolume <- function(radius, height) {
return (0.000074230*radius**1.707348*height**1.16873)
}
treeBasalArea <- function(radius) {
return (((radius**2)*pi)/40000)
}
I'm sure that there is a better way of doing this, but how?
I can't manage to read your example data in, but I think I've made something that generally represents it...so give this a whirl. This answer builds off of Greg's suggestion to look at plyr and the functions ddply to group by segments of your data.frame and numcolwise to calculate your statistics of interest.
#Sample data
set.seed(1)
dat <- data.frame(sname = rep(letters[1:3],2), plot = rep(letters[1:3],2),
CAP = rnorm(6),
H = rlnorm(6),
VOLUME = runif(6),
BASALAREA = rlnorm(6)
)
#Calculate mean for all numeric columns, grouping by sname and plot
library(plyr)
ddply(dat, c("sname", "plot"), numcolwise(mean))
#-----
sname plot CAP H VOLUME BASALAREA
1 a a 0.4844135 1.182481 0.3248043 1.614668
2 b b 0.2565755 3.313614 0.6279025 1.397490
3 c c -0.8280485 1.627634 0.1768697 2.538273
EDIT - response to updated question
Ok - now that your question is more or less reproducible, here's how I'd approach it. First of all, you can take advantage of the fact that R is a vectorized meaning that you can calculate ALL of the values from VOLUME and BASALAREA in one pass, without looping through each row. For that bit, I recommend the transform function:
dat <- transform(dat, VOLUME = treeVolume(CAP, H), BASALAREA = treeBasalArea(CAP))
Secondly, realizing that you intend to calculate different statistics for CAP & H and then VOLUME & BASALAREA, I recommend using the summarize function, like this:
ddply(dat, c("sname", "plot"), summarize,
meanCAP = mean(CAP),
meanH = mean(H),
sumVOLUME = sum(VOLUME),
sumBASAL = sum(BASALAREA)
)
Which will give you an output that looks like:
sname plot meanCAP meanH sumVOLUME sumBASAL
1 a a 0.5868582 0.5032308 9.650184e-06 7.031954e-05
2 b b 0.2869029 0.4333862 9.219770e-06 1.407055e-05
3 c c 0.7356215 0.4028354 2.482775e-05 8.916350e-05
The help pages for ?ddply, ?transform, ?summarize should be insightful.
Look at the plyr package. I will split the data by the SNAME variable for you, then you give it code to do the set of summaries that you want (mixing mean and sum and whatever), then it will put the pieces back together for you. You probably want either the 'ddply' or the 'daply' function in that package.

Resources