Related
I am conducting a kruskal-wallis test to determine statistically significance between three groups of a measurement. I use ggbetweenstats to determine between which group there is a statistically significant association.
Here is the code for sample data and the plot:
sampledata <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20), group = c(1, 2, 3, 1, 2, 3,
1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2), measurement = c(0,
1, 200, 30, 1000, 6000, 1, 0, 0, 10000, 20000, 700, 65, 1, 8,
11000, 13000, 7000, 500, 3000)), class = "data.frame", row.names = c(NA,
20L))
library(ggstatsplot)
library(ggplot2)
ggbetweenstats(
data = sampledata,
x = group,
y = measurement,
type = "nonparametric",
plot.type = "box",
pairwise.comparisons = TRUE,
pairwise.display = "all",
centrality.plotting = FALSE,
bf.message = FALSE
)
You can see the results from the kruskal wallis test on the top of the plot as well as the groupes analysis in the plot. Now I want to change y axis to logarithmic scale:
ggbetweenstats(
data = sampledata,
x = group,
y = measurement,
type = "nonparametric",
plot.type = "box",
pairwise.comparisons = TRUE,
pairwise.display = "all",
centrality.plotting = FALSE,
bf.message = FALSE
) +
ggplot2::scale_y_continuous(trans=scales::pseudo_log_trans(sigma = 1, base = exp(1)), limits = c(0,25000), breaks = c(0,1,10,100,1000,10000)
)
However, this removes the grouped analysis. I have tried different scaling solutions and browsed SO for a solution but couldn't find anything. Thank you for your help!
It seems that the y_position parameter in the geom_signif component is not affected by the y axis transformation. You will need to pass the log values of the desired bracket heights manually. In theory, you can pass these via the ggsignif.args parameter, but it seems that in the latest version of ggstatsplot this isn't possible because the y_position is hard-coded.
One way tound this is to store the plot then change the y positions after the fact. Here's a full reprex with the latest versions of ggplot2, ggstatsplot and their dependencies (at the time of writing)
sampledata <- structure(list(ID = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20), group = c(1, 2, 3, 1, 2, 3,
1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2), measurement = c(0,
1, 200, 30, 1000, 6000, 1, 0, 0, 10000, 20000, 700, 65, 1, 8,
11000, 13000, 7000, 500, 3000)), class = "data.frame", row.names = c(NA,
20L))
library(ggstatsplot)
library(ggplot2)
library(scales)
p <- ggbetweenstats(
data = sampledata,
x = group,
y = measurement,
type = "nonparametric",
plot.type = "box",
pairwise.comparisons = TRUE,
pairwise.display = "all",
centrality.plotting = FALSE,
bf.message = FALSE
) + scale_y_continuous(trans = pseudo_log_trans(sigma = 1, base = exp(1)),
limits = c(0, exp(13)),
breaks = c(0, 10^(0:5)),
labels = comma)
#> Scale for y is already present.
#> Adding another scale for y, which will replace the existing scale.
i <- which(sapply(p$layers, function(x) inherits(x$geom, "GeomSignif")))
p$layers[[i]]$stat_params$y_position <- c(10, 10.8, 11.6)
p
Created on 2023-01-15 with reprex v2.0.2
I am not familiar with if statements/loops/or functions in R. I have a dataset where I want to adjust the a variable (N) by the clustering of the study (the formula is this one: N/(1 + (M - 1) * ICC). Where N is the number of subjects, the M is the size of the cluster and ICC is the intra-class correlation coeff. I have all these variables in separate columns with each row identifying the different studies/sample sizes. Not all the studies have a clustering issues so I need to apply this function only to the subset of those with the ICC. I thought about something like this but I know it is missing something, and also, I don't know if a loop with an if statement is the most efficient way to go.
for (i in df$N) { # for every sample size in df$N
if (df$ICC[i] != .) { # if ICC is not missing
df$N/(1 + (df$M - 1) * df$ICC) # adjust the sample size by dividing the N by the size
of the cluster - 1 and multiply per the ICC of the study
} else {
df$N/1 #otherwise (ICC is missing) do nothing: ie., divide the N by 1.
}
}
Do you know how I could do this with something like this? Other solutions are also welcome! Thanks for any help or suggestion about this!
Here's an example of the dataset:
dput(head(df, 10))
structure(list(ID = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5), ArmsID = c(0,
1, 0, 1, 0, 1, 0, 1, 0, 1), N = c(26, 34, 28, 27, 50, 52, 60,
65, 150, 152), Mean = c(10.1599998474121, 5.59999990463257, 8,
8.52999973297119, 17, 15.1700000762939, 48.0999984741211, 49,
57, 55.1315803527832), SD = c(6.30000019073486, 4.30000019073486,
5.6, 6.61999988555908, 6, 7.75, 10.1599998474121, 12, 11, 10.5495901107788
), SE = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), ICC = c(0.03,
0.02, NA, NA, 0.01, 0.003, NA, NA, NA, NA), M = c(5, 5, NA, NA,
17, 16, NA, NA, NA, NA)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
The . meant to indicate missing data: NA. I want to apply the functions that adjust the N only to the rows that have an ICC.
idx <- which(!is.na(df$ICC))
df$N[idx] <- df$N[idx]/(1 + (df$M[idx] - 1) * df$ICC[idx])
This code correctly works, thanks!
I have the follow set of data:
Using R and tidyverse if possible I would like to sum column S based upon a condition on other columns. If my variable
condition_columns = c('A', 'B')
The output I am after is a data frame containing
Where the 490 is obtained by summing column S only when A=1 and the 250 comes from summing column S when B=1.
Could anyone suggest a tidyverse way of doing it?
Thank you,
Phil,
You can do this using summarize(across())
summarize(df, across(all_of(condition_columns), ~sum(S[.x==1])))
Output:
A B
1 490 250
Input:
structure(list(ID = 1:10, A = c(0, 0, 0, 1, 1, 1, 1, 1, 1, 1),
B = c(1, 0, 1, 0, 1, 0, 1, 0, 1, 0), S = c(10, 20, 30, 40,
50, 60, 70, 80, 90, 100)), class = "data.frame", row.names = c(NA,
-10L))
You may use the following (easy to understand) code :
df %>%
summarise(A = sum(A*S),
B = sum(B*S))
Output:
A B
1 490 250
My data looks like this:
mydata <- data.frame(ID = c(1, 2, 3, 5, 6, 7, 9, 11, 12, 13), #patient ID
t1 = c(37, 66, 28, 60, 44, 24, 47, 44, 33, 47), #evaluation before
t4 = c(33, 45, 27, 39, 24, 29, 24, 37, 27, 42), #evaluation after
sexe = c(1, 2, 2, 1, 1, 1, 2, 2, 2, 1)) #subset
I would like to do a simple before-after graph.
So far, I managed to get this:
With this:
library(ggplot2)
ggplot(mydata) +
geom_segment(aes(x = 1, xend = 2, y = t1, yend = t4), size=0.6) +
scale_x_discrete(name = "Intervention", breaks = c("1", "2"), labels = c("T1", "T4"), limits = c(1, 2)) +
scale_y_continuous(name = "Var") + theme_bw()
I am facing multiple issues, can you help me to...
add black circle at the begining and the end of every line? (geom_point() doesn't work)
make line smoother (look how pixelated they are, especially the second one)?
decrease blank space on left and right side of the graph?
add median for T1 and T4 (in red), link those points, compare them with paired mann whitney test and print p-value on the graph?
I would like not to reformat my database to long format I have a lot of other variable and timepoint (not shown here).
I have read other posts (such as here) but solution provided look so complicated for something that seems simple (yet i can't do it...).
Huge thanks for your help!
I will update the graph along with progression :)
EDIT
I would like not to reformat my database to long format as I have a lot of other variables and timepoints (not shown here)...
Here what i would do! Please feel free to ask questions regarding what's going on here.
library(tidyverse)
mydata <- data.frame(ID = c(1, 2, 3, 5, 6, 7, 9, 11, 12, 13), #patient ID
t1 = c(37, 66, 28, 60, 44, 24, 47, 44, 33, 47), #evaluation before
t4 = c(33, 45, 27, 39, 24, 29, 24, 37, 27, 42), #evaluation after
sexe = c(1, 2, 2, 1, 1, 1, 2, 2, 2, 1))
pval <- wilcox.test(x = mydata$t1,y = mydata$t4, paired = T,exact = F)$p.value %>% round(2)
df <- mydata %>%
pivot_longer(2:3,names_to = "Time") %>% # Pivot into long-format
mutate(sexe = as.factor(sexe),
Time = as.factor(Time)) # Make factors
ggplot(df,aes(Time,value,color = sexe,group = ID)) +
geom_point() +
geom_line() +
stat_summary(inherit.aes = F,aes(Time,value),
geom = "point", fun = "median", col = "red",
size = 3, shape = 24,fill = "red"
) +
annotate("text", x = 1.7, y = 60, label = paste('P-Value is',pval)) +
coord_cartesian(xlim = c(1.4,1.6)) +
theme_bw()
Also be aware that it is common to have some variables which repeat through time, in addition to the long format data. See example here:
mydata <- data.frame(ID = c(1, 2, 3, 5, 6, 7, 9, 11, 12, 13), #patient ID
t1 = c(37, 66, 28, 60, 44, 24, 47, 44, 33, 47), #evaluation before
t4 = c(33, 45, 27, 39, 24, 29, 24, 37, 27, 42), #evaluation after
sexe = c(1, 2, 2, 1, 1, 1, 2, 2, 2, 1),
var1 = c(1:10),
var2 = c(1:10),
var3 = c(1:10))
df <- mydata %>%
pivot_longer(2:3,names_to = "Time") %>% # Pivot into long-format
mutate(sexe = as.factor(sexe),
Time = as.factor(Time))
I can address (1) black circles issue:
First, you should tidy your data, so one column holds information of one variable (now 'Var' values on the plot are stored in two columns: 't1' and 't4'). You can achive this with tidyr package.
library(tidyr)
mydata_long <- pivot_longer(mydata, c(t1, t4), names_to = "t")
Now creating points is easy, and the rest of the code becomes a lot clearer:
We can tell ggplot that we want 't' groups on x-axis, their values on y-axis and in case of lines, we want them separate for every 'ID'.
ggplot(mydata_long) +
geom_line(aes(x = t, y = value, group = ID)) + #ploting lines
geom_point(aes(x = t, y = value)) + #ploting points
labs(x = "Intervention", y = "Var") + #changing labels
theme_bw()
Im trying to do a 10-fold cross validation and estimate the model performance of a joint model by using parallel processing (parLapply). Im trying to find out why I receive the error message:
"Error in checkForRemoteErrors(val): five nodes produced an error: object 'Week' not found"
The code looks as follows:
# Validation using 10-fold CV
library("parallel")
set.seed(123)
V <- 10
n <- nrow(dfC)
splits <- split(seq_len(n), sample(rep(seq_len(V), length.out = n)))
CrossValJM <- function (i) {
library("JM")
library("nlme")
trainingData <- dfL[!dfL$ID %in% i, ]
trainingData_ID <- trainingData[!duplicated(trainingData$ID), ]
testingData <- dfL[dfL$ID %in% i, ]
lmeFit <- lme(DA ~ ns(Week, 2), data = trainingData,
random = ~ ns(Week, 2) | ID)
coxFit <- coxph(Surv(TT_event, Event) ~ Gender * Age, data =
trainingData_ID,
x = TRUE)
jointFit <- jointModel(lmeFit, coxFit, timeVar = "Week")
pe <- prederrJM(jointFit, newdata = testingData, Tstart = 10,
Thoriz = 20)
auc <- aucJM(jointFit, newdata = testingData, Tstart = 10,
Thoriz = 20)
list(pe = pe, auc = auc)
}
cl <- makeCluster(5)
res <- parLapply(cl, splits, CrossValJM)
stopCluster(cl)
The function itself gets accepted but when running the Cluster commands I run into this error that mentions that it cannot recognize objects given within the function.. should they be defined within the function itself?? Or am I not using the parLapply function correctly?
P.S.: data looks as follows (dfL is a dataframe of length ~ 1000 and dfC ~ 200):
dfL <- data.frame(ID = c(1, 1, 1, 2, 2, 3), DA = c(0.4, 1.8, 1.2, 3.2, 3.6, 2.8), Week = c(0, 4, 16, 4, 20, 8), Event = c(1, 1, 1, 0, 0, 1), TT_Event = c(16, 20, 8), Gender = c(0, 0, 0, 1, 1, 0), Age = c(24, 24, 24, 56, 56, 76))
dfC <- data.frame(ID = c(1, 2, 3, 4, 5, 6), DA = c(1.2, 3.6, 2.8, 2.4, 1.9, 3.4), Week = c(16, 20, 8, 36, 24, 32), Event = c(1, 0, 1, 1, 1, 0), TT_Event = c(16, 20, 8, 36, 24, 32), Gender = c(0, 1, 0, 0, 1, 1), Age = c(24, 56, 76, 38, 44, 50))
Thnx :)
Very related questions have already been answered on Stack Overflow.
Basically, you have three solutions:
use clusterExport() to export the variables you need to the clusters (the most common method)
pass all variables as arguments of your function CrossValJM() so that they are automatically exported to the clusters (the solution I prefer, the most programmatically correct one)
use R package {future} which should detect automatically variables to export (the lazy solution, but seems to work well also)
See for example this.