I have a dataset of chess positions made up of 100 groups, with each group taking one of 50 positions ("Position_number") and one of two colours ("stm_white"). I want to run a linear regression for each Position_number subsample, where stm_white is the explanatory variable and stm_perform is the outcome variable. Then, I want to display the coefficient of stm_white and the associated confidence interval for each regression in a forest plot. The idea is to be able to easily see which Position_number subsample gives significant coefficients for stm_white, and to compare coefficients across positions. For example, the plot would have 50 y-axis categories labelled with each position number, the x-axis would represent the coefficient range, and the plot would display a horizontal confidence bar for each position number.
Where I'm stuck:
Getting the confidence interval bounds for each regression
Plotting each of the 50 coefficients (with confidence intervals) on one plot. (I think this is called a forest plot?)
This is how I current get a list of the coefficients for each regression:
fits <- by(df, df[,"Position_number"],
function(x) lm(stm_perform ~ stm_white, data = x))
# Combine coefficients from each model
do.call("rbind", lapply(fits, coef))
And here is a sample of 10 positions (apologies if there's a better way to show reproducible data):
>dput(droplevels(dfMWE[,c("Position_number","stm_white","stm_perform")]))
structure(list(Position_number = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
10, 10, 10, 10, 10, 10), stm_white = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1), stm_perform = c(0.224847134350316, -0.252000458803946,
0.263005239459311, -0.337712202569111, 0.525880930891169, -0.5,
0.514387184165999, 0.520136722035817, -0.471249436107731, -0.557311633762293,
-0.382774969095054, -0.256365477992672, -0.592466230584332, 0.420100239642119,
0.35728693116738, -0.239203909010858, 0.492804918290949, -0.377349804212738,
0.498560888290847, 0.650604627933873, 0.244481117928803, 0.225852022298169,
0.448376452689039, 0.305090287270497, 0.275461757157464, 0.0232950364735793,
-0.117225030904946, 0.103523492101814, 0.098301745397805, 0.435599509759579,
-0.323024628921732, -0.790798102797238, 0.326223812111678, -0.331305043692668,
0.300230596737942, -0.340292005855252, 0.196181480575316, -0.0606495585093978,
0.789844179758131, -0.0862623926308338, -0.560150145231903, 0.697345078589853,
-0.425719796345476, 0.65321716721887, -0.878090073942596, 0.393712176214572,
0.636076899687882, 0.530184680003902, -0.567228844342952, 0.767024918145021,
-0.207303615824231, -0.332581578126777, -0.511510891217792, 0.227871326531416,
-0.0140876421179904, -0.891010911045765, -0.617225030904946,
-0.335142021445235, -0.517262524432376, 0.676301669492737, 0.375998241382333,
-0.0882899718631629, -0.154706189382, -0.108431333126633, 0.204584592662721,
0.475554538879339, 0.0840205872617279, -0.403370826694226, -0.74253555894307,
0.182570385474772, -0.484175014735265, -0.332581578126777, -0.427127748605496,
0.474119069108831, -0.0668284645696687, -0.0262098994728823,
-0.255269593134965, -0.313699742316688, -0.485612815834001, 0.302654921410147,
-0.425719796345476, 0.65321716721887, 0.393712176214572, 0.60766106412682,
0.530184680003902, 0.384135895746244, 0.564400490240421, 0.767024918145021,
0.702182602090521, 0.518699777929559, -0.281243170101218, -0.283576305897061,
0.349395372066127, -0.596629173305774, 0.0849108889395813, -0.264122555898524,
0.593855385236178, -0.418698521631085, 0.269754586702576, -0.719919005947152,
0.510072446927438, -0.0728722513945044, -0.0849108889395813,
0.0650557537775339, 0.063669188530584, -0.527315973006493, -0.716423694102939,
-0.518699777929559, 0.349395372066127, -0.518699777929559, 0.420100239642119,
-0.361262250888275, 0.431358608116332, 0.104596852632671, 0.198558626418023,
0.753386077785615, 0.418698521631085, -0.492804918290949, -0.636076899687882,
-0.294218640287997, 0.617225030904946, -0.333860575416878, -0.544494573083008,
-0.738109032540419, -0.192575818328721, -0.442688366237707, 0.455505426916992,
0.13344335621046, 0.116471711943561, 0.836830966002895, -0.125024693001636,
0.400603203290743, -0.363923100312118, -0.157741327529574, -0.281243170101218,
-0.326223812111678, -0.548774335859742, 0.104058949158278, -0.618584122089031,
-0.148779202375097, -0.543066492022212, -0.790798102797238, -0.541637702714763,
0.166337530816562, -0.431358608116332, -0.471249436107731, -0.531618297828107,
-0.135452994588696, 0.444109038883147, -0.309993792719686, 0.472684026993507,
-0.672509643334985, -0.455505426916992, -0.0304828450187082,
-0.668694956307332, 0.213036720610531, -0.370611452782498, -0.100361684849949,
-0.167940159469667, -0.256580594295053, 0.41031649686005, 0.544494573083008,
-0.675040201040299, 0.683816314193659, 0.397841906825283, 0.384135895746244,
0.634743335052317, 0.518699777929559, -0.598013765769344, -0.524445461120661,
-0.613136820153143, 0.12949974225673, -0.337712202569111, -0.189904841395243,
0.588289971863163, 0.434184796930767, -0.703385003471829, 0.505756208411145,
0.445530625978324, -0.167137309739621, 0.437015271896404, -0.550199353253537,
-0.489927553072562, -0.791748837508184, 0.434184796930767, 0.264122555898524,
-0.282408276808469, -0.574280203654524, 0.167940159469667, -0.439849854768097,
-0.604912902007957, 0.420100239642119, 0.35728693116738, 0.239220254140668,
-0.276612130560829, -0.25746444105693, 0.593855385236178, -0.632070012100074,
0.314483587504712, 0.650604627933873, -0.226860086923233, -0.702182602090521,
0.25746444105693, -0.174474012638818, 0.0166045907672774, 0.535915926945102,
0.141635395826102, 0.420100239642119, 0.557311633762293, 0.593855385236178,
0.6961287704296, 0.0444945730830079, -0.234005329233511, 0.448376452689039,
-0.86655664378954, 0.22107824319756, 0.148051654147426, 0.543066492022212,
-0.448376452689039, 0.373300918333268)), row.names = c(NA, -220L
), groups = structure(list(Position_number = c(0, 1, 2, 3, 4,
5, 6, 7, 8, 9, 10), .rows = structure(list(1:20, 21:40, 41:60,
61:80, 81:100, 101:120, 121:140, 141:160, 161:180, 181:200,
201:220), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, 11L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
confint() can get you the confidence interval of a model.
forestplot() from the forestplot R package can make you a forest plot.
library(dplyr)
library(forestplot)
results <- lapply(unique(df$Position_number), function(pos) {
fit = filter(df, Position_number == pos) %>%
lm(data = ., stm_perform ~ stm_white)
stm_white_lm_index = 2 # the second term in lm() output is "stm_white"
coefficient = coef(fit)[stm_white_lm_index]
lb = confint(fit)[stm_white_lm_index,1] # lower bound confidence
ub = confint(fit)[stm_white_lm_index,2] # upper bound confidence
output = data.frame(Position_number = pos, coefficient, lb, ub)
return(output)
}) %>% bind_rows() # bind_rows() combines output from each model in the list
with(results, forestplot(Position_number, coefficient, lb, ub))
The forest plot shows the "Position_number" labels on the left and the regression coefficients of "stm_white" with the 95% confidence intervals plotted. You can further customize the plot. See forestplot::forestplot() or this introduction by Max Gordon for details.
I have a function which her input should run each time over one cell from one column over each cell in another column.
I can do it with a loop, however, I'm looking to vectorize the process or make it faster. As for now, it would take me days to finish the process.
Ideally, it would be using tidyverse but any help would be appreciated.
My loop looks like that:
results <- data.frame(
pathSubject1 = as.character(),
pathSubject2 = as.character())
i <- 1 #Counter first loop
j <- 1 #Counter second loop
#Loop over subject 1
for (i in 1:dim(df)[1]) {#Start of first loop
#Loop over subject 2
for (j in 1:dim(df)[1]) {#Start of second loop
#calc my function for the subjects
tempPercentSync <- myFunc(df$subject1[i], df$subject2[j])
results <- rbind(
results,
data.frame(
pathSubject1 = df$value[i],
pathSubject2 = df$value[j],
syncData = nest(tempPercentSync)))
} #End second loop
} #End first loop
My example function:
myFunc <- function(x, y) {
temp <- dplyr::inner_join(
as.data.frame(x),
as.data.frame(y),
by = "Time")
out <- as.data.frame(summary(temp))
}
Example of my dataset using dput:
structure(list(value = c("data/ExportECG/101_1_1_0/F010.feather",
"data/ExportECG/101_1_1_0/F020.feather"), ID = c(101, 101), run = c(1,
1), timeComing = c(1, 1), part = c(0, 0), paradigm = c("F010",
"F020"), group = c(1, 1), subject1 = list(structure(list(Time = c(0,
0.5, 1, 1.5, 2, 2.5), subject1 = c(9.73940345482368, 9.08451907157601,
8.42963468832833, 7.77475030508065, 7.11986592183298, 7.24395122629289
)), .Names = c("Time", "subject1"), row.names = c(NA, 6L), class = "data.frame"),
structure(list(Time = c(0, 0.5, 1, 1.5, 2, 2.5), subject1 = c(58.3471156751544,
75.9103303197856, 83.014068283342, 89.7923167579699, 88.6748903116088,
84.7651306939912)), .Names = c("Time", "subject1"), row.names = c(NA,
6L), class = "data.frame")), subject2 = list(structure(list(
Time = c(0, 0.5, 1, 1.5, 2, 2.5), subject2 = c(77.7776200371528,
77.4139420609906, 74.9760822165258, 75.3915183650012, 77.5672070195079,
80.7418145918357)), .Names = c("Time", "subject2"), row.names = c(NA,
6L), class = "data.frame"), structure(list(Time = c(0, 0.5, 1,
1.5, 2, 2.5), subject2 = c(101.133666720578, 105.010792226714,
107.01541987713, 104.471173834529, 97.5910271952943, 92.9840354003295
)), .Names = c("Time", "subject2"), row.names = c(NA, 6L), class = "data.frame"))), .Names = c("value",
"ID", "run", "timeComing", "part", "paradigm", "group", "subject1",
"subject2"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-2L))
Output should loook like:
pathSubject1
1 data/ExportECG/101_1_1_0/F010.feather
2 data/ExportECG/101_1_1_0/F010.feather
3 data/ExportECG/101_1_1_0/F020.feather
4 data/ExportECG/101_1_1_0/F020.feather
pathSubject2
1 data/ExportECG/101_1_1_0/F010.feather
2 data/ExportECG/101_1_1_0/F020.feather
3 data/ExportECG/101_1_1_0/F010.feather
4 data/ExportECG/101_1_1_0/F020.feather
data
1 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 20, 5, 17, 14, 8, 11, 21, 6, 19, 16, 10, 13, 22, 7, 18, 15, 9, 12
2 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 21, 6, 17, 14, 8, 12, 22, 7, 19, 16, 10, 13, 20, 5, 18, 15, 9, 11
3 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 20, 5, 17, 14, 8, 11, 21, 7, 19, 16, 10, 13, 22, 6, 18, 15, 9, 12
4 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 21, 6, 17, 14, 8, 12, 22, 7, 19, 16, 10, 13, 20, 5, 18, 15, 9, 11
Thank you!
I think you're looking for lapply (or a related function).
What's probably taking the most time is the rbind, because at each step in your loops the entire object results gets slightly larger, which means it gets fully copied. With lapply, all results are first calculated, and only then you combine them with dplyr::rbind_list dplyr::bind_rows
What you get is this:
results <- dplyr::bind_rows(lapply(1:dim(df)[1], function(i) {
dplyr::bind_rows(lapply(1:dim(df)[1], function(j) {
data.frame(pathSubject1 = df$value[i],
pathSubject2 = df$value[j],
syncData = tidyr::nest(myFunc(df$subject1[[i]], df$subject2[[j]])))
}))
}))
Does that solve your problem?
EDIT: speeding things up
I've edited to use bind_rows instead of rbind_list, it's supposed to be faster. Furthermore, if you use [[i]] instead of [i] in the call to myFunc, you can drop the as.data.frame(x) there (and some for j/y).
Finally, you could optimize myFunc a bit by not assigning any intermediate results:
myFunc <- function(x, y) {
as.data.frame(summary(dplyr::inner_join(x, y, by = "Time")))
}
But my gut feeling says these will be small differences. To gain more speedup we need to reduce the actual computations, and then it matters what your actual data is, and what you need for your results.
Some observations, based on your example:
Do we need seperate data.frames? We compare all values in df$subject1 with those in df$subject2. In the example, first making one large data.frame for subject1, and then another for subject2, if needed with an extra label would speed up the join.
Why a join? Right now the summary of the join gives only information that we could have gotten without a join as well.
We join on Time, but in the example the timestamps for subject1 and 2 are identical. A check whether they are the same, followed by simply copying would be faster
As end-result we have a data.frame, with one column containing data.frames containing the summary of the join. Is that the way you need it? I think your code could be a lot faster if you only calculate the values you actually need. And I haven't worked a lot with data.frames containing data.frames, but it could well be that bind_rows doesn't handle it efficiently. Maybe a simple list (as column of your data.frame) would work better, as there's less overhead.
Finally, it could be that you're unable to reveal more about your real data, or it's too complicated.
In that case I think you could look aorund for various profiling-tools, functions that can help show you where most time is being spend. Personally, I like the profvis-tool
Put print(profvis::profvis({ mycode }, interval=seconds)) around a block of code, and after it finishes execution you see which lines took the most time, and which functions are called under the hood.
In the example-code, almost all time is spent in the row-binding and making data.frames. But in real data, I expect other functions may be time-consuming.