Using R to create forest plots of coefficients for regressions on subsamples - r

I have a dataset of chess positions made up of 100 groups, with each group taking one of 50 positions ("Position_number") and one of two colours ("stm_white"). I want to run a linear regression for each Position_number subsample, where stm_white is the explanatory variable and stm_perform is the outcome variable. Then, I want to display the coefficient of stm_white and the associated confidence interval for each regression in a forest plot. The idea is to be able to easily see which Position_number subsample gives significant coefficients for stm_white, and to compare coefficients across positions. For example, the plot would have 50 y-axis categories labelled with each position number, the x-axis would represent the coefficient range, and the plot would display a horizontal confidence bar for each position number.
Where I'm stuck:
Getting the confidence interval bounds for each regression
Plotting each of the 50 coefficients (with confidence intervals) on one plot. (I think this is called a forest plot?)
This is how I current get a list of the coefficients for each regression:
fits <- by(df, df[,"Position_number"],
function(x) lm(stm_perform ~ stm_white, data = x))
# Combine coefficients from each model
do.call("rbind", lapply(fits, coef))
And here is a sample of 10 positions (apologies if there's a better way to show reproducible data):
>dput(droplevels(dfMWE[,c("Position_number","stm_white","stm_perform")]))
structure(list(Position_number = c(0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,
6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9,
9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10,
10, 10, 10, 10, 10, 10), stm_white = c(0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1), stm_perform = c(0.224847134350316, -0.252000458803946,
0.263005239459311, -0.337712202569111, 0.525880930891169, -0.5,
0.514387184165999, 0.520136722035817, -0.471249436107731, -0.557311633762293,
-0.382774969095054, -0.256365477992672, -0.592466230584332, 0.420100239642119,
0.35728693116738, -0.239203909010858, 0.492804918290949, -0.377349804212738,
0.498560888290847, 0.650604627933873, 0.244481117928803, 0.225852022298169,
0.448376452689039, 0.305090287270497, 0.275461757157464, 0.0232950364735793,
-0.117225030904946, 0.103523492101814, 0.098301745397805, 0.435599509759579,
-0.323024628921732, -0.790798102797238, 0.326223812111678, -0.331305043692668,
0.300230596737942, -0.340292005855252, 0.196181480575316, -0.0606495585093978,
0.789844179758131, -0.0862623926308338, -0.560150145231903, 0.697345078589853,
-0.425719796345476, 0.65321716721887, -0.878090073942596, 0.393712176214572,
0.636076899687882, 0.530184680003902, -0.567228844342952, 0.767024918145021,
-0.207303615824231, -0.332581578126777, -0.511510891217792, 0.227871326531416,
-0.0140876421179904, -0.891010911045765, -0.617225030904946,
-0.335142021445235, -0.517262524432376, 0.676301669492737, 0.375998241382333,
-0.0882899718631629, -0.154706189382, -0.108431333126633, 0.204584592662721,
0.475554538879339, 0.0840205872617279, -0.403370826694226, -0.74253555894307,
0.182570385474772, -0.484175014735265, -0.332581578126777, -0.427127748605496,
0.474119069108831, -0.0668284645696687, -0.0262098994728823,
-0.255269593134965, -0.313699742316688, -0.485612815834001, 0.302654921410147,
-0.425719796345476, 0.65321716721887, 0.393712176214572, 0.60766106412682,
0.530184680003902, 0.384135895746244, 0.564400490240421, 0.767024918145021,
0.702182602090521, 0.518699777929559, -0.281243170101218, -0.283576305897061,
0.349395372066127, -0.596629173305774, 0.0849108889395813, -0.264122555898524,
0.593855385236178, -0.418698521631085, 0.269754586702576, -0.719919005947152,
0.510072446927438, -0.0728722513945044, -0.0849108889395813,
0.0650557537775339, 0.063669188530584, -0.527315973006493, -0.716423694102939,
-0.518699777929559, 0.349395372066127, -0.518699777929559, 0.420100239642119,
-0.361262250888275, 0.431358608116332, 0.104596852632671, 0.198558626418023,
0.753386077785615, 0.418698521631085, -0.492804918290949, -0.636076899687882,
-0.294218640287997, 0.617225030904946, -0.333860575416878, -0.544494573083008,
-0.738109032540419, -0.192575818328721, -0.442688366237707, 0.455505426916992,
0.13344335621046, 0.116471711943561, 0.836830966002895, -0.125024693001636,
0.400603203290743, -0.363923100312118, -0.157741327529574, -0.281243170101218,
-0.326223812111678, -0.548774335859742, 0.104058949158278, -0.618584122089031,
-0.148779202375097, -0.543066492022212, -0.790798102797238, -0.541637702714763,
0.166337530816562, -0.431358608116332, -0.471249436107731, -0.531618297828107,
-0.135452994588696, 0.444109038883147, -0.309993792719686, 0.472684026993507,
-0.672509643334985, -0.455505426916992, -0.0304828450187082,
-0.668694956307332, 0.213036720610531, -0.370611452782498, -0.100361684849949,
-0.167940159469667, -0.256580594295053, 0.41031649686005, 0.544494573083008,
-0.675040201040299, 0.683816314193659, 0.397841906825283, 0.384135895746244,
0.634743335052317, 0.518699777929559, -0.598013765769344, -0.524445461120661,
-0.613136820153143, 0.12949974225673, -0.337712202569111, -0.189904841395243,
0.588289971863163, 0.434184796930767, -0.703385003471829, 0.505756208411145,
0.445530625978324, -0.167137309739621, 0.437015271896404, -0.550199353253537,
-0.489927553072562, -0.791748837508184, 0.434184796930767, 0.264122555898524,
-0.282408276808469, -0.574280203654524, 0.167940159469667, -0.439849854768097,
-0.604912902007957, 0.420100239642119, 0.35728693116738, 0.239220254140668,
-0.276612130560829, -0.25746444105693, 0.593855385236178, -0.632070012100074,
0.314483587504712, 0.650604627933873, -0.226860086923233, -0.702182602090521,
0.25746444105693, -0.174474012638818, 0.0166045907672774, 0.535915926945102,
0.141635395826102, 0.420100239642119, 0.557311633762293, 0.593855385236178,
0.6961287704296, 0.0444945730830079, -0.234005329233511, 0.448376452689039,
-0.86655664378954, 0.22107824319756, 0.148051654147426, 0.543066492022212,
-0.448376452689039, 0.373300918333268)), row.names = c(NA, -220L
), groups = structure(list(Position_number = c(0, 1, 2, 3, 4,
5, 6, 7, 8, 9, 10), .rows = structure(list(1:20, 21:40, 41:60,
61:80, 81:100, 101:120, 121:140, 141:160, 161:180, 181:200,
201:220), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, 11L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))

confint() can get you the confidence interval of a model.
forestplot() from the forestplot R package can make you a forest plot.
library(dplyr)
library(forestplot)
results <- lapply(unique(df$Position_number), function(pos) {
fit = filter(df, Position_number == pos) %>%
lm(data = ., stm_perform ~ stm_white)
stm_white_lm_index = 2 # the second term in lm() output is "stm_white"
coefficient = coef(fit)[stm_white_lm_index]
lb = confint(fit)[stm_white_lm_index,1] # lower bound confidence
ub = confint(fit)[stm_white_lm_index,2] # upper bound confidence
output = data.frame(Position_number = pos, coefficient, lb, ub)
return(output)
}) %>% bind_rows() # bind_rows() combines output from each model in the list
with(results, forestplot(Position_number, coefficient, lb, ub))
The forest plot shows the "Position_number" labels on the left and the regression coefficients of "stm_white" with the 95% confidence intervals plotted. You can further customize the plot. See forestplot::forestplot() or this introduction by Max Gordon for details.

Related

R error when trying to cluster data using pam (package cluster)

I am trying to run k-means clustering on a data set which was preprocessed (categorical to dummy, na cleaning etc.).
here is an extract (head) of the data:
dput(head(clustering.set.in))
structure(list(activity_type = c(1, 1, 1, 1, 1, 1), app_id.PXkw7OJ1se = c(0,
1, 1, 1, 1, 0), app_id.PXszbKVa5M = c(0, 0, 0, 0, 0, 0), app_id.PXw3GFQKBm = c(1,
0, 0, 0, 0, 0), browser_version = c(48, 42, 9, 9, 48, 44), continent.AS = c(0,
1, 1, 0, 0, 0), continent.EU = c(0, 0, 0, 0, 1, 0), continent.SA = c(0,
0, 0, 0, 0, 0), f_activex = c(1, 1, 1, 1, 1, 1), f_atob = c(2,
2, 2, 2, 2, 2), f_audio = c(2, 2, 2, 2, 2, 2), f_battery = c(2,
2, 1, 1, 2, 2), f_bind = c(2, 2, 2, 2, 2, 2), f_flash = c(1,
2, 2, 2, 2, 2), f_getComputedStyle = c(2, 2, 2, 2, 2, 2), f_matchSelector = c(2,
2, 2, 2, 2, 2), f_mimeTypes = c(2, 2, 2, 2, 2, 2), f_mimeTypesLength = c(0,
8, 11, 55, 7, 8), f_navigationTiming = c(2, 2, 1, 2, 2, 2), f_orientationEvents = c(2,
1, 1, 1, 1, 1), f_plugins = c(2, 2, 2, 2, 2, 2), f_pluginsLength = c(0,
6, 6, 15, 5, 6), f_raf = c(2, 2, 2, 2, 2, 2), f_resourceTiming = c(2,
2, 1, 1, 2, 2), f_sse = c(2, 2, 2, 2, 2, 2), f_webgl = c(1, 2,
2, 2, 2, 1), f_websql = c(1, 2, 2, 2, 2, 2), f_xdr = c(1, 1,
1, 1, 1, 1), n_appCodeName = c(2, 2, 2, 2, 2, 2), n_doNotTrack = c(2,
2, 1, 2, 2, 2), n_geolocation = c(2, 2, 2, 2, 2, 2), n_mimeTypes = c(2,
2, 2, 2, 2, 2), n_platform.iPhone = c(0, 0, 0, 0, 0, 0), n_platform.Linux.armv7l = c(1,
0, 0, 0, 0, 0), n_platform.MacIntel = c(0, 0, 1, 1, 0, 0), n_platform.Win32 = c(0,
1, 0, 0, 1, 0), n_plugins = c(2, 2, 2, 2, 2, 2), n_product.Sub20030107 = c(1,
1, 1, 1, 1, 1), n_product.Sub20100101 = c(0, 0, 0, 0, 0, 0),
n_product.Submissing = c(0, 0, 0, 0, 0, 0), os_family.Android = c(1,
0, 0, 0, 0, 0), os_family.iOS = c(0, 0, 0, 0, 0, 0), os_family.Mac.OS.X = c(0,
0, 1, 1, 0, 0), os_family.Windows = c(0, 1, 0, 0, 1, 0),
os_version = c(6, 8.1, 10, 10, 7, 0), site_history_length = c(31,
1, 1, 1, 1, 1), w_chrome...loadTimes....csi....app....webstore....runtime.. = c(0,
1, 0, 0, 1, 0), w_chrome...loadTimes....csi.. = c(0, 0, 0,
0, 0, 0), w_chrome... = c(1, 0, 1, 1, 0, 0), window_dimensions = c(2,
1, 2, 2, 2, 2), window_history = c(50, 1, 1, 1, 1, 3)), .Names = c("activity_type",
"app_id.PXkw7OJ1se", "app_id.PXszbKVa5M", "app_id.PXw3GFQKBm",
"browser_version", "continent.AS", "continent.EU", "continent.SA",
"f_activex", "f_atob", "f_audio", "f_battery", "f_bind", "f_flash",
"f_getComputedStyle", "f_matchSelector", "f_mimeTypes", "f_mimeTypesLength",
"f_navigationTiming", "f_orientationEvents", "f_plugins", "f_pluginsLength",
"f_raf", "f_resourceTiming", "f_sse", "f_webgl", "f_websql",
"f_xdr", "n_appCodeName", "n_doNotTrack", "n_geolocation", "n_mimeTypes",
"n_platform.iPhone", "n_platform.Linux.armv7l", "n_platform.MacIntel",
"n_platform.Win32", "n_plugins", "n_product.Sub20030107", "n_product.Sub20100101",
"n_product.Submissing", "os_family.Android", "os_family.iOS",
"os_family.Mac.OS.X", "os_family.Windows", "os_version", "site_history_length",
"w_chrome...loadTimes....csi....app....webstore....runtime..",
"w_chrome...loadTimes....csi..", "w_chrome...", "window_dimensions",
"window_history"), row.names = c(NA, 6L), class = "data.frame")
I am trying to cluster kmeans this data sets (k=2)
and getting error message:
Error in pam(clustering.set.in, k) :
negative length vectors are not allowed
my line of code:
pam(clustering.set.in, 2)
Any suggestions ?
it turns out that one column has na values in it.
Removed it with
new.data[is.na(new.data)] <- 1
and it seems to work fine now

Partial Credit Model: How to calculate the item difficulty?

I have a dataframe with credits of participants for several items. I would like to calculate the item difficulty of those items. With the eRm Package I can calculate the difficulty for the different categories of each item:
Some data:
x1 <- c(1, 2, 1, 0, 3, 3, 0, 4, 4, 1, 0, 3, 2, 0, 4, 1, NA, 1, 1, NA, 0, 1, 2, 1, 1, 3, 0, 2, 1, 0)
x2 <- c(0, 1, 0, 3, 2, 0, 1, 2, 2, NA, 0, 1, 2, 2, NA, 1, 2, 1, 2, 1, 0, 2, 3, 0, 1, 1, 0, 1, 1, 3)
x3 <- c(NA, NA, 3, 0, 1, 2, 0, 1, 1, NA, 3, 0, 1, 2, 0, 1, 2, 1, 0, 1, 3, 1, 3, 0, 1, 1, 0, 1, 1, 0)
x4 <- c(3, 0, 2, 2, 3, 2, 1, 2, 0, 0, 1, 0, 1, 1, 0, 1, 2, 1, 1, 2, 0, 1, 1, 2, 1, 1, 0, 1, 1, 0)
x5 <- c(1, NA, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, NA, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0)
dat <- data.frame(x1, x2, x3, x4, x5)
library("eRm")
Calculation of category difficulties:
PCM(dat)
PCM(dat)$etapar
I do not need the difficulty for the categories, but for the whole item. How can I calculate the overall difficulty of each item?
Thank you very much in advance!

What is causing these upticks in this ggplot?

Example data:
df <- structure(
list(
group = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 7, 7, 7, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11),
val = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.000141111115226522, 0, 0, 0, 0.00127000000793487, 0.00070555554702878, 0.000141111115226522, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.00127000000793487, 0.000282222230453044, 0, 0, 0.000141111115226522, 0.000282222230453044, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.00070555554702878, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.000141111115226522, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
),
.Names = c("group", "val"),
row.names = c(NA, -1000L),
class = c("data.table", "data.frame"),
.internal.selfref = <pointer: 0xe4d468>
)
I can plot this as a timeseries per group, like so:
ggplot(df, aes(x=unlist(by(val, group, seq_along)), y=val, group=group)) +
geom_line(alpha=0.5)
but I want to plot a rolling mean of the data, like so:
library(zoo)
ggplot(df, aes(x=unlist(by(val, group, seq_along)),
y=rollmean(val, 48, fill=NA), group=group)) +
geom_line(alpha=0.5)
But this adds upticks to the end of each line, that do not exist in the data:
The upticks at 130 and 670 do not exist in the data, nor do they exist in the rolling mean, as you can see with rollmean(df[group==5, val], 48, fill=NA). So what is causing them?
The first uptick does occur at exactly 132. In your rolling mean, you chose the default align, which sets it to center, meaning that each point is the mean of the previous k/2 and the future k/2 points. Since you set k=48, it means that point 132 will be the mean of (132-24):(132+24). You can verify that the first non-zero point is indeed 156.
# First non-zero value
min(which(df$val!=0))
# 156
You can also verify that the first non-zero value in the rolling mean is 132.
df$rollmean <- rollmean(df$val, 48, fill=NA)
min(which(df$rollmean!=0))
# 132
Additionally, it looks like you are applying your rolling mean across all groups, which you almost certainly don't want. Try splitting by group, like you did with by to create the time variable. Here is an example:
# Set a time variable before hand
df$time <- with(df, unlist(by(val, group, seq_along)))
df$group <- as.factor(df$group)
k=48
# Remove those groups wtihout enough values for rolling mean of k window
df.subset <- df[df$group %in% names(which(table(df$group) >= k)),]
# Calculate a rolling mean on each group
df.subset$rollmean <- unlist(by(df.subset$val, df.subset$group, FUN=rollmean, k=k, fill=NA))
# Plot
ggplot(df.subset, aes(x=time,
y=rollmean,
colour=group)) + geom_line()

Aggregate to compute the percentage of non-zero rows per group

What's the simplest way to compute the percentage of rows (1) containing ones and (2) containing zeros, per group?
Here's some small example data:
dat <- structure(list(rs = c(0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0), group = c(3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1)), .Names = c("rs", "group"), row.names = c(NA,
-62L), class = "data.frame")
Here's what I've got so far (don't laugh!):
require(plyr)
tab <- as.data.frame(table(dat))
dc <- dcast(tab, group ~ rs)
dc <- dc[,-1]
dc[] <- lapply(dc, as.numeric)
data.frame(prop.table(as.matrix(dc), 1))
Which works fine:
X0 X1
1 1.0000000 0.00000000
2 0.8787879 0.12121212
3 0.9285714 0.07142857
But I'm sure there's a method that requires less typing.
Solutions with plyr and data.table most welcome.
table almost does what you want. Convert to ratios by dividing each set of values by its sum:
t(apply(table(dat), 2, function(x) x/sum(x)))
## group 0 1
## 1 1.0000000 0.00000000
## 2 0.8787879 0.12121212
## 3 0.9285714 0.07142857

how to define fill colours in ggplot histogram?

I have the following simple data
data <- structure(list(status = c(9, 5, 9, 10, 11, 10, 8, 6, 6, 7, 10,
10, 7, 11, 11, 7, NA, 9, 11, 9, 10, 8, 9, 10, 7, 11, 9, 10, 9,
9, 8, 9, 11, 9, 11, 7, 8, 6, 11, 10, 9, 11, 11, 10, 11, 10, 9,
11, 7, 8, 8, 9, 4, 11, 11, 8, 7, 7, 11, 11, 11, 6, 7, 11, 6,
10, 10, 9, 10, 10, 8, 8, 10, 4, 8, 5, 8, 7), statusgruppe = c(0,
0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, NA, 0, 1, 0, 1,
0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1,
1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0,
1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0)), .Names = c("status",
"statusgruppe"), class = "data.frame", row.names = c(NA, -78L
))
from that I'd like to make a histogram:
ggplot(data, aes(status))+
geom_histogram(aes(y=..density..),
binwidth=1, colour = "black",
fill="white")+
theme_bw()+
scale_x_continuous("Staus", breaks=c(min(data$status,na.rm=T), median(data$status, na.rm=T), max(data$status, na.rm=T)),labels=c("Low", "Middle", "High"))+
scale_y_continuous("Percent", formatter="percent")
Now - i'd like for the bins to take colou according to value - e.g. bins with value > 9 gets dark grey - everything else should be light grey.
I have tried with fill=statusgruppe, scale_fill_grey(breaks=9) etc. - but I can't get it to work. Any ideas?
Hopefully this should get you started:
ggplot(data, aes(status, fill = ..x..))+
geom_histogram(binwidth = 1) +
scale_fill_gradient(low = "black", high = "white")
ggplot(data, aes(status, fill = ..x.. > 9))+
geom_histogram(binwidth = 1) +
scale_fill_grey()
How about using fill=..count.. or fill=I(..count..>9) right after y=..density..? You have to tinker with the legend title and labels a bit, but it gets the coloring right.
EDIT:
It seems I misunderstood your question a bit. If you want to define color based on the x-coordinate, you can use the ..x.. automatic variable similarly.
What about scale_manual? Here's link to Hadley's site. I've used this function to set an appropriate fill colour for a boxplot. Not sure if it'll work with histogram, though...

Resources