Function within loop with if statement in R - r

I am not familiar with if statements/loops/or functions in R. I have a dataset where I want to adjust the a variable (N) by the clustering of the study (the formula is this one: N/(1 + (M - 1) * ICC). Where N is the number of subjects, the M is the size of the cluster and ICC is the intra-class correlation coeff. I have all these variables in separate columns with each row identifying the different studies/sample sizes. Not all the studies have a clustering issues so I need to apply this function only to the subset of those with the ICC. I thought about something like this but I know it is missing something, and also, I don't know if a loop with an if statement is the most efficient way to go.
for (i in df$N) { # for every sample size in df$N
if (df$ICC[i] != .) { # if ICC is not missing
df$N/(1 + (df$M - 1) * df$ICC) # adjust the sample size by dividing the N by the size
of the cluster - 1 and multiply per the ICC of the study
} else {
df$N/1 #otherwise (ICC is missing) do nothing: ie., divide the N by 1.
}
}
Do you know how I could do this with something like this? Other solutions are also welcome! Thanks for any help or suggestion about this!
Here's an example of the dataset:
dput(head(df, 10))
structure(list(ID = c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5), ArmsID = c(0,
1, 0, 1, 0, 1, 0, 1, 0, 1), N = c(26, 34, 28, 27, 50, 52, 60,
65, 150, 152), Mean = c(10.1599998474121, 5.59999990463257, 8,
8.52999973297119, 17, 15.1700000762939, 48.0999984741211, 49,
57, 55.1315803527832), SD = c(6.30000019073486, 4.30000019073486,
5.6, 6.61999988555908, 6, 7.75, 10.1599998474121, 12, 11, 10.5495901107788
), SE = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), ICC = c(0.03,
0.02, NA, NA, 0.01, 0.003, NA, NA, NA, NA), M = c(5, 5, NA, NA,
17, 16, NA, NA, NA, NA)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
The . meant to indicate missing data: NA. I want to apply the functions that adjust the N only to the rows that have an ICC.
idx <- which(!is.na(df$ICC))
df$N[idx] <- df$N[idx]/(1 + (df$M[idx] - 1) * df$ICC[idx])
This code correctly works, thanks!

Related

r - Error in fda::Data2fd ((a01[1] <= arng[1]) && (arng[2] <= a01[2])) { : missing value where TRUE/FALSE needed

I wrote the following code in R
library(fda)
n_curves <- 15951
n_points <- 2537
argvals <- matrix(df_l$Time, nrow = n_points, ncol = n_curves)
y_mat <- matrix(df_l$Curve, nrow = n_points, ncol = n_curves)
W.obj <- Data2fd(argvals = argvals, y = y_mat, basisobj = basis, lambda = 0.5)
But I'm getting an error
Error in if ((a01[1] <= arng[1]) && (arng[2] <= a01[2])) { :
missing value where TRUE/FALSE needed
What does it mean, and how do I prevent it?
I'm using a repeated measures data, and I`m trying to do functional data analysis.My data has a lot of missing values(NA). I'm thinking that NA is probably the cause of something.
data:
> dput(head(df_l, 30))
structure(list(Time = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28,
29, 30), Curve = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 5, 10, 10, 10, 10, 8, 8, 8, 8,
8, 8)), row.names = c(NA, 30L), class = "data.frame")
> dput(head(basis, 5))
list(call = basisfd(type = type, rangeval = rangeval, nbasis = nbasis,
params = params, dropind = dropind, quadvals = quadvals,
values = values, basisvalues = basisvalues), type = "bspline",
rangeval = c(0, 2537), nbasis = 53, params = c(50.74, 101.48,
152.22, 202.96, 253.7, 304.44, 355.18, 405.92, 456.66, 507.4,
558.14, 608.88, 659.62, 710.36, 761.1, 811.84, 862.58, 913.32,
964.06, 1014.8, 1065.54, 1116.28, 1167.02, 1217.76, 1268.5,
1319.24, 1369.98, 1420.72, 1471.46, 1522.2, 1572.94, 1623.68,
1674.42, 1725.16, 1775.9, 1826.64, 1877.38, 1928.12, 1978.86,
2029.6, 2080.34, 2131.08, 2181.82, 2232.56, 2283.3, 2334.04,
2384.78, 2435.52, 2486.26))

Reverse score a subset of items based on variable predefined maximums

I have longitudinal data for which I would like to reverse score a subset of items using corresponding predefined maximum scores that are stored in a seperate data frame.
In the below example data (df) there are three scores, DST, SOS, and VR at two timepoints (baseline and wave 1). neg_skew.vars contains the scores that are to be reverse across timepoints. I would like to reverse scores based on the maximum possible value for that score, as stored in df.CP1.vars$max.vars. I'd like this to work when multiple scores with different maximum values are included in neg_skew.vars.
For example, in the example below "SOS.score" is stored in neg_skew.vars. Therefore I want all SOS.Score variables to be reversed (i.e., across timepoints); this would include 'SOS.Score.baseline' and 'SOS.Score.wave1' in the example data below. I want scores to be reversed using the corresponding maximum score for SOS. For each SOS variable, I want each value to be reversed like this: (20 + 1) - value. The 20 corresponds to the maximum value for SOS stored in df.CP1.vars. As DST is also negatively skewed, all DST scores (i.e., 'DST.Score.baseline' and 'DST.Score.wave1') should be reveresed, but with 16 as the maximum value, per df.CP1.vars, so: (16 + 1) - value. This results in the desired data frame df_wanted below. VR.Score does not appear in neg_skew.vars and so no VR.Score variables are reversed (i.e., VR.Score.baseline and VR.Score.wave1).
So far I have the code listed below under # reverse scores however this produces two undesired outcomes in the resulting data frame (i.e., df2). These are A) the columns for other scores, such as DST, are not retained, and B) the maximum value used to reverse items is the maximum value for that item/at that timepoint; this is a problem as the data is longitudinal.
The desired data should look like df_wanted. I tried to set up a for-loop but ran into problems with using the dplyr pipeline.
# required packages
library(dplyr)
# create relevant variables and data sets
CP1.vars <- c("DST.Score","SOS.Score", "VR.Score")
max.vars <- c(16,20,80)
df.CP1.vars <- data.frame(CP1.vars, max.vars)
df <- structure(list(
SOS.Score.baseline = c(4, 11, 7, 9, 10, 8, 6, 8, 7, 0, 9, 10),
SOS.Score.wave1 = c(NA, 7.5, 8.5, NA, NA, 6.66, NA, 6, 8, 8, 7, 8),
DST.Score.baseline = c(11, 10, 8, 8, 8, 8, 9, 9, 7, 6, 7, 6),
DST.Score.wave1 = c(NA, 10, 8.5, NA, NA, 8, NA, 9.33, 9, 7, 8, 8),
VR.Score.baseline = c(NA, 60, 38.5, 50, NA, 48, NA, 33, 49, 67, 78, 80),
VR.Score.wave1 = c(NA, 58, 38.5, NA, NA, 40, NA, 35, 49, 67, 78, 78)),
row.names = c(NA, 12L), class = "data.frame")
neg_skew.vars <- c("SOS.Score", "DST.Score")
# reverse scores
df2 <- df %>%
select(contains(neg_skew.vars)) %>%
mutate(across(everything(), ~ max(., na.rm = TRUE) + 1 - . , .names = "{.col}_r"))
# desired outcome (order of variables irrelevant)
df_wanted <- structure(list(
SOS.Score.baseline = c(4, 11, 7, 9, 10, 8, 6, 8, 7, 0, 9, 10),
SOS.Score.wave1 = c(NA, 7.5, 8.5, NA, NA, 6.66, NA, 6, 8, 8, 7, 8),
SOS.Score.baseline_r = c(17, 10, 14, 12, 11, 13, 15, 13, 14, 21, 12, 11),
SOS.Score.wave1_r = c(NA, 13.5, 12.5, NA, NA, 14.34, NA, 15, 13, 13, 14, 13),
DST.Score.baseline = c(11, 10, 8, 8, 8, 8, 9, 9, 7, 6, 7, 6),
DST.Score.wave1 = c(NA, 10, 8.5, NA, NA, 8, NA, 9.33, 9, 7, 8, 8),
DST.Score.baseline_r = c(6, 7, 9, 9, 9, 9, 8, 8, 10, 11, 10, 11),
DST.Score.wave1_r = c(NA, 7, 8.5, NA, NA, 9, NA, 7.67, 8, 10, 9, 9),
VR.Score.baseline = c(NA, 60, 38.5, 50, NA, 48, NA, 33, 49, 67, 78, 80),
VR.Score.wave1 = c(NA, 58, 38.5, NA, NA, 40, NA, 35, 49, 67, 78, 78)),
row.names = c(NA,12L), class = "data.frame")
You can use purrr::map_dfc to loop over the neg_skew.vars and get the value directly from df.CP1.vars, and then bind the resulting dataframe with columns that remained unchanged.
library(tidyverse)
library(purrr)
df2 <- neg_skew.vars %>%
map_dfc(function(a) df %>%
select(matches(a)) %>%
mutate(across(everything(), ~ df.CP1.vars$max.vars[df.CP1.vars$CP1.vars == a] + 1 - .,
.names = "{.col}_r"))) %>%
bind_cols(df %>%
select(!contains(neg_skew.vars)))
This indeed leads to the desired outcome:
identical(df2, df_wanted)
#[1] TRUE
Data:
# create relevant variables and data sets
CP1.vars <- c("DST.Score","SOS.Score", "VR.Score")
max.vars <- c(16,20,80)
df.CP1.vars <- data.frame(CP1.vars, max.vars)
df <- structure(list(
SOS.Score.baseline = c(4, 11, 7, 9, 10, 8, 6, 8, 7, 0, 9, 10),
SOS.Score.wave1 = c(NA, 7.5, 8.5, NA, NA, 6.66, NA, 6, 8, 8, 7, 8),
DST.Score.baseline = c(11, 10, 8, 8, 8, 8, 9, 9, 7, 6, 7, 6),
DST.Score.wave1 = c(NA, 10, 8.5, NA, NA, 8, NA, 9.33, 9, 7, 8, 8),
VR.Score.baseline = c(NA, 60, 38.5, 50, NA, 48, NA, 33, 49, 67, 78, 80),
VR.Score.wave1 = c(NA, 58, 38.5, NA, NA, 40, NA, 35, 49, 67, 78, 78)),
row.names = c(NA, 12L), class = "data.frame")
neg_skew.vars <- c("SOS.Score", "DST.Score")

StyleInterval with cut in column

I recently started exploring DT and I am stuck on something. Imagine the following table:
dt <- data.table(group = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
group2 = c(1, 2, 3, 1, 2, 3, 1, 2, 3),
interval = c(NA, NA, 100, NA, NA, 150, NA, NA, 100),
value1 = c(1000, 10, 90, 2000, 30, 120, 1500, 25, 150),
value2 = c(1200, 10, 110, 2500, 35, 145, 2200, 40, 90))
Now I want to create a DT with a Style that checks the value in value1 and value2 and compares it with the value in Interval. I tried something like this:
datatable(dt) %>% formatStyle(
columns = c("value1", "value2"),
backgroundColor = styleInterval(interval, c("red", "green"))
)
But interval is not recognized as an object. This leads me to believe that I cannot pass a column in the cut parameter. I also tried to pass some kind of function in the valueColumns but this didn't seem to be possible either.
Expected output:
It makes no sense to pass the full column to styleInterval. It requires n values for cut and n+1 for values. Try alternative below instead:
myCut <- sort(unique(dt$interval))
myCol <- rainbow(length(myCut) + 1)
formatStyle(datatable(dt),
columns = c("value1", "value2"),
backgroundColor = styleInterval(myCut, myCol))

Progression of non-missing values that have missing values in-between

To continue on a previous topic:
Finding non-missing values between missing values
I would like to also find whether the value before the missing value is smaller, equal to or larger than the one after the missing.
To use the same example from before:
df = structure(list(FirstYStage = c(NA, 3.2, 3.1, NA, NA, 2, 1, 3.2,
3.1, 1, 2, 5, 2, NA, NA, NA, NA, 2, 3.1, 1), SecondYStage = c(NA,
3.1, 3.1, NA, NA, 2, 1, 4, 3.1, 1, NA, 5, 3.1, 3.2, 2, 3.1, NA,
2, 3.1, 1), ThirdYStage = c(NA, NA, 3.1, NA, NA, 3.2, 1, 4, NA,
1, NA, NA, 3.2, NA, 2, 3.2, NA, NA, 2, 1), FourthYStage = c(NA,
NA, 3.1, NA, NA, NA, 1, 4, NA, 1, NA, NA, NA, 4, 2, NA, NA, NA,
2, 1), FifthYStage = c(NA, NA, 2, NA, NA, NA, 1, 5, NA, NA, NA,
NA, 3.2, NA, 2, 3.2, NA, NA, 2, 1)), class = c("tbl_df", "tbl",
"data.frame"), row.names = c(NA, -20L))
rows 13, 14 and 16 having non-missing in between missing values. The output this time should be: "same", "larger" and "same" for rows 13, 14, and 16, and say "N/A" for the other rows.
A straight forward approach would be to split, convert to numeric, take the last 2 values and compare with an ifelse statement, i.e.
sapply(strsplit(do.call(paste, df)[c(13, 14, 16)], 'NA| '), function(i){
v1 <- as.numeric(tail(i[i != ''], 2));
ifelse(v1[1] > v1[2], 'greater',
ifelse(v1[1] == v1[2], 'same', 'smaller'))
})
#[1] "same" "smaller" "same"
NOTE
I took previous answer as a given (do.call(paste, df)[c(13, 14, 16)])
A more generic approach (as noted by Ronak, last 2 digits will fail in some cases) would be,
sapply(strsplit(gsub("([[:digit:]])+\\s+[NA]+\\s+([[:digit:]])", '\\1_\\2',
do.call(paste, df)[c(13, 14, 16)]), ' '), function(i) {
v1 <- i[grepl('_', i)];
v2 <- strsplit(v1, '_')[[1]];
ifelse(v2[1] > v2[2], 'greater',
ifelse(v2[1] == v2[2], 'same', 'smaller')) })
#[1] "same" "smaller" "same"

Stratify then impute in R - using mi()

I want to "stratify-then-impute" using the packages available in R.
That is, I am hoping to:
1) stratify my dataset using a binary variable called "arm". This variable has no missing data.
2) run an imputation model for the two subsets
3) combine the two imputed data sets
4) run a pooled analysis.
My dataset looks like:
dataSim <- structure(list(pid = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20), arm = c(0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1), X1 = c(0.1, NA, 0.51,
0.56, -0.82, NA, NA, NA, -0.32, 0.4, 0.58, NA, 0.22, -0.23, 1.49,
-1.88, -1.77, -0.94, NA, -1.34), X2 = c(NA, -0.13, NA, 1.2, NA,
NA, NA, 0.02, -0.04, NA, NA, 0.25, -0.81, -1.67, 1.01, 1.69,
-0.06, 0.07, NA, -0.11)), .Names = c("pid", "arm", "X1", "X2"
), row.names = c(NA, 20L), class = "data.frame")
To impute, the data, I'm currently using the mi() function as follows:
library(mi)
data.1 <- dataSim[dataSim[,"arm"]==1,]
data.0 <- dataSim[dataSim[,"arm"]==0,]
data.miss.1 <- missing_data.frame(data.1)
data.miss.0 <- missing_data.frame(data.0)
imputations.1 <- mi(data.1, n.iter=5, n.chains=5, max.minutes=20, parallel=FALSE)
imputations.0 <- mi(data.0, n.iter=5, n.chains=5, max.minutes=20, parallel=FALSE)
complete(imputations.1) # viewing the imputed datasets
complete(imputations.0)
Then I don't know how to combine the 2 imputations in order to do a pooled analysis. I have unsuccessfully tried:
imputations <- rbind(imputations.0, imputations.1) # This doesn't work
# analysis.X1 <- pool(X1 ~ arm, data = imputations ) # This is what I want to run
I assume this method is a simplified version of including an interaction term when imputing, but I don't know how this is possible either.
Thanks

Resources