Reshaping data for use with geeglm() - r

Could you please help me figure out why I am getting an error?
Initially my data looks like this:
> attributes(compl)$names
[1] "UserID" "compl_bin" "Sex.x" "PHQ_base" "PHQ_Surv1" "PHQ_Surv2" "PHQ_Surv3"
[8] "PHQ_Surv4" "EFE" "Neuro" "Intervention.x" "depr0" "error1_1.x" "error1_2.x"
[15] "error1_3.x" "error1_4.x" "stress0" "stress1" "stress2" "stress3" "stress4"
[22] "hours1" "hours2" "hours3" "hours4" "subject"
First I reshape my data to prepare for geeglm:
compl$subject <- factor(rownames(compl))
nobs <- nrow(compl)
compl_long <- reshape(compl, idvar = "subject",
varying = list(c("PHQ_Surv1", "PHQ_Surv2" ,
"PHQ_Surv3", "PHQ_Surv4"),
c("error1_1.x", "error1_2.x",
"error1_3.x", "error1_4.x"),
c("stress1", "stress2", "stress3",
"stress4"),
c("hours1", "hours2", "hours3",
"hours4")),
v.names = c("PHQ", "error", "stress", "hours"),
times = c("1", "2", "3", "4"), direction = "long")
-(Editor's note: not sure what this next output is from...)
[1] "UserID" "compl_bin" "Sex.x" "PHQ_base" "EFE" "Neuro" "Intervention.x"
[8] "depr0" "stress0" "subject" "time" "PHQ" "error" "stress"
[15] "hours"
Then I use geeglm function:
library(geepack)
geeSand=(geeglm(PHQ~as.factor(compl_bin) + Neuro+PHQ_base+as.factor(depr0) +
EFE+as.factor(Sex.x) + as.factor(error)+stress+hours,
family = poisson, data=compl_long,
id=subject, corst="exchangeable"))
I am getting an error:
"Error in geese.fit(xx, yy, id, offset, soffset, w, waves = waves, zsca, :
nrow(zsca) and length(y) not match"
If I remove variables as.factor(error) and hours, geeglm does not complain, and I am getting the output. The function does not work with error and hours variables. I check the length of all the variables, they are equal. Could you please help me figure out what is wrong?
Many thanks!

found this at: https://stat.ethz.ch/pipermail/r-help/2008-October/178337.html
"
I'm pretty sure this is a bug in geese(), which should be reported to
the
maintainer of geepack. The problem is with the treatment of missing
values.
If looks at dim(na.omit(dat[,c("id","score","chem","time")])) one
gets 44.
In geese.fit() zsca is set equal to matrix(1,N,1) where N is set
equal to
length(id). But id has length 46 whereas the response y has been
trimmed
down to length 44 by eliminating any rows of the data where any of
the variables
involved are missing. Hence a problem.
The solution of the problem requires some code re-writing by the
maintainer of geepack."

Related

Model Prediction Partial Least Square Model

I am following the procedure explained Hair et al (2021) to run a partial least square model (seminR).
So far, it worked well. However, when using the predict function, I get the following error:
Parallel encountered this ERROR:
Must subset columns with a valid subscript vector.
x Subscript endogenous_items must be a simple vector, not a matrix.
r in summary.connection(connection) : invalid connection
I have recently started working with r. My dataset is an excel table with 579 obs. of 47 variables. How can I solve these problems? Thank you very much in advance.
That's my code:
composite("EA", multi_items("EA_", 1:3))`
composite("DB", multi_items("DB_", 1:5)),
composite("LTB", multi_items("LTB_", 1:5)),
composite("SN", multi_items("SN_", 1:3)),
composite("PBC", multi_items("PBC_", 1:3)),
composite("SE", multi_items("SE_", 1:8)),
composite("INT", multi_items("INT_", 1:2)),
composite("B", multi_items("B_", 1:2)))
Create structural model
`final_sm_ext <- relationships(
paths(from = c("DB", "LTB", "SN", "PBC", "SE", "INT"), to = c("B")),
paths(from = c("EA"), to = c("INT")))`
bike_final_model_ext <- estimate_pls('data = PLSdata,
measurement_model = final_mm_ext,
structural_model = final_sm_ext,
inner_weights = path_weighting,
missing = mean_replacement,
missing_value = "-99")
summary_bike_final_model_ext <- summary(bike_final_model_ext)
predict_bike_final_model_ext <- `predict_pls( model = bike_final_model_ext,
technique = predict_DA, noFolds = 10, reps = 10)```

Error message using mstate::msprep ??bug?

I have had a problem with an error abend using mstate::msprep to prepare my data for a pretty classical 3 state problem. I can run the code from the mstate package vignette with no difficulty. My problem is entirely parallel to the vignette example. Subjects receive an islet transplant, then may achieve insulin independence. Whether they do or do not, they may have islet graft failure (or loss of insulin independence if it was achieved.) The vignette example works with included covariates (retained by the keep = parameter). My version works fine if I don't include the keep parameter but fails consistently if I use the keep parameter. Since my example works perfectly well without the keep variable, I very much doubt that there is a problem with my main data. It must be some problem with the “keep” data. See below for the session output.
Neither data set has any missing data. I tried the vignette data limiting it to three covariates -- one categorical, one continuous, and the third with one of the event-time variables, exactly parallel to my three covariates. The vignette still works perfectly, but mine doesn’t. Both covariate "keep" lists are character vectors. In sum, I can't imagine a more parallel "real" question to the vignette example.
I have tracked the problem to a subroutine of msprep "msprepEngine" at line 85 at the second time through the processing loop, but I haven't been able to figure out what the problem is. I suspect that it is a bug, but since I can't identify it, I can't be sure.
I would be very grateful for anyone that can help me with this issue. The vignette code is available with the package. Unfortunately I am not free to share my problem's data, but as I said above, the program works perfectly without the keep parameter. There must be something about my "keep" covariates that is giving the system indigestion.
Thanks in advance for any suggestions.
Larry Hunsicker
> library(magrittr)
> library(survival)
> library(mstate)
>
> #Three state tmat:
> data(ebmt3)
> names(msbmt)
[1] "id" "from" "to" "trans" "Tstart" "Tstop" "time" "status" "dissub" "age"
[11] "prtime"
> dim(msbmt)
[1] 5577 11
> tmat <- trans.illdeath(names = c("Tx", "PR", "RelDeath"))
> covs <- c('dissub', 'age', 'drmatch', 'tcd', 'prtime')
> class(covs)
[1] "character"
> msbmt <- msprep(time = c(NA, "prtime", "rfstime"),
+ status = c(NA, "prstat", "rfsstat"),
+ data = ebmt3, trans = tmat, id = 'id', keep = covs)
>
> names(insfree3)
[1] "PatientID" "YrFree" "Free" "YrLossFail" "LossFail" "StudyID" "IEQ_kg"
> tmat3 <- trans.illdeath(names = c("Tx", "II", "LossFail"))
> IImt <- msprep(time = c(NA, 'YrFree', 'YrLossFail'),
+ status = c(NA, 'Free', 'LossFail'),
+ data = insfree3, trans = tmat3, id = 'PatientID')
>
> tmat3 <- trans.illdeath(names = c("Tx", "II", "LossFail"))
> covs <- c('StudyID', 'IEQ_kg', 'YrFree')
> class(covs)
[1] "character"
> IImt <- msprep(time = c(NA, 'YrFree', 'YrLossFail'),
+ status = c(NA, 'Free', 'LossFail'),
+ data = insfree3, trans = tmat3, id = 'PatientID', keep = covs)
Error in rep(keep[, i], tbl) : invalid 'times' argument
I found the problem, and it is a bug. I just don't know whose bug it is. msprep() works when data is a data.frame, but not when it is a tibble. My repro example:
> library(survival)
> library(mstate)
> library(dplyr)
> data(ebmt3)
> class(ebmt3)
[1] "data.frame"
> tmat <- transMat(x = list(c(2, 3), c(3), c()), names = c("Tx",
+ "PR", "RelDeath"))
> ebmt3$prtime <- ebmt3$prtime/365.25
> ebmt3$rfstime <- ebmt3$rfstime/365.25
> covs <- c("dissub", "age", "drmatch", "tcd", "prtime")
> msbmt <- msprep(time = c(NA, "prtime", "rfstime"),
+ status = c(NA, "prstat", "rfsstat"), data = ebmt3,
+ trans = tmat, keep = covs)
> ebmt3 <- as_tibble(ebmt3)
> class(ebmt3)
[1] "tbl_df" "tbl" "data.frame"
> msbmt <- msprep(time = c(NA, "prtime", "rfstime"),
+ status = c(NA, "prstat", "rfsstat"), data = ebmt3,
+ trans = tmat, keep = covs)
Error in rep(keep[, i], tbl) : invalid 'times' argument
I tracked the error down to line 157 in msprep()
ddcovs <- lapply(1:nkeep, function(i) rep(keep[, i], tbl))
When data is a data.frame, this line works. When it is a tibble, it abends with the above error message.
It was my impression that things that work with a data.frame should also work with a tibble, since a tibble is a data.frame. So I'm not sure whether this is a bug in msprep() or in the code for a tibble. But the way to avoid the error is to be sure that the data parameter in the call to msprep() is a data.frame, but not a tibble.
Larry Hunsicker

how to interpolate data within groups in R using seqtime?

I am trying to use seqtime (https://github.com/hallucigenia-sparsa/seqtime) to analyze time-serie microbiome data, as follow:
meta = data.table::data.table(day=rep(c(15:27),each=3), condition =c("a","b","c"))
meta<- meta[order(meta$day, meta$condition),]
meta.ts<-as.data.frame(t(meta))
otu=matrix(1:390, ncol = 39)
oturar<-rarefyFilter(otu, min=0)
rarotu<-oturar$rar
time<-meta.ts[1,]
interp.otu<-interpolate(rarotu, time.vector = time,
method = "stineman", groups = meta$condition)
the interpolation returns the following error:
[1] "Processing group a"
[1] "Number of members 13"
intervals
0
12
[1] "Selected interval: 1"
[1] "Length of time series: 13"
[1] "Length of time series after interpolation: 1"
Error in stinepack::stinterp(time.vector, as.numeric(x[i, ]), xout = xout, :
The values of x must strictly increasing
I tried to change method to "hyman", but it returns the error below:
Error in interpolateSub(x = x, time.vector = time.vector, method = method) :
Time points must be provided in chronological order.
I am using R version 3.6.1 and I am a bit new to R.
Please can anyone tell me what I am doing wrong/ how to go around these errors?
Many thanks!
I used quite some time stumbling around trying to figure this out. It all comes down to the data structure of meta and the resulting time variable used as input for the time.vector parameter.
When meta.ts is being converted to a data frame, all strings are automatically converted to factors - this includes day.
To adjust, you can edit your code to the following:
library(seqtime)
meta <- data.table::data.table(day=rep(c(15:27),each=3), condition =c("a","b","c"))
meta <- meta[order(meta$day, meta$condition),]
meta.ts <- as.data.frame(t(meta), stringsAsFactors = FALSE) # Set stringsAsFactors = FALSE
otu <- matrix(1:390, ncol = 39)
oturar <- rarefyFilter(otu, min=0)
rarotu <- oturar$rar
time <- as.integer(meta.ts[1,]) # Now 'day' is character, so convert to integer
interp.otu <- interpolate(rarotu, time.vector = time,
method = "stineman", groups = meta$condition)
As a bonus, read this blogpost for information on the stringsAsFactors parameter. Strings automatically being converted to Factors is a common bewilderment.

Debug error in frame$yval2[where, 1L + nclass + 1L:nclass, drop = FALSE]: subscript out of bounds

I'm using rpart library to build a regression tree, with the following code:
skillcraft <- read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00272/SkillCraft1_Dataset.csv", header = T, sep =",")
skillcraft$LeagueIndex <- factor(skillcraft$LeagueIndex)
skillcraft <- skillcraft[-1]
skillcraft$Age <- as.numeric(levels(skillcraft$Age))[skillcraft$Age]
skillcraft$TotalHours <- as.numeric(
levels(skillcraft$TotalHours))[skillcraft$TotalHours]
skillcraft$HoursPerWeek <- as.numeric(
levels(skillcraft$HoursPerWeek))[skillcraft$HoursPerWeek]
skillcraft <- skillcraft[complete.cases(skillcraft),]
library(caret)
set.seed(133)
skillcraft_sampling_vector <- createDataPartition(
skillcraft$LeagueIndex, p = 0.8, list = F)
skillcraft_train <- skillcraft[skillcraft_sampling_vector,]
skillcraft_test <- skillcraft[-skillcraft_sampling_vector,]
library(rpart)
regtree <- rpart(LeagueIndex ~., data = skillcraft_train)
regtree_predictions <- predict(regtree, skillcraft_test)
The last line of this code is throwing the error:
Error in frame$yval2[where, 1L + nclass + 1L:nclass, drop = FALSE] :
subscript out of bounds
This doesn't seem very clear, but I've checked that both data frames (train and test) have the same structure and now I'm having trouble in finding a way to debug this code.
Can anyone help?
Thanks in advance!
My best guess is that the problem lies in the LeagueIndex factor. This variable was provided as ordinal data (from Bronze to Professional) and converted to a character factor "1", "2", "3", etc. up to "8".
It looks like in addition to your error with rpart, you get a warning when partitioning the data based on this factor:
In createDataPartition(skillcraft$LeagueIndex, p = 0.8, list = F) :
Some classes have no records ( 8 ) and these will be ignored
Apparently there are no records with LeagueIndex of 8. This seems to come after you select for completed cases here:
skillcraft <- skillcraft[complete.cases(skillcraft),]
And all of the LeagueIndex=8 cases are removed as these will have missing data for Age, HoursPerWeek, and TotalHours (coerced to NA) when converted via as.numeric.
skillcraft[which(skillcraft$LeagueIndex == 8), c("Age", "HoursPerWeek", "TotalHours")]
Age HoursPerWeek TotalHours
3341 ? ? ?
3342 ? ? ?
3343 ? ? ?
...
Assuming you still wanted a factor, I believe if you get rid of the unused factor level this will work such as:
skillcraft$LeagueIndex <- droplevels(skillcraft$LeagueIndex)
before partitioning the data. (You could just do on the training set in this example, but you would want the same factor levels in your test and train sets.)

String pulled directly from source data seems to not match string in source data

I have a string that is failing to evaluate as a match with itself. I am trying to do a simple subset based on one of 8 possible values in a column,
out <- df[df$`Var name` == "string",]
I've had it work multiple times with different strings but for some reason this string fails. I have tried to get the exact string (thinking there may be some character encoding issue) from the source using the four below avenues but have had no success. Even when I make an explicit call to a cell I know contains that string and copy that into an evaluation statement it fails
> df[i,j]
[1] "string"
df[i,j]=="string" # pasted from above line
I don't understand how I can be explicitly pasting the output I was just given and it not match.
## attempts to get exact string to paste into subset statement
# from dput
"IF APPLICABLE – Which of the following best characterizes the expectations with"
# from calling a specific row/col (df[i, j])
[1] "IF APPLICABLE – Which of the following best characterizes the expectations with"
# from the source pane of rstudio
IF APPLICABLE – Which of the following best characterizes the expectations with
# from the source excel file
IF APPLICABLE – Which of the following best characterizes the expectations with
I don't have a clue what could be going on here. I am explicitly drawing the string straight from the data and yet it still fails to evaluate as true. Is there something going on in the background that I'm not seeing? Am I overlooking something ridiculously simple?
edit:
I subset based on another way, below is a dput and actual example of what I'm doing:
> dput(temp)
structure(list(`Item Stem` = "IF APPLICABLE – Which of the following best characterizes the expectations with",
`Item Response` = "It was required.", orgchar_group = "locale",
`Org Characteristic` = "Rural", N = 487, percent = 34.5145287030475,
`Graphs note` = NA_character_, `Report note` = NA_character_,
`Other note` = NA_character_, subsig = 1, overall = 0, varname = NA_character_,
statsig = NA_real_, use = NA_real_, difference = 9.16044821292665), .Names = c("Item Stem",
"Item Response", "orgchar_group", "Org Characteristic", "N",
"percent", "Graphs note", "Report note", "Other note", "subsig",
"overall", "varname", "statsig", "use", "difference"), row.names = 288L, class = "data.frame")
> temp[1,1]
[1] "IF APPLICABLE – Which of the following best characterizes the expectations with"
> temp[1,1] == "IF APPLICABLE – Which of the following best characterizes the expectations with"
[1] FALSE
Turns out it was in fact a non-printable character, shoutout to the commenters for helping me figure it out by 1) suggesting it and 2) showing that it worked for them.
I was able to figure it out using insights from here (& here) and here.
I used a grep command (from #Tyler Rinker) to determine that there was in fact a non-ASCII character in my string, and a stringi command (from #hadley) to determine what kind. I then used base solution from #Josh O'Brien to remove it. Turns out it was the heiphen.
# working in the temp df
> x <- temp[1,1]
> grepl("[^ -~]", x)
[1] TRUE
> stringi::stri_enc_mark(x)
[1] "UTF-8"
> iconv(x, "UTF-8", "ASCII", sub="")
[1] "IF APPLICABLE Which of the following best characterizes the expectations with"
# set x as df$`Var name` and reassign it to fix
df$`Var name` <- iconv(df$`Var name`, "UTF-8", "ASCII", sub="")
Still don't understand it enough to explain why it happened but it's fixed now.

Resources