Error message using mstate::msprep ??bug? - r

I have had a problem with an error abend using mstate::msprep to prepare my data for a pretty classical 3 state problem. I can run the code from the mstate package vignette with no difficulty. My problem is entirely parallel to the vignette example. Subjects receive an islet transplant, then may achieve insulin independence. Whether they do or do not, they may have islet graft failure (or loss of insulin independence if it was achieved.) The vignette example works with included covariates (retained by the keep = parameter). My version works fine if I don't include the keep parameter but fails consistently if I use the keep parameter. Since my example works perfectly well without the keep variable, I very much doubt that there is a problem with my main data. It must be some problem with the “keep” data. See below for the session output.
Neither data set has any missing data. I tried the vignette data limiting it to three covariates -- one categorical, one continuous, and the third with one of the event-time variables, exactly parallel to my three covariates. The vignette still works perfectly, but mine doesn’t. Both covariate "keep" lists are character vectors. In sum, I can't imagine a more parallel "real" question to the vignette example.
I have tracked the problem to a subroutine of msprep "msprepEngine" at line 85 at the second time through the processing loop, but I haven't been able to figure out what the problem is. I suspect that it is a bug, but since I can't identify it, I can't be sure.
I would be very grateful for anyone that can help me with this issue. The vignette code is available with the package. Unfortunately I am not free to share my problem's data, but as I said above, the program works perfectly without the keep parameter. There must be something about my "keep" covariates that is giving the system indigestion.
Thanks in advance for any suggestions.
Larry Hunsicker
> library(magrittr)
> library(survival)
> library(mstate)
>
> #Three state tmat:
> data(ebmt3)
> names(msbmt)
[1] "id" "from" "to" "trans" "Tstart" "Tstop" "time" "status" "dissub" "age"
[11] "prtime"
> dim(msbmt)
[1] 5577 11
> tmat <- trans.illdeath(names = c("Tx", "PR", "RelDeath"))
> covs <- c('dissub', 'age', 'drmatch', 'tcd', 'prtime')
> class(covs)
[1] "character"
> msbmt <- msprep(time = c(NA, "prtime", "rfstime"),
+ status = c(NA, "prstat", "rfsstat"),
+ data = ebmt3, trans = tmat, id = 'id', keep = covs)
>
> names(insfree3)
[1] "PatientID" "YrFree" "Free" "YrLossFail" "LossFail" "StudyID" "IEQ_kg"
> tmat3 <- trans.illdeath(names = c("Tx", "II", "LossFail"))
> IImt <- msprep(time = c(NA, 'YrFree', 'YrLossFail'),
+ status = c(NA, 'Free', 'LossFail'),
+ data = insfree3, trans = tmat3, id = 'PatientID')
>
> tmat3 <- trans.illdeath(names = c("Tx", "II", "LossFail"))
> covs <- c('StudyID', 'IEQ_kg', 'YrFree')
> class(covs)
[1] "character"
> IImt <- msprep(time = c(NA, 'YrFree', 'YrLossFail'),
+ status = c(NA, 'Free', 'LossFail'),
+ data = insfree3, trans = tmat3, id = 'PatientID', keep = covs)
Error in rep(keep[, i], tbl) : invalid 'times' argument

I found the problem, and it is a bug. I just don't know whose bug it is. msprep() works when data is a data.frame, but not when it is a tibble. My repro example:
> library(survival)
> library(mstate)
> library(dplyr)
> data(ebmt3)
> class(ebmt3)
[1] "data.frame"
> tmat <- transMat(x = list(c(2, 3), c(3), c()), names = c("Tx",
+ "PR", "RelDeath"))
> ebmt3$prtime <- ebmt3$prtime/365.25
> ebmt3$rfstime <- ebmt3$rfstime/365.25
> covs <- c("dissub", "age", "drmatch", "tcd", "prtime")
> msbmt <- msprep(time = c(NA, "prtime", "rfstime"),
+ status = c(NA, "prstat", "rfsstat"), data = ebmt3,
+ trans = tmat, keep = covs)
> ebmt3 <- as_tibble(ebmt3)
> class(ebmt3)
[1] "tbl_df" "tbl" "data.frame"
> msbmt <- msprep(time = c(NA, "prtime", "rfstime"),
+ status = c(NA, "prstat", "rfsstat"), data = ebmt3,
+ trans = tmat, keep = covs)
Error in rep(keep[, i], tbl) : invalid 'times' argument
I tracked the error down to line 157 in msprep()
ddcovs <- lapply(1:nkeep, function(i) rep(keep[, i], tbl))
When data is a data.frame, this line works. When it is a tibble, it abends with the above error message.
It was my impression that things that work with a data.frame should also work with a tibble, since a tibble is a data.frame. So I'm not sure whether this is a bug in msprep() or in the code for a tibble. But the way to avoid the error is to be sure that the data parameter in the call to msprep() is a data.frame, but not a tibble.
Larry Hunsicker

Related

Problem with for loop when downloading species occurrence data

I want to download the occurrence data from gbif website and I use the following R script. When I run the script, I got an error with the following message "Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 1, 0)". It would be highly appreciated if anyone could help me with this.
My data: data
My R script:
flist<-read_excel("Mekong fish.xlsx",sheet="Sheet1")
##Loop
fname<-list()
Occ<-list()
datfish<-list()
name_list<-unique(flist$Updated_name)
# create for loop to produce ggplot2 graphs
for (i in seq_along(name_list)) {
# create plot for each Occurrence in df
Occ[[i]] <-occ_search(scientificName = name_list[i], limit=2)
fname[[i]]<-occ_search(scientificName = name_list[i],
fields = c("species", "country","decimalLatitude", "decimalLongitude"),
hasCoordinate=T, limit= Occ[[i]]$meta[4],return ="data")
datfish[[i]]<-as.data.frame(fname[[i]]$data)
}
I got a different error:
Expecting logical in D1424 / R1424C4: got 'in Lao'Expecting logical in D1426 / R1426C4: got 'in China'Expecting logical in D1467 / R1467C4: got 'only Cambodia'Expecting logical in D1469 / R1469C4: got 'only in VN'Expecting logical in D1473 / R1473C4: got 'only in China'Expecting logical in D1486 / R1486C4: got 'only in Malaysia'Expecting logical in D1488 / R1488C4: got 'only 1 point in VN'
I think the problem is caused in some fields in the 4th column. I don't have the right packages installed to run your code. But I got a different error (package missing) once i dropped the fourth column.
flist<-read_excel("~/Downloads/Mekong fish.xlsx",sheet="Sheet1")
flist <=subset(flist, select = -4)
...
EDIT:
This worked for me. read_excel assigned column 4 the type boolean. When I explicitly set it to text it worked.
library(readxl)
library(rgbif)
library(raster)
flist<-read_excel("~/Downloads/Mekong fish.xlsx",
sheet="Sheet1",
col_types = c("numeric", "text", "numeric", "text"))
flist
##Loop
fname<-list()
Occ<-list()
datfish<-list()
name_list<-unique(flist$Updated_name)
# create for loop to produce ggplot2 graphs
for (i in seq_along(name_list[1:2])) {
message(i)
# # create plot for each Occurrence in df
Occ[[i]] <-occ_search(scientificName = name_list[i], limit=2)
message(Occ[[i]])
fname[[i]]<-occ_search(scientificName = name_list[i],
fields = c("species", "country","decimalLatitude", "decimalLongitude"),
hasCoordinate=T, limit= Occ[[i]]$meta[4],return ="data")
message(fname[[i]])
datfish[[i]]<-as.data.frame(fname[[i]]$data)
message(datfish[[i]])
}
> 1
> list(offset = 0, limit = 2, endOfRecords = FALSE, count = >15)list(list(name = c("Animalia", "Chordata", "Actinopterygii",
> "Cypriniformes", "Cyprinidae", "Aaptosyax", "Aaptosyax grypus"), key = > > c("1", "44", "204", "1153", "7336", "2363805", "2363806"),
> etc...

posix time comparison in r not behaving the same in for loop and apply function

Hello i am having an interesting issue with R
When i do :
touchtimepairs = structure(list(v..length.v.. = structure(c(1543323677.254, 1543323678.137, 1543323679.181, 1543323679.918, 1543323680.729, 1543323681.803, 1543323682.523, 1543323682.977,1543323683.519, 1543323684.454), class = c("POSIXct", "POSIXt"), tzone = "CEST"),v.2.length.v.. = structure(c(1543323678.137, 1543323679.181, 1543323679.918, 1543323680.729, 1543323681.803, 1543323682.523, 1543323682.977, 1543323683.519, 1543323684.454, 1543323690.793), class = c("POSIXct", "POSIXt"), tzone = "CEST")), .Names = c("v..length.v..", "v.2.length.v.."), row.names = c(NA, 10L), class = "data.frame")
data = data.frame(a = seq(1,10), b = seq(21,30), posixtime = touchtimepairs[,1])
for(x in seq(nrow(touchtimepairs))){
a = data$[data$posixtime < touchtimepairs[x,2],]
}
it works without a problem i get results back but when i try to use apply
a = apply(touchtimepairs, 1,
function(x) data[data$posixtime < x[2],])
it does not work anymore, I get an empty data frame. The same happens with the subset() command.
Interestingly when i do > instead of < it works !
a = apply(touchtimepairs, 1,
function(x) data[data$posixtime > x[2],])
Then there is another issue:
apply in the case of the > comparison gives another result than the for loop
1951 lines with apply and
1897 with the for loop
can anyone reproduce this behavior?
The posix time has also miliseconds if that is of any interest
Many thanks
If you look at your data inside the apply anonymous function, you'll see the symptom that is causing your trouble.
apply(touchtimepairs, 1, class)
# 1 2 3 4 5 6 7 8 9 10
# "character" "character" "character" "character" "character" "character" "character" "character" "character" "character"
(It should be returning a 2-row matrix with POSIXct and POSIXt.) I should also note that I kept getting warnings about unknown timezone 'CEST'. I fixed it temporarily with attr(touchtimepairs[[1]], "tzone") <- "UTC", though that's just a kludge to stop the warnings on my console. It doesn't fix the problem and might just be my system. :-)
If you are trying to use both columns of touchtimepairs, you have two options:
If you really only need one of touchtimepairs at a time, then lapply will work:
lapply(touchtimepairs[[1]],
function(x) subset(data, posixtime < x))
If you need to use both columns at the same time, use an index on the rows:
lapply(seq_len(nrow(touchtimepairs)),
function(i) subset(data, posixtime < touchtimepairs[i,2]))
(where you'd also reference touchtimepairs[i,1] somehow).
Especially if you are trying to use both columns simultaneously, you can use Map:
Map(function(a, b) subset(data, a < posixtime & posixtime <= b),
touchtimepairs[[1]], touchtimepairs[[2]])
(This does not return anything in your sample data, so either the data is not the best representative sample, or you are not intending to use it in this fashion. Most likely the latter, I'm just guessing :-)
The biggest difference between Map and the *apply family is that it accepts one or more vectors/lists and zips them together. As an example of this "zipper" effect:
Map(func, 1:3, 11:13)
is effectively calling:
func(1, 11)
func(2, 12)
func(3, 13)

R and Data Selection

I have a data table dt, as given below:
structure(list(IM = c(0.830088495575221, 0.681436210847976, 0.498810939357907,
0.47265400115141, 0.527908540685945, 0.580763582966226, 0.408069043807859,
0.467368671545006, 0.44662887412295, 0.0331974034502217, 0.0368210899219588,
0.0333698233772947, 0.0294312465832275, 0.578743426515361, 0.566950053134963,
0.808756701221038, 0.585507838980771, 0.61507839619537, 0.586388329979879,
0.794196637085474), CM = c(0.876991150442478, 0.996180290297937,
0.651605231866825, 0.824409902130109, 0.94418291862811, 0.961820851688693,
0.943861532396347, 1.10137922144883, 1.1524325077831, 0.128868067469359,
0.155932251596297, 0.159414951213752, 0.196968075413411, 1.19678937171326,
0.901168969181722, 3.42528220866977, 2.4377239516641, 2.0040870054458,
1.86099597585513, 1.51928615911568), RM = c(0.601769911504425,
0.495034377387319, 0.405469678953627, 0.368451352907311, 0.361802286482851,
0.320851688693098, 0.791548118347242, 0.816050925099649, 0.786622368849031,
0.545805622636092, 0.594370732740163, 0.594771872860171, 0.536043514857356,
0.617215610296153, 0.619287991498406, 0.602602774009141, 0.634069706132375,
0.596543561108693, 0.582203219315895, 0.695985131558462)), .Names = c("IM", "CM", "RM"), class = c("data.table", "data.frame"), row.names
= c(NA,
-20L), .internal.selfref = <pointer: 0x00000000003f0788>)
I have written a function as given below:
DSanity.markWinsorize <- function(dt, colnames)
{
PERnames <- unlist(lapply(colnames, function(x) paste0("PER",x)));
print(dt[,colnames])
if(length(colnames)>1)
{dt[,PERnames] <- sapply(dt[,colnames], Num.calPtile);}
else
{dt[,PERnames] <- Num.calPtile(dt[,colnames]);}
return(dt)
}
## Calculate Percentile score of a data vector
Num.calPtile <- function(x)
{
return((ecdf(x))(x))
}
The job of this function is to create new columns, calculating the percentile of each of the data points for the columns provided to the function markWinsorize.
Here I am trying to run the function markWinsorize:
colnames <- c('CM','AM','BM')
DSanity.markWinsorize(dt,colnames)
I get the following error:
> sdc1 <- DSanity.markWinsorize(sdc,colnames)
[1] "CM" "AM" "BM"
Show Traceback
Rerun with Debug
Error in approxfun(vals, cumsum(tabulate(match(x, vals)))/n, method = "constant", :
zero non-NA points In addition: Warning message:
In xy.coords(x, y) : NAs introduced by coercion
It would be great if some of you can help me out here. Thanks.
Your approach is quite unwieldy. I recommend a completely new approach.
library(dplyr)
colnames <- c("CM", "AM", "BM")
dt %>%
select_(.dots = colnames) %>%
mutate_each(funs(ntile(., 100)))
I think this gives what you want (perhaps with the addition of %>% bind_cols(dt)).

Running `ctree` using `party` package, column as factor and not character

I have referred convert data.frame column format from character to factor and Converting multiple data.table columns to factors in R and Convert column classes in data.table
Unfortunately it did not solve my problem. I am working with the bodyfat dataset and my dataframe is called > bf. I added a column called agegrp to categorize persons of different ages as young, middle or old thus :
bf$agegrp<-ifelse(bf$age<=40, "young", ifelse(bf$age>40 & bf$age<55,"middle", "old"))
This is the ctree analysis:
> set.seed(1234)
> modelsample<-sample(2, nrow(bf), replace=TRUE, prob=c(0.7, 0.3))
> traindata<-bf[modelsample==1, ]
> testdata<-bf[modelsample==2, ]
> predictor<-agegrp~DEXfat+waistcirc+hipcirc+kneebreadth` and ran, `bf_ctree<-ctree(predictor, data=traindata)
> bf_ctree<-ctree(predictor, data=traindata)
I got the following error:
Error in trafo(data = data, numeric_trafo = numeric_trafo, factor_trafo = factor_trafo, :
data class character is not supported
In addition: Warning message:
In storage.mode(RET#predict_trafo) <- "double" : NAs introduced by coercion
Since bf$agegrp is of class "character" I ran,
> bf$agegrp<-as.factor(bf$agegrp)
the agegrp column is now coerced to factor.
> Class (bf$agegrp) gives [1] "Factor".
I tried running the ctree again, but it throws the same error. Does anyone know what the root-cause of the problem is?
This works for me:
library(mboot)
library(party)
bf <- bodyfat
bf$agegrp <- cut(bf$age,c(0,40,55,100),labels=c("young","middle","old"))
predictor <- agegrp~DEXfat+waistcirc+hipcirc+kneebreadth
set.seed(1234)
modelsample <-sample(2, nrow(bf), replace=TRUE, prob=c(0.7, 0.3))
traindata <-bf[modelsample==1, ]
testdata <-bf[modelsample==2, ]
bf_ctree <-ctree(predictor, data=traindata)
plot(bf_ctree)

Reshaping data for use with geeglm()

Could you please help me figure out why I am getting an error?
Initially my data looks like this:
> attributes(compl)$names
[1] "UserID" "compl_bin" "Sex.x" "PHQ_base" "PHQ_Surv1" "PHQ_Surv2" "PHQ_Surv3"
[8] "PHQ_Surv4" "EFE" "Neuro" "Intervention.x" "depr0" "error1_1.x" "error1_2.x"
[15] "error1_3.x" "error1_4.x" "stress0" "stress1" "stress2" "stress3" "stress4"
[22] "hours1" "hours2" "hours3" "hours4" "subject"
First I reshape my data to prepare for geeglm:
compl$subject <- factor(rownames(compl))
nobs <- nrow(compl)
compl_long <- reshape(compl, idvar = "subject",
varying = list(c("PHQ_Surv1", "PHQ_Surv2" ,
"PHQ_Surv3", "PHQ_Surv4"),
c("error1_1.x", "error1_2.x",
"error1_3.x", "error1_4.x"),
c("stress1", "stress2", "stress3",
"stress4"),
c("hours1", "hours2", "hours3",
"hours4")),
v.names = c("PHQ", "error", "stress", "hours"),
times = c("1", "2", "3", "4"), direction = "long")
-(Editor's note: not sure what this next output is from...)
[1] "UserID" "compl_bin" "Sex.x" "PHQ_base" "EFE" "Neuro" "Intervention.x"
[8] "depr0" "stress0" "subject" "time" "PHQ" "error" "stress"
[15] "hours"
Then I use geeglm function:
library(geepack)
geeSand=(geeglm(PHQ~as.factor(compl_bin) + Neuro+PHQ_base+as.factor(depr0) +
EFE+as.factor(Sex.x) + as.factor(error)+stress+hours,
family = poisson, data=compl_long,
id=subject, corst="exchangeable"))
I am getting an error:
"Error in geese.fit(xx, yy, id, offset, soffset, w, waves = waves, zsca, :
nrow(zsca) and length(y) not match"
If I remove variables as.factor(error) and hours, geeglm does not complain, and I am getting the output. The function does not work with error and hours variables. I check the length of all the variables, they are equal. Could you please help me figure out what is wrong?
Many thanks!
found this at: https://stat.ethz.ch/pipermail/r-help/2008-October/178337.html
"
I'm pretty sure this is a bug in geese(), which should be reported to
the
maintainer of geepack. The problem is with the treatment of missing
values.
If looks at dim(na.omit(dat[,c("id","score","chem","time")])) one
gets 44.
In geese.fit() zsca is set equal to matrix(1,N,1) where N is set
equal to
length(id). But id has length 46 whereas the response y has been
trimmed
down to length 44 by eliminating any rows of the data where any of
the variables
involved are missing. Hence a problem.
The solution of the problem requires some code re-writing by the
maintainer of geepack."

Resources