Anesrake algorithm doesn't work with zero as a weight - r

I tried using the anesrake package, but it won't accept a weight of zero, giving the error message:
Error in while (range(weightvec)[2] > cap + 1e-04) { :
missing value where TRUE/FALSE needed
Sample code:
ipfdata<- read.csv("dummydata.csv", header = T)
ipfdata$caseid <- 1:length(ipfdata$age)
sex <- c(0.30, 0.70)
age <- c(0.2, 0.1, 0.05, 0.05, 0.05, 0.05, 0.3, 0.2)
ses <- c(0.20, 0.20, 0.0)
targets <- list(sex, age, ses)
names(targets) <- c("sex", "age", "ses")
outsave <- anesrake(targets, ipfdata, caseid = ipfdata$caseid, weightvec = NULL, cap = 10, verbose = TRUE, maxit = 50, choosemethod = "total", type = "nolim", pctlim = 0.0001, nlim=10, iterate = T, force1 = TRUE)
(sample code modified from this question: https://stackoverflow.com/questions/19458306/ipf-raking-using-anesrake-in-r-error)

The package was never updated despite my contacting the author to address this issue. The only workaround is to remove any rows with the variable set to zero before raking.
In the given sample above, you would have to remove any rows with the third SES factor, and then change the SES vector to c(0.20, 0.20) instead of c(0.20, 0.20, 0.0).

Related

Can't pass variable into function in R

I am trying to fit a list of dataframes and I can't figure out why I can't define conc and t0 outside of the function.
If I do it like this I get error:
'Error in nls.multstart::nls_multstart(y ~ fit_drx_mono(assoc_time,
t0, : There must be as many parameter starting bounds as there are
parameters'
conc <- 5e-9
t0 <- 127
nls.multstart::nls_multstart(y ~ fit_mono(assoc_time, t0, conc, kon, koff, ampon, ampoff),
data = data_to_fit,
iter = 100,
start_lower = c(kon = 1e4, koff = 0.00001, ampon = 0.05, ampoff = 0),
start_upper = c(kon = 1e7, koff = 0.5, ampon = 0.6, ampoff = 0.5),
lower = c(kon = 0, koff = 0, ampon = 0, ampoff = 0))
When I specify the values in the function everything works as it is supposed to. And I don't understand why.
It turned out I cannot define data = data_to_fit otherwise the function looks for variables only in that dataframe. Once I defined every variable outside of the function without specifying data it works.

Very weird behaviour of xgboost in R

I am learning to use the XGBoost package in R and I encountered some very weird behaviour that I'm not sure how to explain. Perhaps someone can give me some directions. I simplified the R code as much as possible:
rm(list = ls())
library(xgboost)
setwd("/home/my_username/Documents/R_files")
my_data <- read.csv("my_data.csv")
my_data$outcome_01 = ifelse(my_data$outcome_continuous > 0.0, 1, 0)
reg_features = c("feature_1", "feature_2")
class_features = c("feature_1", "feature_3")
set.seed(93571)
train_data = my_data[seq(1, nrow(my_data), 2), ]
mm_reg_train = model.matrix(~ . + 0, data = train_data[, reg_features])
train_DM_reg = xgb.DMatrix(data = mm_reg_train, label = train_data$outcome_continuous)
var_nrounds = 190
xgb_reg_model = xgb.train(data = train_DM_reg, booster = "gbtree", objective = "reg:squarederror",
nrounds = var_nrounds, eta = 0.07,
max_depth = 5, min_child_weight = 0.8, subsample = 0.6, colsample_bytree = 1.0,
verbose = F)
mm_class_train = model.matrix(~ . + 0, data = train_data[, class_features])
train_DM_class = xgb.DMatrix(data = mm_class_train, label = train_data$outcome_01)
xgb_class_model = xgb.train(data = train_DM_class, booster = "gbtree", objective = "binary:logistic",
eval_metric = 'auc', nrounds = 70, eta = 0.1,
max_depth = 3, min_child_weight = 0.5, subsample = 0.75, colsample_bytree = 0.5,
verbose = F)
probabilities = predict(xgb_class_model, newdata = train_DM_class, type = "response")
print(paste0("simple check: ", sum(probabilities)), quote = F)
Here is the problem: The outcome of sum(probabilities) depends on the value of var_nrounds!
How could that be? After all var_nrounds enters only in xgb_reg_model, while the probabilities are computed with the xgb_class_model that does (should) not know anything about the value of var_nrounds. The only thing that I change in this code is the value of var_nrounds and yet the sum of probabilities changes when I rerun it. It also changes deterministically, i.e., with var_nrounds = 190 I always get (with my data) 5324.3 and with var_nrounds = 285:5322.8. However, if I remove the line set.seed(93571), then the result changes non-deterministically every time I rerun the code.
Could it be that XGBoost has some sort of in-built stochastic behaviour that changes depending on the number of rounds run beforehand in another model and that also gets controlled by setting a seed somewhere in the code before training the XGBoost? Any ideas?

R: make nomogram plot label italic

I am drawing a nomogram based on the dataset PimaIndiansDiabetes using the package rms:
library(mlbench)
data(PimaIndiansDiabetes)
library(rms)
ddist <- datadist(PimaIndiansDiabetes)
options(datadist='ddist')
lrm_model <- lrm (diabetes ~ .,
data = PimaIndiansDiabetes,
x = TRUE, y = TRUE)
nom <- nomogram(lrm_model,
fun=plogis,
fun.at=c(0.01, 0.05, seq(0.1, 0.9, by=0.2), 0.95, 0.99),
abbrev = TRUE,
lp=F,
vnames = "labels",
varname.label=TRUE,
funlabel = "Diabetes")
plot(nom,
fun.side=c(1,3,1,3,1,3,1,3,1),
label.every=1,
lmgp = 0.15,
xfrac=.4
)
Below is the plot:
How do I make the axis names italic? (I actually prefer if only the 8 predictors' names are italic)

How to sample discrete distribution conditional on factor within for if loop

I'm attempting to generate dummy data by sampling from a specific discrete distribution - conditional on the levels of a factor (so a different distribution for each factor level) and then wish to insert each random result into a new dataframe column in the row corresponding to the factor level. If you run the code below you will see that 'data$last' is empty. I'm not sure what I'm doing wrong, I've tried it without the loop as well, by setting replications to 100 for each level - however the distributions are incorrect.
#Create data frame with factor
set.seed(1)
ID<-(1:200)
gender<-sample(x = c("Male","Female"), 200, replace = T, prob = c(0.5, 0.5))
data<-data.frame(ID,gender)
#Generate random response based on discrete distribution conditional on gender
data$last <- for (i in 1:nrow(data)) {if(data$gender=="Male") {
sample(x = c("Today","Yesterday"), 1, replace = T, prob = c(0.8, 0.2))
} else {
sample(x = c("Today","Yesterday"), 1, replace = T, prob = c(0.3, 0.7))
}
}
You should rewrite your for-loop to assignate each data$last value inside the loop :
for (i in 1:nrow(data)) {
if(data$gender[i]=="Male") {
data$last[i] = sample(x = c("Today","Yesterday"), 1, replace = T, prob = c(0.8, 0.2))
} else {
data$last[i] = sample(x = c("Today","Yesterday"), 1, replace = T, prob = c(0.3, 0.7))
}
}
Or without for-loop :
data$last = ifelse(data$gender=="Male",
sample(x = c("Today","Yesterday"), length(data$gender[(data$gender=="Male")==TRUE]), replace = T, prob = c(0.8, 0.2)),
sample(x = c("Today","Yesterday"), length(data$gender[(data$gender!="Male")==TRUE]), replace = T, prob = c(0.3, 0.7)))
#Generate random response based on discrete distribution conditional on gender
data$last <- sapply(1:nrow(data),function(i){if(data$gender[i]=="Male") {
s =sample(x = c("Today","Yesterday"), 1, replace = T, prob = c(0.8, 0.2))
} else {
s = sample(x = c("Today","Yesterday"), 1, replace = T, prob = c(0.3, 0.7))
}
return(s)
})
Check how you didn't look for an specific data$gender but for the whole vector. Also, return the result using return(s)

Error in nonlinear least squeares in R - Logistic and Gompertz curves

I'm working on a model for variable y, in which I intend to use time as an explanatory variable. I've chosen a Gompertz and a logistic curve as candidates, but when I try to estimate the coefficients (using both nls and nls2), I end up getting different errors (singularity or step factor reduced below 'minFactor'). I would really appreciate any help. Here is my code and a deput version of the info object.
I chose the initial values according to the criteria in http://www.metla.fi/silvafennica/full/sf33/sf334327.pdf
library(nls2)
> dput(info)
structure(list(y = c(0.308, 0.279, 0.156, 0.214, 0.224, 0.222,
0.19, 0.139, 0.111, 0.17, 0.155, 0.198, 0.811, 0.688, 0.543,
0.536, 0.587, 0.765, 0.667, 0.811, 0.587, 0.617, 0.586, 0.633,
2.231, 2.202, 1.396, 1.442, 1.704, 2.59, 2.304, 3.026, 2.7, 3.275,
3.349, 3.936, 9.212, 8.773, 6.431, 6.983, 7.169, 9.756, 10.951,
13.938, 14.378, 18.406, 24.079, 28.462, 51.461, 46.555, 39.116,
43.982, 41.722), t = 1:53), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -53L))
summary(gomp_nls <- nls2(y ~ alpha*exp(-beta*exp(-gamma*t)),
data = info,
start = list(alpha = 40, beta = 4.9, gamma = 0.02),
algorithm = "default")
)
summary(logist_nls <- nls2(y ~ alpha/(1+beta*exp(-gamma*t)),
data = info,
start = list(alpha = 40, beta = 128, gamma = 0.02),
algorithm = "default"))
)
I'd appreciate any help
The "default" algorithm for nls2 is to use nls. You want to specify "brute-force" or one of the other algorithms for finding an initial value. The starting value should be a data frame of two rows such that it will fill in the hypercube so defined with potential starting values.
It will then evaluate the residual sum of squares at each of those starting values and return the starting values at which the formula gives the least sum of squares.
If you find that the result returned by nls2 is at the boundary of the region you defined then enlarge the region and try again. (You might not need this step if the starting value returned are good enough anyways.)
Finally run nls with the starting values you found.
library(nls2)
## 1
fo1 <- y ~ alpha*exp(-beta*exp(-gamma*t))
st1 <- data.frame(alpha = c(10, 100), beta = c(1, 100), gamma = c(0.01, 0.20))
fm1.0 <- nls2(fo1, data = info, start = st1, algorithm = "brute-force")
fm1 <- nls(fo1, data = info, start = coef(fm1.0))
## 2
fo2 <- y ~ alpha/(1+beta*exp(-gamma*t))
st2 <- data.frame(alpha = c(10, 1000), beta = c(1, 10000), gamma = c(0.01, 0.20))
fm2.0 <- nls2(fo2, data = info, start = st2, algorithm = "brute-force")
fm2 <- nls(fo2, data = info, start = coef(fm2.0))
# plot both fits
plot(y ~ t, info)
lines(fitted(fm1) ~ t, info, col = "blue")
lines(fitted(fm2) ~ t, info, col = "red")
Note
Note that for the data shown these two 2-parameter exponential models fit reasonably well so if you are only interested in the range where it rises exponentially then these could be alternatives to consider. (The first one below is better because the coefficients are more similar to each other. The second one may have scaling problems.)
fm3 <- nls(y ~ a * exp(b/t), info, start = c(a = 1, b = 1))
fm4 <- nls(y ~ a * t^b, info, start = c(a = .001, b = 6))

Resources