I am writing the below function to let me conduct a test of skewness for a vector of samples (10, 20, 50, 100) with a 1000 replicate.
library(moments)
out <- t(sapply(c(10, 20, 50, 100), function(x)
table(replicate(1000, skewness(rgamma(n = x, shape = 3, rate = 0.5))) > 2)))
row.names(out) <- c(10, 20, 50, 100)
out
My conditions
My condition of rejecting the Null hypothesis is that the statistic must fulfil two (2) conditions:
less than -2
or greater than +2.
What I have
But in my R function I can only describe the second condition.
What I want
How do I include both the first and the second condition in my function?
Perhaps adding the abs would be the easiest approach to meet both conditions
out <- t(sapply(c(10, 20, 50, 100), function(x)
table(abs(unlist(replicate(1000, skewness(rgamma(n = x, shape = 3, rate = 0.5))))) > 2)))
row.names(out) <- c(10, 20, 50, 100)
out
Possibly this is a naive question but did not find a solution. I have a dataframe with count data from field survey and I want to predict species richness using poisson regression. The survey is allocated to grids of equal size but variable number of survey were done in each grid. So I wanted to include 'number of surveys per grid' as offset. The problem is when I want to predict the glm output using raster stack it wants a raster layer for the offset variable (number of surveys per grid). My question is how to incorporate that offset variable into raster stack so that I can produce a spatial prediction (i.e., prediction should be a raster file). Below is my reproducible effort (using fewer variable):
Create the dataframe:
bio2 <- c(12.74220, 14.10092, 13.82644, 14.30550, 15.02780, 14.88224, 13.98853, 14.89524, 15.59887, 13.98664, 14.75405,
15.38178, 14.50719, 15.00427, 12.77741, 13.25432, 12.91208, 15.75312, 15.36683, 13.33202, 12.55190, 14.94755,
13.52424, 14.75273, 14.42298, 15.37897, 12.02472, 15.49786, 14.28823, 13.01982, 13.60521, 15.07687, 14.17427,
13.24491, 14.84833, 13.52594, 13.92113, 11.39738, 14.31446, 12.10239)
bio9 <- c(26.30980, 26.52826, 27.03376, 23.93621, 26.48416, 26.05859, 25.37550, 25.34595, 25.34056, 23.37793, 25.74681,
22.72016, 22.00458, 24.37140, 22.95169, 24.52542, 24.63087, 22.86291, 23.10240, 23.79215, 24.86875, 21.40718,
23.84258, 21.91964, 25.97682, 24.97625, 22.31471, 19.64094, 23.93386, 25.87234, 25.99514, 17.17149, 20.72802,
18.22862, 24.51112, 24.33626, 23.90822, 23.43660, 23.07425, 20.71244)
count <- c(37, 144, 91, 69, 36, 32, 14, 34, 48, 168, 15, 21, 36, 29, 24, 16, 14, 11, 18, 64, 37, 31, 18, 9, 4,
16, 14, 10, 14, 43, 18, 88, 69, 26, 20, 5, 9, 75, 8, 26)
sitesPerGrid <- c(3, 16, 8, 5, 3, 3, 1, 3, 3, 29, 2, 4, 5, 2, 3, 4, 2, 1, 2, 9, 6, 3, 3, 2, 1, 2, 2, 1, 2, 5, 7, 15, 9, 4,
1, 1, 2, 22, 6, 5)
testdf <- data.frame(bio2, bio9, count, sitesPerGrid)
pois1 <- glm(count ~ bio2 + bio9, offset = log(sitesPerGrid), family = poisson (link = "log"), data = testdf)
Spatial prediction:
library(raster)
bio_2 <- bio_9 <- raster(nrow=5,ncol=8, xmn=0, xmx=1,ymn=0,ymx=1)
values(bio_2) <- bio2
values(bio_9) <- bio9
predRas <- stack(bio_2, bio_9)
names(predRas) <- c("bio2", "bio9")
pdPois <- raster::predict(predRas, pois1, type = "response")
#Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = #object$xlevels) :
# variable lengths differ (found for 'bio9')
#In addition: Warning message:
#'newdata' had 16 rows but variables found have 40 rows
I get error because it expect a raster layer for sitesPerGrid. But I don't want to use sitesPerGrid as a predictor.
Update
Based on the comment and answer given by #robertHijmans I have tried using the following code:
pdPois <- raster::predict(predRas, pois1, const = testdf[, "sitesPerGrid"], type = "response")
Again I get the following error:
Error in data.frame(..., check.names = FALSE) : arguments imply differing number of rows: 143811, 40
I see that this works, because the number of data points is the same as what was used to fit the model
p <- predict(pois1, as.data.frame(predRas), type = "response")
However, this (taking two data points) does not work:
p <- predict(pois1, as.data.frame(predRas)[1:2,], type = "response")
#Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
# variable lengths differ (found for 'bio9')
#In addition: Warning message:
#'newdata' had 2 rows but variables found have 40 rows
So, irrespective of the raster data, can you (and if so how?) use a model like this to make predictions to (any number of) new data points?
The problem is solved using a raster for the offset variable. The raster is created based on a hypothesis. For example, I want to see the prediction if there is one site per grid, or mean(sitesPerGrid) or max(sitesPerGrid). If my hypothesis is mean(sitesPerGrid) then the raster for prediction would be:
# make new raster for sitesPerGrid
rasGrid <- bio2
rasGrid[,] <- mean(testdf$sitesPerGrid)
names(rasGrid) <- "sitesPerGrid"
predRas <- stack(bio_2, bio_9, rasGrid)
p <- raster::predict(predRas, pois1, type = "response")
I'm a beginner with machine learning (and also R). I've figured out how to run some basic linear regression, elastic net, and random forest models in R and have gotten some decent results for a regression project (with a continuous dependent variable) that I'm working on.
I've been trying to learning how to use the gradient boosting algorithm and, in particular, the xgboost() command. My results are way worse here, though, and I'm not sure why.
I was hoping someone could take a look at my code and see if there are any glaring errors.
# Create training data with and without the dependent variable
train <- data[1:split, ]
train.treat <- select(train, -c(y))
# Create test data with and without the dependent variable
test <- data[(split+1):nrow(data), ]
test.treat <- select(test, -c(y))
# Load the package xgboost
library(xgboost)
# Run xgb.cv
cv <- xgb.cv(data = as.matrix(train.treat),
label = train$y,
nrounds = 100,
nfold = 10,
objective = "reg:linear",
eta = 0.1,
max_depth = 6,
early_stopping_rounds = 10,
verbose = 0 # silent
)
# Get the evaluation log
elog <- cv$evaluation_log
# Determine and print how many trees minimize training and test error
elog %>%
summarize(ntrees.train = which.min(train_rmse_mean), # find the index of min(train_rmse_mean)
ntrees.test = which.min(test_rmse_mean)) # find the index of min(test_rmse_mean)
# The number of trees to use, as determined by xgb.cv
ntrees <- 25
# Run xgboost
model_xgb <- xgboost(data = as.matrix(train.treat), # training data as matrix
label = train$y, # column of outcomes
nrounds = ntrees, # number of trees to build
objective = "reg:linear", # objective
eta = 0.001,
depth = 10,
verbose = 0 # silent
)
# Make predictions
test$pred <- predict(model_xgb, as.matrix(test.treat))
# Plot predictions vs actual bike rental count
ggplot(test, aes(x = pred, y = y)) +
geom_point() +
geom_abline()
# Calculate RMSE
test %>%
mutate(residuals = y - pred) %>%
summarize(rmse = sqrt(mean(residuals^2)))
How does this look?
Also, one thing I don't get about xgboost() is why I have to take out the dependent variable from the dataset in the "data" option and then add it back in the "label" option. Why do we do this?
My dataset has 809 observations and 108 independent variables. Here is an arbitrary subset:
structure(list(year = c(2019, 2019, 2019, 2019), ht = c(74, 76,
74, 73), wt = c(223, 234, 215, 215), age = c(36, 29, 32, 24),
gp_l1 = c(16, 16, 11, 14), gp_l2 = c(7, 0, 16, 0), gp_l3 = c(16,
15, 16, 0), gs_l1 = c(16, 16, 11, 13), gs_l2 = c(7, 0, 16,
0), gs_l3 = c(16, 15, 16, 0), cmp_l1 = c(372, 430, 226, 310
), cmp_l2 = c(154, 0, 297, 0), cmp_l3 = c(401, 346, 364,
0), att_l1 = c(597, 639, 365, 486), y = c(8, 71.5, 26, 22
)), row.names = c(NA, -4L), class = c("tbl_df", "tbl", "data.frame"
))
My RMSE from this xgboost() model is 31.7. Whereas my random forest and glmnet models give RMSEs around 13. The prediction metric I'm comparing to has RMSE of 15.5. I don't get why my xgboost() model does so much worse than my random forest and glmnet models.
How to change the size of the labels in contourplot {lattice} to avoid the overlapping of numerical legends in this simulation:
? I tried without success things like labcex within the function, or inserting a par(cex=.5) before plotting.
Here is the code as in the linked tutorial:
diams = c(150, 124, 100, 107, 170, 144,
113, 108, 92, 129, 123, 118)
# Obtain typical mean and standard deviation
mean(diams)
[1] 123.1667
sd(diams)
[1] 22.38438
# A simple function to calculate likelihood.
# Values quickly become very small
like = function(data, mu, sigma) {
like = 1
for(obs in data){
like = like * 1/(sqrt(2*pi)*sigma) *
exp(-1/2 * (obs - mu)^2/(sigma^2))
}
return(like)
}
# For example, likelihood of data if mu=120 and sigma=20
like(diams, 120, 20)
[1] 3.476384e-24
# A simple function to calculate log-likelihood.
# Values will be easier to manage
loglike = function(data, mu, sigma) {
loglike = 0
for(obs in data){
loglike = loglike +
log(1/(sqrt(2*pi)*sigma) *
exp(-1/2 * (obs - mu)^2/(sigma^2)))
}
return(loglike)
}
# Example, log-likelihood of data if mu=120 and sigma=20
loglike(diams, 122, 20)
[1] -53.88605
# Let's try some combinations of parameters and
# plot the results
params = expand.grid(mu = seq(50, 200, 1),
sigma = seq(10, 40, 1))
params$logL = with(params, loglike(diams, mu, sigma))
summary(params)
library(lattice)
contourplot(logL ~ mu*sigma, data = params, cuts = 20)
After generating a model using cph() from rms, calling model$y will return a survival object. Is there a function that will "undo" the survival object and return a data frame?
I would like to be able to use a survival object as a argument for a function I am writing, but I also need the response data. I am trying to avoid using the data as an argument and creating the model inside the function.
A minimal working example is provided below:
library(rms)
# generate data
time <- c(82, 73, 89, 79, 72, 87, 103, 83, 100, 79)
event <- c(0, 0, 1, 0, 1, 0, 0, 0, 1, 1)
covar <- c(15, 11, 11, 20, 12, 13, 10, 11, 10, 14)
df <- data.frame(time, event, covar)
# Cox model
dd <- datadist(df)
options(datadist = 'dd')
model <- cph(Surv(time, event) ~ covar, x=TRUE, y=TRUE, surv=TRUE, data=df)
# returns a survival object
model$y
# what I want is a data frame
want <- data.frame(time, event)
want
I think you can get what you want by using as.matrix:
pl <- as.matrix(model$y)
as.data.frame(pl)