I have input.dat file for 11 parameters.
Using input.dat, I am running a software and it gives me (row,column)=(100,3) data set. I already have correct (100,3) data set.
Then I calculated cost function or error between this two dataset.
I am trying to use optimize.minimize to find best set of parameters in input.dat such a way I can reduce the error between two data sets or minimize the cost function.
Here input.dat have 11 numbers.
y = already have a data (100,3)
y_pred = run the software using input.dat to get data (100,3)
cost_function = y - y_pred
optimize.minimize(cost_funation, input.dat, method="BFGS")
I am struggling to find the correct API for releasing memory for an object created by the H2O grid. This code was pre-written by someone else and I am currently maintaining it.
#train grid search
gbm_grid1 <- h2o.grid(algorithm = "gbm" #specifies gbm algorithm is used
,grid_id = paste("gbm_grid1",current_date,sep="_") #defines a grid identification
,x = predictors #defines column variables to use as predictors
,y = y #specifies the response variable
,training_frame = train1 #specifies the training frame
#gbm parameters to remain fixed
,nfolds = 5 #specify number of folds for cross-validation is 5 (this acceptable here in order to reduce training time)
,distribution = "bernoulli" #specify that we are predicting a binary dependent variable
,ntrees = 1000 #specify the number of trees to build (1000 as essentially the maximum number of trees that can be built. Early stopping parameters defined later will make it unlikely our model will reach 1000 trees)
,learn_rate = 0.1 #specify the learn rate used of for gradient descent optimization (goal is to use as small a learn rate as possible)
,learn_rate_annealing = 0.995 #specifies that the learn rate will perpetually decrease by a factor of 0.995 (this can help speed up traing for our grid search)
,max_depth = tuned_max_depth
,min_rows = tuned_min_rows
,sample_rate = 0.8 #specify the amount of row observations used when making a split decision
,col_sample_rate = 0.8 #specify the amount of column observations used when making a split decision
,stopping_metric = "logloss" #specify loss function
,stopping_tolerance = 0.001 #specify minimum change required in stopping metric for individual model to continue training
,stopping_rounds = 5 #specify maximum amount of training rounds stopping metric has to change in excess of stopping tolerance
#specifies hyperparameters to fluctuate during model building in the grid search
,hyper_params = gbm_hp2
#specifies the search criteria that includes stop training etrics to speed up model building
,search_criteria = search_criteria2
#sets a reproducible seed
,seed = 123456
The problem is I believe this code was written awhile ago and has been deprecated since. h2o.rm(gbm_grid1) fails and R Studio tells me that I require a hex identifier. So I assigned my object an identifier and tried h2o.rm(gbm_grid1, "identifier.hex") and it tells me I cannot release this type of object.
The issue is I run out of memory if I move onto the next steps of the script. What should I do?
This is what I get with H2O.ls()
Yes, you can remove objects with h2o.rm(). You can use the variable name or key.
You can use h2o.ls() to check what objects are in memory. Also, you can add the argument cascade = TRUE to the rm method to remove sub-models.
See more here
I am having issue implementing recency-weighting for xgboost training in R (i.e. passing a weight vector to xgb.dmatrix) - although the weighting affects the learning curve readout for the training set, it does not appear to have any impact at all on the actual model produced - performance in the test set is identical.
I can't seem to get to the bottom of this issue or generate a reproducible example. So instead I would like to pass the Date column of the features to a custom loss function, something like:
custom_loss <- function(preds,dat) {
labels <- getinfo(dat,"label")
dates <- [a vector corresponding to the dates associated with each prediction]
grad = f(dates)*-2*(labels - preds)
hess = f(dates)*2
[where f is an increasing function of the value in dates, so later samples matter more when training]
But I can't seem to figure out how to do this, any suggestions?
In my dataset, I have ants that switch between one state (in this case a resting state) and all other states over a period of time. I am attempting to fit an exponential distribution to the number of times an ant spends in a resting state for some duration of time (for instance, the ant may rest for 5 seconds 10 times, or it could rest for 6 seconds 5 times, etc.). While subjectively this distribution of durations seems to be exponential, I can't fit a single parameter exponential distribution (where the one parameter is rate) to the data. Is this possible to do with my dataset, or do I need to use a two parameter exponential distribution?
I am attempting to fit the data to the following equation (where lambda is rate):
lambda * exp(-lambda * x).
This, however, doesn't seem to be mathematically possible to fit to either the counts of my data or the probability density of my data. In R I attempt to fit the data with the following code:
fit = nls(newdata$x.counts ~ (b*exp(b*newdata$x.mids)), start =
list(x.counts = 1, x.mids = 1, b = 1))
When I do this, though, I get the following message:
Error in parse(text= x, keep.source = FALSE):
<text>:2:0: unexpected end of input
1: ~
I believe I am getting this because its mathematically impossible to fit this particular equation to my data. Am I correct in this, or is there a way to transform the data or alter the equation so I can make it fit? I can also make it fit with the equation lambda * exp(mu * x) where mu is another free parameter, but my goal is to make this equation as simple as possible, so I would prefer to use the one parameter version.
Here is the data, as I can't seem to find a way to attach it as a csv:
First, you have a typo in your formula, you forgot the - sign in
But this is not what is throwing the error. The start parameter should be a list that initializes only the parameter value, not x.counts nor x.mids.
So the correct version would be:
fit = nls(newdata$x.counts ~ b*exp(-b*newdata$x.mids), start = list(b = 1))
I have some time to event data that I need to generate around 200 shape/scale parameters for subgroups for a simulation model. I have analysed the data, and it best follows a weibull distribution.
Normally, I would use the fitdistrplus package and fitdist(x, "weibull") to do so, however this data has been matched using kernel matching and I have a variable of weighting values called km and so needs to incorporate a weight, which isn't something fitdist can do as far as I can tell.
With my gamma distributed data instead of using fitdist I did the calculation manually using the wtd.mean and wtd.var functions from the hsmisc package, which worked well. However, finding a similar formula for the weibull is eluding me.
I've been testing a few options and comparing them against the fitdist results:
test_data <- rweibull(100, 0.676, 946)
fitweibull <- fitdist(test_data, "weibull", method = "mle", lower = c(0,0))
shape scale
0.6981165 935.0907482
I first tested this: The Weibull distribution in R (ExtDist)
m1 <- mle2(y~dweibull(shape=exp(lshape),scale=exp(lscale)),
which gave me lshape = -0.3919991 and lscale = 6.852033
The other thing I've tried is eweibull from the EnvStats package.
eweibull <- eweibull(test_data)
shape scale
0.698091 935.239277
However, while these are giving results, I still don't think I can fit my data with the weights into any of these.
Edit: I have also tried the similarly named eWeibull from the ExtDist package (which I'm not 100% sure still works, but does have a weibull function that takes a weight!). I get a lot of error messages about the inputs being non-computable (NA or infinite). If I do it with map, so map(test_data, test_km, eWeibull) I get [[NULL] for all 100 values. If I try it just with test_data, I get a long string of errors associated with optimx.
I have also tried fitDistr from propagate which gives errors that weights should be a specific length. For example, if both are set to be 100, I get an error that weights should be length 94. If I set it to 94, it tells me it has to be length of 132.
I need to be able to pass either a set of pre-weighted mean/var/sd etc data into the calculation, or have a function that can take data and weights and use them both in the calculation.
After much trial and error, I edited the eweibull function from the EnvStats package to instead of using mean(x) and sd(x), to instead use wtd.mean(x,w) and sqrt(wtd.var(x, w)). This now runs and outputs weighted values.
I am doing a regression with several categorial variables and continuous variables mixed together. For simplify my question, I want to create a regression model that predicts the driving time given a certain driver in different zones with driving miles. That's say I have 5 different drivers and 2 zones in my training data.
I know I probably need to build 5*2=10 regression models for prediction. What I am using in R is
m <- lm(driving_time ~ factor(driver)+factor(zone)+miles)
But it seems like R doesn't expend the combination. My problem is if there are any smart way to do the expansion automatically in R. Or I have to write the 10 regression models one by one. Thank you.
Please read ?formula. + in a formula means include that variable as a main effect. You seem to be looking for an interaction term between driver and zone. You create an interaction term using the : operator. There is also a short cut to get both main and interaction effect via the * operator.
There is some confusion as to whether you want miles to also interact, but I'll assume not here as you only mention 2 x 5 terms.
foo <- transform(foo, driver = factor(driver), zone = factor(zone))
m <- lm(driving_time ~ driver * zone + miles, data = foo)
Here I assume your data are in data frame foo. The first line separates the data processing from the model specification/fitting by converting the variables of interest to factors before fitting.
The formula then specifies main and interactive effects for driver and zone plus main effect for miles.
If you want interactions between all three then:
m <- lm(driving_time ~ driver * zone * miles, data = foo)
m <- lm(driving_time ~ (driver + zone + miles)^3, data = foo)
would do that for you.