Confused for the Code for over-sampling with R - r

The code below is about oversampling houses with over 10 rooms, may I ask what does prob = ifelse(housing.df$ROOMS>10, 0.9, 0.01) mean? Thanks a lot.
s <- sample(row.names(housing.df), 5, pro = ifelse(housing.df$ROOMS>10, 0.9, 0.01))
housing.df[s.]

I imagine the purpose of this ccode is to first check to see if a given house in the data set has ten rooms. If that is the case then it gets a probability of 90%, otherwise it gets a probability of 10%
sample with sample from the given house names using this associated probability thus favouring those houses with more than ten rooms when it samples. This creates your over sample.
Is this what you mean?

Related

R: Standardization with quota sampling

How do you properly standardize a variable that was collected with quota sampling? Let me explain.
I am working with survey data that was collected with quota sampling. Each quota is a different village (1500 in total). The questionnaire was applied to 10% of each village's population. The villages vary a lot in size: from tens of thousands to a mere hundreds.
I am working with logit models and want to standardize one of my dataframe's columns. Should I standardize as is? Or would the population imbalance between the villages bias my standardize variable? Should I include population weights?
To illustrate with data, lets imagine there are only two villages (village 1 is big and village 2 is small). This is how the data would look like
total1 <- data.frame("response1" = c(0.4, -0.1, 2.1, 0.08, 0, -2.5),
"village.number" = c(1, 1, 1, 1, 2, 2))
The question stands: how do I standardize response1 when observations from village 1 double those of village 2.
Thank you.

Creating Group Constraints in PortfolioAnalytics for R

I am working to put together portfolio optimizations with 11 securities using the PortfolioAnalytics package in R. Of the 11, 5 are equity funds, 2 are preferred stock funds, 3 are fixed income, and 1 is a money market fund. I would like to set my asset class allocations to 55% equity, 10% preferred, 30% fixed income, and 5% money market to be fully invested with no leverage and no turnover. What I would hope to see as the output is the various permutations of portfolios but static asset class allocations.
I have tried to use the add.constraint function to achieve this and I've used the following code:
port <- add.constraint(portfolio = port, type="group",
groups= list(c(1:5),(6:7),c(8:10),c(11)),
group_min=c(0.55, 0.1, 0.3, 0.05),
group_max=c(0.55, 0.1, 0.3, 0.05),
group_pos= c(1,1,1,1))
When I attempt to generate random portfolios I get the following error message:
rportfolios <- random_portfolios(port, permutations = 5000, rp_method = "sample")
Error in rp_transform(w = tmp_group_w, min_sum = cLO[j], max_sum = cUP[j], :
Infeasible portfolio created, perhaps increase max_permutations and/or adjust your parameters.
Any thoughts on where I am going wrong?
William, I think the problem is caused by the hard group constraints and the way that the package's random portfolio generator works. Since the portfolios are randomly created it is rare for the generator to produce portfolios that exactly match your criteria given the small number permutations that are tried (i.e. 5000).
This may not be ideal for your problem, but if you provide a little bit of wiggle room in each groups' min-max then the random generator is more likely to create a portfolio that falls within said range. For example instead of setting min=max=0.55, try min=0.5495 and max=0.555 and at the same time increase the permutations to 10k or more. I had the same problem and resolved it this way.

Programming probability of patients being well in R

I'm going to preface this with the fact that I am a complete R novice.
I have the following problem:
Consider a simple model that progresses year-by-year. In year i, let W_i = patient is well, I_i = patient is ill, and D_i = patient is dead. Transitions can be modeled as a set of conditional probabilities.
Let L = number of years that the patient is well.
I have come up with the probability mass function of L to be P(L)=(1-p)(p)^{L-1}.
The given information is that a patient is well in year 1 and given their age and risk factors, P(W_{i+1}|W_{i})=0.2 for all i
The problem is to write a function in R that simulates the trajectory of a single patient and returns the number of years the patient is well.
I thought that this could be programmed in R as a binomial distribution using the rbinom function. For a single patient,
rbinom(1, 1, 0.2)
but I don't think that this would return the number of years that the patient is well. I'm thinking that the rbinom function should be the start, and that it would need to be paired with a way to count the number of years that a patient is well, but I don't know how to do that.
The next problem is to use R to simulate 1000 patient trajectories and find the sample mean of years of wellness. I'm assuming that this would be an extension of the previous part, just replacing the 1 patient with 1000. However I can't quite figure out where to replace the 1 with 1000: n or size
rbinom(n, size, prob)
This is assuming that using rbinom is the correct thing to do in the first place...
If I were to do this in another programming language (say Python) I would use a while loop conditional on patient_status=W and starting with L=0 iterate through the loop and add 1 each successful iteration. I'm not sure if R works in the same way.
Let's start with what rbinom(1, 1, 0.2) does: it returns 1 instance of 1 independent Bernoulli (that is, 0-1) random variables added together that have a probability of 0.2 of being equal to 1. So, that line will only give outputs 0 (which it will do 80% of the time) or 1 (which it will do the other 20% of the time). As you noted, this isn't what you want.
The issue here is the selection of a random variable. A binomial variable is great for something like, "I roll ten dice. How many land on 6?" because it has the following essential components:
outcomes dichotomized into success / failure
a fixed number (ten) of trials
a consistent probability of success (1/6)
independent trials (dice do not affect each other)
The situation you're describing doesn't have those features. So, what to do?
Option 1: Go with your instinct for a while() loop. I'll preface this by saying that while() loops are discouraged in R for various reasons (chiefly inefficiency). But, since you already understand the concept, let's run with it.
one_patient <- function(){
status <- 1 # 1 = healthy, 0 = ill
years <- (-1) # count how many years completed while healthy
while(status == 1){
years <- years + 1 # this line will run at least one time
status <- rbinom(1, 1, 0.2) # your rbinom(1, 1, 0.2) line makes an appearance!
}
return(years)
}
Now, executing one_patient() will result in the number of the years the patient successfully transitioned from well to well. This will be at least 0, since years starts at -1 and is incremented at least one time. It could be very high, if the patient is lucky, though it most likely won't. You can experiment with this by changing the 0.2 parameter to something more optimistic like 0.99 to simulate long life spans.
Option 2: Rethink the random variable. I mentioned above that the variable wasn't binomial; in fact, it's geometric. A situation like, "I roll a die until it lands on 6. How many rolls did it take?" is geometric because it has the following essential components:
outcomes dichotomized into success / failure
a consistent probability of success
repeated trials that terminate when the first success is reached
independent trials
Much like how binomial variables have useful functions in R such as rbinom(), pbinom(), qbinom(), dbinom(), there is a corresponding collection for geometric variables: rgeom(), pgeom(), qgeom(), dgeom().
To use rgeom(), we need to be careful about one detail: here, a "success" is characterized as the patient becoming ill, because that's when the experiment ends. (Above, by encoding the patient being well as 1, we're implicitly using the reverse perspective.) This means that the "success" probability is 0.8. rgeom(1, 0.8) will return the number of draws strictly before the first success, which is equivalent to the number of years the patient went from well to well, as above. Note that the 1 parameter refers to the number of times we want to run this experiment and not something else. Hence:
rgeom(1, 0.8)
will accomplish the same task as the one_patient() function we defined above. (That is, the distribution of outputs for each will be the same.)
For multiple patients, you can either wrap the one_patient() function inside replicate(), or you can just directly adjust the first parameter of rgeom(1, 0.8). The second option is much faster, though both are fast if just simulating 1000 patients.
Addendum
Proof that both have the same effect:
sims1 <- replicate(10000, one_patient())
hist(sims1, breaks = seq(-0.5, max(sims1) + 0.5, by = 1))
sims2 <- rgeom(10000, 0.8)
hist(sims2, breaks = seq(-0.5, max(sims2) + 0.5, by = 1))
Proof that rgeom() is faster:
library(microbenchmark)
microbenchmark(
replicate(10000, one_patient()),
rgeom(10000, 0.8)
)
#Unit: milliseconds
# expr min lq mean median uq max neval
# replicate(10000, one_patient()) 35.4520 38.77585 44.135562 43.82195 46.05920 73.5090 100
# rgeom(10000, 0.8) 1.1978 1.22540 1.273766 1.23640 1.27485 1.9734 100

How to reproduce the H2o GBM class probability calculation

I've been using h2o.gbm for a classification problem, and wanted to understand a bit more about how it calculates the class probabilities. As a starting point, I tried to recalculate the class probability of a gbm with only 1 tree (by looking at the observations in the leafs), but the results are very confusing.
Let's assume my positive class variable is "buy" and negative class variable "not_buy" and I have a training set called "dt.train" and a separate test-set called "dt.test".
In a normal decision tree, the class probability for "buy" P(has_bought="buy") for a new data row (test-data) is calculated by dividing all observations in the leaf with class "buy" by the total number of observations in the leaf (based on the training data used to grow the tree).
However, the h2o.gbm seems to do something differently, even when I simulate a 'normal' decision tree (setting n.trees to 1, and alle sample.rates to 1). I think the best way to illustrate this confusion is by telling what I did in a step-wise fashion.
Step 1: Training the model
I do not care about overfitting or model performance. I want to make my life as easy as possible, so I've set the n.trees to 1, and make sure all training-data (rows and columns) are used for each tree and split, by setting all sample.rate parameters to 1. Below is the code to train the model.
base.gbm.model <- h2o.gbm(
x = predictors,
y = "has_bought",
training_frame = dt.train,
model_id = "2",
nfolds = 0,
ntrees = 1,
learn_rate = 0.001,
max_depth = 15,
sample_rate = 1,
col_sample_rate = 1,
col_sample_rate_per_tree = 1,
seed = 123456,
keep_cross_validation_predictions = TRUE,
stopping_rounds = 10,
stopping_tolerance = 0,
stopping_metric = "AUC",
score_tree_interval = 0
)
Step 2: Getting the leaf assignments of the training set
What I want to do, is use the same data that is used to train the model, and understand in which leaf they ended up in. H2o offers a function for this, which is shown below.
train.leafs <- h2o.predict_leaf_node_assignment(base.gbm.model, dt.train)
This will return the leaf node assignment (e.g. "LLRRLL") for each row in the training data. As we only have 1 tree, this column is called "T1.C1" which I renamed to "leaf_node", which I cbind with the target variable "has_bought" of the training data. This results in the output below (from here on referred to as "train.leafs").
Step 3: Making predictions on the test set
For the test set, I want to predict two things:
The prediction of the model itself P(has_bought="buy")
The leaf node assignment according to the model.
test.leafs <- h2o.predict_leaf_node_assignment(base.gbm.model, dt.test)
test.pred <- h2o.predict(base.gbm.model, dt.test)
After finding this, I've used cbind to combine these two predictions with the target variable of the test-set.
test.total <- h2o.cbind(dt.test[, c("has_bought")], test.pred, test.leafs)
The result of this, is the table below, from here on referred to as "test.total"
Unfortunately, I do not have enough rep point to post more than 2 links. But if you click on "table "test.total" combined with manual
probability calculation" in step 5, it's basically the same table
without the column "manual_prob_buy".
Step 4: Manually predicting probabilities
Theoretically, I should be able to predict the probabilities now myself. I did this by writing a loop, that loops over each row in "test.total". For each row, I take the leaf node assignment.
I then use that leaf-node assignment to filter the table "train.leafs", and check how many observations have a positive class (has_bought == 1) (posN) and how many observations are there in total (totalN) within the leaf associated with the test-row.
I perform the (standard) calculation posN / totalN, and store this in the test-row as a new column called "manual_prob_buy", which should be the probability of P(has_bought="buy") for that leaf. Thus, each test-row that falls in this leaf should get this probability.
This for-loop is shown below.
for(i in 1:nrow(dt.test)){
leaf <- test.total[i, leaf_node]
totalN <- nrow(train.leafs[train.leafs$leaf_node == leaf])
posN <- nrow(train.leafs[train.leafs$leaf_node == leaf & train.leafs$has_bought == "buy",])
test.total[i, manual_prob_buy := posN / totalN]
}
Step 5: Comparing the probabilities
This is where I get confused. Below is the the updated "test.total" table, in which "buy" represents the probability P(has_bought="buy") according to the model and "manual_prob_buy" represents the manually calculated probability from step 4. As for as I know, these probabilities should be identical, knowing I only used 1 tree and I've set the sample.rates to 1.
Table "test.total" combined with manual probability calculation
The Question
I just don't understand why these two probabilities are not the same. As far as I know, I've set the parameters in such a way that it should just be like a 'normal' classification tree.
So the question: does anyone know why I find differences in these probabilities?
I hope someone could point me to where I might have made wrong assumptions. I just really hope I did something stupid, as this is driving me crazy.
Thanks!
Rather than compare the results from R's h2o.predict() with your own handwritten code, I recommend you compare with an H2O MOJO, which should match.
See an example here:
http://docs.h2o.ai/h2o/latest-stable/h2o-genmodel/javadoc/overview-summary.html#quickstartmojo
You can run that simple example yourself, and then modify it according to your own model and new row of data to predict on.
Once you can do that, you can look at the code and debug/single-step it in a java environment to see exactly how the prediction gets calculated.
You can find the MOJO prediction code on github here:
https://github.com/h2oai/h2o-3/blob/master/h2o-genmodel/src/main/java/hex/genmodel/easy/EasyPredictModelWrapper.java
The main cause of the large difference between your observed probabilities and the predictions of h2o is your learning rate. As you have learn_rate = 0.001 the gbm is adjusting the probabilities by a relatively small amount from the overall rate. If you adjust this to learn_rate = 1 you will have something much closer to a decision tree, and h2o's predicted probabilities will come much closer to the rates in each leaf node.
There is a secondary difference which will then become apparent as your probabilities will still not exactly match. This is due to the method of gradient descent (the G in GBM) on the logistic loss function, which is used rather than the number of observations in each leaf node.

Random sampling based on vector of probability weights

I have the vector d<-1:100
I want to sample k=3 times from this vector without replacement. I would like to make elements that are at a distance length(d)/k from the first sampled element to have a higher probability of getting sampled. I am not yet sure how much higher. I know that sample has a prob= argument, however i can't seem to find a way so that the prob= vectors gets to be recalculated from the location of the initial sample.
Any ideas?
Example:
d<-1:100 . Lets say the first trial samples d[30]=30. Then the elements of ddd that are near 0, 60 and 90 should have a higher probability of sampling. So after the initial sample the the distribution of the sampling probabilities of the rest of the elements of ddd is as in the image:
I think:
samp <- sample(1:100,1)
prob <- rep(1,100)
prob[samp]=0
MORE EDIT: I'm an idiot today. Now this will make the probability shape you asked for.
peke<-c(2,5,7,10,7,5,2) #your 'triangle' probability
for (jj = c(0,2,3){
prob[(1:7)*(1+samp*(jj)] <- peke
}
newsamp <-sample(1:100,1,prob)
You may want to add a slight offset if that doesn't place the probability peaks where you wanted them.

Resources