How to run a function to EACH of my observations in R? - r

My problem is as follows:
I have a dataset of 6000 observation containing information from costumers (each observation is one client's information).
I'm optimizing a given function (in my case is a profit function) in order to find an optimal for my variable of interest. Particularly I'm looking for the optimal interest rate I should offer in order to maximize my expected profits.
I don't have any doubt about my function. The problem is that I don't know how should I proceed in order to apply this function to EACH OBSERVATION in order to obtain an OPTIMAL INTEREST RATE for EACH OF MY 6000 CLIENTS (or observations, as you prefer).
Until now, it has been easy to find the UNIQUE optimal (same for all clients) for this variable that would maximize my profits (This is, the global maximum I guess). But what I need to know is how I should proceed in order to apply my optimization problem to EACH of my 6000 observations, INDIVIDUALLY, in order to have the optimal interest rate to offer to each costumer (this is, 6000 optimal interest rates, one for each of them).
I guess I should do something similar to a for loop, but my experience in this area is limited, and I'm quite frustrated already. What's more, I've tried to use mapply(myfunction, mydata) as usual, but I only get error messages.
This is how my (really) simple code now looks like:
profits<- function(Rate)
sum((Amount*(Rate-1.2)/100)*
(1/(1+exp(0.600002438-0.140799335888812*
((Previous.Rate - Rate)+(Competition.Rate - Rate))))))
And results for ONE optimal for the entire sample:
> optimise(profits, lower = 0, upper = 100, maximum = TRUE)
$maximum
[1] 6.644821
$objective
[1] 1347291
So the thing is, how do I rewrite my code in order to maximize this and obtain the optimal of my variable of interest for EACH of my rows?
Hope I've been clear! Thank you all in advance!

It appears each of your customers are independent. So you just put lapply() around the optimize() call:
lapply(customer_list, function(one_customer){
optimise(profits, lower = 0, upper = 100, maximum = TRUE)
})
This will return a very big list, where each list element has a $maximum and a $objective. You can then run lapply to total the $maximums, to find just how rich you have become!

Related

Making a for loop in r

I am just getting started with R so I am sorry if I say things that dont make sense.
I am trying to make a for loop which does the following,
l_dtest[[1]]<-vector()
l_dtest[[2]]<-vector()
l_dtest[[3]]<-vector()
l_dtest[[4]]<-vector()
l_dtest[[5]]<-vector()
all the way up till any number which will be assigned as n. for example, if n was chosen to be 100 then it would repeat this all the way to > l_dtest[[100]]<-vector().
I have tried multiple different attempts at doing this and here is one of them.
n<-4
p<-(1:n)
l_dtest<-list()
for(i in p){
print((l_dtest[i]<-vector())<-i)
}
Again I am VERY new to R so I don't know what I am doing or what is wrong with this loop.
The detailed background for why I need to do this is that I need to write an R function that receives as input the size of the population "n", runs a simulation of the model below with that population size, and returns the number of generations it took to reach a MRCA (most recent common ancestor).
Here is the model,
We assume the population size is constant at n. Generations are discrete and non-overlapping. The genealogy is formed by this random process: in each
generation, each individual chooses two parents at random from the previous generation. The choices are made randomly and equally likely over the n possibilities and each individual chooses twice. All choices are made independently. Thus, for example, it is possible that, when an individual chooses his two parents, he chooses the same individual twice, so that in
fact he ends up with just one parent; this happens with probability 1/n.
I don't understand the specific step at the begining of this post or why I need to do it but my teacher said I do. I don't know if this helps but the next step is choosing parents for the first person and then combining the lists from the step I posted with a previous step. It looks like this,
sample(1:5, 2, replace=T)
#[1] 1 2
l_dtemp[[1]]<-union(l_dtemp[[1]], l_d[[1]]) #To my understanding, l_dtem[[1]] is now receiving the listdescandants from l_d[[1]] bcs the ladder chose l_dtemp[[1]] as first parent
l_dtemp[[2]]<-union(l_dtemp[[2]], l_d[[1]]) #Same as ^^ but for l_d[[1]]'s 2nd choice which is l_dtemp[[2]]
sample(1:5, 2, replace=T)
#[1] 1 3
l_dtemp[[1]]<-union(l_dtemp[[1]], l_d[[2]])
l_dtemp[[3]]<-union(l_dtemp[[3]], l_d[[2]])

Generating testing and training datasets with replacement in R

I have mirrored some code to perform an analysis, and everything is working correctly (I believe). However, I am trying to understand a few lines of code related to splitting the data up into 40% testing and 60% training sets.
To my current understanding, the code randomly assigns each row into group 1 or 2. Subsequently, all the the rows assigned to 1 are pulled into the training set, and the 2's into the testing.
Later, I realized that sampling with replacement is not want I wanted for my data analysis. Although in this case I am unsure of what is actually being replaced. Currently, I do not believe it is the actual data itself being replaced, rather the "1" and "2" place holders. I am looking to understand exactly how these lines of code work. Based on my results, it seems as it is working accomplishing what I want. I need to confirm whether or not the data itself is being replaced.
To test the lines in question, I created a dataframe with 10 unique values (1 through 10).
If the data values themselves were being sampled with replacement, I would expect to see some duplicates in "training1" or "testing2". I ran these lines of code 10 times with 10 different set.seed numbers and the data values were never duplicated. To me, this suggest the data itself is not being replaced.
If I set replace= FALSE I get this error:
Error in sample.int(x, size, replace, prob) :
cannot take a sample larger than the population when 'replace = FALSE'
set.seed(8)
test <-sample(2, nrow(df), replace = TRUE, prob = c(.6,.4))
training1 <- df[test==1,]
testing2 <- df[test==2,]
Id like to split up my data into 60-40 training and testing. Although I am not sure that this is actually happening. I think the prob function is not doing what I think it should be doing. I've noticed the prob function does not actually split the data exactly into 60percent and 40percent. In the case of the n=10 example, it can result in 7 training 2 testing, or even 6 training 4 testing. With my actual larger dataset with ~n=2000+, it averages out to be pretty close to 60/40 (i.e., 60.3/39.7).
The way you are sampling is bound to result in a undesired/ random split size unless number of observations are huge, formally known as law of large numbers. To make a more deterministic split, decide on the size/ number of observation for the train data and use it to sample from nrow(df):
set.seed(8)
# for a 60/40 train/test split
train_indx = sample(x = 1:nrow(df),
size = 0.6*nrow(df),
replace = FALSE)
train_df <- df[train_indx,]
test_df <- df[-train_indx,]
I recommend splitting the code based on Mankind_008's answer. Since I ran quite a bit of analysis based on the original code, I spent a few hours looking into what it does exactly.
The original code:
test <-sample(2, nrow(df), replace = TRUE, prob = c(.6,.4))
Answer From ( https://www.datacamp.com/community/tutorials/machine-learning-in-r ):
"Note that the replace argument is set to TRUE: this means that you assign a 1 or a 2 to a certain row and then reset the vector of 2 to its original state. This means that, for the next rows in your data set, you can either assign a 1 or a 2, each time again. The probability of choosing a 1 or a 2 should not be proportional to the weights amongst the remaining items, so you specify probability weights. Note also that, even though you don’t see it in the DataCamp Light chunk, the seed has still been set to 1234."
One of my main concerns that the data values themselves were being replaced. Rather it seems it allows the 1 and 2 placeholders to be assigned over again based on the probabilities.

Forcing discrete time series to be monotonous decreasing

I've an evaluations series. Each evaluation could have discrete values ranging from 0 to 4. Series should decrease in time. However, since values are inserted manually, errors could happen.
Therefore, I would like to modify my series to be monotonous decreasing. Moreover, I would minimize the number of evaluations modified. Finally, if two or more series would satisfy these criteria, would choose the one with the higher overall values sum.
E.g.
Recorded evaluation
4332422111
Ideal evaluation
4332222111
Recorded evaluation
4332322111
Ideal evaluation
4333322111
(in this case, 4332222111 would have satisfied criteria too, but I chose with the higher values)
I tried with brutal force approach by generating all possible combinations, selecting those monotonous decreasing and finally comparing each one of these with that recorded.
However, series could be even 20-evaluations long and combinations would too many.
x1 <- c(4,3,3,2,4,2,2,1,1,1)
x2 <- c(4,3,3,2,3,2,2,1,1,1)
You could almost certainly break this algorithm, but here's a first try: replace locations with increased values by NA, then fill them in with the previous location.
dfun <- function(x) {
r <- replace(x,which(c(0,diff(x))>0),NA)
zoo::na.locf(r)
}
dfun(x1)
dfun(x2)
This gives the "less-ideal" answer in the second case.
For the record, I also tried
dfun2 <- function(x) {
s <- as.stepfun(isoreg(-x))
-s(seq_along(x))
}
but this doesn't handle the first example as desired.
You could also try to do this with discrete programming (about which I know almost nothing), or a slightly more sophisticated form of brute force -- use a stochastic algorithm that strongly penalizes non-monotonicity and weakly penalizes the distance from the initial sequence ... (e.g. optim(..., method="SANN") with a candidate function that adds or subtracts 1 from an element at random)

How to find global maximum in R optimization with bounds

I have five variables. Each variable has some bounds. And I am investing some amount of money on each channel. Now my question is are there any optimizer or logic to find out global maximum value for the given functional form. And sum of combinations should not exceed my total spend.
parameters=c(10,120,105,121,180,140) #intercept and variable coefficients
spend=c(16,120,180,170,180) # total spend
total=sum(spend)
upper_bound=c(50,200,250,220,250)
lower_bound=c(10,70,100,90,70)
var1=seq(lower_bound[1],upper_bound[1],by=1)
var2=seq(lower_bound[2],upper_bound[2],by=1)
var3=seq(lower_bound[3],upper_bound[3],by=1)
var4=seq(lower_bound[4],upper_bound[4],by=1)
var5=seq(lower_bound[5],upper_bound[5],by=1)
functional form is: exp(BETA 0-BETA i/X i)
I have used expand.grid function to find existing combinations. But I am getting too many combinations.
Here is my code.
seq_data=expand.grid(var1=var1,var2=var2,var3=var3,var4=var4,var5=var5)
rs=rowSums(seq_data)
seq_data=seq_data[rs<=total,]
seq_data1=seq_data
for(i in 1:length(seq_data))
seq_data1[,i]=exp(parameters[1]-parameters[i+1]/seq_data1[,i])
How can I overcome this problem. Please suggest me if there are any other alternative.
Thanks in advance.

lda.collapsed.gibbs.sampler model and top words ranking

I have a model generated by the function lda.collapsed.gibbs.sampler, from the lda package, and i need to know the "relevance" of the top words.
When using the
top.topic.words(result$topics, 10, by.score=TRUE)
i get a list of top 10 words for each topic, but i'd like to see the percentage of the topic that those 10 words represent. I guess the information exists, because there is a "score", but I'm not really familiar with the statistical methods of the Gibbs sampler.
Thanks in advance!
I think something like this may be what you want:
for (ii in 1:nrow(result$topics)) {
print(
head(
cumsum(
sort(result$topics[ii,], decreasing=TRUE)
),
n = 20
) / result$topic_sums[ii]
)
}
Let's break it down. If you want the fraction of Gibbs assignments, then that is easy. The LDA routine returns the number of assignments to each (word, topic) pair. So all you have to do is sort each row of the result$topics to get the top words (this is essentially what top.topic.words does if you set by.score=FALSE). Once you have it in sorted order you can just see, for each topic, how many counts occur for that word versus for the entire topic. To do that I divide by result$topic_sums which contains the total number of assignments of that topic. Finally, I use cumsum so you can see the running total weight for words in that topic.

Resources