Optimizing the sum of a variable in R given a constraint - r

Using the following dataset:
ID=c(1:24)
COST=c(85,109,90,104,107,87,99,95,82,112,105,89,101,93,111,83,113,81,97,97,91,103,86,108)
POINTS=c(113,96,111,85,94,105,105,95,107,88,113,100,96,89,89,93,100,92,109,90,101,114,112,109)
mydata=data.frame(ID,COST,POINTS)
I need a R function that will consider all combinations of rows where the sum of 'COST' is less than a fixed value - in this case, $500 - and make the optimal selection based on the summed 'POINTS'.
Your help is appreciated.

So since this post is still open I thought I would give my solution. These kinds of problems are always fun. So, you can try to brute force the solution by checking all possible combinations (some 2^24, or over 16 million) one by one. This could be done by considering that for each combination, a value is either in it or not. Thinking in binary you could use the follow function code which was inspired by this post:
#DO NOT RUN THIS CODE
for(i in 1:2^24)
sum_points[i]<-ifelse(sum(as.numeric((intToBits(i)))[1:24] * mydata$COST) < 500,
sum(as.numeric((intToBits(i)))[1:24] * mydata$POINTS),
0)
I estimate this would take many hours to run. Improvements could be made with parallelization, etc, but still this is a rather intense calculation. This method will also not scale very well, as an increase by 1 (to 25 different IDs now) will double the computation time. Another option would be to cheat a little. For example, we know that we have to stay under $500. If we added up the n cheapest items, at would n would we definitely be over $500?
which(cumsum(sort(mydata$COST))>500)
[1] 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
So any more than 5 IDs chosen and we are definitely over $500. What else.
Well we can run a little code and take the max for that portion and see what that tells us.
sum_points<-1:10000
for(i in 1:10000)
sum_points[i]<-ifelse(sum(as.numeric((intToBits(i)))[1:24]) <6,
ifelse(sum(as.numeric((intToBits(i)))[1:24] * mydata$COST) < 500,
sum(as.numeric((intToBits(i)))[1:24] * mydata$POINTS),
0),
0)
sum_points[which.max(sum_points)]
[1] 549
So we have to try to get over 549 points with the remaining 2^24 - 10000 choices. But:
which(cumsum(rev(sort(mydata$POINTS)))<549)
[1] 1 2 3 4
Even if we sum the 4 highest point values, we still dont beat 549, so there is no reason to even search those. Further, the number of choices to consider must be greater than 4, but less than 6. My gut feeling tells me 5 would be a good number to try. Instead of looking at all 16 millions choices, we can just look at all of the ways to make 5 out of 24, which happens to be 24 choose 5:
num<-1:choose(24,5)
combs<-combn(24,5)
sum_points<-1:length(num)
for(i in num)
sum_points[i]<-ifelse(sum(mydata[combs[,i],]$COST) < 500,
sum(mydata[combs[,i],]$POINTS),
0)
which.max(sum_points)
[1] 2582
sum_points[2582]
[1] 563
We have a new max on the 2582nd iteration. To retrieve the IDs:
mydata[combs[,2582],]$ID
[1] 1 3 11 22 23
And to verify that nothing went wrong:
sum(mydata[combs[,2582],]$COST)
[1] 469 #less than 500
sum(mydata[combs[,2582],]$POINTS)
[1] 563 #what we expected.

Related

Specify multiple conditions in long form data in R

How do I index rows I need by with specifications?
id<-c(65,65,65,65,65,900,900,900,900,900,900,211,211,211,211,211,211,211,45,45,45,45,45,45,45)
age<-c(19,22,23,24,25,21,26,31,32,37,38,22,23,25,28,29,31,32,30,31,36,39,42,44,48)
stat<-c('intern','reg','manage1','left','reg','manage1','manage2','left','reg',
'reg','left','intern','left','intern','reg','left','reg','manage1','reg','left','intern','manage1','left','reg','manage2')
mydf<-data.frame(id,age,stat)
I need to create 5 variables:
m01time & m12time: measure the amount of years elapsed before becoming a level1 manager (manage1), and then since manage1 to manage2 regardless of whether or not it's at the same job. (numeric in years)
change: capture whether or not they experienced a job change between manage1 and manage2 (if 'left' happens somewhere in between manage1 and manage2), (0 or 1)
& 4: m1p & m2p: capture the position before becoming manager1 and manager2 (intern, reg, or manage1).
There's a lot of information I don't need here that I am not sure how to ignore (all the jobs 211 went through before going to one where they become a manager).
The end result should look something like this:
id m01time m02time change m1p m2p
1 65 4 NA NA reg <NA>
2 900 NA 5 0 <NA> manage1
3 211 1 NA NA reg <NA>
4 45 3 9 1 intern reg
I tried to use ifelse with lag() and lead() to capture some conditions, but there are more for loop type of jobs (such as how to capture a "left" somewhere in between) that I am not sure what to do with.
I'd calculate the variables the first three variables differently than m1p and m2p. Maybe there's an elegant unified approach that I don't see at the moment.
So for the last position before manager you could do:
mydt <- data.table(mydf)
mydt[,.(m1p=stat[.I[stat=="manage1"]-1],
m2p=stat[.I[stat=="manage2"]-1]),by=id]
The other variables are more conveniently calculated in a wide data.format:
dt <- dcast(unique(mydt,by=c("id","stat")),
formula=id~stat,value.var="age")
dt[,.(m01time = manage1-intern,
m12time = manage2-manage1,
change = manage1<left & left<manage2)]
Two caveats:
reshaping might be quite costly larger data sets
I (over-)simplified your dummy data by ignoring duplicates of id and stat

To use the correct test for independence

I have two groups (data.frame) in R called good and bad which contain good users and bad users respectively.
The group good contains game_id which is the id for a computergame and number which is how many times this game has been played.
For example good$game_id we get 1 2 3 ... 20. We have 20 games.
Similar good$number we get 45214 1254 23 ... 8914 which is the number the game has been played. For example has game_id==1 been played 45214 times in group good.
Similar for bad.
We also have the same number of users in the two groups.
So for head(good,20) we get
game_id number
1 45214
2 1254
...
20 8914
I want to investigate if there is dependence between the number of times a fixed computergame has been played.
For game_id==1 I would try to use Pearson's Chi test for 'Independence'.
In R I type chisq.test(good[1,2], bad[1,2]) to see if there is indepence between good and bad for game_id==1 but I get an error message: x and y must have same levels.
How can this problem be solved ?

How to convert values into percentage in r software

So this is the question.
Suppose you track your commute times for two weeks (10 days) and you find the following times in minutes
17 16 20 24 22 15 21 15 17 22
Suppose that the ‘24’ was a mistake, and it should have been 18. Write a code that fixes this, i.e. changing ‘24’ to ‘18’. Then compute for the new mean and standard deviation of the commute times.
Write a code which counts the number of instances that the commute time is at least 20 minutes. Then convert this into a percentage.
This is my solution for Q3 when I ran this code. I want to ask anybody if my solution is correct?
commute <- c(17,16,20,24,22,15,21,15,17,22)
commute[commute==24] <- 18
n <- length(commute)
sum((commute>=20)/n)
#[1] **0.4**
to complete the answer of the user20650, you could use a string formatted command to correctly display the outcome as a percentage as requested:
sprintf("%0.2f%%",100* mean(commute>=20))
[1] "40.00%"

How to optimize for loops in extremely large dataframe

I have a dataframe "x" with 5.9 million rows and 4 columns: idnumber/integer, compdate/integer and judge/character,, representing individual cases completed in an administrative court. The data was imported from a stata dataset and the date field came in as integer, which is fine for my purposes. I want to create the caseload variable by calculating the number of cases completed by the judge within the 30 day window of the completion date of the case at issue.
here are the first 34 rows of data:
idnumber compdate judge
1 9615 JVC
2 15316 BAN
3 15887 WLA
4 11968 WFN
5 15001 CLR
6 13914 IEB
7 14760 HSD
8 11063 RJD
9 10948 PPL
10 16502 BAN
11 15391 WCP
12 14587 LRD
13 10672 RTG
14 11864 JCW
15 15071 GMR
16 15082 PAM
17 11697 DLK
18 10660 ADP
19 13284 ECC
20 13052 JWR
21 15987 MAK
22 10105 HEA
23 14298 CLR
24 18154 MMT
25 10392 HEA
26 10157 ERH
27 9188 RBR
28 12173 JCW
29 10234 PAR
30 10437 ADP
31 11347 RDW
32 14032 JTZ
33 11876 AMC
34 11470 AMC
Here's what I came up with. So for each record I'm taking a subset of the data for that particular judge and then subsetting the cases decided in the 30 day window, and then assigning the length of a vector in the subsetted dataframe to the caseload variable for the subject case, as follows:
for(i in 1:length(x$idnumber)){
e<-x$compdate[i]
f<-e-29
a<-x[x$judge==x$judge[i] & !is.na(x$compdate),]
b<-a[a$compdate<=e & a$compdate>=f,]
x$caseload[i]<-length(b$idnumber)
}
It is working but it is taking extremely long to complete. How can I optimize this or do this easier. Sorry I'm very new to r and to programming -- I'm a law professor trying to analyze court data.... Your help is appreciated. Thanks.
Ken
You don't have to loop through every row. You can do operations on the entire column at once. First, create some data:
# Create some data.
n<-6e6 # cases
judges<-apply(combn(LETTERS,3),2,paste0,collapse='') # About 2600 judges
set.seed(1)
x<-data.frame(idnumber=1:n,judge=sample(judges,n,replace=TRUE),compdate=Sys.Date()+round(runif(n,1,120)))
Now, you can make a rolling window function, and run it on each judge.
# Sort
x<-x[order(x$judge,x$compdate),]
# Create a little rolling window function.
rolling.window<-function(y,window=30) seq_along(y) - findInterval(y-window,y)
# Run the little function on each judge.
x$workload<-unlist(by(x$compdate,x$judge,rolling.window)))
I don't have much experience with rolling calculations, but...
Calculate this per-day, not per-case (since it will be the same for cases on the same day).
Calculate a cumulative sum of the number of cases, and then take the difference of the current value of this sum and the value of the sum 31 days ago (or min{daysAgo:daysAgo>30} since cases are not resolved every day).
It's probably fastest to use a data.table. This is my attempt, using #nograpes simulated data. Comments start with #.
require(data.table)
DT <- data.table(x)
DT[,compdate:=as.integer(compdate)]
setkey(DT,judge,compdate)
# count cases for each day
ldt <- DT[,.N,by='judge,compdate']
# cumulative sum of counts
ldt[,nrun:=cumsum(N),by=judge]
# see how far to look back
ldt[,lookbk:=sapply(1:.N,function(i){
z <- compdate[i]-compdate[i:1]
older <- which(z>30)
if (length(older)) min(older)-1L else as(NA,'integer')
}),by=judge]
# compute cumsum(today) - cumsum(more than 30 days ago)
ldt[,wload:=list(sapply(1:.N,function(i)
nrun[i]-ifelse(is.na(lookbk[i]),0,nrun[i-lookbk[i]])
))]
On my laptop, this takes under a minute. Run this command to see the output for one judge:
print(ldt['XYZ'],nrow=120)

Sorting data in R

I have a dataset that I need to sort by participant (RECORDING_SESSION_LABEL) and by trial_number. However, when I sort the data using R none of the sort functions I have tried put the variables in the correct numeric order that I want. The participant variable comes out ok but the trial ID variable comes out in the wrong order for what I need.
using:
fix_rep[order(as.numeric(RECORDING_SESSION_LABEL), as.numeric(trial_number)),]
Participant ID comes out as:
118 118 118 etc. 211 211 211 etc. 306 306 306 etc.(which is fine)
trial_number comes out as:
1 1 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 2 2 20 20 .... (which is not what I want - it seems to be sorting lexically rather than numerically)
What I would like is trial_number to be order like this within each participant number:
1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 ....
I have checked that these variables are not factors and are numeric and also tried without the 'as.numeric', but with no joy. Looking around I saw suggestions that sort() and mixedsort() might do the trick in place of 'order', both come up with errors. I am slowly pulling my hair out over what I think should be a simple thing. Can anybody help shed some light on how to do this to get what I need?
Even though you claim it is not a factor, it does behave exactly as if it were a factor. Testing if something is a factor can be tricky since a factor is just an integer vector with a levels attribute and a class label. If it is a factor, your code needs to have a call to as.character() nested inside the as.numeric():
fix_rep[order(as.numeric(RECORDING_SESSION_LABEL), as.numeric(as.character(trial_number))),]
To be really sure if it's a factor, I recommend the str() function:
str(trial_number)
I think it may be worthwhile for you to design your own function in this case. It wouldn't be too hard, basically you could just design a bubble-sort algorithm with a few alterations. These alterations could change each number to a string, and begin by sorting those with different numbers of digits into different bins (easily done by finding which numbers, which are now strings, have the greatest numbers of indices). Then, in a similar fashion, the numbers in these bins could be sorted by converting the least significant digit to a numeric type and checking to see which are the largest/smallest. If you're interested, I could come up with some code for this, however, it looks like the two above me have beat me to the punch with some of the built-in functions. I've never used those functions, so I'm not sure if they'll work as you intend, but there's no use in reinventing the wheel.

Resources