How to optimize for loops in extremely large dataframe - r

I have a dataframe "x" with 5.9 million rows and 4 columns: idnumber/integer, compdate/integer and judge/character,, representing individual cases completed in an administrative court. The data was imported from a stata dataset and the date field came in as integer, which is fine for my purposes. I want to create the caseload variable by calculating the number of cases completed by the judge within the 30 day window of the completion date of the case at issue.
here are the first 34 rows of data:
idnumber compdate judge
1 9615 JVC
2 15316 BAN
3 15887 WLA
4 11968 WFN
5 15001 CLR
6 13914 IEB
7 14760 HSD
8 11063 RJD
9 10948 PPL
10 16502 BAN
11 15391 WCP
12 14587 LRD
13 10672 RTG
14 11864 JCW
15 15071 GMR
16 15082 PAM
17 11697 DLK
18 10660 ADP
19 13284 ECC
20 13052 JWR
21 15987 MAK
22 10105 HEA
23 14298 CLR
24 18154 MMT
25 10392 HEA
26 10157 ERH
27 9188 RBR
28 12173 JCW
29 10234 PAR
30 10437 ADP
31 11347 RDW
32 14032 JTZ
33 11876 AMC
34 11470 AMC
Here's what I came up with. So for each record I'm taking a subset of the data for that particular judge and then subsetting the cases decided in the 30 day window, and then assigning the length of a vector in the subsetted dataframe to the caseload variable for the subject case, as follows:
for(i in 1:length(x$idnumber)){
e<-x$compdate[i]
f<-e-29
a<-x[x$judge==x$judge[i] & !is.na(x$compdate),]
b<-a[a$compdate<=e & a$compdate>=f,]
x$caseload[i]<-length(b$idnumber)
}
It is working but it is taking extremely long to complete. How can I optimize this or do this easier. Sorry I'm very new to r and to programming -- I'm a law professor trying to analyze court data.... Your help is appreciated. Thanks.
Ken

You don't have to loop through every row. You can do operations on the entire column at once. First, create some data:
# Create some data.
n<-6e6 # cases
judges<-apply(combn(LETTERS,3),2,paste0,collapse='') # About 2600 judges
set.seed(1)
x<-data.frame(idnumber=1:n,judge=sample(judges,n,replace=TRUE),compdate=Sys.Date()+round(runif(n,1,120)))
Now, you can make a rolling window function, and run it on each judge.
# Sort
x<-x[order(x$judge,x$compdate),]
# Create a little rolling window function.
rolling.window<-function(y,window=30) seq_along(y) - findInterval(y-window,y)
# Run the little function on each judge.
x$workload<-unlist(by(x$compdate,x$judge,rolling.window)))

I don't have much experience with rolling calculations, but...
Calculate this per-day, not per-case (since it will be the same for cases on the same day).
Calculate a cumulative sum of the number of cases, and then take the difference of the current value of this sum and the value of the sum 31 days ago (or min{daysAgo:daysAgo>30} since cases are not resolved every day).
It's probably fastest to use a data.table. This is my attempt, using #nograpes simulated data. Comments start with #.
require(data.table)
DT <- data.table(x)
DT[,compdate:=as.integer(compdate)]
setkey(DT,judge,compdate)
# count cases for each day
ldt <- DT[,.N,by='judge,compdate']
# cumulative sum of counts
ldt[,nrun:=cumsum(N),by=judge]
# see how far to look back
ldt[,lookbk:=sapply(1:.N,function(i){
z <- compdate[i]-compdate[i:1]
older <- which(z>30)
if (length(older)) min(older)-1L else as(NA,'integer')
}),by=judge]
# compute cumsum(today) - cumsum(more than 30 days ago)
ldt[,wload:=list(sapply(1:.N,function(i)
nrun[i]-ifelse(is.na(lookbk[i]),0,nrun[i-lookbk[i]])
))]
On my laptop, this takes under a minute. Run this command to see the output for one judge:
print(ldt['XYZ'],nrow=120)

Related

Rolling subset of data frame within for loop in R

Big picture explanation is I am trying to do a sliding window analysis on environmental data in R. I have PAR (photosynthetically active radiation) data for a select number of sequential dates (pre-determined based off other biological factors) for two years (2014 and 2015) with one value of PAR per day. See below the few first lines of the data frame (data frame name is "rollingpar").
par14 par15
1356.3242 1306.7725
NaN 1232.5637
1349.3519 505.4832
NaN 1350.4282
1344.9306 1344.6508
NaN 1277.9051
989.5620 NaN
I would like to create a loop (or any other way possible) to subset the data frame (both columns!) into two week windows (14 rows) from start to finish sliding from one window to the next by a week (7 rows). So the first window would include rows 1 to 14 and the second window would include rows 8 to 21 and so forth. After subsetting, the data needs to be flipped in structure (currently using the melt function in the reshape2 package) so that the values of the PAR data are in one column and the variable of par14 or par15 is in the other column. Then I need to get rid of the NaN data and finally perform a wilcox rank sum test on each window comparing PAR by the variable year (par14 or par15). Below is the code I wrote to prove the concept of what I wanted and for the first subsetted window it gives me exactly what I want.
library(reshape2)
par.sub=rollingpar[1:14, ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
wilcox.test(value~variable, par.sub)
#when melt flips a data frame the columns become value and variable...
#for this case value holds the PAR data and variable holds the year
#information
When I tried to write a for loop to iterate the process through the whole data frame (total rows = 139) I got errors every which way I ran it. Additionally, this loop doesn't even take into account the sliding by one week aspect. I figured if I could just figure out how to get windows and run analysis via a loop first then I could try to parse through the sliding part. Basically I realize that what I explained I wanted and what I wrote this for loop to do are slightly different. The code below is sliding row by row or on a one day basis. I would greatly appreciate if the solution encompassed the sliding by a week aspect. I am fairly new to R and do not have extensive experience with for loops so I feel like there is probably an easy fix to make this work.
wilcoxvalues=data.frame(p.values=numeric(0))
Upar=rollingpar$par14
for (i in 1:length(Upar)){
par.sub=rollingpar[[i]:[i]+13, ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
save.sub=wilcox.test(value~variable, par.sub)
for (j in 1:length(save.sub)){
wilcoxvalues$p.value[j]=save.sub$p.value
}
}
If anyone has a much better way to do this through a different package or function that I am unaware of I would love to be enlightened. I did try roll apply but ran into problems with finding a way to apply it to an entire data frame and not just one column. I have searched for assistance from the many other questions regarding subsetting, for loops, and rolling analysis, but can't quite seem to find exactly what I need. Any help would be appreciated to a frustrated grad student :) and if I did not provide enough information please let me know.
Consider an lapply using a sequence of every 7 values through 365 days of year (last day not included to avoid single day in last grouping), all to return a dataframe list of Wilcox test p-values with Week indicator. Then later row bind each list item into final, single dataframe:
library(reshape2)
slidingWindow <- seq(1,364,by=7)
slidingWindow
# [1] 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127
# [20] 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 253 260
# [39] 267 274 281 288 295 302 309 316 323 330 337 344 351 358
# LIST OF WILCOX P VALUES DFs FOR EACH SLIDING WINDOW (TWO-WEEK PERIODS)
wilcoxvalues <- lapply(slidingWindow, function(i) {
par.sub=rollingpar[i:(i+13), ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
data.frame(week=paste0("Week: ", i%/%7+1, "-", i%/%7+2),
p.values=wilcox.test(value~variable, par.sub)$p.value)
})
# SINGLE DF OF ALL P-VALUES
wilcoxdf <- do.call(rbind, wilcoxvalues)

Optimizing the sum of a variable in R given a constraint

Using the following dataset:
ID=c(1:24)
COST=c(85,109,90,104,107,87,99,95,82,112,105,89,101,93,111,83,113,81,97,97,91,103,86,108)
POINTS=c(113,96,111,85,94,105,105,95,107,88,113,100,96,89,89,93,100,92,109,90,101,114,112,109)
mydata=data.frame(ID,COST,POINTS)
I need a R function that will consider all combinations of rows where the sum of 'COST' is less than a fixed value - in this case, $500 - and make the optimal selection based on the summed 'POINTS'.
Your help is appreciated.
So since this post is still open I thought I would give my solution. These kinds of problems are always fun. So, you can try to brute force the solution by checking all possible combinations (some 2^24, or over 16 million) one by one. This could be done by considering that for each combination, a value is either in it or not. Thinking in binary you could use the follow function code which was inspired by this post:
#DO NOT RUN THIS CODE
for(i in 1:2^24)
sum_points[i]<-ifelse(sum(as.numeric((intToBits(i)))[1:24] * mydata$COST) < 500,
sum(as.numeric((intToBits(i)))[1:24] * mydata$POINTS),
0)
I estimate this would take many hours to run. Improvements could be made with parallelization, etc, but still this is a rather intense calculation. This method will also not scale very well, as an increase by 1 (to 25 different IDs now) will double the computation time. Another option would be to cheat a little. For example, we know that we have to stay under $500. If we added up the n cheapest items, at would n would we definitely be over $500?
which(cumsum(sort(mydata$COST))>500)
[1] 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
So any more than 5 IDs chosen and we are definitely over $500. What else.
Well we can run a little code and take the max for that portion and see what that tells us.
sum_points<-1:10000
for(i in 1:10000)
sum_points[i]<-ifelse(sum(as.numeric((intToBits(i)))[1:24]) <6,
ifelse(sum(as.numeric((intToBits(i)))[1:24] * mydata$COST) < 500,
sum(as.numeric((intToBits(i)))[1:24] * mydata$POINTS),
0),
0)
sum_points[which.max(sum_points)]
[1] 549
So we have to try to get over 549 points with the remaining 2^24 - 10000 choices. But:
which(cumsum(rev(sort(mydata$POINTS)))<549)
[1] 1 2 3 4
Even if we sum the 4 highest point values, we still dont beat 549, so there is no reason to even search those. Further, the number of choices to consider must be greater than 4, but less than 6. My gut feeling tells me 5 would be a good number to try. Instead of looking at all 16 millions choices, we can just look at all of the ways to make 5 out of 24, which happens to be 24 choose 5:
num<-1:choose(24,5)
combs<-combn(24,5)
sum_points<-1:length(num)
for(i in num)
sum_points[i]<-ifelse(sum(mydata[combs[,i],]$COST) < 500,
sum(mydata[combs[,i],]$POINTS),
0)
which.max(sum_points)
[1] 2582
sum_points[2582]
[1] 563
We have a new max on the 2582nd iteration. To retrieve the IDs:
mydata[combs[,2582],]$ID
[1] 1 3 11 22 23
And to verify that nothing went wrong:
sum(mydata[combs[,2582],]$COST)
[1] 469 #less than 500
sum(mydata[combs[,2582],]$POINTS)
[1] 563 #what we expected.

Test performing on counts

In R a dataset data1 that contains game and times. There are 6 games and times simply tells us how many time a game has been played in data1. So head(data1) gives us
game times
1 850
2 621
...
6 210
Similar for data2 we get
game times
1 744
2 989
...
6 711
And sum(data1$times) is a little higher than sum(data2$times). We have about 2000 users in data1 and about 1000 users in data2 but I do not think that information is relevant.
I want to compare the two datasets and see if there is a statistically difference and which game "causes" that difference.
What test should I use two compare these. I don't think Pearson's chisq.test is the right choice in this case, maybe wilcox.test is the right to chose ?

Any easy way to subset my consecutive data?

One of my data attributes is "Time",from o seconds to 8640 seconds (consecutive), my professor ask me to analysis my data (0-5 mins) (5-10 mins) (10 -15 mins) separately. I know the basic code to subset my data, like:
data<-data[which(data$Time<=300&data$>=0),]
But if I do in this way, I have to repeat again again and again... I am thinking, can I have an easy way to solve this problem? Anyone can help me? Thank you very much!!
Let us assume that your data set is as following, without loss of generality:
time <- seq(0,8640)
define your unit blocks of time as (you are looking at units of 5 minutes)
blocks <- seq(1, length(time), 300)
and then
groups <- cbind(time, cut(time, blocks))
accompanies each row in time with the corresponding block of 5 minutes (variable V2) it corresponds to. In fact:
R: set.seed(1234)
R: groups[sample(nrow(groups), 5), ]
time V2
983 983 4
5377 5377 18
5263 5263 18
5385 5385 18
7435 7435 25
Once so, you can perform any sort of operation of your data set grouping by the variable V2.

How to convert values into percentage in r software

So this is the question.
Suppose you track your commute times for two weeks (10 days) and you find the following times in minutes
17 16 20 24 22 15 21 15 17 22
Suppose that the ‘24’ was a mistake, and it should have been 18. Write a code that fixes this, i.e. changing ‘24’ to ‘18’. Then compute for the new mean and standard deviation of the commute times.
Write a code which counts the number of instances that the commute time is at least 20 minutes. Then convert this into a percentage.
This is my solution for Q3 when I ran this code. I want to ask anybody if my solution is correct?
commute <- c(17,16,20,24,22,15,21,15,17,22)
commute[commute==24] <- 18
n <- length(commute)
sum((commute>=20)/n)
#[1] **0.4**
to complete the answer of the user20650, you could use a string formatted command to correctly display the outcome as a percentage as requested:
sprintf("%0.2f%%",100* mean(commute>=20))
[1] "40.00%"

Resources