Any easy way to subset my consecutive data? - r

One of my data attributes is "Time",from o seconds to 8640 seconds (consecutive), my professor ask me to analysis my data (0-5 mins) (5-10 mins) (10 -15 mins) separately. I know the basic code to subset my data, like:
data<-data[which(data$Time<=300&data$>=0),]
But if I do in this way, I have to repeat again again and again... I am thinking, can I have an easy way to solve this problem? Anyone can help me? Thank you very much!!

Let us assume that your data set is as following, without loss of generality:
time <- seq(0,8640)
define your unit blocks of time as (you are looking at units of 5 minutes)
blocks <- seq(1, length(time), 300)
and then
groups <- cbind(time, cut(time, blocks))
accompanies each row in time with the corresponding block of 5 minutes (variable V2) it corresponds to. In fact:
R: set.seed(1234)
R: groups[sample(nrow(groups), 5), ]
time V2
983 983 4
5377 5377 18
5263 5263 18
5385 5385 18
7435 7435 25
Once so, you can perform any sort of operation of your data set grouping by the variable V2.

Related

Rolling subset of data frame within for loop in R

Big picture explanation is I am trying to do a sliding window analysis on environmental data in R. I have PAR (photosynthetically active radiation) data for a select number of sequential dates (pre-determined based off other biological factors) for two years (2014 and 2015) with one value of PAR per day. See below the few first lines of the data frame (data frame name is "rollingpar").
par14 par15
1356.3242 1306.7725
NaN 1232.5637
1349.3519 505.4832
NaN 1350.4282
1344.9306 1344.6508
NaN 1277.9051
989.5620 NaN
I would like to create a loop (or any other way possible) to subset the data frame (both columns!) into two week windows (14 rows) from start to finish sliding from one window to the next by a week (7 rows). So the first window would include rows 1 to 14 and the second window would include rows 8 to 21 and so forth. After subsetting, the data needs to be flipped in structure (currently using the melt function in the reshape2 package) so that the values of the PAR data are in one column and the variable of par14 or par15 is in the other column. Then I need to get rid of the NaN data and finally perform a wilcox rank sum test on each window comparing PAR by the variable year (par14 or par15). Below is the code I wrote to prove the concept of what I wanted and for the first subsetted window it gives me exactly what I want.
library(reshape2)
par.sub=rollingpar[1:14, ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
wilcox.test(value~variable, par.sub)
#when melt flips a data frame the columns become value and variable...
#for this case value holds the PAR data and variable holds the year
#information
When I tried to write a for loop to iterate the process through the whole data frame (total rows = 139) I got errors every which way I ran it. Additionally, this loop doesn't even take into account the sliding by one week aspect. I figured if I could just figure out how to get windows and run analysis via a loop first then I could try to parse through the sliding part. Basically I realize that what I explained I wanted and what I wrote this for loop to do are slightly different. The code below is sliding row by row or on a one day basis. I would greatly appreciate if the solution encompassed the sliding by a week aspect. I am fairly new to R and do not have extensive experience with for loops so I feel like there is probably an easy fix to make this work.
wilcoxvalues=data.frame(p.values=numeric(0))
Upar=rollingpar$par14
for (i in 1:length(Upar)){
par.sub=rollingpar[[i]:[i]+13, ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
save.sub=wilcox.test(value~variable, par.sub)
for (j in 1:length(save.sub)){
wilcoxvalues$p.value[j]=save.sub$p.value
}
}
If anyone has a much better way to do this through a different package or function that I am unaware of I would love to be enlightened. I did try roll apply but ran into problems with finding a way to apply it to an entire data frame and not just one column. I have searched for assistance from the many other questions regarding subsetting, for loops, and rolling analysis, but can't quite seem to find exactly what I need. Any help would be appreciated to a frustrated grad student :) and if I did not provide enough information please let me know.
Consider an lapply using a sequence of every 7 values through 365 days of year (last day not included to avoid single day in last grouping), all to return a dataframe list of Wilcox test p-values with Week indicator. Then later row bind each list item into final, single dataframe:
library(reshape2)
slidingWindow <- seq(1,364,by=7)
slidingWindow
# [1] 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127
# [20] 134 141 148 155 162 169 176 183 190 197 204 211 218 225 232 239 246 253 260
# [39] 267 274 281 288 295 302 309 316 323 330 337 344 351 358
# LIST OF WILCOX P VALUES DFs FOR EACH SLIDING WINDOW (TWO-WEEK PERIODS)
wilcoxvalues <- lapply(slidingWindow, function(i) {
par.sub=rollingpar[i:(i+13), ]
par.sub=melt(par.sub)
par.sub=na.omit(par.sub)
par.sub$variable=as.factor(par.sub$variable)
data.frame(week=paste0("Week: ", i%/%7+1, "-", i%/%7+2),
p.values=wilcox.test(value~variable, par.sub)$p.value)
})
# SINGLE DF OF ALL P-VALUES
wilcoxdf <- do.call(rbind, wilcoxvalues)

Optimizing the sum of a variable in R given a constraint

Using the following dataset:
ID=c(1:24)
COST=c(85,109,90,104,107,87,99,95,82,112,105,89,101,93,111,83,113,81,97,97,91,103,86,108)
POINTS=c(113,96,111,85,94,105,105,95,107,88,113,100,96,89,89,93,100,92,109,90,101,114,112,109)
mydata=data.frame(ID,COST,POINTS)
I need a R function that will consider all combinations of rows where the sum of 'COST' is less than a fixed value - in this case, $500 - and make the optimal selection based on the summed 'POINTS'.
Your help is appreciated.
So since this post is still open I thought I would give my solution. These kinds of problems are always fun. So, you can try to brute force the solution by checking all possible combinations (some 2^24, or over 16 million) one by one. This could be done by considering that for each combination, a value is either in it or not. Thinking in binary you could use the follow function code which was inspired by this post:
#DO NOT RUN THIS CODE
for(i in 1:2^24)
sum_points[i]<-ifelse(sum(as.numeric((intToBits(i)))[1:24] * mydata$COST) < 500,
sum(as.numeric((intToBits(i)))[1:24] * mydata$POINTS),
0)
I estimate this would take many hours to run. Improvements could be made with parallelization, etc, but still this is a rather intense calculation. This method will also not scale very well, as an increase by 1 (to 25 different IDs now) will double the computation time. Another option would be to cheat a little. For example, we know that we have to stay under $500. If we added up the n cheapest items, at would n would we definitely be over $500?
which(cumsum(sort(mydata$COST))>500)
[1] 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
So any more than 5 IDs chosen and we are definitely over $500. What else.
Well we can run a little code and take the max for that portion and see what that tells us.
sum_points<-1:10000
for(i in 1:10000)
sum_points[i]<-ifelse(sum(as.numeric((intToBits(i)))[1:24]) <6,
ifelse(sum(as.numeric((intToBits(i)))[1:24] * mydata$COST) < 500,
sum(as.numeric((intToBits(i)))[1:24] * mydata$POINTS),
0),
0)
sum_points[which.max(sum_points)]
[1] 549
So we have to try to get over 549 points with the remaining 2^24 - 10000 choices. But:
which(cumsum(rev(sort(mydata$POINTS)))<549)
[1] 1 2 3 4
Even if we sum the 4 highest point values, we still dont beat 549, so there is no reason to even search those. Further, the number of choices to consider must be greater than 4, but less than 6. My gut feeling tells me 5 would be a good number to try. Instead of looking at all 16 millions choices, we can just look at all of the ways to make 5 out of 24, which happens to be 24 choose 5:
num<-1:choose(24,5)
combs<-combn(24,5)
sum_points<-1:length(num)
for(i in num)
sum_points[i]<-ifelse(sum(mydata[combs[,i],]$COST) < 500,
sum(mydata[combs[,i],]$POINTS),
0)
which.max(sum_points)
[1] 2582
sum_points[2582]
[1] 563
We have a new max on the 2582nd iteration. To retrieve the IDs:
mydata[combs[,2582],]$ID
[1] 1 3 11 22 23
And to verify that nothing went wrong:
sum(mydata[combs[,2582],]$COST)
[1] 469 #less than 500
sum(mydata[combs[,2582],]$POINTS)
[1] 563 #what we expected.

How to convert values into percentage in r software

So this is the question.
Suppose you track your commute times for two weeks (10 days) and you find the following times in minutes
17 16 20 24 22 15 21 15 17 22
Suppose that the ‘24’ was a mistake, and it should have been 18. Write a code that fixes this, i.e. changing ‘24’ to ‘18’. Then compute for the new mean and standard deviation of the commute times.
Write a code which counts the number of instances that the commute time is at least 20 minutes. Then convert this into a percentage.
This is my solution for Q3 when I ran this code. I want to ask anybody if my solution is correct?
commute <- c(17,16,20,24,22,15,21,15,17,22)
commute[commute==24] <- 18
n <- length(commute)
sum((commute>=20)/n)
#[1] **0.4**
to complete the answer of the user20650, you could use a string formatted command to correctly display the outcome as a percentage as requested:
sprintf("%0.2f%%",100* mean(commute>=20))
[1] "40.00%"

Mismatching drawdown calculations

I would like to ask you to clarify the next question, which is of extreme importance to me, since a major part of my master's thesis relies on properly implementing the data calculated in the following example.
I hava a list of financial time series, which look like this (AUDUSD example):
Open High Low Last
1992-05-18 0.7571 0.7600 0.7565 0.7598
1992-05-19 0.7594 0.7595 0.7570 0.7573
1992-05-20 0.7569 0.7570 0.7548 0.7562
1992-05-21 0.7558 0.7590 0.7540 0.7570
1992-05-22 0.7574 0.7585 0.7555 0.7576
1992-05-25 0.7575 0.7598 0.7568 0.7582
From this data I calculate log returns for the column Last to obtain something like this
Last
1992-05-19 -0.0032957646
1992-05-20 -0.0014535847
1992-05-21 0.0010573620
1992-05-22 0.0007922884
Now I want to calculate the drawdowns in the above presented time series, which I achieve by using (from package PerformanceAnalytics)
ddStats <- drawdownsStats(timeSeries(AUDUSDLgRetLast[,1], rownames(AUDUSDLgRetLast)))
which results in the following output (here are just the first 5 lines, but it returns every single drawdown, including also one day long ones)
From Trough To Depth Length ToTrough Recovery
1 1996-12-03 2001-04-02 2007-07-13 -0.4298531511 2766 1127 1639
2 2008-07-16 2008-10-27 2011-04-08 -0.4003839141 713 74 639
3 2011-07-28 2014-01-24 2014-05-13 -0.2254426369 730 652 NA
4 1992-06-09 1993-10-04 1994-12-06 -0.1609854215 650 344 306
5 2007-07-26 2007-08-16 2007-09-28 -0.1037999707 47 16 31
Now, the problem is the following: The depth of the worst drawdown (according to the upper output) is -0.4298, whereas if I do the following calculations "by hand" I obtain
(AUDUSD[as.character(ddStats[1,1]),4]-AUDUSD[as.character(ddStats[1,2]),4])/(AUDUSD[as.character(ddStats[1,1]),4])
[1] 0.399373
To make things clearer, this are the two lines from the AUDUSD dataframe for from and through dates:
AUDUSD[as.character(ddStats[1,1]),]
Open High Low Last
1996-12-03 0.8161 0.8167 0.7845 0.7975
AUDUSD[as.character(ddStats[1,2]),]
Open High Low Last
2001-04-02 0.4858 0.4887 0.4773 0.479
Also, the other drawdown depts do not agree with the calculations "by hand". What I am missing? How come that this two numbers, which should be the same, differ for a substantial amount?
I have tried replicating the drawdown via:
cumsum(rets) -cummax(cumsum(rets))
where rets is the vector of your log returns.
For some reason when I calculate Drawdowns that are say less than 20% I get the same results as table.Drawdowns() & drawdownsStats() but when there is a large difference say drawdowns over 35%, then the Max Drawdown begin to diverge between calculations. More specifically the table.Drawdowns() & drawdownsStats() are overstated (at least what i noticed). I do not know why this is so, but perhaps what might help is if you use an confidence interval for large drawdowns (those over 35%) by using the Standard error of the drawdown. I would use: 0.4298531511/sqrt(1127) which is the max drawdown/sqrt(depth to trough). This would yield a +/- of 0.01280437 or a drawdown of 0.4169956 to 0.4426044 respectively, which the lower interval of 0.4169956 is much closer to you "by-hand" calculation of 0.399373. Hope it helps.

How to optimize for loops in extremely large dataframe

I have a dataframe "x" with 5.9 million rows and 4 columns: idnumber/integer, compdate/integer and judge/character,, representing individual cases completed in an administrative court. The data was imported from a stata dataset and the date field came in as integer, which is fine for my purposes. I want to create the caseload variable by calculating the number of cases completed by the judge within the 30 day window of the completion date of the case at issue.
here are the first 34 rows of data:
idnumber compdate judge
1 9615 JVC
2 15316 BAN
3 15887 WLA
4 11968 WFN
5 15001 CLR
6 13914 IEB
7 14760 HSD
8 11063 RJD
9 10948 PPL
10 16502 BAN
11 15391 WCP
12 14587 LRD
13 10672 RTG
14 11864 JCW
15 15071 GMR
16 15082 PAM
17 11697 DLK
18 10660 ADP
19 13284 ECC
20 13052 JWR
21 15987 MAK
22 10105 HEA
23 14298 CLR
24 18154 MMT
25 10392 HEA
26 10157 ERH
27 9188 RBR
28 12173 JCW
29 10234 PAR
30 10437 ADP
31 11347 RDW
32 14032 JTZ
33 11876 AMC
34 11470 AMC
Here's what I came up with. So for each record I'm taking a subset of the data for that particular judge and then subsetting the cases decided in the 30 day window, and then assigning the length of a vector in the subsetted dataframe to the caseload variable for the subject case, as follows:
for(i in 1:length(x$idnumber)){
e<-x$compdate[i]
f<-e-29
a<-x[x$judge==x$judge[i] & !is.na(x$compdate),]
b<-a[a$compdate<=e & a$compdate>=f,]
x$caseload[i]<-length(b$idnumber)
}
It is working but it is taking extremely long to complete. How can I optimize this or do this easier. Sorry I'm very new to r and to programming -- I'm a law professor trying to analyze court data.... Your help is appreciated. Thanks.
Ken
You don't have to loop through every row. You can do operations on the entire column at once. First, create some data:
# Create some data.
n<-6e6 # cases
judges<-apply(combn(LETTERS,3),2,paste0,collapse='') # About 2600 judges
set.seed(1)
x<-data.frame(idnumber=1:n,judge=sample(judges,n,replace=TRUE),compdate=Sys.Date()+round(runif(n,1,120)))
Now, you can make a rolling window function, and run it on each judge.
# Sort
x<-x[order(x$judge,x$compdate),]
# Create a little rolling window function.
rolling.window<-function(y,window=30) seq_along(y) - findInterval(y-window,y)
# Run the little function on each judge.
x$workload<-unlist(by(x$compdate,x$judge,rolling.window)))
I don't have much experience with rolling calculations, but...
Calculate this per-day, not per-case (since it will be the same for cases on the same day).
Calculate a cumulative sum of the number of cases, and then take the difference of the current value of this sum and the value of the sum 31 days ago (or min{daysAgo:daysAgo>30} since cases are not resolved every day).
It's probably fastest to use a data.table. This is my attempt, using #nograpes simulated data. Comments start with #.
require(data.table)
DT <- data.table(x)
DT[,compdate:=as.integer(compdate)]
setkey(DT,judge,compdate)
# count cases for each day
ldt <- DT[,.N,by='judge,compdate']
# cumulative sum of counts
ldt[,nrun:=cumsum(N),by=judge]
# see how far to look back
ldt[,lookbk:=sapply(1:.N,function(i){
z <- compdate[i]-compdate[i:1]
older <- which(z>30)
if (length(older)) min(older)-1L else as(NA,'integer')
}),by=judge]
# compute cumsum(today) - cumsum(more than 30 days ago)
ldt[,wload:=list(sapply(1:.N,function(i)
nrun[i]-ifelse(is.na(lookbk[i]),0,nrun[i-lookbk[i]])
))]
On my laptop, this takes under a minute. Run this command to see the output for one judge:
print(ldt['XYZ'],nrow=120)

Resources