Efficient (repeat) looping - r

I am trying to evaluate if a price, price(k), in a given row,(k), is equal to the one above, price(k-1). If it is I want to sum the volume from the prior and the price in question, volume(k)+volume(k+1), and then remove the row with the duplicate price, row k.
I have the following repeat loop which I am applying to a large dataset looking to delete repeated values.
k <- 1
repeat{
if( Prices$Price[ k + 1 ] == Prices$Price[ k ] ){
Prices$CumVolume[ k + 1 ] <- Prices$CumVolume[ k + 1 ] + Prices$CumVolume[ k ]
Prices <- Prices[ -k , ]
k <- k + 1
if( k > nrow( Prices ) ) break
}
}
The loop is very slow and I was wondering if there are ways to speed it up. Unfortunately I am relatively new to R and am having difficulty working out the best way to go about this.
Also is there a way in R to observe the iteration the loop is currently up too? i.e. have it displayed in the workspace on each iteration?
Example data:
Date Time Price CumVolume Ret MeanRet VolRet
26 01-JAN-2009 21:30:01.783 96.660 537 0 0 0
31 01-JAN-2009 21:30:58.041 96.650 78 0 0 0
33 01-JAN-2009 21:34:09.589 96.640 60 0 0 0
35 01-JAN-2009 21:34:10.879 96.640 40 0 0 0
37 01-JAN-2009 21:35:55.001 96.635 50 0 0 0

It appears you want something like this:
DF <- read.table(text=" Date Time Price CumVolume Ret MeanRet VolRet
26 01-JAN-2009 21:30:01.783 96.660 537 0 0 0
31 01-JAN-2009 21:30:58.041 96.650 78 0 0 0
33 01-JAN-2009 21:34:09.589 96.640 60 0 0 0
35 01-JAN-2009 21:34:10.879 96.640 40 0 0 0
37 01-JAN-2009 21:35:55.001 96.635 50 0 0 0", header=TRUE)
#create a run id
DF$runs <- cumsum(c(TRUE, diff(DF$Price) != 0))
#sum per each price run
DF$CCVolume <- with(DF, ave(CumVolume, runs, FUN=sum))
#remove duplicated prices
DF[!duplicated(DF$Price), ]
# Date Time Price CumVolume Ret MeanRet VolRet runs CCVolume
#26 01-JAN-2009 21:30:01.783 96.660 537 0 0 0 1 537
#31 01-JAN-2009 21:30:58.041 96.650 78 0 0 0 2 78
#33 01-JAN-2009 21:34:09.589 96.640 60 0 0 0 3 100
#37 01-JAN-2009 21:35:55.001 96.635 50 0 0 0 4 50

I think your code is going in infinite loop because of your increment index.K=k+1 and Break is always within the condition,I hope you want this
k=1
z=unique(Prices$Price)
for(i in 1:length(z))
{
dupindex=which(z[i]==Prices$Price)
Prices$CumVolume[tail(dupindex,n=1)]=sum(Prices$CumVolume[dupindex])
Prices=Prices[-(dupindex[1:length(dupindex)-1]),]
}
I hope it help,thanks.

Related

Nested xtab tables

I would like to produce nested tables for a multilevel factorial experiment. I have 10 paints examined for time to reach an end point under 4 levels of humidity, 3 temperatures and 2 wind speeds. Of course I have searched on line but without success.
Some sample code can be generated using:
## Made Up Data # NB the data is continuous whereas observations were made 40/168 so data is censored.
time3 <- 4*seq(1:24) # Dependent: times in hrs, runif is not really representative but will do
wind <- c(1,2) # Independent: factor draught on or off
RH <- c(0,35,75,95) # Independent: value for RH but can be processes as a factor
temp <- c(5,11,20) # Independent: value for temperature but can be processed as a factor
paint <- c("paintA", "paintB", "paintC") # Independent: Experimental material
# Combine into dataframe
dfa <- data.frame(rep(temp,8))
dfa$RH <- rep(RH,6)
dfa$wind <- rep(wind,12)
dfa$time3 <- time3
dfa$paint <- rep(paint[1],24)
# Replicate for different paints
dfb <- dfa
dfb$paint <- paint[2]
dfc <- dfa
dfc$paint <- paint[3]
dfx <- do.call("rbind", list(dfa,dfb,dfc))
# Rename first col
colnames(dfx)[1] <- "temp"
# Prepare xtab tables
tx <- xtabs(dfx$time3 ~ dfx$wind + dfx$RH + dfx$temp + dfx$paint)
tx
And the target I hope to obtain would be like this xtab example
This
tx <- xtabs(dfx$time3 ~ dfx$wind + dfx$RH + dfx$temp)
does not work well enough. I would also like to write to C:\file.csv for printing and reporting etc. Please advise on how to achieve the desired output.
You can paste the two variables you want to nest together. Since the items will be ordered lexicographically, you will need to zero-pad the temp variable, to get numerical ordering.
xtabs(time3~wind+paste(sprintf("%02d",temp),RH,sep=":")+paint,dfx)
, , paint = paintA
paste(sprintf("%02d", temp), RH, sep = ":")
wind 05:0 05:35 05:75 05:95 11:0 11:35 11:75 11:95 20:0 20:35 20:75 20:95
1 56 0 104 0 88 0 136 0 120 0 72 0
2 0 128 0 80 0 64 0 112 0 96 0 144
, , paint = paintB
paste(sprintf("%02d", temp), RH, sep = ":")
wind 05:0 05:35 05:75 05:95 11:0 11:35 11:75 11:95 20:0 20:35 20:75 20:95
1 56 0 104 0 88 0 136 0 120 0 72 0
2 0 128 0 80 0 64 0 112 0 96 0 144
, , paint = paintC
paste(sprintf("%02d", temp), RH, sep = ":")
wind 05:0 05:35 05:75 05:95 11:0 11:35 11:75 11:95 20:0 20:35 20:75 20:95
1 56 0 104 0 88 0 136 0 120 0 72 0
2 0 128 0 80 0 64 0 112 0 96 0 144

r if else for loop with logical and boollean operators

I'm trying to use an if else statement to create a new column of binary data in my data frame, but what I get is all zeros...
command:
for(i in 1:nrow(asort)){
if(asort$recip==0 && asort$dist<.74){
asort$temp[i]<-0
} else{
asort$temp[i]<-1
}
}
#temp ends up being all 0's
In addition, I would actually like to ask something along the lines of this:
# if the data in the recip column = 0, and the distances is < 0.74, OR if the #data is greater than 1.85 give me a zero, else 1
for(i in 1:nrow(asort)){
if(asort$recip==0 && asort$dist<.74 || asort$dist>1.85){
asort$temp[i]<-0
} else{
asort$temp[i]<-1
}
}
> head(asort)
coordinates CLASS_ID Flight UFID dist nnid nnid2 observed recip temp
157 (285293.3, 4426017) 0 F4_ F4_156 0.3857936 158 F4_157 0 0 0
158 (285293.2, 4426017) 0 F4_ F4_157 0.3857936 157 F4_156 0 0 0
259 (285255, 4426014) 0 F4_ F4_258 0.5226039 261 F4_260 1 0 0
261 (285255, 4426014) 0 F4_ F4_260 0.5226039 259 F4_258 1 0 0
402 (285275.3, 4426004) 0 F4_ F4_401 0.5598427 403 F4_402 1 0 0
403 (285275.6, 4426004) 0 F4_ F4_402 0.5598427 402 F4_401 1 0 0
Using df data.frame
dist <- runif(10, 0.3, 2)
recip<- c(0,1,1,0,1,0,1,0,0,1)
df <- data.frame(dist, recip)
and ifelse
df$temp<-ifelse(df$dist < 0.74 & df$recip == 0 , 0,
ifelse(df$dist > 1.85 & df$recip == 0, 0, 1))
> head(df)
# dist recip temp
#1 1.1878002 0 1
#2 0.4147835 1 1
#3 1.3707311 1 1
#4 0.9008034 0 1
#5 1.0220149 1 1
#6 1.9384069 0 0

Product between two data.frames columns

I have two data.frames:
The first one is the coefficients of my regressions for each day:
> parametrosBase
beta0 beta1 beta4
2015-12-15 0.1622824 -0.012956819 -0.04637442
2015-12-16 0.1641884 -0.007914548 -0.06170213
2015-12-17 0.1623660 -0.005618474 -0.05914809
2015-12-18 0.1643263 0.005380472 -0.08533237
2015-12-21 0.1667710 0.003824588 -0.09040071
The second one is: the independent (x) variables:
> head(ir_dfSTORED)
ind m h0x h1x h4x beta0_h0x beta1_h1x beta4_h4x
1 2015-12-15 21 1 0.5642792 0.2859359 0 0 0
2 2015-12-15 42 1 0.3606713 0.2831963 0 0 0
3 2015-12-15 63 1 0.2550200 0.2334554 0 0 0
4 2015-12-15 84 1 0.1943071 0.1883048 0 0 0
5 2015-12-15 105 1 0.1561231 0.1544524 0 0 0
6 2015-12-15 126 1 0.1302597 0.1297947 0 0 0
> tail(ir_dfSTORED)
ind m h0x h1x h4x beta0_h0x beta1_h1x beta4_h4x
835 2015-12-21 2415 1 0.006799321 0.006799321 0 0 0
836 2015-12-21 2436 1 0.006740707 0.006740707 0 0 0
837 2015-12-21 2457 1 0.006683094 0.006683094 0 0 0
838 2015-12-21 2478 1 0.006626457 0.006626457 0 0 0
839 2015-12-21 2499 1 0.006570773 0.006570773 0 0 0
840 2015-12-21 2520 1 0.006516016 0.006516016 0 0 0
What i want is to multiply the beta0 column of "parametrosBase" by h0x column of "ir_dfSTORED" and store the result in the beta0_h0x column. And the same for the others: beta1 and beta4
The problem im facing is with the dates in "ind" column. This multiplication has to respect the dates.
So, once i change the day in "ir_dfSTORED" i have to change to the same day in "parametrosBase".
For example:
The first rowof "parametrosBase" df is
2015-12-15 0.1622824 -0.012956819 -0.04637442
is fixed for the 2015-12-15 day. And then i do the product. Once i enter on the 2015-12-16 day i will have to consider the second row of "parametrosBase" df.
How can i do this?
Thanks a lot. :)
Maybe you should merge the two datasets first:
parametrosBase$ind <- rownames(parametrosBase)
df <- merge(ir_dfSTORED,parametrosBase)
df <- within(df,{
beta0_h0x <- beta0*h0x
beta1_h0x <- beta1*h0x
beta4_h0x <- beta4*h0x
})
Since I don't know the structure of the data, you may have to convert the dates from rownames to a date format in order for the merge to work. Using ind as the name of the date in parametrosBase is key to making merge work, otherwise you'll have to specify the variables to merge by.

mistake in multivePenal but not in frailtyPenal

The libraries used are: library(survival)
library(splines)
library(boot)
library(frailtypack) and the function used is in the library frailty pack.
In my data I have two recurrent events(delta.stable and delta.unstable) and one terminal event (delta.censor). There are some time-varying explanatory variables, like unemployment rate(u.rate) (is quarterly) that's why my dataset has been splitted by quarters.
Here there is a link to the subsample used in the code just below, just in case it may be helpful to see the mistake. https://www.dropbox.com/s/spfywobydr94bml/cr_05_males_services.rda
The problem is that it takes a lot of time running until the warning message appear.
Main variables of the Survival function are:
I have two recurrent events:
delta.unstable (unst.): takes value one when the individual find an unstable job.
delta.stable (stable): takes value one when the individual find a stable job.
And one terminal event
delta.censor (d.censor): takes value one when the individual has death, retired or emigrated.
row id contadorbis unst. stable d.censor .t0 .t
1 78 1 0 1 0 0 88
2 101 2 0 1 0 0 46
3 155 3 0 1 0 0 27
4 170 4 0 0 0 0 61
5 170 4 1 0 0 61 86
6 213 5 0 0 0 0 92
7 213 5 0 0 0 92 182
8 213 5 0 0 0 182 273
9 213 5 0 0 0 273 365
10 213 5 1 0 0 365 394
11 334 6 0 1 0 0 6
12 334 7 1 0 0 0 38
13 369 8 0 0 0 0 27
14 369 8 0 0 0 27 119
15 369 8 0 0 0 119 209
16 369 8 0 0 0 209 300
17 369 8 0 0 0 300 392
When I apply multivePenal I obtain the following message:
Error en aggregate.data.frame(as.data.frame(x), ...) :
arguments must have same length
Además: Mensajes de aviso perdidos
In Surv(.t0, .t, delta.stable) : Stop time must be > start time, NA created
#### multivePenal function
fit.joint.05_malesP<multivePenal(Surv(.t0,.t,delta.stable)~cluster(contadorbis)+terminal(as.factor(delta.censor))+event2(delta.unstable),formula.terminalEvent=~1, formula2=~as.factor(h.skill),data=cr_05_males_serv,Frailty=TRUE,recurrentAG=TRUE,cross.validation=F,n.knots=c(7,7,7), kappa=c(1,1,1), maxit=1000, hazard="Splines")
I have checked if Surv(.t0,.t,delta.stable) contains NA, and there are no NA's.
In addition, when I apply for the same data the function frailtyPenal for both possible combinations, the function run well and I get results. I take one week looking at this and I do not find the key. I would appreciate some of light to this problem.
#delta unstable+death
enter code here
fit.joint.05_males<-frailtyPenal(Surv(.t0,.t,delta.unstable)~cluster(id)+u.rate+as.factor(h.skill)+as.factor(m.skill)+as.factor(non.manual)+as.factor(municipio)+as.factor(spanish.speakers)+ as.factor(no.spanish.speaker)+as.factor(Aged.16.19)+as.factor(Aged.20.24)+as.factor(Aged.25.29)+as.factor(Aged.30.34)+as.factor(Aged.35.39)+ as.factor(Aged.40.44)+as.factor(Aged.45.51)+as.factor(older61)+ as.factor(responsabilities)+
terminal(delta.censor),formula.terminalEvent=~u.rate+as.factor(h.skill)+as.factor(m.skill)+as.factor(municipio)+as.factor(spanish.speakers)+as.factor(no.spanish.speaker)+as.factor(Aged.16.19)+as.factor(Aged.20.24)+as.factor(Aged.25.29)+as.factor(Aged.30.34)+as.factor(Aged.35.39)+as.factor(Aged.40.44)+as.factor(Aged.45.51)+as.factor(older61)+ as.factor(responsabilities),data=cr_05_males_services,n.knots=12,kappa1=1000,kappa2=1000,maxit=1000, Frailty=TRUE,joint=TRUE, recurrentAG=TRUE)
###Be patient. The program is computing ...
###The program took 2259.42 seconds
#delta stable+death
fit.joint.05_males<frailtyPenal(Surv(.t0,.t,delta.stable)~cluster(id)+u.rate+as.factor(h.skill)+as.factor(m.skill)+as.factor(non.manual)+as.factor(municipio)+as.factor(spanish.speakers)+as.factor(no.spanish.speaker)+as.factor(Aged.16.19)+as.factor(Aged.20.24)+as.factor(Aged.25.29)+as.factor(Aged.30.34)+as.factor(Aged.35.39)+as.factor(Aged.40.44)+as.factor(Aged.45.51)+as.factor(older61)+as.factor(responsabilities)+terminal(delta.censor),formula.terminalEvent=~u.rate+as.factor(h.skill)+as.factor(m.skill)+as.factor(municipio)+as.factor(spanish.speakers)+as.factor(no.spanish.speaker)+as.factor(Aged.16.19)+as.factor(Aged.20.24)+as.factor(Aged.25.29)+as.factor(Aged.30.34)+as.factor(Aged.35.39)+as.factor(Aged.40.44)+as.factor(Aged.45.51)+as.factor(older61)+as.factor(responsabilities),data=cr_05_males_services,n.knots=12,kappa1=1000,kappa2=1000,maxit=1000, Frailty=TRUE,joint=TRUE, recurrentAG=TRUE)
###The program took 3167.15 seconds
Because you neither provide information about the packages used, nor the data necessary to run multivepenal or frailtyPenal, I can only help you with the Surv part (because I happened to have that package loaded).
The Surv warning message you provided (In Surv(.t0, .t, delta.stable) : Stop time must be > start time, NA created) suggests that something is strange with your variables .t0 (the time argument in Surv, refered to as 'start time' in the warning), and/or .t (time2 argument, 'Stop time' in the warning). I check this possibility with a simple example
# read the data you feed `Surv` with
df <- read.table(text = "row id contadorbis unst. stable d.censor .t0 .t
1 78 1 0 1 0 0 88
2 101 2 0 1 0 0 46
3 155 3 0 1 0 0 27
4 170 4 0 0 0 0 61
5 170 4 1 0 0 61 86
6 213 5 0 0 0 0 92
7 213 5 0 0 0 92 182
8 213 5 0 0 0 182 273
9 213 5 0 0 0 273 365
10 213 5 1 0 0 365 394
11 334 6 0 1 0 0 6
12 334 7 1 0 0 0 38
13 369 8 0 0 0 0 27
14 369 8 0 0 0 27 119
15 369 8 0 0 0 119 209
16 369 8 0 0 0 209 300
17 369 8 0 0 0 300 392", header = TRUE)
# create survival object
mysurv <- with(df, Surv(time = .t0, time2 = .t, event = stable))
mysurv
# create a new data set where one .t for some reason is less than .to
# on row five .t0 is 61, so I set .t to 60
df2 <- df
df2$.t[df2$.t == 86] <- 60
# create survival object using new data which contains at least one Stop time that is less than Start time
mysurv2 <- with(df2, Surv(time = .t0, time2 = .t, event = stable))
# Warning message:
# In Surv(time = .t0, time2 = .t, event = stable) :
# Stop time must be > start time, NA created
# i.e. the same warning message as you got
# check the survival object
mysurv2
# as you can see, the fifth interval contains NA
# I would recommend you check .t0 and .t in your data set carefully
# one way to examine rows where Stop time (.t) is less than start time (.t0) is:
df2[which(df2$.t0 > df2$.t), ]
I am not familiar with multivepenal but it seems that it does not accept a survival object which contains intervals with NA, whereas might frailtyPenal might do so.
The authors of the package have told me that the function is not finished yet, so perhaps that is the reason that it is not working well.
I encountered the same error and arrived at this solution.
frailtyPenal() will not accept data.frames of different length. The data.frame used in Surv and data.frame named in data= in frailtyPenal must be the same length. I used a Cox regression to identify the incomplete cases, reset the survival object to exclude the missing cases and, finally, run frailtyPenal:
library(survival)
library(frailtypack)
data(readmission)
#Reproduce the error
#change the first start time to NA
readmission[1,3] <- NA
#create a survival object with one missing time
surv.obj1 <- with(readmission, Surv(t.start, t.stop, event))
#observe the error
frailtyPenal(surv.obj1 ~ cluster(id) + dukes,
data=readmission,
cross.validation=FALSE,
n.knots=10,
kappa=1,
hazard="Splines")
#repair by resetting the surv object to omit the missing value(s)
#identify NAs using a Cox model
cox.na <- coxph(surv.obj1 ~ dukes, data = readmission)
#remove the NA cases from the original set to create complete cases
readmission2 <- readmission[-cox.na$na.action,]
#reset the survival object using the complete cases
surv.obj2 <- with(readmission2, Surv(t.start, t.stop, event))
#run frailtyPenal using the complete cases dataset and the complete cases Surv object
frailtyPenal(surv.obj2 ~ cluster(id) + dukes,
data = readmission2,
cross.validation = FALSE,
n.knots = 10,
kappa = 1,
hazard = "Splines")

Using a column entry as a "selector" for datasets in R

My array looks like this:
Slide Index A B C DoseGroup
482 778 l 0 0 2 13Gy_p_75_42wk
483 778 r 0 0 2 13Gy_p_75_42wk
484 779 l 0 0 2 13Gy_p_75_42wk
485 779 r 0 0 2 13Gy_p_75_42wk
486 4700 l 2 2 2 14.25Gy_C_50pl_42wk
487 4700 r 0 0 1 14.25Gy_C_50pl_42wk
488 4701 l 0 0 1 14.25Gy_C_50pl_42wk
I would like to use the DoseGroup column's entries to be able to select the respective entries in the other columns. I would like to be able to tell R, e.g., "Do a wilcox.test between the 13Gy_p_75_42wk and the 14.25Gy_C_50pl_42wk datasets using column C."
How can I do this with R? Is there some kind of way to select all columns having the entry 14.25Gy_C_50pl_42wk?
I modified your data to add a third level in DoseGroup to make it more realistic.
txt <- "Slide Index A B C DoseGroup
778 l 0 0 2 13Gy_p_75_42wk
778 r 0 0 2 13Gy_p_75_42wk
779 l 0 0 2 13Gy_p_75_42wk
779 r 0 0 2 13Gy_p_75_42wk
4700 l 2 2 2 14.25Gy_C_50pl_42wk
4700 r 0 0 1 14.25Gy_C_50pl_42wk
4701 l 0 0 1 14.25Gy_C_50pl_42wk
4702 l 0 0 10 15Gy_C_50pl_42wk"
dat <- read.table(text = txt, header = TRUE)
wilcox.test(C ~ DoseGroup, data = dat,
subset = DoseGroup %in% c("13Gy_p_75_42wk", "14.25Gy_C_50pl_42wk"))
## Wilcoxon rank sum test with continuity correction
## data: C by DoseGroup
## W = 10, p-value = 0.1175
## alternative hypothesis: true location shift is not equal to 0
To select data, you can use one of these two command.
dat[dat$DoseGroup == "14.25Gy_C_50pl_42wk", ]
subset(dat, DoseGroup == "14.25Gy_C_50pl_42wk")
Those commands are basics in R and if you read any introduction to R, you'll be able to do same.
So I urge you to do so, I you want to really enjoy R.

Resources