Sampling and Calculation in R - r

I have a file that contains two columns (Time , VA). The file is large and I managed to read it in R(used read and subset -not a practical for large file). Now, I want to do sampling based on the time, where each sample has a sample size and sample shift. Sample size is fixed value for the whole process of sampling e.g. sampleSize=10 second. Sample shift is the start point for each new sample (after First sample). For example, if sampleShift =4 sec and the sampleSize is 10 sec , that means the second sample will start from 5 sec and add 10 sec as the sample sample size=10 sec. For each sample I want feed the
-VA- values to a function to some calculation.
Sampling <- function(values){
# Perform the sampling
lastRowNumber<- #specify the last row manually
sampleSize<-10
lastValueInFile<-lastRowNumber-sampleSize
for (i in 1: (lastValueInFile ) ){
EndOfShift<-9+i
sample<-c(1:sampleSize)
h<-1
for(j in i:EndOfShift){
sample[h] <- values[j,1]
h<-h+1
}
print(sample)
#Perform the Calculation on the extracted sample
#--Samp_Calculation<-SomFunctionDoCalculation(sample)
}
}
The problems with my try are:
1) I have to specify the lastRow number manually for each file I read.
2) I was trying to do the sampling based on rows number not the Time value. Also, the shift was by one for each sample.
file sample:
Time VA
0.00000 1.000
0.12026 2.000
0.13026 2.000
0.14026 2.000
0.14371 3.000
0.14538 4.000
..........
..........
15.51805 79.002
15.51971 79.015
15.52138 79.028
15.52304 79.040
15.52470 79.053
.............
Any suggestion for more professional way ?

I've generated some test data as follows:
val <- data.frame (time=seq(from=0,to=15,by=0.01),VA=c(0:1500))
... then the function:
sampTime <- function (values,sampTimeLen)
{
# return a data frame for a random sample of the data frame -values-
# of length -sampTimeLen-
minTime <- values$time[1]
maxTime <- values$time[length(values$time)] - sampTimeLen
startTime <- runif(1,minTime,maxTime)
values[(values$time >= startTime) & (values$time <= (startTime+sampTimeLen)),]
}
... can be used as follows:
> sampTime(val,0.05)
time VA
857 8.56 856
858 8.57 857
859 8.58 858
860 8.59 859
861 8.60 860
... which I think is what you were looking for.
(EDIT)
Following the clarification that you want a sample from a specific time rather than a random time, this function should give you that:
sampTimeFrom <- function (values,sampTimeLen,startTime)
{
# return a data frame for sample of the data frame -values-
# of length -sampTimeLen- from a specific -startTime-
values[(values$time >= startTime) & (values$time <= (startTime+sampTimeLen)),]
}
... which gives:
> sampTimeFrom(val,0.05,0)
time VA
1 0.00 0
2 0.01 1
3 0.02 2
4 0.03 3
5 0.04 4
6 0.05 5
> sampTimeFrom(val,0.05,0.05)
time VA
6 0.05 5
7 0.06 6
8 0.07 7
9 0.08 8
10 0.09 9
11 0.10 10
If you want multiple samples, they can be delivered with sapply() like this:
> samples <- sapply(seq(from=0,to=0.15,by=0.05),function (x) sampTimeFrom(val,0.05,x))
> samples[,1]
$time
[1] 0.00 0.01 0.02 0.03 0.04 0.05
$VA
[1] 0 1 2 3 4 5
In this case the output will overlap but making the sampTimeLen very slightly smaller than the shift value (which is shown in the by= parameter of the seq) will give you non-overlapping samples. Alternatively, one or both of the criteria in the function could be changed from >= or <= to > or <.

Related

Optimization function across multiple factors

I am trying to identify the appropriate thresholds for two activities which generate the greatest success rate.
Listed below is an example of what I am trying to accomplish. For each location I am trying to identify the thresholds to use for activities 1 & 2, so that if either criteria is met then we would guess 'yes' (1). I then need to make sure that we are guessing 'yes' for only a certain percentage of the total volume for each location, and that we are maximizing our accuracy (our guess of yes = 'outcome' of 1).
location <- c(1,2,3)
testFile <- data.frame(location = rep.int(location, 20),
activity1 = round(rnorm(20, mean = 10, sd = 3)),
activity2 = round(rnorm(20, mean = 20, sd = 3)),
outcome = rbinom(20,1,0.5)
)
set.seed(145)
act_1_thresholds <- seq(7,12,1)
act_2_thresholds <- seq(19,24,1)
I was able to accomplish this by creating a table that contains all of the possible unique combinations of thresholds for activities 1 & 2, and then merging it with each observation within the sample data set. However, with ~200 locations in the actual data set, each of which with thousands of observations I quickly ran of out of space.
I would like to create a function that takes the location id, set of possible thresholds for activity 1, and also for activity 2, and then calculates how often we would have guessed yes (i.e. the values in 'activity1' or 'activity2' exceed their respective thresholds we're testing) to ensure our application rate stays within our desired range (50% - 75%). Then for each set of thresholds which produce an application rate within our desired range we would want to store only the set of which maximizes accuracy, along with their respective location id, application rate, and accuracy rate. The desired output is listed below.
location act_1_thresh act_2_thresh application_rate accuracy_rate
1 1 13 19 0.52 0.45
2 2 11 24 0.57 0.53
3 3 14 21 0.67 0.42
I had tried writing this into a for loop, but was not able to navigate my way through the number of nested arguments I would have to make in order to account for all of these conditions. I would appreciate assistance from anyone who has attempted a similar problem. Thank you!
An example of how to calculate the application and accuracy rate for a single set of thresholds is listed below.
### Create yard IDs
location <- c(1,2,3)
### Create a single set of thresholds
single_act_1_threshold <- 12
single_act_2_threshold <- 20
### Calculate the simulated application, and success rate of thresholds mentioned above using historical data
as.data.table(testFile)[,
list(
application_rate = round(sum(ifelse(single_act_1_threshold <= activity1 | single_act_2_threshold <= activity2, 1, 0))/
nrow(testFile),2),
accuracy_rate = round(sum(ifelse((single_act_1_threshold <= activity1 | single_act_2_threshold <= activity2) & (outcome == 1), 1, 0))/
sum(ifelse(single_act_1_threshold <= activity1 | single_act_2_threshold <= activity2, 1, 0)),2)
),
by = location]
Consider expand.grid that builds a data frame of all combinations betwen both thresholds. Then use Map to iterate elementwise between both columns of data frame to build a list of data tables (of which now includes columns for each threshold indicator).
act_1_thresholds <- seq(7,12,1)
act_2_thresholds <- seq(19,24,1)
# ALL COMBINATIONS
thresholds_df <- expand.grid(th1=act_1_thresholds, th2=act_2_thresholds)
# USER-DEFINED FUNCTION
calc <- function(th1, th2)
as.data.table(testFile)[, list(
act_1_thresholds = th1, # NEW COLUMN
act_2_thresholds = th2, # NEW COLUMN
application_rate = round(sum(ifelse(th1 <= activity1 | th2 <= activity2, 1, 0)) /
nrow(testFile),2),
accuracy_rate = round(sum(ifelse((th1 <= activity1 | th2 <= activity2) & (outcome == 1), 1, 0)) /
sum(ifelse(th1 <= activity1 | th2 <= activity2, 1, 0)),2)
), by = location]
# LIST OF DATA TABLES
dt_list <- Map(calc, thresholds_df$th1, thresholds_df$th2)
# NAME ELEMENTS OF LIST
names(dt_list) <- paste(thresholds_df$th1, thresholds_df$th2, sep="_")
# SAME RESULT AS POSTED EXAMPLE
dt_list$`12_20`
# location act_1_thresholds act_2_thresholds application_rate accuracy_rate
# 1: 1 12 20 0.23 0.5
# 2: 2 12 20 0.23 0.5
# 3: 3 12 20 0.23 0.5
And if you need to append all elements use data.table's rbindlist:
final_dt <- rbindlist(dt_list)
final_dt
# location act_1_thresholds act_2_thresholds application_rate accuracy_rate
# 1: 1 7 19 0.32 0.47
# 2: 2 7 19 0.32 0.47
# 3: 3 7 19 0.32 0.47
# 4: 1 8 19 0.32 0.47
# 5: 2 8 19 0.32 0.47
# ---
# 104: 2 11 24 0.20 0.42
# 105: 3 11 24 0.20 0.42
# 106: 1 12 24 0.15 0.56
# 107: 2 12 24 0.15 0.56
# 108: 3 12 24 0.15 0.56

Backtesting stock returns for Moving Average rule in r

I am trying to backtest stock returns given a 10 month moving average rule. The rule being, if the price is above the 10mnth average - buy, if it is below the 10mnth average - hold the value constant.
I know how to do this in excel very easily, but I am having trouble in R.
Below is my approach in R:
#Downloand financial data
library(Quandl)
SPY <- Quandl("YAHOO/INDEX_GSPC", type = "xts", collapse = "monthly")
head(SPY)
#Calculate log returns
SPY$log_ret <- diff(log(SPY$Close))
#Calculate moving average for Closing price
SPY$MA.10 <- rollapply(SPY$Close, width = 10, FUN = mean)
#Create binary rule to determine when to buy and when to hold
#1 = Buy
SPY$Action <- ifelse(SPY$MA.10 < SPY$Close, 1, 0)
#Create default value in a new column to backtest returns
SPY$Hit <- 100
#Calculate cumulative returns
SPY$Hit <-ifelse(SPY$Action == 1, SPY[2:n, "Hit"] *
(1 + SPY$log_ret), lag.xts(SPY$Hit, k=1))
Returns do get calculated correctly for an Action of 1, but when the Action is not 1, I find that SPY$Hit only lags 1 time, then defaults to the 100 value, while I would like it to hold the value from the last Action == 1 time.
This formula works very well in MS Excel and is very easy to implement, but it seems that the issue in R is that I cannot keep the value constant from the last Action == 1, how can I do this so that I can see how well this simple trading strategy would work?
Please let me know if I can clarify this further, thank you.
Sample of the desired output:
Action Return Answer
[1,] 0 0.00 100.00000
[2,] 1 0.09 109.00000
[3,] 1 0.08 117.72000
[4,] 1 -0.05 111.83400
[5,] 1 -0.03 108.47898
[6,] 0 -0.02 108.47898
[7,] 0 0.01 108.47898
[8,] 0 0.06 108.47898
[9,] 1 -0.03 105.22461
[10,] 0 0.10 105.22461
[11,] 1 -0.05 99.96338
Here's my guess, let me know what you think.
# Looping
Hit <- matrix(100, nrow = nrow(SPY))
for(row in 11:nrow(SPY)){ # 11 since you have NA's from your moving average
if(SPY$Action[row] == 1){
Hit[row] = Hit[row-1] * (1 + SPY$log_ret[row]) # here we needed row-1
} else {
Hit[row] = Hit[row-1]
}
}
SPY$Hit <- Hit
cbind(SPY$Action, SPY$Hit)
For your sample:
x <- data.frame(Action = c(0,1,1,1,1,0,0,0,1,0,1))
x$Return <- c(0,0.09,0.08,-0.05,-0.03,-0.02,0.01,0.06,-0.03,0.10,-0.05)
x$Answer <- matrix(100, nrow = nrow(x))
for(row in 2:nrow(x)){ # 11 since you have NA's from your moving average
if(x$Action[row] == 1){
x$Answer[row] = x$Answer[row-1] * (1 + x$Return[row])
} else {
x$Answer[row] = x$Answer[row-1]
}
}
x
Action Return Answer
1 0 0.00 100.00000
2 1 0.09 109.00000
3 1 0.08 117.72000
4 1 -0.05 111.83400
5 1 -0.03 108.47898
6 0 -0.02 108.47898
7 0 0.01 108.47898
8 0 0.06 108.47898
9 1 -0.03 105.22461
10 0 0.10 105.22461
11 1 -0.05 99.96338
In Excel there are 2 ways to attain it,
1. Go to the Data command find Data Analysis, find Moving Average,,
In the dialog box you need to put Input data range, Interval (in yur case 10), then output cell addresses.
After finding Result write this formula,
=if(A2 >B2, "Buy", "Hold")
Where A2 hold Price, B2 holds 10 months Moving Average value.
Any where in sheet number cells horizontally 1 to 10 (month number).
Below row put month's value 1 to 10.
Below row calculate 10 months Average.
And finally write the Above written formula to find Buy or hold.

R inference from one matrix to a data frame

I think this may be a very simple and easy question, but since I'm new to R, I hope someone can give me some outlines of how to solve it step by step. Thanks!
So the question is if I have a (n * 2) matrix (say m) where the first column representing the index of the data in another data frame (say d) and the second column representing some value(p value).
What i want to do is if the p value of some row r in m is less than 0.05,I will plot the data in d by the index indicated in the first column in row r of matrix m.
..............
The data is somewhat like what I draw below:
m:
ind p_value
2 0.02
23 0.03
56 0.12
64 0.54
105 0.04
d:
gene_id s1 s2 s3 s4 ... sn
IDH1 0.23 3.01 0 0.54 ... 4.02
IDH2 0.67 0 8.02 10.54 ... 0.72
...
so IDH2 is corresponding to the first line in m whose index column is 2
toplot <- d[ m[ m[,'p_value'] < .05,'ind'], ] works!

How to find out the most occurring range in a list

I plotted a graph in R:
OBD=read.csv("OBD.CSV",header = TRUE,stringsAsFactors=FALSE)
x1 <- OBD$Time1
x2 <- OBD$Time2
y1<-OBD$Vehicle_speed
y2 <-OBD$Engine_speed
par(mar=c(5,4,4,5)+.1)
plot(x1,y1,type="l",col="yellow",ylab = "Vehicle speed")
par(new=TRUE)
plot(x2,y2,type="l",col="blue4",xaxt="n",yaxt="n",xlab="Time",ylab="")
axis(4)
mtext("Engine speed",side=4,line=3)
legend("topleft",col=c("blue4","yellow"),lty=1,legend=c("y1","y2"))
Sample data, CSV format:
Vehicle_speed,Time1,Engine_speed,Time2,Engine_torq,Time3,Acc_pedal,Time4,Eng_fuel_rate,Time5
4.98,0,650,0,11,0,0,0,1.15,0
4.98,0,650,0,11,0,0,0,1.2,0.002
4.96,0,650,0.001,11,0.001,0,0.001,1.2,0.003
4.96,0,651,0.001,11,0.001,0,0.001,1.2,0.005
4.94,0.001,651,0.001,11,0.001,0,0.001,1.2,0.007
4.94,0.001,651,0.001,11,0.001,0,0.002,1.2,0.008
4.91,0.001,650.5,0.001,11,0.001,0,0.002,1.2,0.01
4.91,0.001,650.5,0.001,11,0.001,0,0.002,1.2,0.012
4.89,0.001,650.5,0.002,11,0.002,0,0.003,1.15,0.013
4.89,0.001,650.5,0.002,11,0.002,0,0.003,1.15,0.015
4.87,0.002,649.5,0.002,11,0.002,0,0.003,1.15,0.017
4.87,0.002,649.5,0.002,11,0.002,0,0.004,1.15,0.018
4.85,0.002,650,0.002,11,0.002,0,0.004,1.15,0.02
4.85,0.002,650,0.002,11,0.002,0,0.004,1.15,0.022
4.82,0.002,650,0.003,11,0.003,0,0.005,1.2,0.023
From this table, i just want find a the most occurring engine speed and vehicle speed or most occurring range.
To find the most common (mode) vehicle speed, you can pull this from table
mySpeeds <- table(df$Vehicle_speed)
modeSpeed <- as.numeric(names(mySpeeds)[which.max(mySpeeds)])
modeSpeed
[1] 4.85
To get such a value for a range of speeds, you should use cut:
# get range categories
df$speedRange <- cut(df$Vehicle_speed, breaks=c(-Inf, 4.85, 4.90, 4.95, Inf))
mySpeedsRange <- table(df$speedRange)
modeSpeedRange <- names(mySpeedsRange)[which.max(mySpeedsRange)]
modeSpeedRange
[1] "(4.85,4.9]"
cut takes a numeric variable and returns a factor variable based on the second (breaks) argument. You can supply breaks with a single number indicating the number of breaks, or a vector, indicating the unique cut points. I included -Inf and Inf to ensure full coverage.
OBD <- read.csv(text = "Vehicle_speed,Time1,Engine_speed,Time2,Engine_torq,Time3,Acc_pedal,Time4,Eng_fuel_rate,Time5
4.98,0,650,0,11,0,0,0,1.15,0
4.98,0,650,0,11,0,0,0,1.2,0.002
4.96,0,650,0.001,11,0.001,0,0.001,1.2,0.003
4.96,0,651,0.001,11,0.001,0,0.001,1.2,0.005
4.94,0.001,651,0.001,11,0.001,0,0.001,1.2,0.007
4.94,0.001,651,0.001,11,0.001,0,0.002,1.2,0.008
4.91,0.001,650.5,0.001,11,0.001,0,0.002,1.2,0.01
4.91,0.001,650.5,0.001,11,0.001,0,0.002,1.2,0.012
4.89,0.001,650.5,0.002,11,0.002,0,0.003,1.15,0.013
4.89,0.001,650.5,0.002,11,0.002,0,0.003,1.15,0.015
4.87,0.002,649.5,0.002,11,0.002,0,0.003,1.15,0.017
4.87,0.002,649.5,0.002,11,0.002,0,0.004,1.15,0.018
4.85,0.002,650,0.002,11,0.002,0,0.004,1.15,0.02
4.85,0.002,650,0.002,11,0.002,0,0.004,1.15,0.022
4.82,0.002,650,0.003,11,0.003,0,0.005,1.2,0.023")
> table(OBD$Engine_speed)
649.5 650 650.5 651
2 6 4 3
Or for a couple of columns:
tables <- apply(OBD[ ,c(1,3,5)], 2, table)
> tables
$Vehicle_speed
4.82 4.85 4.87 4.89 4.91 4.94 4.96 4.98
1 2 2 2 2 2 2 2
$Engine_speed
649.5 650 650.5 651
2 6 4 3
$Engine_torq
11
15
To get only the most occuring:
> lapply(tables, which.max)
$Vehicle_speed
4.85
2
$Engine_speed
650
2
$Engine_torq
11
1
Does this solve the problem?

Subtracting Values in Previous Rows: Ecological Lifetable Construction

I was hoping I could get some help. I am constructing a life table, not for insurance, but for ecology (a cross-sectional of the population of a any kind of wild fauna), so essentially censoring variables like smoker/non-smoker, pregnant, gender, health-status, etc.:
AgeClass=C(1,2,3,4,5,6)
SampleSize=c(100,99,87,46,32,19)
for(i in 1:6){
+ PropSurv=c(Sample/100)
+ }
> LifeTab1=data.frame(cbind(AgeClass,Sample,PropSurv))
Which gave me this:
ID AgeClas Sample PropSurv
1 1 100 1.00
2 2 99 0.99
3 3 87 0.87
4 4 46 0.46
5 5 32 0.32
6 6 19 0.19
I'm now trying to calculate those that died in each row (DeathInt) by taking the initial number of those survived and subtracting it by the number below it (i.e. 100-99, then 99-87, then 87-46, so on and so forth). And try to look like this:
ID AgeClas Sample PropSurv DeathInt
1 1 100 1.00 1
2 2 99 0.99 12
3 3 87 0.87 41
4 4 46 0.46 14
5 5 32 0.32 13
6 6 19 0.19 NA
I found this and this, and I wasn't sure if they answered my question as these guys subtracted values based on groups. I just wanted to subtract values by row.
Also, just as a side note: I did a for() to get the proportion that survived in each age group. I was wondering if there was another way to do it or if that's the proper, easiest way to do it.
Second note: If any R-users out there know of an easier way to do a life-table for ecology, do let me know!
Thanks!
If you have a vector x, that contains numbers, you can calculate the difference by using the diff function.
In your case it would be
LifeTab1$DeathInt <- c(-diff(Sample), NA)

Resources