How to subset 'n' number of rows past a certain value? - r

I'm trying to subset a data.frame based on a 1 or 0 value the data.frame.
Here is some sample code;
> Test
Close High Low Dn.BB MaVg Up.BB Per.BB Dn.Brk
2007-02-27 6286.1 6434.7 6270.5 6305.813 6389.679 6473.544 -0.11752900 1
2007-02-28 6171.5 6286.1 6166.2 6237.635 6377.186 6516.737 -0.23695539 1
2007-03-01 6116.0 6230.7 6038.9 6164.470 6358.129 6551.787 -0.12514308 1
2007-03-02 6116.2 6164.4 6085.6 6110.807 6341.179 6571.550 0.01170495 0
2007-03-05 6058.7 6116.2 5989.6 6047.421 6318.100 6588.779 0.02083561 0
2007-03-06 6138.5 6138.5 6058.7 6018.953 6297.907 6576.861 0.21427696 0
2007-03-07 6156.5 6167.6 6106.1 6001.139 6278.136 6555.133 0.28043853 0
2007-03-08 6227.7 6233.1 6156.5 5997.989 6264.436 6530.882 0.43106389 0
2007-03-09 6245.2 6255.8 6190.3 6003.152 6250.207 6497.262 0.48986661 0
2007-03-12 6233.3 6276.3 6219.3 6007.297 6237.421 6467.546 0.49104464 0
2007-03-13 6161.2 6240.7 6161.2 6000.401 6223.429 6446.457 0.36049188 0
Here, I would like to have something that iterates along the data.frame and then splits out the subsets based on Dn.Brk > 0. I can only think of a loop method here and am not to familiar with sub-setting, so was wondering if anyone could point me in the right direction / provide some tips of functions / packages that could achive this?
A little more detail below;
Sub <- rep(0,nrow(Test))
for (i in nrow(Test)){
if (Test[i,8] > 0){Sub = Test(i:i+10,1)}
}
So, the above would, at every point where Test[i,8] > 0, select, Test$Close from i:i+10.
Ideally, I'd like every sample to be stored in a separate row/column in a new df. Is that possible?

You can use sapply here:
sapply(which(Test[, 8] > 0), function(z) Test$Close[z:(z+10)])
A few things to note in the loop you provided though:
You are not iterating: Your loop is from i in nrow(Test) which is effectively nrow(Test)
You would be overwriting Sub with each iteration

If you are still in search for doing it with a for loop here is the answer:
#### results list #####
results <- list()
for (i in rows.test){
if (test[i,8] > 0)
{
results[[i]] = test$Close[i:(i+10)]
}
else {results[[i]] = "no value"}
}
This could also be further parallelisable if your dataset is huge with a package called foreach. A good intro here: http://www.vikparuchuri.com/blog/parallel-r-loops-for-windows-and-linux/. You could also change "no value" to next if you want a list with only three named elements

Related

Attribute labels to scored items [function]

I don't know if the subject has already been find but here my problem :
I have a dataset from behaviors personality items scored from 1 to 8 and I would like to convert each scored according a range (e.g. 1-2 = Rare ; 3-5 = Occasionally ; 6-8 = Frequent).
I succeed to create new columns and put labels in it but I don't understand why I have same repetition in others columns :
Beh_data[,c(2,3,4,32,33,34)
enter image description here
You can see that columns with "_class" had the same outputs, and there are mistakes about correct match between labels and scores (e.g. row4 -- 8 put as Occasionally)
Here the function code :
l = unlist(names(Beh_data[,2:28]))
for (j in 1:length(l)) {
cl[j] = list(paste(l[j],"class",sep="_"))
for (k in 1:length(cl)) {
Beh_data[,cl[[k]] ] <- cl[[k]]
for(i in 1:nrow(Beh_data)){
Beh_data[,cl[[k]] ][i] <-ifelse(Beh_data[,l[j] ][i]<3, "Rare", Beh_data[,cl[[k]] ][i])
Beh_data[,cl[[k]] ][i] <-ifelse(Beh_data[,l[j] ][i]>2 & Beh_data[,l[j] ][i]<6, "Occasionally", Beh_data[,cl[[k] ] ][i])
Beh_data[,cl[[k]] ][i] <-ifelse(Beh_data[,l[j] ][i]>5, "Frequent", Beh_data[,cl[[k]] ][i])
}
}
}
I tried to see if it's could from a wrong annotation as cl[[k]] ] or something like this but it steels doesn't work
Do you have any ideas please ?
If you're open to a dplyr solution, I think its across and case_when functions are helpful here. It should also run faster since it's vectorized. This will create new columns like aff_sum_class which use the categorization you've specified.
library(dplyr)
Beh_data |>
mutate(across(aff_sum:qui_sum,
~case_when(. >= 6 ~ "Frequent",
. >= 3 ~ "Occasionally",
TRUE ~ "Rare"),
.names = "{.col}_class"))

How can I remove all rows in which one of the values satisfies a condition? apply() not working

I have a dataframe with two columns, and I want to remove all rows in which one of the values in each row is either smaller than 0 or bigger than a specified number (for the sake of argument let's call this 2000).
This is the dataframe
structure(list(xx = c(134.697838289433, 222.004361198059, 131.230956160172,
206.658871436917, 111.25078650042, 241.965831417648, 171.46912254679,
116.860666678254, 196.894985820028, 135.309699618638, 133.082437475133,
185.509376072318, 718.998297748551, 745.902984215293, 752.655615982603,
633.199684348903, 764.983924278636, 694.856525559398, 773.56532078895,
757.32358575657, 709.924023536199, 658.863564702233, 733.076690816291,
745.9306541374, 788.134444412421, 759.445624288787, 796.989170170713,
632.952543475636, 746.103571612919, 715.296116988119, 766.899107551248,
628.268453830605, 658.574104878488, 689.916530654021, 820.841422812349,
709.097957368612, 793.109262845978, 716.713801941779, 726.83260343463,
746.547080776193, 759.644057119419, 757.41275593749, 723.539527360327,
839.816318612061, 795.655016954661, 766.245386324182, 756.300015395758,
808.255074043333, 745.915083305187, 685.465492956583, 694.567959198318,
786.919467838804, 699.521900871042, 749.041223560884, 700.079697765533,
753.805501259023, 745.080253997501, 846.982894686656, 775.66384433188,
809.39649823454, 841.009469183585, 790.987061753069, 792.441925234251,
1377.97739642236, 1353.19738061511, 1259.94435540633, 1276.25060187203,
1331.26106031956, 1227.68481147557, 1345.95561236514, 1309.51489973952,
1285.62680259649, 1329.46388049714, 1256.00394500077, 1294.0505313591,
1349.09440181876, 1294.72661682462, 1339.38577920408, 1277.114896541,
1267.54884404031, 1291.32793111573, 1254.85565551553, 1298.78499697743,
1283.89664572036, 1273.92831816666, 1310.221891323, 1327.89682404014,
1310.81394400863, 595.342571560588, 689.892254230306, 562.390766853428,
736.319251501976, 609.577261412134, 641.591997384705, 682.957658696869,
580.320759093636, 560.64984978551, 643.487033739876, 688.457314818318,
631.156743281308, 659.535909106305), yy = c(1169.70954243065,
1259.830208937, 1172.21661417439, 1097.62724268622, 1198.15024522658,
1231.90665701131, 1211.36196331211, 1152.4207367321, 1287.57553021171,
1120.61366993258, 1234.70366243878, 1258.47454705197, 893.983957068268,
994.99854601335, 916.330965835536, 947.536265806389, 950.345051732045,
934.313361799171, 1018.76942964176, 918.182358835366, 1005.51128858608,
967.577307930044, 997.239384198691, 995.866808447868, 962.292293255127,
864.624084608006, 895.091604672023, 906.22162647536, 1024.45206885923,
908.693026118345, 923.625774785301, 931.801569764776, 1007.88553380827,
848.55309782664, 927.608364899483, 1024.60765786828, 1085.64295260059,
1057.90632135992, 1195.30607038065, 1151.39888340311, 1168.2831257626,
1137.15375447446, 1145.42393212912, 1108.89072769468, 1075.15451622384,
1129.91711324634, 1191.94330388541, 1132.41649984784, 1210.89342724886,
1100.60339252755, 1083.5987922884, 1056.69487941162, 1150.2707936581,
1055.75678264632, 1055.53323667429, 1049.79655119467, 1166.86598024805,
1141.82593378866, 1066.37755267981, 1160.55793904653, 1162.65728735716,
1060.29360609309, 1107.40480300404, 1825.01445883899, 1802.95011068891,
1692.84948509132, 1675.97166713074, 1758.10341887143, 1788.48414279738,
1680.15824054313, 1756.01930833023, 1706.98458587119, 1770.57687329296,
1692.21991398915, 1835.60585163662, 1790.6487914694, 1787.52076839767,
1704.25313427813, 1735.96312434652, 1813.02044772293, 1847.21159474717,
1725.63580525853, 1841.32016678, 1713.80845602987, 1770.39756152819,
1747.72988313376, 1778.13110060636, 1786.3871288087, 6.01666671271317,
19.2497357431764, 9.6964112500295, -3.23929433528044, 89.4863211231715,
86.0082947221296, 42.7982120490919, 2.19886414532234, 12.8780844043502,
30.694893442471, 7.58386594976601, 83.8385161493349, 36.4551491976192
)), row.names = 100:200, class = "data.frame")
First I create a function to eliminate points which satisfy the conditions.
routliers<-function(x){
if(x>2000|x<0){
rm(x)
}
}
Then I use the apply function across the rows to eliminate the points using the above function (the above dput() is named cds).
cds<-data.frame(apply(cds,1,routliers))
But this eliminates all points
length(cds)
[1]0
Interestingly, if I replace the rm() function with print(), then I do print out the desired points when using the apply function, but I receive the error "arguments imply differing number of rows: 0, 2". Also, I am not sure when I use the apply() function that the specified function applies to both columns of data, as I am not seeing any data points in print() that satisfies the condition for ONLY the second column of points. The first column is x co-ordinates, and the second column is y co-ordinates. I think the error "arguments imply differing number of rows:0,2" suggests that only the first value in the row is being tested against the function.
How can I write code in which rows are eliminated if one or more of the data points satisfies my condition?
This is easy to do when the columns are separate vectors, (x<-x[!condition]) however I cannot add them together again easily so I prefer to do this on a dataframe of points.
Please check if this code works for you, with df being the data you shared:
#Code
new <- df[!rowSums(df < 0 | df>2000) > 0, ]
Or this:
#Code 2
new <- df[which(apply(df,1,function(x) sum(x<0 | x>2000))==0),]
Let's make your function return TRUE for an outlier and FALSE for not an outlier. And it can be vectorized:
is_outlier = function(x) {
x > 2000 | x < 0
}
Here's how we'd use this to drop rows with outliers in a single column:
cds[!is_outlier(cds$xx), ]
For two columns, we can combine the is_outlier results with & or |. I can't tell from your text whether you want to remove rows where xx AND yy are outliers, or remove rows where xx OR yy are outliers. So pick the appropriate version:
cds[!is_outlier(cds$xx) & !is_outlier(cds$yy), ]
cds[!is_outlier(cds$xx) | !is_outlier(cds$yy), ]

How to handle "write.xlsx" error: arguments imply differing number of rows

I'm trying to write an xlsx file from a list of dataframes that I created but I'm getting an error due to missing data (I couldn't download it). I just want to write the xlsx file besides having this lacking data. Any help is appreciated.
For replication of the problem:
library(quantmod)
name_of_symbols <- c("AKER","YECO","SNOA")
research_dates <- c("2018-11-19","2018-11-19","2018-11-14")
my_symbols_df <- lapply(name_of_symbols, function(x) tryCatch(getSymbols(x, auto.assign = FALSE),error = function(e) { }))
my_stocks_OHLCV <- list()
for (i in 1:3) {
trade_date <- paste(as.Date(research_dates[i]))
OHLCV_data <- my_symbols_df[[i]][trade_date]
my_stocks_OHLCV[[i]] <- data.frame(OHLCV_data)
}
And you can see the missing data down here in my_stocks_OHLCV[[2]] and the write.xlsx error I'm getting:
print(my_stocks_OHLCV)
[[1]]
AKER.Open AKER.High AKER.Low AKER.Close AKER.Volume AKER.Adjusted
2018-11-19 2.67 3.2 1.56 1.75 15385800 1.75
[[2]]
data frame with 0 columns and 0 rows
[[3]]
SNOA.Open SNOA.High SNOA.Low SNOA.Close SNOA.Volume SNOA.Adjusted
2018-11-14 1.1 1.14 1.01 1.1 107900 1.1
write.xlsx(my_stocks_OHLCV, "C:/Users/MICRO/Downloads/Datasets_stocks/dux_OHLCV.xlsx")
Error in (function (..., row.names = NULL, check.rows = FALSE,
check.names = TRUE,:arguments imply differing number of rows: 1, 0
How do I run write.xlsx even though I have this missing data?
The main question you need to ask is, what do you want instead?
As you are working with stock data, the best idea, is that if you don't have data for a stock, then remove it. Something like this should work,
my_stocks_OHLCV[lapply(my_stocks_OHLCV,nrow)>0]
If you want a row full of NA or 0
Then use the lapply function and for each element of the list, of length 0, replace with either NA's, vector of 0's (c(0,0,0,0,0,0)) etc...
Something like this,
condition <- !lapply(my_stocks_OHLCV,nrow)>0
my_stocks_OHLCV[condition] <- data.frame(rep(NA,6))
Here we define the condition variable, to be the elements in the list where you don't have any data. We can then replace those by NA or swap the NA for 0. However, I can't think of a reason to do this.
A variation on your question, and one you could handle inside your for loop, is to check if you have data, and if you don't, replace the values there, with NAs, and you could given it the correct headers, as you know which stock it relates to.
Hope this helps.

Back testing for Stock Market with R

I am very new user for R and want to use R for back testing my Strategy. I try to combine some scripts found in web. However, it did not work according my idea. My problem is the transaction date cannot be generated according to my strategy design date.
library(quantmod)
library(lubridate)
stock1<-getSymbols("AAPL",src="yahoo",from="2016-01-01",auto.assign=F)
stock1<-na.locf(stock1)
stock1$EMA9<-EMA(Cl(stock1),n=9)
stock1$EMA19<-EMA(Cl(stock1),n=19)
stock1$EMACheck<-ifelse(stock1$EMA9>stock1$EMA19,1,0)
stock1$EMA_CrossOverUp<-ifelse(diff(stock1$EMACheck)==1,1,0)
stock1$EMA_CrossOverDown<-ifelse(diff(stock1$EMACheck)==-1,-1,0)
stock1<-stock1[index(stock1)>="2016-01-01",]
stock1_df<-data.frame(index(stock1),coredata(stock1))
colnames(stock1_df)<-c("Date","Open","High","Low","Close","Volume","Adj","EMA9","EMA19","EMACheck","EMACheck_up","EMACheck_down")
#To calculate the number of crossoverup transactions during the duration from 2016-01-01
sum(stock1_df$EMACheck_up==1 & index(stock1)>="2016-01-01",na.rm=T)
stock1_df$Date[stock1_df$EMACheck_up==1 & index(stock1)>="2016-01-01"]
sum(stock1_df$EMACheck_down==-1 & index(stock1)>="2016-01-01",na.rm=T)
stock1_df$Date[stock1_df$EMACheck_down==-1 & index(stock1)>="2016-01-01"]
#To generate the transcation according to the strategy
transaction_dates<-function(stock2,Buy,Sell)
{
Date_buy<-c()
Date_sell<-c()
hold<-F
stock2[["Hold"]]<-hold
for(i in 1:nrow(stock2)) {
if(hold == T) {
stock2[["Hold"]][i]<-T
if(stock2[[Sell]][i] == -1) {
#stock2[["Hold"]][i]<-T
hold<-F
}
} else {
if(stock2[[Buy]][i] == 1) {
hold<-T
stock2[["Hold"]][i]<-T
}
}
}
stock2[["Enter"]]<-c(0,ifelse(diff(stock2[["Hold"]])==1,1,0))
stock2[["Exit"]]<-c(ifelse(diff(stock2[["Hold"]])==-1,-1,0),0)
Buy_date <- stock2[["Date"]][stock2[["Enter"]] == 1]
Sell_date <- stock2[["Date"]][stock2[["Exit"]] == -1]
if (length(Sell_date)<length(Buy_date)){
#Sell_date[length(Sell_date)+1]<-tail(stock2[["Date"]],n=2)[1]
Buy_date<-Buy_date[1:length(Buy_date)-1]
}
return(list(DatesBuy=Buy_date,DatesSell=Sell_date))
}
#transaction dates generate:
stock1_df <- na.locf(stock1_df)
transactionDates<-transaction_dates(stock1_df,"EMACheck_up","EMACheck_down")
transactionDates
num_transaction1<-length(transactionDates[[1]])
Open_price<-function(df,x) {df[as.integer(rownames(df[df[["Date"]]==x,]))+1,][["Open"]]}
transactions_date<-function(df,x) {df[as.integer(rownames(df[df[["Date"]]==x,]))+1,][["Date"]]}
transactions_generate<-function(df,num_transaction)
{
price_buy<-sapply(1:num_transaction,function(x) {Open_price(df,transactionDates[[1]][x])})
price_sell<-sapply(1:num_transaction,function(x) {Open_price(df,transactionDates[[2]][x])})
Dates_buy<-as.Date(sapply(1:num_transaction,function(x) {transactions_date(df,transactionDates[[1]][x])}))
Dates_sell<-as.Date(sapply(1:num_transaction,function(x) {transactions_date(df,transactionDates[[2]][x])}))
transactions_df<-data.frame(DatesBuy=Dates_buy,DatesSell=Dates_sell,pricesBuy=price_buy,pricesSell=price_sell)
#transactions_df$return<-100*(transactions_df$pricesSell-transactions_df$pricesBuy)/transactions_df$pricesBuy
transactions_df$Stop_loss<-NA
return(transactions_df)
}
transaction_summary<-transactions_generate(stock1_df,num_transaction1)
transaction_summary$Return<-100*(transaction_summary$pricesSell-transaction_summary$pricesBuy)/transaction_summary$pricesBuy
transaction_summary
sum(transaction_summary$Return,na.rm=T)
Hi, I am very new user for R and want to use R for back testing my Strategy. I try to combine some scripts found in web. However, it did not work according my idea. My problem is the transaction date cannot be generated according to my strategy design date.
problem as this image
The code you have is to complicated for it's own good.
The issue lies in the fact that the functions Open_price and transactions_date look for use rownames to find a record number and then take the next one. But then instead of looking for the rownames again, it is used as an index. There it goes wrong.
If you look at the following result for the first date, it returns 40.
as.integer(rownames(stock1_df[stock1_df[["Date"]] == "2016-03-01", ]))
[1] 40
So the next record it would look for will be 41. But stock_df[41, ] is not the same as rowname 41. An issue with rownames is that if you filter / remove records from the data.frame the rownames don't change. To get the correct index number you should use which. If you look at the stock1_df, you can see that it returns 21 and we need record 22
which(stock1_df[["Date"]] == "2016-03-01")
[1] 21
I changed the Open_price and transactions_date functions to use the which function. This will now return the correct results.
Open_price <- function(df, x) {
df[which(df[["Date"]] == x) + 1, ][["Open"]]
}
transactions_date <- function(df, x) {
df[which(df[["Date"]] == x) + 1, ][["Date"]]
}
head(transaction_summary)
DatesBuy DatesSell pricesBuy pricesSell Stop_loss Return
1 2016-03-02 2016-04-25 100.51 105.00 NA 4.467215
2 2016-05-27 2016-06-20 99.44 96.00 NA -3.459374
3 2016-07-13 2016-09-12 97.41 102.65 NA 5.379322
4 2016-09-15 2016-11-02 113.86 111.40 NA -2.160547
5 2016-12-12 2017-06-13 113.29 147.16 NA 29.896728
6 2017-07-17 2017-09-19 148.82 159.51 NA 7.183166
A bit of advice, try to use spaces in your code. That makes it more readable. Look for example at this style guide. Your whole code be rewritten to only use stock1 without the need to turning it into a data.frame halfway your code. But for now the code does what it needs to do.

the condition has length > 1 and only the first element will be used while trying to use if case

I have a csv fie as:
score text
1 0 RT #RealJackEdwards: (2 of) a solution. 7 st yrs in playoffs, a Cup, a Final, a Prez Trophy. Yup, Boychuk trade a disaster; Bottom 6 fwds r…
I need to write all the tweets with negative score to a different file. I am trying to use if statement as:
if(stat$score < 0 )
write.csv(stat$text, file=paste('negtweetscore.csv'), row.names=TRUE)
But after running this code i am getting the following error message:
In if (stat$score < 0) write.csv(stat$text, file = paste("negtweetscore.csv"), :
the condition has length > 1 and only the first element will be used
You have to subset your data.frame properly:
write.csv(stat$text[stat$score<0], file=paste('negtweetscore.csv'), row.names=TRUE)

Resources