Applying a function to a few rows then the next few rows - r

I am trying to find the max of rows 2:5, then 3:6, then 4:7 and so on for nrows(df). I am however having a problem thinking of how to do this because I have never used a for loop in the past successfully. Any help is greatly appreciated.
structure(c(76.89, 77.08, 77.05, 77.28, 77.28, 77.61, 77.03,
77.61, 77.28, 77.3, 77.37, 77.61, 76.7, 77, 76.98, 77.09, 77.21,
77.5, 76.74, 77.49, 76.98, 77.2, 77.29, 77.58, NA, 76.91, 77.27,
77.13, 77.24, 77.45, NA, 0.910154726303475, 0.0129416332341208,
0.220407104887854, 0.168306576903153, 0.20658489347966, NA, 0.117019893381879,
-0.3753073637893, -0.0518604952677195, -0.0388399792853642, 0.0645577792123914
), .indexCLASS = "Date", .indexTZ = "UTC", tclass = "Date", tzone = "UTC", class = c("xts",
"zoo"), index = structure(c(631324800, 631411200, 631497600,
631756800, 631843200, 631929600), tzone = "UTC", tclass = "Date"), .Dim = 6:7, .Dimnames = list(
NULL, c("open", "high", "low", "close", "avgco", "percenthigh",
"percentlow")))
Specifically I want to apply the max function over the AD1$high column for rows 2 through 5 then rows 3 through 6 and so on and have this in a new column.
Thank You

You could do it by making three copies of your column (i.e "high") and offsetting them so one starts ahead one value and one starts behind one value. Then just take the max as you iterate across them:
y <- yourdata
t <- y[,"high"]
tback <- t[2:length(t)]
tforward <- append(NA,t)
using a loop
for(i in 1:length(t)) {
maxvals[i] <- max(c(t[i],tback[i],tforward[i]), na.rm=T)
}
output
> maxvals
[1] 77.61 77.61 77.61 77.37 77.61 77.61
Or you could do it more efficiently without a loop by initializing maxvals to the proper length and filling its values.

Using the zoo function "rollapply" solved my problem.

Related

Finding monthly average of columns using group_by function in R

I have a dataset that has daily values. I want to find the monthly average of the values of columns. The following code used to work for me but I don't understand why, it doesn't work anymore. It gives me data1 as 1 obs of 1 variable which is NA.
data %>% group_by(month=floor_date(Timestamp, "month")) %>%
summarize(USDTRY=mean(USDTRY)) -> data1
The following is how my data looks:
dput(head(data))
structure(list(Timestamp = structure(c(1629417600, 1629331200,
1629244800, 1629158400, 1629072000, 1628812800), tzone = "UTC", class = c("POSIXct",
"POSIXt")), USDTRY = c(8.4852, 8.4939, 8.4485, 8.4284, 8.453,
8.5171), EURTRY = c(9.9325, 9.9311, 9.8916, 9.8746, 9.9618, 10.0539
), EURUSD = c(1.1696, 1.1674, 1.171, 1.1708, 1.1777, 1.1791),
BIST100 = c(1444.63, 1439.86, 1449.59, 1461.69, 1455.25,
1447.64), TR2YT = c(18.01, 18.01, 18.01, 18.01, 18.01, 18.15
), TR10YT = c(16.88, 16.87, 16.79, 16.8, 16.69, 16.77), TR_EURBON_2 = c(3.648673,
3.63085, 3.611969, 3.572728, 3.567871, 3.559959), TR_EURBON_10 = c(6.302608,
6.307343, 6.276473, 6.240502, 6.255035, 6.301358), BRENT = c(65.18,
66.45, 68.23, 69.03, 69.51, 70.59), WTI = c(62.32, 63.69,
65.46, 66.59, 67.29, 68.44), Altın = c(1780.8668, 1780.179,
1787.59, 1785.9556, 1787.2383, 1779.1515), Gümüş = c(23.01,
23.23, 23.4805, 23.6351, 23.8235, 23.74)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
Any idea how can I solve it?
Thanks.
(Additionally note that my Timestamp variable has the column values as 2021-08-01, 2021-08-18... when I view(data) but it seems as 1629417600, 1629331200 in the dput output.)

R loop doesn't work while single command works

I am trying to covert many time series xts objects to tibbles, and the for loop I wrote does not work properly, I don't know why.
This does not only happen to this particular task, but other task I perform, I have a list called "code", which contains a list of names for all the xts objects I want to convert from.
code <- c('ABT','BA','CL','ROK')
for (i in code)
{
i <- tk_tbl(i, preserve_index = TRUE, rename_index = "index",
timetk_idx = FALSE, silent = FALSE)
}
What is strange is that, if I use a single one without loop, it works beautifully and convert the xts "ABT" to a tibble "ABT"
ABT <- tk_tbl(ABT, preserve_index = TRUE, rename_index = "index",
timetk_idx = FALSE, silent = FALSE)
The error message for the first code is
Warning: No index to preserve. Object otherwise converted to tibble
successfully.
38: In tk_tbl.data.frame(as.data.frame(data), preserve_index, ... :
Edit:
tk_tabl is a function from the package timetk, and it "Coerce time-series objects to tibble."
And code is a vector containing names.
library(timetk)
code <- c('ABT','BA','CL','ROK')
> dput(head(ROK))
structure(c(8.14062, 8.15625, 8.03125, 7.78125, 7.6875, 7.71875,
8.25, 8.15625, 8.125, 7.90625, 7.71875, 7.75, 8.03125, 8.125,
7.90625, 7.65625, 7.625, 7.65625, 8.1875, 8.125, 7.90625, 7.71875,
7.65625, 7.6875, 109600, 80800, 138400, 151600, 96800, 258800,
0.684505, 0.67928, 0.660992, 0.645316, 0.640091, 0.642704),
class=c("xts", "zoo"), .indexCLASS = "Date", tclass = "Date",
.indexTZ = "UTC", tzone = "UTC", src = "yahoo",
updated = structure(1558826745.23035, class = c("POSIXct","POSIXt")),
index = structure(c(378604800, 378950400, 379036800,
379123200, 379209600, 379296000), tzone = "UTC", tclass = "Date"),
.Dim = c(6L, 6L), .Dimnames = list(NULL, c("ROK.Open", "ROK.High",
"ROK.Low", "ROK.Close", "ROK.Volume", "ROK.Adjusted")))
For me it looks like that you expect <- to do what assign is doing.
I think you get your expected result when you change your loop to:
for (i in code) {
assign(i, tk_tbl(i, preserve_index = TRUE, rename_index = "index", timetk_idx = FALSE, silent = FALSE))
}

Change data to numeric type to determine which distribution fits better

I am trying to figure out which distribution fits best logarithmic stock returns. Here is my code:
library(TTR)
sign="^GSPC"
start=19900101
end=20160101
x <- getYahooData(sign, start = start, end = end, freq = "daily")
x$logret <- log(x$Close) - lag(log(x$Close))
x=x[,6]
I want to use the function descdist(x, discrete = FALSE) which I got from this amazing post https://stats.stackexchange.com/questions/132652/how-to-determine-which-distribution-fits-my-data-best Nonetheless r gives me this error: Error in descdist(x, discrete = FALSE) : data must be a numeric vector How do I transform my data to numeric vector??
The output from dput(head(x)) is:
structure(c(NA, -0.00258888580664607, -0.00865029791190164, -0.00980414107803274,
0.00450431207515223, -0.011856706127011), class = c("xts", "zoo"
), .indexCLASS = "Date", .indexTZ = "UTC", tclass = "Date", tzone = "UTC", index = structure(c(631238400,
631324800, 631411200, 631497600, 631756800, 631843200), tzone = "UTC", tclass = "Date"), .Dim = c(6L,
1L), .Dimnames = list(NULL, "logret"))
Pre-process x using as.numeric(na.omit(x)), or simply run
descdist(as.numeric(na.omit(x)), discrete = FALSE)

optimisation of a condition for loop in r

I have 2 datasets, one of which contains measurements of temperature at 30 min intervals
ordered.temp<-structure(list(time = structure(c(1385244720, 1385246520, 1385248320,
1385250120, 1385251920, 1385253720, 1385255520, 1385257320, 1385259120,
1385260920), class = c("POSIXct", "POSIXt"), tzone = ""), temp = c(30.419,
29.34, 28.965, 28.866, 28.891, 28.866, 28.692, 28.419, 28.122,
27.85), hoboID = c(2392890L, 2392890L, 2392890L, 2392890L, 2392890L,
2392890L, 2392890L, 2392890L, 2392890L, 2392890L)), .Names = c("time",
"temp", "hoboID"), row.names = c(NA, 10L), class = "data.frame")
, the other I created to be able to assign each measurement into 3-hour bins
df<-structure(list(start = structure(c(1385182800, 1385193600, 1385204400,
1385215200, 1385226000, 1385236800, 1385247600, 1385258400, 1385269200,
1385280000), class = c("POSIXct", "POSIXt"), tzone = ""), end = structure(c(1385193600,
1385204400, 1385215200, 1385226000, 1385236800, 1385247600, 1385258400,
1385269200, 1385280000, 1385290800), class = c("POSIXct", "POSIXt"
), tzone = ""), b = 1:10), .Names = c("start", "end", "b"), row.names = c(NA,
10L), class = "data.frame")
For simplicity, I created a subset of the data, but in reality the temp dataframe is 460k rows long and growing bigger every year. I wrote a for loop to compare each line in temp with lines in bin and assign it the corresponding b value from the bin dataframe.
m <- length(ordered.temp$time)
b <- numeric(m)
n <- length(df$start)
for (i in 1:m){
for (j in 1:n){
if (df$start[j] < ordered.temp$time[i] & ordered.temp$time[i] <= df$end[j]){
b[i] <- df$b[j]
print(i/dim(ordered.temp)[1]*100)
}
}
}
Running this loop with 460k rows takes a very long time (i ran the loop for 1 minute and calculated that it would take ±277 hours to complete it. Therefore, it is imperative to speed this loop up or find alternative methods if this is not possible. I however have no idea how I achieve the desired result. Any help would be greatly appreciated. thanks.

How to remove a row from zoo/xts object, given a timestamp

I was happily running with this code:
z=lapply(filename_list, function(fname){
read.zoo(file=fname,header=TRUE,sep = ",",tz = "")
})
xts( do.call(rbind,z) )
until Dirty Data came along with this at the end of one file:
Open High Low Close Volume
2011-09-20 21:00:00 1.370105 1.370105 1.370105 1.370105 1
and this at the start of the next file:
Open High Low Close Volume
2011-09-20 21:00:00 1.370105 1.371045 1.369685 1.3702 2230
So rbind.zoo complains about a duplicate.
I can't use something like:
y <- x[ ! duplicated( index(x) ), ]
as they are in different zoo objects, inside a list. And I cannot use aggregate, as suggested here because they are a list of zoo objects, not one big zoo object. And I can't get one big object 'cos of the duplicates. Catch-22.
So, when the going gets tough, the tough hack together some for loops (excuse the prints and a stop, as this isn't working code yet):
indexes <- do.call("c", unname(lapply(z, index)))
dups=duplicated(indexes)
if(any(dups)){
duplicate_timestamps=indexes[dups]
for(tix in 1:length(duplicate_timestamps)){
t=duplicate_timestamps[tix]
print("We have a duplicate:");print(t)
for(zix in 1:length(z)){
if(t %in% index(z[[zix]])){
print(z[[zix]][t])
if(z[[zix]][t]$Volume==1){
print("-->Deleting this one");
z[[zix]][t]=NULL #<-- PROBLEM
}
}
}
}
stop("There are duplicate bars!!")
}
The bit I've got stumped on is assigning NULL to a zoo row doesn't delete it (Error in NextMethod("[<-") : replacement has length zero). OK, so I do a filter-copy, without the offending item... but I'm tripping up on these:
> z[[zix]][!t,]
Error in Ops.POSIXt(t) : unary '!' not defined for "POSIXt" objects
> z[[zix]][-t,]
Error in `-.POSIXt`(t) : unary '-' is not defined for "POSIXt" objects
P.S. While high-level solutions to my real problem of "duplicates rows across a list of zoo objects" are very welcome, the question here is specifically about how to delete a row from a zoo object given a POSIXt index object.
A small bit of test data:
list(structure(c(1.36864, 1.367045, 1.370105, 1.36928, 1.37039,
1.370105, 1.36604, 1.36676, 1.370105, 1.367065, 1.37009, 1.370105,
5498, 3244, 1), .Dim = c(3L, 5L), .Dimnames = list(NULL, c("Open",
"High", "Low", "Close", "Volume")), index = structure(c(1316512800,
1316516400, 1316520000), class = c("POSIXct", "POSIXt"), tzone = ""), class = "zoo"),
structure(c(1.370105, 1.370115, 1.36913, 1.371045, 1.37023,
1.37075, 1.369685, 1.36847, 1.367885, 1.3702, 1.36917, 1.37061,
2230, 2909, 2782), .Dim = c(3L, 5L), .Dimnames = list(NULL,
c("Open", "High", "Low", "Close", "Volume")), index = structure(c(1316520000,
1316523600, 1316527200), class = c("POSIXct", "POSIXt"), tzone = ""), class = "zoo"))
UPDATE: Thanks to G. Grothendieck for the row-deleting solution. In the actual code I followed the advice of Joshua and GSee to get a list of xts objects instead of a list of zoo objects. So my code became:
z=lapply(filename_list, function(fname){
xts(read.zoo(file=fname,header=TRUE,sep = ",",tz = ""))
})
x=do.call.rbind(z)
(As an aside, please note the call to do.call.rbind. This is because rbind.xts has some serious memory issues. See https://stackoverflow.com/a/12029366/841830 )
Then I remove duplicates as a post-process step:
dups=duplicated(index(x))
if(any(dups)){
duplicate_timestamps=index(x)[dups]
to_delete=x[ (index(x) %in% duplicate_timestamps) & x$Volume<=1]
if(nrow(to_delete)>0){
#Next line says all lines that are not in the duplicate_timestamp group
# OR are in the duplicate timestamps, but have a volume greater than 1.
print("Will delete the volume=1 entry")
x=x[ !(index(x) %in% duplicate_timestamps) | x$Volume>1]
}else{
stop("Duplicate timestamps, and we cannot easily remove them just based on low volume.")
}
}
If z1 and z2 are your zoo objects then to rbind while removing any duplicates in z2:
rbind( z1, z2[ ! time(z2) %in% time(z1) ] )
Regarding deleting points in a zoo object having specified times, the above already illustrates this but in general if tt is a vector of times to delete:
z[ ! time(z) %in% tt ]
or if we knew there were a single element in tt then z[ time(z) != tt ] .
rbind.xts will allow duplicate index values, so it could work if you convert to xts first.
x <- lapply(z, as.xts)
y <- do.call(rbind, x)
# keep last value of any duplicates
y <- y[!duplicated(index(y),fromLast=TRUE),]
I think you'll have better luck if you convert to xts first.
a <- structure(c(1.370105, 1.370105, 1.370105, 1.370105, 1), .Dim = c(1L,
5L), index = structure(1316570400, tzone = "", tclass = c("POSIXct",
"POSIXt")), .indexCLASS = c("POSIXct", "POSIXt"), tclass = c("POSIXct",
"POSIXt"), .indexTZ = "", tzone = "", .Dimnames = list(NULL,
c("Open", "High", "Low", "Close", "Volume")), class = c("xts",
"zoo"))
b <- structure(c(1.370105, 1.371045, 1.369685, 1.3702, 2230), .Dim = c(1L,
5L), index = structure(1316570400, tzone = "", tclass = c("POSIXct",
"POSIXt")), .indexCLASS = c("POSIXct", "POSIXt"), tclass = c("POSIXct",
"POSIXt"), .indexTZ = "", tzone = "", .Dimnames = list(NULL,
c("Open", "High", "Low", "Close", "Volume")), class = c("xts",
"zoo"))
(comb <- rbind(a, b))
# Open High Low Close Volume
#2011-09-20 21:00:00 1.370105 1.370105 1.370105 1.370105 1
#2011-09-20 21:00:00 1.370105 1.371045 1.369685 1.370200 2230
dupidx <- index(comb)[duplicated(index(comb))] # indexes of duplicates
tail(comb[dupidx], 1) #last duplicate
# now rbind the last duplicated row with all non-duplicated data
rbind(comb[!index(comb) %in% dupidx], tail(comb[dupidx], 1))

Resources