I have 2 datasets, one of which contains measurements of temperature at 30 min intervals
ordered.temp<-structure(list(time = structure(c(1385244720, 1385246520, 1385248320,
1385250120, 1385251920, 1385253720, 1385255520, 1385257320, 1385259120,
1385260920), class = c("POSIXct", "POSIXt"), tzone = ""), temp = c(30.419,
29.34, 28.965, 28.866, 28.891, 28.866, 28.692, 28.419, 28.122,
27.85), hoboID = c(2392890L, 2392890L, 2392890L, 2392890L, 2392890L,
2392890L, 2392890L, 2392890L, 2392890L, 2392890L)), .Names = c("time",
"temp", "hoboID"), row.names = c(NA, 10L), class = "data.frame")
, the other I created to be able to assign each measurement into 3-hour bins
df<-structure(list(start = structure(c(1385182800, 1385193600, 1385204400,
1385215200, 1385226000, 1385236800, 1385247600, 1385258400, 1385269200,
1385280000), class = c("POSIXct", "POSIXt"), tzone = ""), end = structure(c(1385193600,
1385204400, 1385215200, 1385226000, 1385236800, 1385247600, 1385258400,
1385269200, 1385280000, 1385290800), class = c("POSIXct", "POSIXt"
), tzone = ""), b = 1:10), .Names = c("start", "end", "b"), row.names = c(NA,
10L), class = "data.frame")
For simplicity, I created a subset of the data, but in reality the temp dataframe is 460k rows long and growing bigger every year. I wrote a for loop to compare each line in temp with lines in bin and assign it the corresponding b value from the bin dataframe.
m <- length(ordered.temp$time)
b <- numeric(m)
n <- length(df$start)
for (i in 1:m){
for (j in 1:n){
if (df$start[j] < ordered.temp$time[i] & ordered.temp$time[i] <= df$end[j]){
b[i] <- df$b[j]
print(i/dim(ordered.temp)[1]*100)
}
}
}
Running this loop with 460k rows takes a very long time (i ran the loop for 1 minute and calculated that it would take ±277 hours to complete it. Therefore, it is imperative to speed this loop up or find alternative methods if this is not possible. I however have no idea how I achieve the desired result. Any help would be greatly appreciated. thanks.
Related
I have a data frame with all the information from a racing yacht in that day (lg) and I wish to create variable that tells me what race the yacht was in. This race start and finish time are in a separate df (RaceInfo). I can filter by race time, but there is a changeable amount of races per day so it may need a loop.
Some Data
lg <- structure(list(Date = structure(c(18897, 18897, 18897, 18897,
18897, 18897, 18897, 18897, 18897, 18897), class = "Date"), Time = structure(c(1632725883,
1632725884, 1632725885, 1632725886, 1632725887, 1632725888, 1632725889,
1632725890, 1632725891, 1632725892), tzone = "", class = c("POSIXct",
"POSIXt")), Lat = c(43.2760531, 43.276059, 43.276065, 43.2760708,
43.2760766, 43.2760858, 43.276095, 43.2761, 43.276105, 43.2761095
), Lon = c(6.619109, 6.619136, 6.619163, 6.6191932, 6.6192235,
6.6192488, 6.619274, 6.6192988, 6.6193235, 6.6193532), Awa = c(-7.1,
-7.12, -7.15, -6.57, -6, -6.2, -6.4, -5.28, -4.15, 0.25), X = 1:10), row.names = c(NA,
-10L), class = "data.frame")
This is the yachts onboard data.
More Data
RaceInfo <- structure(list(date = structure(c(18897, 18896), class = "Date"),
RaceStartTime = structure(c(1632738480, 1632751560), tzone = "", class = c("POSIXct",
"POSIXt")), RaceNum = c("1", "2"), RaceFinishTime = structure(c(1632751520,
1632753000), tzone = "", class = c("POSIXct", "POSIXt"))), row.names = c("event.2",
"1"), class = "data.frame")
In the RaceInfo df it tells us the start and finish time of each race, as mentioned before there could be many races and I need to assign a new variable in the lg df as lg$RaceNum based on the times given in the RaceInfo df.
My closes attempt is this but loops are a weak point in my game.
for (i in RaceInfo$RaceNum){
lg <- lg %>%
mutate(Racenum = case_when(
lg$Time >= (subset(RaceInfo$RaceStartTime, RaceInfo$RaceNum == i)) &
lg$Time <= (subset(RaceInfo$RaceFinishTime, RaceInfo$RaceNum == i)) ~ i))
}
But this only returns the last number in the loop
The methods mutate and case_when are really to assign calculated columns within a data frame and not specifically for subsetting data frame itself.
Instead, consider dplyr::filter (similar to base::subset) even dplyr::between and collect your iteration results to build a data frame list. Then, rbind results at end. To subset by unique values, see by
df_list <- lapply(RaceInfo$RaceNum, function(i)
dplyr::filter(
lg,
dplyr::between(
Time,
RaceInfo$StartTime[RaceInfo$Racenum == i],
RaceInfo$RaceFinishTime[RaceInfo$Racenum == i]
)
)
)
final_df <- dplyr::bind_rows(df_list)
But as mentioned above, if your data is manageable with small set of distinct RaceInfo, consider a cross join with filter:
final_df <- dplyr::full_join(lg, RaceInfo, by = character()) %>%
filter(lg, between(
Time,
RaceInfo$StartTime[RaceInfo$Racenum == i],
RaceInfo$RaceFinishTime[RaceInfo$Racenum == i]
)
)
I have a table (1) like this:
START;END;CATEGORY
20.05.2017 19:23:00;20.05.2017 19:27:00;A
20.05.2017 19:27:00;20.05.2017 19:32:00;B
20.05.2017 19:32:00;20.05.2017 19:38:00;A
and a table (2) like this:
TIMESTAMP;VALUES
20.05.2017 19:24:09;323
20.05.2017 19:23:12;2322
20.05.2017 19:27:55;23333
20.05.2017 19:36:12;123123
Now I want to join the category from table 1 to table 2. The key are the timstamps. If the TIMESTAMP from table 2 is between START and END of table1 add category. I want basically a table like this:
TIMESTAMP;VALUES;CATEGORY
20.05.2017 19:24:09;323;A
20.05.2017 19:23:12;2322;A
20.05.2017 19:27:55;23333;B
20.05.2017 19:36:12;123123;B
These are my tries but they aren't efficient:
I)
for(j in seq(dim(table1)[1])){
for(i in seq(dim(table2)[1])){
table2[table2$TIMESTAMP[i]>=table1$START[j] & table2$TIMESTAMP[i]<=table1$END[j]] <- table1$CATEGORY[j]
}
II)
mapped_df <- data.frame()
for(i in seq(dim(table1)[1])){
start <- as.POSIXct(table1$START[i])
end <- as.POSIXct(table1$END[i])
cat <- table1$CATEGORY[i]
mapped_df <- rbind(mapped_df, data.frame(TIMESTAMP=seq(from=start, by=1, to=end), CATEGORY=cat))
}
merge(table2 , mapped_df)
Thanks in advance!
I have a preference for using SQL to do this. The sqldf package comes in handy.
Table1 <-
structure(
list(START = structure(c(1495322580, 1495322820, 1495323120),
class = c("POSIXct", "POSIXt"),
tzone = ""),
END = structure(c(1495322820, 1495323120, 1495323480),
class = c("POSIXct", "POSIXt"),
tzone = ""),
CATEGORY = c("A", "B", "A")),
class = "data.frame",
.Names = c("START", "END", "CATEGORY"),
row.names = c(NA, -3L)
)
Table2 <-
structure(
list(TIMESTAMP = structure(c(1495322649, 1495322592, 1495322875, 1495323372),
class = c("POSIXct", "POSIXt"),
tzone = ""),
VALUES = c(323L, 2322L, 23333L, 123123L)),
class = "data.frame",
.Names = c("TIMESTAMP", "VALUES"),
row.names = c(NA, -4L))
library(sqldf)
sqldf("SELECT T2.TIMESTAMP, T2.[VALUES], T1.CATEGORY
FROM Table2 T2
LEFT JOIN Table1 T1
ON T2.TIMESTAMP > T1.START AND T2.TIMESTAMP < T1.END")
I have the data.frame in which every row is an episode with a start and an end timestamp.
test.DF<-dput(head(test.DF, n=50))
structure(list(start = structure(c(1189494920, 1189495400, 1189496120,
1189496840, 1189497440, 1189498040, 1189498640, 1189501760, 1189503560,
1190453600, 1247458520, 1247480840, 1247482880, 1247483840, 1247485040,
1247486600, 1247487320, 1247488040, 1247488760, 1247490920, 1247491280,
1247492480, 1247493680, 1247502440, 1247503160, 1247503520, 1247548040,
1247549360, 1247550680, 1247552600, 1247553920, 1247557400, 1247558000,
1247558480, 1247559440, 1247560400, 1247563760, 1247564960, 1247566640,
1247567120, 1194935549, 1194936029, 1195722629, 1195724309, 1199691029,
1199692349, 1202560229, 1208063669, 1208322989, 1188188112), class = c("POSIXct",
"POSIXt"), tzone = ""), end = structure(c(1189495280, 1189495520,
1189496360, 1189497080, 1189497560, 1189498160, 1189498760, 1189501880,
1189503920, 1190453720, 1247458640, 1247480960, 1247483480, 1247484080,
1247485640, 1247486840, 1247487560, 1247488640, 1247490440, 1247491160,
1247491520, 1247492600, 1247493920, 1247502680, 1247503400, 1247504120,
1247549240, 1247550560, 1247551280, 1247552720, 1247554400, 1247557880,
1247558240, 1247559080, 1247559560, 1247560760, 1247563880, 1247565080,
1247566760, 1247567240, 1194935669, 1194936269, 1195722749, 1195724429,
1199691269, 1199692469, 1202560349, 1208063789, 1208323109, 1188204792
), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("start",
"end"), row.names = c(NA, 50L), class = "data.frame")
I would like to see the distribution of these episodes within a 24 hour cycle. That is either a histogram or a density plot, with the 24H day cycle in the x axis. Is this possible? I would like to ignore the dates of the episodes.
By converting to a POSIXltformat, you can easily extract the hour of the time:
par(mar=c(6,4,1,1))
Hour <- as.POSIXlt(test.DF$start)$hour
hist(Hour, breaks=seq(0, 23), main="Start time (hour)")
Edit: Adding a value for ever minute between start and end
fun <- function(start.time, end.time){
seq.POSIXt(
as.POSIXlt(
paste0("2000-01-01 ", as.POSIXlt(start.time)$hour, ":", as.POSIXlt(start.time)$min)
),
as.POSIXlt(
paste0("2000-01-01 ", as.POSIXlt(end.time)$hour, ":", as.POSIXlt(end.time)$min)
),
by="min"
)
}
HM <- vector(mode="list", dim(test.DF)[1])
for(i in seq(HM)){
HM[[i]] <- fun(test.DF$start[i], test.DF$end[i])
}
HM2 <- as.POSIXlt(unlist(HM), origin="1970-01-01")
Hour <- HM2$hour
hist(Hour, breaks=seq(0, 23))
HourMinute <- HM2$hour + HM2$min/60
hist(HourMinute, breaks=seq(0, 23, by=1/60))
I was happily running with this code:
z=lapply(filename_list, function(fname){
read.zoo(file=fname,header=TRUE,sep = ",",tz = "")
})
xts( do.call(rbind,z) )
until Dirty Data came along with this at the end of one file:
Open High Low Close Volume
2011-09-20 21:00:00 1.370105 1.370105 1.370105 1.370105 1
and this at the start of the next file:
Open High Low Close Volume
2011-09-20 21:00:00 1.370105 1.371045 1.369685 1.3702 2230
So rbind.zoo complains about a duplicate.
I can't use something like:
y <- x[ ! duplicated( index(x) ), ]
as they are in different zoo objects, inside a list. And I cannot use aggregate, as suggested here because they are a list of zoo objects, not one big zoo object. And I can't get one big object 'cos of the duplicates. Catch-22.
So, when the going gets tough, the tough hack together some for loops (excuse the prints and a stop, as this isn't working code yet):
indexes <- do.call("c", unname(lapply(z, index)))
dups=duplicated(indexes)
if(any(dups)){
duplicate_timestamps=indexes[dups]
for(tix in 1:length(duplicate_timestamps)){
t=duplicate_timestamps[tix]
print("We have a duplicate:");print(t)
for(zix in 1:length(z)){
if(t %in% index(z[[zix]])){
print(z[[zix]][t])
if(z[[zix]][t]$Volume==1){
print("-->Deleting this one");
z[[zix]][t]=NULL #<-- PROBLEM
}
}
}
}
stop("There are duplicate bars!!")
}
The bit I've got stumped on is assigning NULL to a zoo row doesn't delete it (Error in NextMethod("[<-") : replacement has length zero). OK, so I do a filter-copy, without the offending item... but I'm tripping up on these:
> z[[zix]][!t,]
Error in Ops.POSIXt(t) : unary '!' not defined for "POSIXt" objects
> z[[zix]][-t,]
Error in `-.POSIXt`(t) : unary '-' is not defined for "POSIXt" objects
P.S. While high-level solutions to my real problem of "duplicates rows across a list of zoo objects" are very welcome, the question here is specifically about how to delete a row from a zoo object given a POSIXt index object.
A small bit of test data:
list(structure(c(1.36864, 1.367045, 1.370105, 1.36928, 1.37039,
1.370105, 1.36604, 1.36676, 1.370105, 1.367065, 1.37009, 1.370105,
5498, 3244, 1), .Dim = c(3L, 5L), .Dimnames = list(NULL, c("Open",
"High", "Low", "Close", "Volume")), index = structure(c(1316512800,
1316516400, 1316520000), class = c("POSIXct", "POSIXt"), tzone = ""), class = "zoo"),
structure(c(1.370105, 1.370115, 1.36913, 1.371045, 1.37023,
1.37075, 1.369685, 1.36847, 1.367885, 1.3702, 1.36917, 1.37061,
2230, 2909, 2782), .Dim = c(3L, 5L), .Dimnames = list(NULL,
c("Open", "High", "Low", "Close", "Volume")), index = structure(c(1316520000,
1316523600, 1316527200), class = c("POSIXct", "POSIXt"), tzone = ""), class = "zoo"))
UPDATE: Thanks to G. Grothendieck for the row-deleting solution. In the actual code I followed the advice of Joshua and GSee to get a list of xts objects instead of a list of zoo objects. So my code became:
z=lapply(filename_list, function(fname){
xts(read.zoo(file=fname,header=TRUE,sep = ",",tz = ""))
})
x=do.call.rbind(z)
(As an aside, please note the call to do.call.rbind. This is because rbind.xts has some serious memory issues. See https://stackoverflow.com/a/12029366/841830 )
Then I remove duplicates as a post-process step:
dups=duplicated(index(x))
if(any(dups)){
duplicate_timestamps=index(x)[dups]
to_delete=x[ (index(x) %in% duplicate_timestamps) & x$Volume<=1]
if(nrow(to_delete)>0){
#Next line says all lines that are not in the duplicate_timestamp group
# OR are in the duplicate timestamps, but have a volume greater than 1.
print("Will delete the volume=1 entry")
x=x[ !(index(x) %in% duplicate_timestamps) | x$Volume>1]
}else{
stop("Duplicate timestamps, and we cannot easily remove them just based on low volume.")
}
}
If z1 and z2 are your zoo objects then to rbind while removing any duplicates in z2:
rbind( z1, z2[ ! time(z2) %in% time(z1) ] )
Regarding deleting points in a zoo object having specified times, the above already illustrates this but in general if tt is a vector of times to delete:
z[ ! time(z) %in% tt ]
or if we knew there were a single element in tt then z[ time(z) != tt ] .
rbind.xts will allow duplicate index values, so it could work if you convert to xts first.
x <- lapply(z, as.xts)
y <- do.call(rbind, x)
# keep last value of any duplicates
y <- y[!duplicated(index(y),fromLast=TRUE),]
I think you'll have better luck if you convert to xts first.
a <- structure(c(1.370105, 1.370105, 1.370105, 1.370105, 1), .Dim = c(1L,
5L), index = structure(1316570400, tzone = "", tclass = c("POSIXct",
"POSIXt")), .indexCLASS = c("POSIXct", "POSIXt"), tclass = c("POSIXct",
"POSIXt"), .indexTZ = "", tzone = "", .Dimnames = list(NULL,
c("Open", "High", "Low", "Close", "Volume")), class = c("xts",
"zoo"))
b <- structure(c(1.370105, 1.371045, 1.369685, 1.3702, 2230), .Dim = c(1L,
5L), index = structure(1316570400, tzone = "", tclass = c("POSIXct",
"POSIXt")), .indexCLASS = c("POSIXct", "POSIXt"), tclass = c("POSIXct",
"POSIXt"), .indexTZ = "", tzone = "", .Dimnames = list(NULL,
c("Open", "High", "Low", "Close", "Volume")), class = c("xts",
"zoo"))
(comb <- rbind(a, b))
# Open High Low Close Volume
#2011-09-20 21:00:00 1.370105 1.370105 1.370105 1.370105 1
#2011-09-20 21:00:00 1.370105 1.371045 1.369685 1.370200 2230
dupidx <- index(comb)[duplicated(index(comb))] # indexes of duplicates
tail(comb[dupidx], 1) #last duplicate
# now rbind the last duplicated row with all non-duplicated data
rbind(comb[!index(comb) %in% dupidx], tail(comb[dupidx], 1))
I am trying to improve the memory performance for the following example:
basline df with 4 rows
df <- structure(list(sessionid = structure(c(1L, 2L, 3L, 4L), .Label =
c("AAA1", "AAA2","AAA3", "AAA4"), class = "factor"), bitrateinbps = c(10000000,
10000000, 10000000, 10000000), startdate = structure(c(1326758507, 1326758671,
1326759569, 1326760589), class = c("POSIXct", "POSIXt"), tzone = ""), enddate =
structure(c(1326765780, 1326758734, 1326760629, 1326761592), class = c("POSIXct",
"POSIXt"), tzone = "")), .Names = c("sessionid", "bitrateinbps", "startdate",
"enddate"), row.names = c(NA, 4L), class =
"data.frame")
alternate df with 8 rows
df <- structure(list(sessionid = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L),
.Label = c("AAA1", "AAA2", "AAA3", "AAA4", "AAA5", "AAA6", "AAA7", "AAA8"),
class = "factor"), bitrateinbps =c(10000000, 10000000, 10000000, 10000000,
10000000, 10000000, 10000000, 10000000), startdate = structure(c(1326758507,
1326758671, 1326759569, 1326760589, 1326761589, 1326762589, 1326763589, 1326764589),
class = c("POSIXct",
"POSIXt"), tzone = ""), enddate = structure(c(1326765780, 1326758734, 1326760629,
1326761592, 1326767592,
1326768592, 1326768700, 1326769592), class = c("POSIXct", "POSIXt"), tzone = "")),
.Names = c("sessionid",
"bitrateinbps", "startdate", "enddate"), row.names = c(NA, 8L), class =
"data.frame")
try df analysis memory usage and again for alternate df
library(xts)
fun0 <- function(i, d) {
idx0 <- seq(d$startdate[i],d$enddate[i],1) # create sequence for index
dat0 <- rep(1,length(idx0)) # create data over sequence
xts(dat0, idx0, dimnames=list(NULL,d$sessionid[i])) # xts object
}
# loop over each row and put each row into its own xts object
xl0 <- lapply(1:NROW(df), fun0, d=df)
# merge all the xts objects
xx0 <- do.call(merge, xl0)
# apply a function (e.g. colMeans) to each 15-minute period
xa0 <- period.apply(xx0, endpoints(xx0, 'minutes', 15), colSums, na.rm=TRUE)/900
xa1 <- t(xa0)
# convert from atomic vector to data frame
xa1 = as.data.frame(xa1)
# bind to df
out1 = cbind(df, xa1)
# print aggregate memory usage statistics
print(paste('R is using', memory.size(), 'MB out of limit', memory.limit(), 'MB'))
# create function to return matrix of memory consumption
object.sizes <- function()
{
return(rev(sort(sapply(ls(envir=.GlobalEnv), function (object.name)
object.size(get(object.name))))))
}
# print to console in table format
object.sizes()
results as follows:
4 row df:
xx0 = 292104 Bytes .... do.call(merge, xl0)
xl0 = 154648 Bytes .... lapply(1:NROW(df), fun0, d=df)
8 row df:
xx0 = 799480 Bytes .... do.call(merge, xl0)
xl0 = 512808 Bytes .... lapply(1:NROW(df), fun0, d=df)
I'm looking for something a little more memory efficient for the merge and lapply functions, so I can scale out the number of rows, if anyone has any suggestions and can show the comparative results for alternatives.