Creating a for loop to subset data on R - r

I have a huge set of data that in the .csv format has 2 columns (one Date_time and other is Q.vanda).
This is what the head and tail of the data looks like,
> head(mdf.vanda)
Date_Time Q.vanda
1 1969-12-05 21:00:00 0
2 1969-12-05 21:01:00 4
3 1969-12-05 21:05:00 11
4 1969-12-05 21:20:00 17
5 1969-12-05 22:45:00 27
6 1969-12-05 22:55:00 23
> tail(mdf.vanda)
Date_Time Q.vanda
165738 2016-01-19 10:15:00 2995.25
165739 2016-01-19 10:30:00 2858.04
165740 2016-01-19 10:45:00 2956.94
165741 2016-01-19 11:00:00 2972.52
165742 2016-01-19 11:15:00 2776.99
165743 2016-01-19 11:30:00 3082.53
There are 48 years of data in between and I want to create a for loop to subset them by year (ex. from 1969/10/01 to 1970/10/01, 1970/10/01 to 1971/10/01 etc.)
I wrote a code but, it's giving me an error that I am not able to resolve. I am pretty new at R so, feel free to suggest some other code that you might think is more efficient for my purpose.
code:
cut <- as.POSIXct(strptime(as.character(c('1969/10/01','1970/10/01','1971/10/01','1972/10/01','1973/10/01','1974/10/01','1975/10/01','1976/10/01','1977/10/01','1978/10/01','1979/10/01','1980/10/01','1981/10/01','1982/10/01','1983/10/01','1984/10/01','1985/10/01','1986/10/01','1987/10/01','1988/10/01','1989/10/01','1990/10/01','1991/10/01','1992/10/01','1993/10/01','1994/10/01','1995/10/01','1996/10/01','1997/10/01','1998/10/01',
'1999/10/01','2000/10/01','2001/10/01','2002/10/01','2003/10/01','2004/10/01',
'2005/10/01','2006/10/01','2007/10/01','2008/10/01','2009/10/01','2010/10/01',
'2011/10/01','2012/10/01','2013/10/01','2014/10/01','2015/10/01','2016/10/01')),format = "%Y/%m/%d"))
df.sub <- as.data.frame(matrix(data=NA,nrow=14496, ncol=96)) #nrow = (31+30+31+31+28)*(4*24)[days * readings/day] , ncol = (48*2)[Seasons*cols]
i.odd <- seq(1,49, by=2)
for (i in 1:48) {df.sub[1:length(mdf.vanda$Date_Time[mdf.vanda$Date_Time >= cut[i] & mdf.vanda$Date_Time < cut[i+1]])
,i.odd[i]:(i.odd[i]+1)] <- subset(mdf.vanda,mdf.vanda$Date_Time > cut[i] & mdf.vanda$Date_Time < cut[i+1])}
Error:
Error in [<-.data.frame(*tmp*, 1:length(mdf.vanda$Date_Time[mdf.vanda$Date_Time >= :
replacement element 1 has 1595 rows, need 1596

you can split your data as shown
split(mdf.vanda,findInterval(as.Date(mdf.vanda$Date_Time),seq(as.Date("1969-10-01"),as.Date("2016-10-01"),"1 year"))

There is no need for a loop here. Base R has the cut function to perform this very operation and significantly faster than the loop. Since you have the break points defined with your "cut" variable.
#cut <- as.POSIXct(c('1969/10/01', ... ,'2016/10/01'),format = "%Y/%m/%d")
mytime<-cut(mdf.vanda$Date_Time, breaks = cut, include.lowest = TRUE)
The variable "mytime" is a vector the length of your data frame with a label to bin the data.
You could then use the split function to break your dataframe in a list of data frames or use the group_by function from the dplyr library for additional data processing.

I suggest you have a look at the convenient quantmod package. Once you have Time Series data, you can use the apply.yearly function and apply any function to every year of data.

Related

I want to understand why lapply exhausts memory but a for loop doesn't

I am working in R and trying to understand the best way to join data frames when one of them is very large.
I have a data frame which is not excruciatingly large but also not small (~80K observations of 8 variables, 144 MB). I need to match observations from this data frame to observations from another smaller data frame on the basis of a date range. Specifically, I have:
events.df <- data.frame(individual=c('A','B','C','A','B','C'),
event=c(1,1,1,2,2,2),
time=as.POSIXct(c('2014-01-01 08:00:00','2014-01-05 13:00:00','2014-01-10 07:00:00','2014-05-01 01:00:00','2014-06-01 12:00:00','2014-08-01 10:00:00'),format="%Y-%m-%d %H:%M:%S"))
trips.df <- data.frame(individual=c('A','B','C'),trip=c('x1A','CA1B','XX78'),
trip_start = as.POSIXct(c('2014-01-01 06:00:00','2014-01-04 03:00:00','2014-01-08 12:00:00'),format="%Y-%m-%d %H:%M:%S"),
trip_end=as.POSIXct(c('2014-01-03 06:00:00','2014-01-06 03:00:00','2014-01-11 12:00:00'),format="%Y-%m-%d %H:%M:%S"))
In my case events.df contains around 80,000 unique events and I am looking to match them to events from the trips.df data frame, which has around 200 unique trips. Each trip has a unique trip identifier ('trip'). I would like to match based on whether the event took place during the date range defining a trip.
First, I have tried fuzzy_inner_join from the fuzzyjoin library. It works great in principal:
fuzzy_inner_join(events.df,trips.df,by=c('individual'='individual','time'='trip_start','time'='trip_end'),match_fun=list(`==`,`>=`,`<=`))
individual.x event time individual.y trip trip_start trip_end
1 A 1 2014-01-01 08:00:00 A x1A 2014-01-01 06:00:00 2014-01-03 06:00:00
2 B 1 2014-01-05 13:00:00 B CA1B 2014-01-04 03:00:00 2014-01-06 03:00:00
3 C 1 2014-01-10 07:00:00 C XX78 2014-01-08 12:00:00 2014-01-11 12:00:00
>
but runs out of memory when I try to apply it to the larger data frames.
Here is a second solution I cobbled together:
trip.match <- function(tripid){
individual <- trips.df$individual[trips$trip==tripid]
start <- trips.df$trip_start[trips$trip==tripid]
end <- trips.df$trip_end[trips$trip==tripid]
tmp <- events.df[events.df$individual==individual &
events.df$time>= start &
events.df$time<= end,]
tmp$trip <- tripid
return(tmp)
}
result <- data.frame(rbindlist(lapply(unique(trips.df$trip),trip.match)
This solution also breaks down because the list object returned by lapply is 25GB and the attempt to cast this list to a data frame also exhausts the available memory.
I have been able to do what I need to do using a for loop. Basically, I append a column onto events.df and loop through the unique trip identifiers and populate the new column in events.df accordingly:
events.df$trip <- NA
for(i in unique(trips.df$trip)){
individual <- trips.df$individual[trips.df$trip==i]
start <- min(trips.df$trip_start[trips.df$trip==i])
end <- max(trips.df$trip_end[trips.df$trip==i])
events.df$trip[events.df$individual==individual & events.df$time >= start & events.df$time <= end] <- i
}
> events.df
individual event time trip
1 A 1 2014-01-01 08:00:00 x1A
2 B 1 2014-01-05 13:00:00 CA1B
3 C 1 2014-01-10 07:00:00 XX78
4 A 2 2014-05-01 01:00:00 <NA>
5 B 2 2014-06-01 12:00:00 <NA>
6 C 2 2014-08-01 10:00:00 <NA>
My question is this: I'm not a very advanced R programmer so I expect there is a more memory efficient way to accomplish what I'm trying to do. Is there?
Try creating a table that expands the trip ranges by hour and then merge with the event. Here is an example (using the data.table function because data.table outperforms data.frame for larger datasets):
library('data.table')
tripsV <- unique(trips.df$trip)
tripExpand <- function(t){
dateV <- seq(trips.df$trip_start[trips.df$trip == t],
trips.df$trip_end[trips.df$trip == t],
by = 'hour')
data.table(trip = t, time = dateV)
}
trips.dt <- rbindlist(
lapply(tripsV, function(t) tripExpand(t))
)
merge(events.df,
trips.dt,
by = 'time')
Output:
time individual event trip
1 2014-01-01 08:00:00 A 1 x1A
2 2014-01-05 13:00:00 B 1 CA1B
3 2014-01-10 07:00:00 C 1 XX78
So you are basically translating the trip table to trip-hour long-form panel dataset. That makes for easy merging with the event dataset. I haven't benchmarked it to your current method but my hunch is that it will be more memory & cpu efficient.
Consider splitting your data with data.table's split and run each subset on fuzzy_inner_join then call rbindlist to bind all data frame elements together for single output.
df_list <- data.table::split(events.df, by="individual")
fuzzy_list <- lapply(df_list, function(sub.df) {
fuzzy_inner_join(sub.df, trips.df,
by = c('individual'='individual', 'time'='trip_start', 'time'='trip_end'),
match_fun = list(`==`,`>=`,`<=`)
)
})
# REMOVE TEMP OBJECT AND CALL GARBAGE COLLECTOR
rm(df_list); gc()
final_df <- rbindlist(fuzzy_list)
# REMOVE TEMP OBJECT AND CALL GARBAGE COLLECTOR
rm(fuzzy_list); gc()

Map a list of events (instants) to a list of periods (intervals) in R (with or without lubridate)

I have two data frames. One containing time periods marked with character unique IDs and another containing events with another set of unique IDs associated with them
Period DF (code):
periodID <- c("P_UID_00", "P_UID_01", "P_UDI_02", "P_UID_03")
periodStart <- as.POSIXct(c("2016/02/10 19:00", "2016/02/11 19:00",
"2016/02/12 19:00", "2016/02/13 19:00"))
periodEnd <- as.POSIXct(c("2016/02/10 21:00", "2016/02/11 21:00",
"2016/02/12 21:00", "2016/02/13 21:00"))
periodDF <- data.frame(periodID, periodStart, periodEnd)
Period DF:
periodID periodStart periodEnd
1 P_UID_00 2016-02-10 19:00:00 2016-02-10 21:00:00
2 P_UID_01 2016-02-11 19:00:00 2016-02-11 21:00:00
3 P_UDI_02 2016-02-12 19:00:00 2016-02-12 21:00:00
4 P_UID_03 2016-02-13 19:00:00 2016-02-13 21:00:00
Event DF (code):
eventID <- c("E_UID_00", "E_UID_01", "E_UDI_02", "E_UID_03")
eventTime <- as.POSIXct(c("2016/02/09 19:55:01", "2016/02/11 19:12:01",
"2016/02/11 20:22:01", "2016/02/15 19:00:01"))
eventDF <- data.frame(eventID, eventTime)
Event DF:
eventID eventTime
1 E_UID_00 2016-02-09 19:55:01
2 E_UID_01 2016-02-11 19:12:01
3 E_UDI_02 2016-02-11 20:22:01
4 E_UID_03 2016-02-15 19:00:01
I want to to map the event times in second DF to the time periods in the first DF in order to match the ID of the event to the ID of the period. Essentially the result table I want to see should look like:
eventID periodID
1 E_UID_00 NA
2 NA P_UID_00
3 E_UID_01 P_UID_01
4 E_UDI_02 P_UID_01
5 NA P_UID_02
6 NA P_UID_03
7 E_UID_03 NA
I suppose this can be achieved by using lubricate to transform the start and end cloumns in the first DF to intervals and the use some form of apply and instant %within% interval combination, but I am not really familiar with lubridate and did not manage to produce a working code
Additional considerations:
- periods are completely arbitrary and can last from seconds to years
- periods never overlap, so this is not an issue
- more than one event could be associated with a time period
- it is possible for DFs to contain unassociatable events and time periods
- the solution must not include loops
- does not have to be solved with lubridate, in fact a solution with the base R will be even more welcome.
I actually managed to come up with the code that produces exactly what I wanted using lubridate. So if anyone knows how to do this in base OR simply a better way than the one suggested below, sharing this will be greatly appreciated!
First off, the start and end times in the period DF should be converted to lubridate intervals:
intervalsP <- as.interval(periodStart, periodEnd)
Step 2: A function should be created for checking if an instant is located within a list of intervals. The only reason I have created a separate function is to be able using it with apply:
PeriodAssign <- function(x, y){
# x - instants
# y - intervals
variable1 <- mapply(`%within%`, x, y)
if (length(y[variable1]) != 0) {
as.character(y[variable1])
} else {
NA
}
}
NOTE: I had to use the interval to character coercion, because otherwise intervals were coerced to their length in seconds by the apply function and as such being not really useful for matching purposes - i.e. all four intervals in this example are the same length
Step 3: The function can the be used on the event DF and both DFs can then be merged to produce the DF I was looking for:
eventDF$intervals <- lapply(eventTime, PeriodAssign, intervalsP)
periodDF$intervals <- as.character(intervalsP)
mergedDF <- merge(periodDF, eventDF, by = "intervals")
presentableDF <- mergedDF[, c(2, 5)]
# adding in the unmatched Periods and Evenets
tDF1 <- data.frame(periodDF[!(periodDF$periodID %in% presentableDF$periodID), 1], NA)
colnames(tDF1) <- c("periodID", "eventID")
presentableDF <- rbind(presentableDF, tDF1)
tDF2 <- data.frame(NA, eventDF[!(eventDF$eventID %in% presentableDF$eventID), 1])
colnames(tDF2) <- c("periodID", "eventID")
presentableDF <- rbind(presentableDF, tDF2)
presentableDF <- presentableDF[order(presentableDF[,1]),]
The eventual DF looks like:
> presentableDF
periodID eventID
3 P_UID_00 <NA>
1 P_UID_01 E_UID_01
2 P_UID_01 E_UDI_02
4 P_UID_02 <NA>
5 P_UID_03 <NA>
6 <NA> E_UID_00
7 <NA> E_UID_03

How to insert zeros in a data frame of R

I have following data.frame, DF
DF is already in R. we Do not need to load it in r using read.csv or something
timeStamp count
1 2014-01-15 14:30:00 2
2 2014-01-15 16:30:00 3
3 2014-01-15 17:00:00 2
4 2014-01-15 17:15:00 1
I have an "independent seq of timestamps", say tmpSeq from 2014-01-15 14:00:00 to 2014-01-22 13:00:00. I want to get a List of counts from this data.frame and insert zeros for timeStamp not present in data.frame but in the tmpSeq
Assuming your sequence is in 15 minute increments:
DF <- data.frame(timeStamp=as.POSIXct(c("2014-01-15 14:30:00","2014-01-15 16:30:00",
"2014-01-15 17:00:00","2014-01-15 17:15:00")),
count=c(2,3,2,1))
tmpSeq <- seq(as.POSIXct("2014-01-15 14:00:00"),
as.POSIXct("2014-01-22 13:00:00"), by="15 mins")
DF <- merge(DF, data.frame(timeStamp=tmpSeq, count=0), all=TRUE)
should do it.
Generally, it is better to work with some ts packages when you deal with time series objects. Using xts package you can use rbind to merge 2 times series.
First I create the short time series
I generate the long ts assuming it is regular ts with 15 mins interval
I merge the 2 series using rbind
Here my code:
library(xts)
dat = as.xts(read.zoo(text='
time Stamp count ## a small hack here to read your data
1 2014-01-15 14:30:00 2
2 2014-01-15 16:30:00 3
3 2014-01-15 17:00:00 2
4 2014-01-15 17:15:00 1',
header=TRUE,
index=1:2,
format='%Y-%m-%d %H:%M:%S',tz=''))
## generate the long ts
tmpSeq <-
seq.POSIXt(as.POSIXct('2014-01-15 14:00:00'),
as.POSIXct('2014-01-22 13:00:00'),by = '15 mins')
tmpSeq <-
xts(x=rep(0,length(tmpSeq)),tmpSeq)
## insert dat values in tmpSeq
rbind(tmpSeq,dat)
It seems what you are looking for is a 'merge'. Look at this post: How to join (merge) data frames (inner, outer, left, right)?
You need a right outer join ( if you make tmpSeq as your right data frame)
Edit:
Adding the merge statement in the answer to make the answer clearer :
Right outer: merge(x = DF, y = data.frame(timeStamp=tmpSeq, count=0), all.y=TRUE)

Split Date to Day, Month and Year for ffdf Data in R

I'm using R's ff package with ffdf objects named MyData, (dim=c(10819740,16)). I'm trying to split the variable Date into Day, Month and Year and add these 3 variables into ffdf existing data MyData.
For instance: My Date column named SalesReportDate with VirtualVmode and PhysicalVmode = double after I've changed SalesReportDate to as.date(,format="%m/%d/%Y").
Example of SalesReportDate are as follow:
> B
SalesReportDate
1 2013-02-01
2 2013-05-02
3 2013-05-04
4 2013-10-06
5 2013-15-10
6 2013-11-01
7 2013-11-03
8 2013-30-02
9 2013-12-12
10 2014-01-01
I've refer to Split date into different columns for year, month and day and try to apply it but keep getting error warning.
So, is there any way for me to do this? Thanks in advance.
Credit to #jwijffels for this great solution:
require(ffbase)
MyData$SalesReportDateYear <- with(MyData["SalesReportDate"], format(SalesReportDate, "%Y"), by = 250000)
MyData$SalesReportDateMonth <- with(MyData["SalesReportDate"], format(SalesReportDate, "%m"), by = 250000)
MyData$SalesReportDateDay <- with(MyData["SalesReportDate"], format(SalesReportDate, "%d"), by = 250000)

How to merge couples of Dates and values contained in a unique csv

We have a csv file with Dates in Excel format and Nav for Manager A and Manager B as follows:
Date,Manager A,Date,Manager B
41346.6666666667,100,40932.6666666667,100
41347.6666666667,100,40942.6666666667,99.9999936329992
41348.6666666667,100,40945.6666666667,99.9999936397787
41351.6666666667,100,40946.6666666667,99.9999936714362
41352.6666666667,100,40947.6666666667,100.051441180137
41353.6666666667,100,40948.6666666667,100.04877283951
41354.6666666667,100.000077579585,40949.6666666667,100.068400298752
41355.6666666667,100.00007861475,40952.6666666667,100.070263374822
41358.6666666667,100.000047950872,40953.6666666667,99.9661095940006
41359.6666666667,99.9945012295984,40954.6666666667,99.8578245935173
41360.6666666667,99.9944609274138,40955.6666666667,99.7798031949116
41361.6666666667,99.9944817907402,40956.6666666667,100.029523604978
41366.6666666667,100,40960.6666666667,100.14859511024
41367.6666666667,99.4729804387476,40961.6666666667,99.7956029017769
41368.6666666667,99.4729804387476,40962.6666666667,99.7023420799123
41369.6666666667,99.185046151864,40963.6666666667,99.6124531927299
41372.6666666667,99.1766469096966,40966.6666666667,99.5689030038018
41373.6666666667,98.920738006398,40967.6666666667,99.5701493637685
,,40968.6666666667,99.4543885041996
,,40969.6666666667,99.3424528379521
We want to create a zoo object with the following structure [Dates, Manager A Nav, Manager B Nav].
After reading the csv file with:
data = read.csv("...", header=TRUE, sep=",")
we set an index for splitting the object and use lapply to split
INDEX <- seq(1, by = 2, length = ncol(data) / 2)
data.zoo <- lapply(INDEX, function(i, data) data[i:(i+1)], data = zoo(data))
I'm stuck with the fact that Dates are in Excel format and don't know how to fix that stuff. Is the problem set in a correct way?
If all you want to do is to convert the dates to proper dates you can do this easily enough. The thing you need to know is the origin date. Your numbers represent the integer and fractional number of days that have passed since the origin date. Usually this is Jan 0 1990!!! Go figure, but be careful as I don't think this is always the case. You can try this...
# Excel origin is day 0 on Jan 0 1900, but treats 1900 as leap year so...
data$Date <- as.Date( data$Date , origin = "1899/12/30")
data$Date.1 <- as.Date( data$Date.1 , origin = "1899/12/30")
# For more info see ?as.Date
If you are interested in keeping the times as well, you can use as.POSIXct, but you must also specify the timezone (UTC by default);
data$Date <- as.POSIXct(data$Date, origin = "1899/12/30" )
head(data)
# Date Manager.A Date.1 Manager.B
# 1 2013-03-13 16:00:00 100 2012-01-24 100.00000
# 2 2013-03-14 16:00:00 100 2012-02-03 99.99999
# 3 2013-03-15 16:00:00 100 2012-02-06 99.99999
# 4 2013-03-18 16:00:00 100 2012-02-07 99.99999
# 5 2013-03-19 16:00:00 100 2012-02-08 100.05144
# 6 2013-03-20 16:00:00 100 2012-02-09 100.04877

Resources