How to insert zeros in a data frame of R - r

I have following data.frame, DF
DF is already in R. we Do not need to load it in r using read.csv or something
timeStamp count
1 2014-01-15 14:30:00 2
2 2014-01-15 16:30:00 3
3 2014-01-15 17:00:00 2
4 2014-01-15 17:15:00 1
I have an "independent seq of timestamps", say tmpSeq from 2014-01-15 14:00:00 to 2014-01-22 13:00:00. I want to get a List of counts from this data.frame and insert zeros for timeStamp not present in data.frame but in the tmpSeq

Assuming your sequence is in 15 minute increments:
DF <- data.frame(timeStamp=as.POSIXct(c("2014-01-15 14:30:00","2014-01-15 16:30:00",
"2014-01-15 17:00:00","2014-01-15 17:15:00")),
count=c(2,3,2,1))
tmpSeq <- seq(as.POSIXct("2014-01-15 14:00:00"),
as.POSIXct("2014-01-22 13:00:00"), by="15 mins")
DF <- merge(DF, data.frame(timeStamp=tmpSeq, count=0), all=TRUE)
should do it.

Generally, it is better to work with some ts packages when you deal with time series objects. Using xts package you can use rbind to merge 2 times series.
First I create the short time series
I generate the long ts assuming it is regular ts with 15 mins interval
I merge the 2 series using rbind
Here my code:
library(xts)
dat = as.xts(read.zoo(text='
time Stamp count ## a small hack here to read your data
1 2014-01-15 14:30:00 2
2 2014-01-15 16:30:00 3
3 2014-01-15 17:00:00 2
4 2014-01-15 17:15:00 1',
header=TRUE,
index=1:2,
format='%Y-%m-%d %H:%M:%S',tz=''))
## generate the long ts
tmpSeq <-
seq.POSIXt(as.POSIXct('2014-01-15 14:00:00'),
as.POSIXct('2014-01-22 13:00:00'),by = '15 mins')
tmpSeq <-
xts(x=rep(0,length(tmpSeq)),tmpSeq)
## insert dat values in tmpSeq
rbind(tmpSeq,dat)

It seems what you are looking for is a 'merge'. Look at this post: How to join (merge) data frames (inner, outer, left, right)?
You need a right outer join ( if you make tmpSeq as your right data frame)
Edit:
Adding the merge statement in the answer to make the answer clearer :
Right outer: merge(x = DF, y = data.frame(timeStamp=tmpSeq, count=0), all.y=TRUE)

Related

I want to understand why lapply exhausts memory but a for loop doesn't

I am working in R and trying to understand the best way to join data frames when one of them is very large.
I have a data frame which is not excruciatingly large but also not small (~80K observations of 8 variables, 144 MB). I need to match observations from this data frame to observations from another smaller data frame on the basis of a date range. Specifically, I have:
events.df <- data.frame(individual=c('A','B','C','A','B','C'),
event=c(1,1,1,2,2,2),
time=as.POSIXct(c('2014-01-01 08:00:00','2014-01-05 13:00:00','2014-01-10 07:00:00','2014-05-01 01:00:00','2014-06-01 12:00:00','2014-08-01 10:00:00'),format="%Y-%m-%d %H:%M:%S"))
trips.df <- data.frame(individual=c('A','B','C'),trip=c('x1A','CA1B','XX78'),
trip_start = as.POSIXct(c('2014-01-01 06:00:00','2014-01-04 03:00:00','2014-01-08 12:00:00'),format="%Y-%m-%d %H:%M:%S"),
trip_end=as.POSIXct(c('2014-01-03 06:00:00','2014-01-06 03:00:00','2014-01-11 12:00:00'),format="%Y-%m-%d %H:%M:%S"))
In my case events.df contains around 80,000 unique events and I am looking to match them to events from the trips.df data frame, which has around 200 unique trips. Each trip has a unique trip identifier ('trip'). I would like to match based on whether the event took place during the date range defining a trip.
First, I have tried fuzzy_inner_join from the fuzzyjoin library. It works great in principal:
fuzzy_inner_join(events.df,trips.df,by=c('individual'='individual','time'='trip_start','time'='trip_end'),match_fun=list(`==`,`>=`,`<=`))
individual.x event time individual.y trip trip_start trip_end
1 A 1 2014-01-01 08:00:00 A x1A 2014-01-01 06:00:00 2014-01-03 06:00:00
2 B 1 2014-01-05 13:00:00 B CA1B 2014-01-04 03:00:00 2014-01-06 03:00:00
3 C 1 2014-01-10 07:00:00 C XX78 2014-01-08 12:00:00 2014-01-11 12:00:00
>
but runs out of memory when I try to apply it to the larger data frames.
Here is a second solution I cobbled together:
trip.match <- function(tripid){
individual <- trips.df$individual[trips$trip==tripid]
start <- trips.df$trip_start[trips$trip==tripid]
end <- trips.df$trip_end[trips$trip==tripid]
tmp <- events.df[events.df$individual==individual &
events.df$time>= start &
events.df$time<= end,]
tmp$trip <- tripid
return(tmp)
}
result <- data.frame(rbindlist(lapply(unique(trips.df$trip),trip.match)
This solution also breaks down because the list object returned by lapply is 25GB and the attempt to cast this list to a data frame also exhausts the available memory.
I have been able to do what I need to do using a for loop. Basically, I append a column onto events.df and loop through the unique trip identifiers and populate the new column in events.df accordingly:
events.df$trip <- NA
for(i in unique(trips.df$trip)){
individual <- trips.df$individual[trips.df$trip==i]
start <- min(trips.df$trip_start[trips.df$trip==i])
end <- max(trips.df$trip_end[trips.df$trip==i])
events.df$trip[events.df$individual==individual & events.df$time >= start & events.df$time <= end] <- i
}
> events.df
individual event time trip
1 A 1 2014-01-01 08:00:00 x1A
2 B 1 2014-01-05 13:00:00 CA1B
3 C 1 2014-01-10 07:00:00 XX78
4 A 2 2014-05-01 01:00:00 <NA>
5 B 2 2014-06-01 12:00:00 <NA>
6 C 2 2014-08-01 10:00:00 <NA>
My question is this: I'm not a very advanced R programmer so I expect there is a more memory efficient way to accomplish what I'm trying to do. Is there?
Try creating a table that expands the trip ranges by hour and then merge with the event. Here is an example (using the data.table function because data.table outperforms data.frame for larger datasets):
library('data.table')
tripsV <- unique(trips.df$trip)
tripExpand <- function(t){
dateV <- seq(trips.df$trip_start[trips.df$trip == t],
trips.df$trip_end[trips.df$trip == t],
by = 'hour')
data.table(trip = t, time = dateV)
}
trips.dt <- rbindlist(
lapply(tripsV, function(t) tripExpand(t))
)
merge(events.df,
trips.dt,
by = 'time')
Output:
time individual event trip
1 2014-01-01 08:00:00 A 1 x1A
2 2014-01-05 13:00:00 B 1 CA1B
3 2014-01-10 07:00:00 C 1 XX78
So you are basically translating the trip table to trip-hour long-form panel dataset. That makes for easy merging with the event dataset. I haven't benchmarked it to your current method but my hunch is that it will be more memory & cpu efficient.
Consider splitting your data with data.table's split and run each subset on fuzzy_inner_join then call rbindlist to bind all data frame elements together for single output.
df_list <- data.table::split(events.df, by="individual")
fuzzy_list <- lapply(df_list, function(sub.df) {
fuzzy_inner_join(sub.df, trips.df,
by = c('individual'='individual', 'time'='trip_start', 'time'='trip_end'),
match_fun = list(`==`,`>=`,`<=`)
)
})
# REMOVE TEMP OBJECT AND CALL GARBAGE COLLECTOR
rm(df_list); gc()
final_df <- rbindlist(fuzzy_list)
# REMOVE TEMP OBJECT AND CALL GARBAGE COLLECTOR
rm(fuzzy_list); gc()

Creating a for loop to subset data on R

I have a huge set of data that in the .csv format has 2 columns (one Date_time and other is Q.vanda).
This is what the head and tail of the data looks like,
> head(mdf.vanda)
Date_Time Q.vanda
1 1969-12-05 21:00:00 0
2 1969-12-05 21:01:00 4
3 1969-12-05 21:05:00 11
4 1969-12-05 21:20:00 17
5 1969-12-05 22:45:00 27
6 1969-12-05 22:55:00 23
> tail(mdf.vanda)
Date_Time Q.vanda
165738 2016-01-19 10:15:00 2995.25
165739 2016-01-19 10:30:00 2858.04
165740 2016-01-19 10:45:00 2956.94
165741 2016-01-19 11:00:00 2972.52
165742 2016-01-19 11:15:00 2776.99
165743 2016-01-19 11:30:00 3082.53
There are 48 years of data in between and I want to create a for loop to subset them by year (ex. from 1969/10/01 to 1970/10/01, 1970/10/01 to 1971/10/01 etc.)
I wrote a code but, it's giving me an error that I am not able to resolve. I am pretty new at R so, feel free to suggest some other code that you might think is more efficient for my purpose.
code:
cut <- as.POSIXct(strptime(as.character(c('1969/10/01','1970/10/01','1971/10/01','1972/10/01','1973/10/01','1974/10/01','1975/10/01','1976/10/01','1977/10/01','1978/10/01','1979/10/01','1980/10/01','1981/10/01','1982/10/01','1983/10/01','1984/10/01','1985/10/01','1986/10/01','1987/10/01','1988/10/01','1989/10/01','1990/10/01','1991/10/01','1992/10/01','1993/10/01','1994/10/01','1995/10/01','1996/10/01','1997/10/01','1998/10/01',
'1999/10/01','2000/10/01','2001/10/01','2002/10/01','2003/10/01','2004/10/01',
'2005/10/01','2006/10/01','2007/10/01','2008/10/01','2009/10/01','2010/10/01',
'2011/10/01','2012/10/01','2013/10/01','2014/10/01','2015/10/01','2016/10/01')),format = "%Y/%m/%d"))
df.sub <- as.data.frame(matrix(data=NA,nrow=14496, ncol=96)) #nrow = (31+30+31+31+28)*(4*24)[days * readings/day] , ncol = (48*2)[Seasons*cols]
i.odd <- seq(1,49, by=2)
for (i in 1:48) {df.sub[1:length(mdf.vanda$Date_Time[mdf.vanda$Date_Time >= cut[i] & mdf.vanda$Date_Time < cut[i+1]])
,i.odd[i]:(i.odd[i]+1)] <- subset(mdf.vanda,mdf.vanda$Date_Time > cut[i] & mdf.vanda$Date_Time < cut[i+1])}
Error:
Error in [<-.data.frame(*tmp*, 1:length(mdf.vanda$Date_Time[mdf.vanda$Date_Time >= :
replacement element 1 has 1595 rows, need 1596
you can split your data as shown
split(mdf.vanda,findInterval(as.Date(mdf.vanda$Date_Time),seq(as.Date("1969-10-01"),as.Date("2016-10-01"),"1 year"))
There is no need for a loop here. Base R has the cut function to perform this very operation and significantly faster than the loop. Since you have the break points defined with your "cut" variable.
#cut <- as.POSIXct(c('1969/10/01', ... ,'2016/10/01'),format = "%Y/%m/%d")
mytime<-cut(mdf.vanda$Date_Time, breaks = cut, include.lowest = TRUE)
The variable "mytime" is a vector the length of your data frame with a label to bin the data.
You could then use the split function to break your dataframe in a list of data frames or use the group_by function from the dplyr library for additional data processing.
I suggest you have a look at the convenient quantmod package. Once you have Time Series data, you can use the apply.yearly function and apply any function to every year of data.

Interpolation of constrained gaps

In continuity to the following question:
Efficient dynamic addition of rows in dataframe and dynamic calculation in R
I have the following table:
Lines <- "D1,Diff
1,20/11/2014 16:00,0.01
2,20/11/2014 17:00,0.02
3,20/11/2014 19:00,0.03 <-- Gap I
4,21/11/2014 16:00,0.04
5,21/11/2014 17:00,0.06 <-- Gap II
6,21/11/2014 20:00,0.10"
As can be seen there are a gap of 18:00 in 20/11/2014 and two gaps of 18:00 and 19:00 at 21/11/2014.
An addition gap is between the days 20/11/2014 19:00 and 21/11/2014 16:00.
I would to interpolate (fill in) the value which the gap is up to 3 hours between the rows.
The required result should be as followed (in dataframe format):
Lines <- "D1,Diff
1,20/11/2014 16:00,0.01
2,20/11/2014 17:00,0.02
3,20/11/2014 18:00,0.025<-- Added lines
4,20/11/2014 19:00,0.03
5,21/11/2014 16:00,0.04
6,21/11/2014 17:00,0.06
6,21/11/2014 18:00,0.073 <--
6,21/11/2014 19:00,0.086 <--
6,21/11/2014 20:00,0.10"
Here is the code I use that fills in the gap between days that is over 3 hours:
library (zoo)
z <- read.zoo(text = Lines, tz = "", format = "%d/%m/%Y %H:%M", sep = ",")
interpolated1 <-na.approx(z, xout = seq(start(z), end(z), "hours"))
We can merge z with a zero width zoo series z0 which is based on a grid of hours. This will transform z to an hourly series with NAs. Then use the maxgap argument to na.approx as shown below to fill in the desired gaps only. This still leaves NAs in the longer gaps so remove them using na.omit .
fortify.zoo(z3) would transform the result to data frame but since z3, the resulting series with only gaps to length 3 filled, is a time series this is probably not a good idea and it would be better to leave it as a zoo object so that you can use all the facilities of zoo.
No packages other than zoo are used.
z0 <- zoo(, seq(start(z), end(z), "hours"))
z3 <- na.omit(na.approx(merge(z, z0), maxgap = 3))
giving:
> z3
2014-11-20 16:00:00 2014-11-20 17:00:00 2014-11-20 18:00:00 2014-11-20 19:00:00
0.01000000 0.02000000 0.02500000 0.03000000
2014-11-21 16:00:00 2014-11-21 17:00:00 2014-11-21 18:00:00 2014-11-21 19:00:00
0.04000000 0.06000000 0.07333333 0.08666667
2014-11-21 20:00:00
0.10000000
Source 1: Creating a specific sequence of date/times in R. Answer by mnel on Sep 13 2012 and edit by Matt Dowle on Sep 13 2012
&
Source 2: Creating regular 15-minute time-series from irregular time-series. Answer by mnel on Sep 13 2012 and edit by Dirk Eddelbuettel on May 3 2012
library(zoo)
library(xts)
library(data.table)
library(devtools)
devtools::install_github("iembry-USGS/ie2misc")
library(ie2misc)
# iembry released a version of ie2misc so you should be able to install
# the package now
# `na.interp1` is a function that combines zoo's `na.approx` and pracma's
# `interp1`
The rest of the code starts after the creation of your z zoo object
## Source 1 begins
startdate <- as.character((start(z)))
# set the start date/time as the 1st entry in the time series and make
# this a character vector.
start <- as.POSIXct(startdate)
# transform the character vector to a POSIXct object
enddate <- as.character((end(z)))
# set the end date/time as the last entry in the time series and make
# this a character vector.
end <- as.POSIXct(enddate)
# transform the character vector to a POSIXct object
gridtime <- seq(from = start, by = 3600, to = end)
# create a sequence beginning with the start date/time with a 60 minute
# interval ending at the end date/time
## Source 1 ends
## Source 2 begins
timeframe <- data.frame(rep(NA, length(gridtime)))
# create 1 NA column spaced out by the gridtime to complement the single
# column of z
timelength <- xts(timeframe, order.by = gridtime)
# create a xts time series object using timeframe and gridtime
zDate <- merge(timelength, z)
# merge the z zoo object and the timelength xts object
## Source 2 ends
The next steps involve the process of interpolating your data as requested.
Lines <- as.data.frame(zDate)
# to data.frame from zoo
Lines[, "D1"] <- rownames(Lines)
# create column named D1
Lines <- setDT(Lines)
# create data.table out of data.frame
setcolorder(Lines, c(3, 2, 1))
# set the column order as the 3rd column followed by the 2nd and 1st
# columns
Lines <- Lines[, 3 := NULL]
# remove the 3rd column
setnames(Lines, 2, "diff")
# change the name of the 2nd column to diff
Lines <- setDF(Lines)
# return to data.frame
rowsinterps1 <- which(is.na(Lines$diff == TRUE))
# index of rows of Lines that have NA (to be interpolated)
xi <- as.numeric(Lines[which(is.na(Lines$diff == TRUE)), 1])
# the Date-Times for diff to be interpolated in numeric format
interps1 <- na.interp1(as.numeric(Lines$Time), Lines$diff, xi = xi,
na.rm = FALSE, maxgap = 3)
# the interpolated values where only gap sizes of 3 are filled
Lines[rowsinterps1, 2] <- interps1
# replace the NAs in diff with the interpolated diff values
Lines <- na.omit(Lines) # remove rows with NAs
Lines
This is the Lines data.frame:
Lines
D1 diff
1 2014-11-20 16:00:00 0.01000000
2 2014-11-20 17:00:00 0.02000000
3 2014-11-20 18:00:00 0.02500000
4 2014-11-20 19:00:00 0.03000000
25 2014-11-21 16:00:00 0.04000000
26 2014-11-21 17:00:00 0.06000000
27 2014-11-21 18:00:00 0.07333333
28 2014-11-21 19:00:00 0.08666667
29 2014-11-21 20:00:00 0.10000000

Adding missing rows

The format of my excel data file is:
day value
01-01-2000 00:00:00 4
01-01-2000 00:01:00 3
01-01-2000 00:02:00 1
01-01-2000 00:04:00 1
I open my file with this:
ts = read.csv(file=pathfile, header=TRUE, sep=",")
How can I add additional rows with zero number in column “value” into the data frame. Output example:
day value
01-01-2000 00:00:00 4
01-01-2000 00:01:00 3
01-01-2000 00:02:00 1
01-01-2000 00:03:00 0
01-01-2000 00:04:00 1
This is now completely automated in the padr package. Takes only one line of code.
original <- data.frame(
day = as.POSIXct(c("01-01-2000 00:00:00",
"01-01-2000 00:01:00",
"01-01-2000 00:02:00",
"01-01-2000 00:04:00"), format="%m-%d-%Y %H:%M:%S"),
value = c(4, 3, 1, 1))
library(padr)
library(dplyr) # for the pipe operator
original %>% pad %>% fill_by_value(value)
See vignette("padr") or this blog post for its working.
I think this is a more general solution, which relies on creating a sequence of all timestamps, using that as the basis for a new data frame, and then filling in your original values in that df where applicable.
# convert original `day` to POSIX
ts$day <- as.POSIXct(ts$day, format="%m-%d-%Y %H:%M:%S", tz="GMT")
# generate a sequence of all minutes in a day
minAsNumeric <- 946684860 + seq(0,60*60*24,by=60) # all minutes of your first day
minAsPOSIX <- as.POSIXct(minAsNumeric, origin="1970-01-01", tz="GMT") # convert those minutes to POSIX
# build complete dataframe
newdata <- as.data.frame(minAsPOSIX)
newdata$value <- ts$value[pmatch(newdata$minAsPOSIX, ts$day)] # fill in original `value`s where present
newdata$value[is.na(newdata$value)] <- 0 # replace NAs with 0
Try:
ts = read.csv(file=pathfile, header=TRUE, sep=",", stringsAsFactors=F)
ts.tmp = rbind(ts,list("01-01-2000 00:03:00",0))
ts.out = ts.tmp[order(ts.tmp$day),]
Notice that you need to force load the strings in first column as character and not factors otherwise you will have issue with the rbind. To get the day column to be a factor after than just do:
ts.out$day = as.factor(ts.out$day)
Tidyr offers the nice complete function to generate rows for implicitly missing data. I use replace_na to turn NA values to 0 in second step.
ts%>%
tidyr::complete(day=seq.POSIXt(min(day), max(day), by="min"))%>%
dplyr::mutate(value=tidyr::replace_na(value,0))
Notice that I set the granularity of the dates to minutes since your dataset expects a row every minute.

How to merge couples of Dates and values contained in a unique csv

We have a csv file with Dates in Excel format and Nav for Manager A and Manager B as follows:
Date,Manager A,Date,Manager B
41346.6666666667,100,40932.6666666667,100
41347.6666666667,100,40942.6666666667,99.9999936329992
41348.6666666667,100,40945.6666666667,99.9999936397787
41351.6666666667,100,40946.6666666667,99.9999936714362
41352.6666666667,100,40947.6666666667,100.051441180137
41353.6666666667,100,40948.6666666667,100.04877283951
41354.6666666667,100.000077579585,40949.6666666667,100.068400298752
41355.6666666667,100.00007861475,40952.6666666667,100.070263374822
41358.6666666667,100.000047950872,40953.6666666667,99.9661095940006
41359.6666666667,99.9945012295984,40954.6666666667,99.8578245935173
41360.6666666667,99.9944609274138,40955.6666666667,99.7798031949116
41361.6666666667,99.9944817907402,40956.6666666667,100.029523604978
41366.6666666667,100,40960.6666666667,100.14859511024
41367.6666666667,99.4729804387476,40961.6666666667,99.7956029017769
41368.6666666667,99.4729804387476,40962.6666666667,99.7023420799123
41369.6666666667,99.185046151864,40963.6666666667,99.6124531927299
41372.6666666667,99.1766469096966,40966.6666666667,99.5689030038018
41373.6666666667,98.920738006398,40967.6666666667,99.5701493637685
,,40968.6666666667,99.4543885041996
,,40969.6666666667,99.3424528379521
We want to create a zoo object with the following structure [Dates, Manager A Nav, Manager B Nav].
After reading the csv file with:
data = read.csv("...", header=TRUE, sep=",")
we set an index for splitting the object and use lapply to split
INDEX <- seq(1, by = 2, length = ncol(data) / 2)
data.zoo <- lapply(INDEX, function(i, data) data[i:(i+1)], data = zoo(data))
I'm stuck with the fact that Dates are in Excel format and don't know how to fix that stuff. Is the problem set in a correct way?
If all you want to do is to convert the dates to proper dates you can do this easily enough. The thing you need to know is the origin date. Your numbers represent the integer and fractional number of days that have passed since the origin date. Usually this is Jan 0 1990!!! Go figure, but be careful as I don't think this is always the case. You can try this...
# Excel origin is day 0 on Jan 0 1900, but treats 1900 as leap year so...
data$Date <- as.Date( data$Date , origin = "1899/12/30")
data$Date.1 <- as.Date( data$Date.1 , origin = "1899/12/30")
# For more info see ?as.Date
If you are interested in keeping the times as well, you can use as.POSIXct, but you must also specify the timezone (UTC by default);
data$Date <- as.POSIXct(data$Date, origin = "1899/12/30" )
head(data)
# Date Manager.A Date.1 Manager.B
# 1 2013-03-13 16:00:00 100 2012-01-24 100.00000
# 2 2013-03-14 16:00:00 100 2012-02-03 99.99999
# 3 2013-03-15 16:00:00 100 2012-02-06 99.99999
# 4 2013-03-18 16:00:00 100 2012-02-07 99.99999
# 5 2013-03-19 16:00:00 100 2012-02-08 100.05144
# 6 2013-03-20 16:00:00 100 2012-02-09 100.04877

Resources