how to concatenate 2 columns in the same data set in R

how to concatenate 2 columns in the same data set in R - r

I have a data set in the following order:
Date Time Open High Low Close Volume NumberOfTrades BidVolume AskVolume
1 2011/12/22 02:00:00 5805.5 5820.5 5804.0 5820.5 253 96 161 71
2 2011/12/22 02:01:00 5819.0 5820.0 5813.0 5817.0 77 57 43 23
3 2011/12/22 02:02:00 5816.5 5820.0 5816.0 5819.0 30 22 9 14
I need to add a column before column a (Date) that will be A+B ("Date" "Time") and than I will be able to make my dataset an XTS (XTS needs a unick key)
The final result will be something like:
DateTime Date Time Open High Low Close Volume NumberOfTrades BidVolume AskVolume
1 2011/12/22 02:00:00 2011/12/22 02:00:00 5805.5 5820.5 5804.0 5820.5 253 96 161 71
Thanks

Use paste to combine the Date and Time columns and as.POSIXct to convert the string to date-time class.
Assuming your data frame is called df:
df$DateTime = as.POSIXct(paste(df$Date, df$Time))
After you've added DateTime to your data frame, per #RichardScriven's comment, you can rearrange the order of the columns as follows:
df = df[ , c(length(df), 1:(length(df)-1))]
Or, you can add DateTime as the first column as follows:
df = data.frame(DateTime=as.POSIXct(paste(df$Date, df$Time)), df)

Related

Thoughts on how to speed up replacing a column value when conditions between two objects (dataframe or datatable) are met in a loop?

I'm trying to get some input on how I might speed up a for loop that I've written. Essentially, I have a dataframe (DF1) where each row provides the latitude and longitude/point at a given time. The variables therefore are the lat and long for the point and the timestamp (date and time object). In addition, each row is a unique timestamp, for a unique point (in other word no repeats).
I'm trying to match it up to weather data which is contained in an netcdf file. The hourly weather data I have is a 3 dimensional file that includes the lats and longs for grids, time stamps, and the value for that weather variable. I'll call the weather variable u for now.
What I want to end up with: I have a column for the u values in DF1. It starts out with only missing values. In the end I want to replace those missing values in DF1 with the appropriate u value from the netcdf file.
How I've done this so far: I have constructed a for loop that can extract the appropriate u values for the nearest grid to each point in DF1. For example, for lat x and long y in DF1 I find the nearest grid in the netcdf file and extract the full timeseries. Then, within the for loop, I amend the DF1$u values with the appropriate data. DF1 is a rather large dataframe and the netcdf is even bigger (DF1 is just a small subset of the full netcdf).
Some psuedo data:
ID <-c("1","2","3","4","5","6")
datetime <- c("1/1/2021 01:00:00"
, "1/5/2021 04:00:00"
, '2/1/2021 06:00:00'
, "1/7/2021 01:00:00"
, "2/2/2021 01:00:00"
, "2/5/2021 02:00:00")
lat <- c(34,36,41,50,20,40)
long <- c(55,50,‑89,-175,-155,25)
DF1 <- data.frame(ID, datetime, lat, long)
DF1$u <- NA
ID datetime lat long u
1 1 1/1/2021 01:00:00 34 55 NA
2 2 1/5/2021 04:00:00 36 50 NA
3 3 2/1/2021 06:00:00 41 -89 NA
4 4 1/7/2021 01:00:00 50 -175 NA
5 5 2/2/2021 01:00:00 20 -155 NA
6 6 2/5/2021 02:00:00 40 25 NA
Here is an example of the type of for loop I've constructed, I've left out some of the more specific details that aren't relevant:
for(i in 1:nrows(DF1) {
### a number of steps here that identify the closest grid to each point. ####
mat_index <- as.vector(arrayInd(which.min(dist_mat), dim(dist_mat)))
# Extract the time series for the lat and long that are closest. u is the weather variable, time is the datetime variable, and geometry is the corresponding lat long list item.
df_u <- data.frame(u=data_u[mat_index[2],mat_index[1],],time=timestamp,geometry=rep(psj[i,1],dim(data_u)[3]))
# To make things easier I seperate geometry into a lat and a long col
df_u <- tidyr::separate(df_u, geometry, into = c("long", "lat"), sep = ",")
df_u$long <- gsub("c\\(", "", df_u$long)
df_u$lat <- gsub("\\)", "", df_u$lat)
# I find that datatables are a bit easier for these types of tasks, so I set the full timeseries data and DF1 to data table (in reality I set df1 as a data table outside of the for loop)
df_u <- setDT(df_u)
# Then I use merge to join the two datatables, replace the missing value with the appropriate value from df_u, and then drop all the unnecessary columns in the final step.
df_u_new <- merge(windu_df, df_u, by =c("lat","long","time"), all.x = T)
df_u_new[, u := ifelse(is.na(u.x), u.y, u.x)]
windu_df <- df_u_new[, .(time, lat, long, u)]
}
This works, but given the sheer size of the dataframe/datatables that I'm working with, I wonder if there is a faster way to do that last step in particular. I know merge is probably the slowest way to do this, but I kept running into issues using match() and inner_join.
Unfortunately I'm not able to really give fully reproduceable data here given that I'm working with a netcdf file, but df_u looks something like this for the first iteration:
ID <-c("1","2","3","4","5","6")
datetime <- c("1/1/2021 01:00:00"
, "1/1/2021 02:00:00"
, "1/1/2021 03:00:00"
, "1/1/2021 04:00:00"
, "1/1/2021 05:00:00"
, "1/1/2021 06:00:00")
lat <- c(34,34,34,34,34,34)
long <- c(55,55,55,55,55,55)
u <- c(2.8,3.6,2.1,5.6,4.4,2,5)
df_u <- data.frame(ID, datetime, lat, long,u)
ID datetime lat long u
1 1 1/1/2021 01:00:00 34 55 2.8
2 2 1/1/2021 02:00:00 34 55 3.6
3 3 1/1/2021 03:00:00 34 55 2.1
4 4 1/1/2021 04:00:00 34 55 5.6
5 5 1/1/2021 05:00:00 34 55 4.4
6 6 1/1/2021 06:00:00 34 55 2.5
Once u is amended in DF1 it should read:
ID datetime lat long u
1 1 1/1/2021 01:00:00 34 55 2.8
2 2 1/5/2021 04:00:00 36 50 NA
3 3 2/1/2021 06:00:00 41 -89 NA
4 4 1/7/2021 01:00:00 50 -175 NA
5 5 2/2/2021 01:00:00 20 -155 NA
6 6 2/5/2021 02:00:00 40 25 NA
In the next iteration, the for loop will extract the relevant weather data for the lat and long in row 2 and then retain 2.8 and replace the NA in row 2 with a value.
EDIT: The NETCDF data covers an entire year (so every hour for a year) for a decently large spatial area. It's ERA5 data. In addition, DF1 had thousands of unique lat/long and timestamp observations.

Time difference between rows of a dataframe

I have been zoning in the R part of StackOverflow for quite a while looking for a proper answer but nothing that what saw seems to apply to my problem.
I have a dataset of this format ( I have adapted it for what seems to be the easiest way to work with, but the stop_sequence values are normally just incremental numbers for each stop) :
route_short_name trip_id direction_id departure_time stop_sequence
33A 1.1598.0-33A-b12-1.451.I 1 16:15:00 start
33A 1.1598.0-33A-b12-1.451.I 1 16:57:00 end
41C 10.3265.0-41C-b12-1.277.I 1 08:35:00 start
41C 10.3265.0-41C-b12-1.277.I 1 09:26:00 end
41C 100.3260.0-41C-b12-1.276.I 1 09:40:00 start
41C 100.3260.0-41C-b12-1.276.I 1 10:53:00 end
114 1000.987.0-114-b12-1.86.O 0 21:35:00 start
114 1000.987.0-114-b12-1.86.O 0 22:02:00 end
39 10000.2877.0-39-b12-1.242.I 1 11:15:00 start
39 10000.2877.0-39-b12-1.242.I 1 12:30:00 end
It is basically a bus trips dataset. All I want is to manage to get the duration of each trip, so something like that:
route_short_name trip_id direction_id duration
33A 1.1598.0-33A-b12-1.451.I 1 42
41C 10.3265.0-41C-b12-1.277.I 1 51
41C 100.3260.0-41C-b12-1.276.I 1 73
114 1000.987.0-114-b12-1.86.O 0 27
39 10000.2877.0-39-b12-1.242.I 1 75
I have tried a lot of things, but in no case have I managed to group the data by trip_id and then working on the two values at each time. I must have misunderstood something, but I do not know what.
Does anyone have a clue?

We can also do this without converting to 'wide' format (assuming that the 'stop_sequence' is 'start' followed by 'end' for each 'route_short_name', 'trip_id', and 'direction_id'.
Convert the 'departure_time' to a datetime column, grouped by 'route_short_name', 'trip_id', and 'direction_id', get the difftime of the last 'departure_time' with that of the 'first' 'departure_time'
df1 %>%
mutate(departure_time = as.POSIXct(departure_time, format = '%H:%M:%S')) %>%
group_by(route_short_name, trip_id, direction_id) %>%
summarise(duration = as.numeric(difftime(last(departure_time), first(departure_time), unit = 'min')))
# A tibble: 5 x 4
# Groups: route_short_name, trip_id [?]
# route_short_name trip_id direction_id duration
# <chr> <chr> <int> <dbl>
#1 114 1000.987.0-114-b12-1.86.O 0 27
#2 33A 1.1598.0-33A-b12-1.451.I 1 42
#3 39 10000.2877.0-39-b12-1.242.I 1 75
#4 41C 10.3265.0-41C-b12-1.277.I 1 51
#5 41C 100.3260.0-41C-b12-1.276.I 1 73

Try this. Right now you have your dataframe in "long" format, but it would be nice to have it in "wide" format to calculate the time difference. Using the spread function in the tidyverse package will take your data from long to wide. From there you can use the mutate function to add the new column you want. as.numeric(difftime(end,start)) will keep the difference unit in minutes.
library(tidyverse)
wide_df <-
spread(your_df,key = stop_sequence, value = departure_time) %>%
mutate(timediff = as.numeric(difftime(end,start)))
If you want to learn more about "tidy" data (and spreading and gathering), see this link to Hadley's book

How to make dataframe with large number of row in R [duplicate]

I am new to R but have turned to it to solve a problem with a large data set I am trying to process. Currently I have a 4 columns of data (Y values) set against minute-interval timestamps (month/day/year hour:min) (X values) as below:
timestamp tr tt sr st
1 9/1/01 0:00 1.018269e+02 -312.8622 -1959.393 4959.828
2 9/1/01 0:01 1.023567e+02 -313.0002 -1957.755 4958.935
3 9/1/01 0:02 1.018857e+02 -313.9406 -1956.799 4959.938
4 9/1/01 0:03 1.025463e+02 -310.9261 -1957.347 4961.095
5 9/1/01 0:04 1.010228e+02 -311.5469 -1957.786 4959.078
The problem I have is that some timestamp values are missing - e.g. there may be a gap between 9/1/01 0:13 and 9/1/01 0:27 and such gaps are irregular through the data set. I need to put several of these series into the same database and because the missing values are different for each series, the dates do not currently align on each row.
I would like to generate rows for these missing timestamps and fill the Y columns with blank values (no data, not zero), so that I have a continuous time series.
I'm honestly not quite sure where to start (not really used R before so learning as I go along!) but any help would be much appreciated. I have thus far installed chron and zoo, since it seems they might be useful.
Thanks!

This is an old question, but I just wanted to post a dplyr way of handling this, as I came across this post while searching for an answer to a similar problem. I find it more intuitive and easier on the eyes than the zoo approach.
library(dplyr)
ts <- seq.POSIXt(as.POSIXct("2001-09-01 0:00",'%m/%d/%y %H:%M'), as.POSIXct("2001-09-01 0:07",'%m/%d/%y %H:%M'), by="min")
ts <- seq.POSIXt(as.POSIXlt("2001-09-01 0:00"), as.POSIXlt("2001-09-01 0:07"), by="min")
ts <- format.POSIXct(ts,'%m/%d/%y %H:%M')
df <- data.frame(timestamp=ts)
data_with_missing_times <- full_join(df,original_data)
timestamp tr tt sr st
1 09/01/01 00:00 15 15 78 42
2 09/01/01 00:01 20 64 98 87
3 09/01/01 00:02 31 84 23 35
4 09/01/01 00:03 21 63 54 20
5 09/01/01 00:04 15 23 36 15
6 09/01/01 00:05 NA NA NA NA
7 09/01/01 00:06 NA NA NA NA
8 09/01/01 00:07 NA NA NA NA
Also using dplyr, this makes it easier to do something like change all those missing values to something else, which came in handy for me when plotting in ggplot.
data_with_missing_times %>% group_by(timestamp) %>% mutate_each(funs(ifelse(is.na(.),0,.)))
timestamp tr tt sr st
1 09/01/01 00:00 15 15 78 42
2 09/01/01 00:01 20 64 98 87
3 09/01/01 00:02 31 84 23 35
4 09/01/01 00:03 21 63 54 20
5 09/01/01 00:04 15 23 36 15
6 09/01/01 00:05 0 0 0 0
7 09/01/01 00:06 0 0 0 0
8 09/01/01 00:07 0 0 0 0

I think the easiest thing ist to set Date first as already described, convert to zoo, and then just set a merge:
df$timestamp<-as.POSIXct(df$timestamp,format="%m/%d/%y %H:%M")
df1.zoo<-zoo(df[,-1],df[,1]) #set date to Index
df2 <- merge(df1.zoo,zoo(,seq(start(df1.zoo),end(df1.zoo),by="min")), all=TRUE)
Start and end are given from your df1 (original data) and you are setting by - e.g min - as you need for your example. all=TRUE sets all missing values at the missing dates to NAs.

Date padding is implemented in the padr package in R. If you store your data frame, with your date-time variable stored as POSIXct or POSIXlt. All you need to do is:
library(padr)
pad(df_name)
See vignette("padr") or this blog post for its working.

I think this can accomplished by using complete in tidyr package.
library(tidyverse)
df <- df %>%
complete(timestamp = seq.POSIXt(min(timestamp), max(timestamp), by = "minute"),
tr, tt, sr,st)
you can also initialize your start date and end date instead of using min(timestamp) and max(timestamp).

# some made-up data
originaldf <- data.frame(timestamp=c("9/1/01 0:00","9/1/01 0:01","9/1/01 0:03","9/1/01 0:04"),
tr = rnorm(4,0,1),
tt = rnorm(4,0,1))
originaldf$minAsPOSIX <- as.POSIXct(originaldf$timestamp, format="%m/%d/%y %H:%M", tz="GMT")
# Generate vector of all minutes
ndays <- 1 # number of days to generate
minAsNumeric <- 60*60*24*243 + seq(0,60*60*24*ndays,by=60)
# convert those minutes to POSIX
minAsPOSIX <- as.POSIXct(minAsNumeric, origin="2001-01-01", tz="GMT")
# new df
newdf <- merge(data.frame(minAsPOSIX),originaldf,all.x=TRUE, by="minAsPOSIX")

In case you want to substitute the NA values acquired by any method mentioned above with zeroes, you can do this:
df[is.na(df)] <- 0
(I orginally wanted to comment this on Ibollar's answer but I lack the necessary reputation, thus I posted as an answer)

df1.zoo <- zoo(df1[,-1], as.POSIXlt(df1[,1], format = "%Y-%m-%d %H:%M:%S")) #set date to Index: Notice that column 1 is Timestamp type and is named as "TS"
full.frame.zoo <- zoo(NA, seq(start(df1.zoo), end(df1.zoo), by="min")) # zoo object
full.frame.df <- data.frame(TS = as.POSIXlt(index(full.frame.zoo), format = "%Y-%m-%d %H:%M:%S")) # conver zoo object to data frame
full.vancouver <- merge(full.frame.df, df1, all = TRUE) # merge

I was looking for something similar where instead of filling out missing timestamps my data was in months and days. So I wanted to generate a sequence of months that would cater for leap years et cetera. I used lubridate:
date <- df$timestamp[1]
date_list <- c(date)
while (date < df$timestamp[nrow(df)]){
date <- date %m+% months(1)
date_list <- c(date_list,date)
}
date_list <- format(as.Date(date_list),"%Y-%m-%d")
df_1 <- data.frame(months=date_list, stringsAsFactors = F)
This will give me a list of dates in incremental months. Then I join
df_with_missing_months <- full_join(df_1,df)

There are some advances in handling time series data in R, e.g. the tsibble package added such time series manipulations in tidy way:
library(tsibble)
library(lubridate)
ts <- lubridate::dmy_hm(c("9/1/01 0:00","9/1/01 0:01","9/1/01 0:03","9/1/01 0:27"))
originaldf <- tsibble(timestamp = ts,
tr = rnorm(4,0,1),
tt = rnorm(4,0,1),
index = timestamp)
originaldf %>%
fill_gaps()

Removing multiple data entries based on a total number of entries per day

I start with a data frame titled 'dat' in R that looks like the following:
datetime lat long id extra step
1 8/9/2014 13:00 31.34767 -81.39117 36 1 31.38946
2 8/9/2014 17:00 31.34767 -81.39150 36 1 11155.67502
3 8/9/2014 23:00 31.30683 -81.28433 36 1 206.33342
4 8/10/2014 5:00 31.30867 -81.28400 36 1 11152.88177
What I need to do is find out what days have less than 3 entries and remove all entries associated with those days from the original data.
I initially did this by the following:
library(plyr)
datetime<-dat$datetime
###strip the time down to only have the date no hh:mm:ss
date<- strptime(datetime, format = "%m/%d/%Y")
### bind the date to the old data
dat2<-cbind(date, dat)
### count using just the date so you can ID which days have fewer than 3 points
datecount<- count(dat2, "date")
datecount<- subset(datecount, datecount$freq < 3)
This end up producing the following:
row.names date freq
1 49 2014-09-26 1
2 50 2014-09-27 2
3 135 2014-12-21 2
Which is great, but I cannot figure out how to remove the entries from these days with less than three entries from the original 'dat' because this is a compressed version of the original data frame.
So to try and deal with this I have come up with another way of looking at the problem. I will use the strptime and cbind from above:
datetime<-dat$datetime
###strip the time down to only have the date no hh:mm:ss
date<- strptime(datetime, format = "%m/%d/%Y")
### bind the date to the old data
dat2<-cbind(date, dat)
And I will utilize the column titled "extra". I would like to create a new column which is the result of summing the values in this "extra" column by the simplified strptime dates. But find a way to apply this new value to all entries from that date, like the following:
date datetime lat long id extra extra_sum
1 2014-08-09 8/9/2014 13:00 31.34767 -81.39117 36 1 3
2 2014-08-09 8/9/2014 17:00 31.34767 -81.39150 36 1 3
3 2014-08-09 8/9/2014 23:00 31.30683 -81.28433 36 1 3
4 2014-08-10 8/10/2014 5:00 31.30867 -81.28400 36 1 4
5 2014-08-10 8/10/2014 13:00 31.34533 -81.39317 36 1 4
6 2014-08-10 8/10/2014 17:00 31.34517 -81.39317 36 1 4
7 2014-08-10 8/10/2014 23:00 31.34483 -81.39283 36 1 4
8 2014-08-11 8/11/2014 5:00 31.30600 -81.28317 36 1 2
9 2014-08-11 8/11/2014 13:00 31.34433 -81.39300 36 1 2
The code that creates the "extra_sum" column is what I am struggling with.
After creating this I can simply subset my data to all entries that have a value >2. Any help figuring out how to use my initial methodology or this new one to remove days with fewer than 3 entries from my initial data set would be much appreciated!

The plyr way.
library(plyr)
datetime <- dat$datetime
###strip the time down to only have the date no hh:mm:ss
date <- strptime(datetime, format = "%m/%d/%Y")
### bind the date to the old data
dat2 <-cbind(date, dat)
dat3 <- ddply(dat2, .(date), function(df){
if (nrow(df)>=3) {
return(df)
} else {
return(NULL)
}
})

I recommend using the data.table package
library(data.table)
dat<-data.table(dat)
dat$Date<-as.Date(as.character(dat$datetime), format = "%m/%d/%Y")
dat_sum<-dat[, .N, by = Date ]
dat_3plus<-dat_sum[N>=3]
dat<-dat[Date%in%dat_3plus$Date]

Insert rows for missing dates/times

I am new to R but have turned to it to solve a problem with a large data set I am trying to process. Currently I have a 4 columns of data (Y values) set against minute-interval timestamps (month/day/year hour:min) (X values) as below:
timestamp tr tt sr st
1 9/1/01 0:00 1.018269e+02 -312.8622 -1959.393 4959.828
2 9/1/01 0:01 1.023567e+02 -313.0002 -1957.755 4958.935
3 9/1/01 0:02 1.018857e+02 -313.9406 -1956.799 4959.938
4 9/1/01 0:03 1.025463e+02 -310.9261 -1957.347 4961.095
5 9/1/01 0:04 1.010228e+02 -311.5469 -1957.786 4959.078
The problem I have is that some timestamp values are missing - e.g. there may be a gap between 9/1/01 0:13 and 9/1/01 0:27 and such gaps are irregular through the data set. I need to put several of these series into the same database and because the missing values are different for each series, the dates do not currently align on each row.
I would like to generate rows for these missing timestamps and fill the Y columns with blank values (no data, not zero), so that I have a continuous time series.
I'm honestly not quite sure where to start (not really used R before so learning as I go along!) but any help would be much appreciated. I have thus far installed chron and zoo, since it seems they might be useful.
Thanks!

This is an old question, but I just wanted to post a dplyr way of handling this, as I came across this post while searching for an answer to a similar problem. I find it more intuitive and easier on the eyes than the zoo approach.
library(dplyr)
ts <- seq.POSIXt(as.POSIXct("2001-09-01 0:00",'%m/%d/%y %H:%M'), as.POSIXct("2001-09-01 0:07",'%m/%d/%y %H:%M'), by="min")
ts <- seq.POSIXt(as.POSIXlt("2001-09-01 0:00"), as.POSIXlt("2001-09-01 0:07"), by="min")
ts <- format.POSIXct(ts,'%m/%d/%y %H:%M')
df <- data.frame(timestamp=ts)
data_with_missing_times <- full_join(df,original_data)
timestamp tr tt sr st
1 09/01/01 00:00 15 15 78 42
2 09/01/01 00:01 20 64 98 87
3 09/01/01 00:02 31 84 23 35
4 09/01/01 00:03 21 63 54 20
5 09/01/01 00:04 15 23 36 15
6 09/01/01 00:05 NA NA NA NA
7 09/01/01 00:06 NA NA NA NA
8 09/01/01 00:07 NA NA NA NA
Also using dplyr, this makes it easier to do something like change all those missing values to something else, which came in handy for me when plotting in ggplot.
data_with_missing_times %>% group_by(timestamp) %>% mutate_each(funs(ifelse(is.na(.),0,.)))
timestamp tr tt sr st
1 09/01/01 00:00 15 15 78 42
2 09/01/01 00:01 20 64 98 87
3 09/01/01 00:02 31 84 23 35
4 09/01/01 00:03 21 63 54 20
5 09/01/01 00:04 15 23 36 15
6 09/01/01 00:05 0 0 0 0
7 09/01/01 00:06 0 0 0 0
8 09/01/01 00:07 0 0 0 0

I think the easiest thing ist to set Date first as already described, convert to zoo, and then just set a merge:
df$timestamp<-as.POSIXct(df$timestamp,format="%m/%d/%y %H:%M")
df1.zoo<-zoo(df[,-1],df[,1]) #set date to Index
df2 <- merge(df1.zoo,zoo(,seq(start(df1.zoo),end(df1.zoo),by="min")), all=TRUE)
Start and end are given from your df1 (original data) and you are setting by - e.g min - as you need for your example. all=TRUE sets all missing values at the missing dates to NAs.

Date padding is implemented in the padr package in R. If you store your data frame, with your date-time variable stored as POSIXct or POSIXlt. All you need to do is:
library(padr)
pad(df_name)
See vignette("padr") or this blog post for its working.

I think this can accomplished by using complete in tidyr package.
library(tidyverse)
df <- df %>%
complete(timestamp = seq.POSIXt(min(timestamp), max(timestamp), by = "minute"),
tr, tt, sr,st)
you can also initialize your start date and end date instead of using min(timestamp) and max(timestamp).

# some made-up data
originaldf <- data.frame(timestamp=c("9/1/01 0:00","9/1/01 0:01","9/1/01 0:03","9/1/01 0:04"),
tr = rnorm(4,0,1),
tt = rnorm(4,0,1))
originaldf$minAsPOSIX <- as.POSIXct(originaldf$timestamp, format="%m/%d/%y %H:%M", tz="GMT")
# Generate vector of all minutes
ndays <- 1 # number of days to generate
minAsNumeric <- 60*60*24*243 + seq(0,60*60*24*ndays,by=60)
# convert those minutes to POSIX
minAsPOSIX <- as.POSIXct(minAsNumeric, origin="2001-01-01", tz="GMT")
# new df
newdf <- merge(data.frame(minAsPOSIX),originaldf,all.x=TRUE, by="minAsPOSIX")

In case you want to substitute the NA values acquired by any method mentioned above with zeroes, you can do this:
df[is.na(df)] <- 0
(I orginally wanted to comment this on Ibollar's answer but I lack the necessary reputation, thus I posted as an answer)

df1.zoo <- zoo(df1[,-1], as.POSIXlt(df1[,1], format = "%Y-%m-%d %H:%M:%S")) #set date to Index: Notice that column 1 is Timestamp type and is named as "TS"
full.frame.zoo <- zoo(NA, seq(start(df1.zoo), end(df1.zoo), by="min")) # zoo object
full.frame.df <- data.frame(TS = as.POSIXlt(index(full.frame.zoo), format = "%Y-%m-%d %H:%M:%S")) # conver zoo object to data frame
full.vancouver <- merge(full.frame.df, df1, all = TRUE) # merge

I was looking for something similar where instead of filling out missing timestamps my data was in months and days. So I wanted to generate a sequence of months that would cater for leap years et cetera. I used lubridate:
date <- df$timestamp[1]
date_list <- c(date)
while (date < df$timestamp[nrow(df)]){
date <- date %m+% months(1)
date_list <- c(date_list,date)
}
date_list <- format(as.Date(date_list),"%Y-%m-%d")
df_1 <- data.frame(months=date_list, stringsAsFactors = F)
This will give me a list of dates in incremental months. Then I join
df_with_missing_months <- full_join(df_1,df)

There are some advances in handling time series data in R, e.g. the tsibble package added such time series manipulations in tidy way:
library(tsibble)
library(lubridate)
ts <- lubridate::dmy_hm(c("9/1/01 0:00","9/1/01 0:01","9/1/01 0:03","9/1/01 0:27"))
originaldf <- tsibble(timestamp = ts,
tr = rnorm(4,0,1),
tt = rnorm(4,0,1),
index = timestamp)
originaldf %>%
fill_gaps()

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

how to concatenate 2 columns in the same data set in R - r

Related

Thoughts on how to speed up replacing a column value when conditions between two objects (dataframe or datatable) are met in a loop?

Time difference between rows of a dataframe

How to make dataframe with large number of row in R [duplicate]

Removing multiple data entries based on a total number of entries per day

Insert rows for missing dates/times

Categories

Resources