Issue merging dataframes in R using POSIXct - r

I have two dataframes (per_frame, values) - The first contains POSIXct values for a 24 hour period at 15 minute intervals.
periods = as.POSIXct(seq.POSIXt("2019-06-01 04:00:00 UTC","2019-06-02 03:45:00 UTC", by=900))
per_frame = data.frame(Period = periods)
The second contains a column for some of the time values above (but not all) and another for 'average value'.
Period
avg_value
2019-06-01 04:45:00
4
2019-06-01 05:00:00
7
2019-06-01 05:45:00
9
2019-06-01 08:45:00
2
2019-06-01 10:00:00
4
I want to create a new dataframe that adds the average values where available to the first dataframe, leaving 'missing values' where there aren't any. I thought this could be achieved easily using the below:
Combined= merge(per_frame, values, by = "Period", all.x = TRUE)
However, the new dataframe it creates has incorrect values for each Period. It is adding values to some time periods that don't have a corresponding average value in the values dataframe. I'm not sure what i'm doing incorrect here?

Apologies - I realised after some investigation that the timezones used in the two databases were different - hence the mismatch when merging. I'm not actually sure why this happened as i'm using the same data import to generate both the start and end values for the first dataframe and the values for the second. I was able to override it though using the 'tz' value in the as.POSIXct function.

Related

How to subset datetime stamps to randomly keep only one value per day in R?

I am a GIS analyst and using R for a project. I am a bit rusty with R code. I have data in csv format from radio collared foxes with datetime stamps and GPS locations. However, throughout our study the time interval changed so some of the dates have 3 records per day and some have only one. For example:
[1] 2014-12-24 03:00:00
[2] 2014-12-24 12:00:00
[3] 2014-12-24 22:00:00.
There are duplicate datetimes as well that I need to thin, but they have different GPS locations:
[55] 2015-11-03 12:00:00
[56] 2015-11-03 12:00:00.
Ultimately I need just one record per day and I would like it to randomly choose which one is deleted so that I end up with a mix of time values. For example:
[1] 2014-12-24 12:00:00
[2] 2014-12-25 22:00:00.
I tried the !duplicate function with the date only in a separate column but the problem is it only returns the first value so all the times would be at 3:00 am. example code:
oneaday6730 <- xFox6730[!duplicated(xFox6730$Date), drop = FALSE]
With dplyr, assuming mydf is your data:
mydf %>%
group_by(Date) %>%
sample_n(1) -> result
Note that I'm making some assumptions on the structure of your data, in particular that the Date column contains the date you want to sample on.

how to iterate based on a condition, and assign aggregated value to a row in new dataframe in R

I have a large dataset of stock prices with 203615 rows and 2 columns(price and Timestamp). in below format
price(USD) | Timestamp
3.5 | 2014-01-01 20:00:00
2 | 2014-01-01 20:15:00
5 | 2014-01-01 20:15:00
----
4 | 2014-01-31 23:00:00
5 | 2014-01-31 23:00:00
4.5 | 2014-01-31 23:00:00
203615 2.3 | 2014-01-31 23:00:00
Time stamp varies from "2014-01-01 20:00:00" to "2014-01-31 23:00:00" with intervals of 15min(rounded to 15min). i have several transactions on same timestamp.
I have to group rows based on timestamp with difference of one day, and caluclate min,max and mean of the price and no of rows within the timestamp limits and assign them to a row in a new dataframe for every iteration until it reaches the end timestamp("2014-01-31 23:00:00") from starting date('2014-01-02 20:00:00")
note: iteration has to be done for every 15min
i have tried while loop. please help me with this and suggest me if i can use any packages
This is my own code which I used as a way of creating a window of time (the prior 24 hours) to iterate over and create min and max values for a project I am working on...
inter is the inteval I worked on in the loop
raw is the data frame name
i is the specific row from which the datetime column was selected from raw
I started my intervals at 97th row ( (i in 97:nrow(raw) ) because the stamps were taken at 15 minute intervals and I wanted a 24 hour backward window, so I needed to leave 96 intervals to pull from...I could not reach back into time I had no data for...so I started far enough into my data to leave room for those intervals.
for (i in 97:nrow(raw)){
inter=raw$datetime[i] - as.difftime(24, unit='hours')
raw$deltaAirTemp_24[i] <-max(temp$Air.Temperature)- min(temp$Air.Temperature)
}
The key is getting into a real date time format. Run str() on the field with the dates, if the come back as anything but Factor, use:
as.POSIXct(yourdate$field, %Y-%m-%d %H:%M:%S)
If they come back from str(yourdatecolumn here) as FACTOR then wrap it in as.POSIXct(as.character(yourdate$field), %Y-%m-%d %H:%M:%S) to be sure it does not coerce the date into a Level number then time..
Get them into a consistent date format, then construct something like above to extract the periods you need. difftime is in the base package and works well you can use positive and negative intervals with it. I hope his helps!

R multiple columns group by

I have a dataset x_output that looks like this:
timestamp city wait_time weekday
2015-07-14 09:00:00 Boston 1.4 Tuesday
2015-07-14 09:01:00 Boston 2.5 Tuesday
2015-07-14 09:02:00 Boston 2.8 Tuesday
2015-07-14 09:03:00 Boston 1.6 Tuesday
2015-07-14 09:04:00 Boston 1.5 Tuesday
2015-07-14 09:05:00 Boston 1.4 Wednesday
I would like to find the mean wait_time, grouped by city, weekday, and time. Basically, given your city, what is the average wait time for Monday, for example? Then Tuesday?
I'm having difficulty creating the time column given x_output$timestamp; I'm currently using:
x_output$time <- strsplit(as.character(x_output$timestamp), split = " ")[[1]][2]
However, that simply puts "09:00" in every row, not the correct time for each individual row.
Secondly, I need to have a 3-way grouping to find the mean wait_time given city, weekday and time. This is something that's fairly straightforward to do in python pandas, but I can find very little documentation on it in R (and unfortunately I need to do it in R, not python).
I've looked into using data.table but that hasn't seemed to work. Is there a simple function like there would be in python pandas (eg. df.groupby(['col1', 'col2', 'col3']).mean())?
Mean wait_time grouped by city, weekday, time:
library(plyr)
ddply(x_output, .(city, weekday, time), summarize, avg=mean(wait_time))
If you wanted data.table
x_output[, list(avg=mean(wait_time)), .(city, weekday, time)]
I'm having difficulty creating the time column given x_output$timestamp
Well, what is the time column supposed to have in it? Just the time component of timestamp? Is timestamp a POSIXct or a string?
If it is a POSIXct, then you can just convert to character, specifying the time format:
x_output$time <- as.character(x_output$timestamp, '%H:%M')
# or as.factor(as.character(...)) if you need it to be a factor.
# in data.table: x[, time:=as.character(timestamp, '%H:%M')]
This will make the time column a string with the hour and minutes. See ?strptime for more options on converting that datetime to a string (e.g. if you want to include seconds).
If it is a string, you could strsplit and extract the second component:
vapply(strsplit(x_output$timestamp, ' '), '[', i=2, 'template')
which will give you "HH:MM:SS" as your time format. If you want to do a custom time format, probably best to convert your timestamp string into a POSIXct and back out to the specific format like already mentioned.

difftime for multiple dates in r

I have chemistry water data taken from a river. Normally, the sample dates were on a Wednesday every two weeks. The data record starts in 1987 and ends in 2013.
Now, I want to re-check if there are any inconsistencies within the data, that is if the samples are really taken every 14 days. For that task I want to use the r function difftime. But I have no idea on how to do that for multiple dates.
Here is some data:
Date Value
1987-04-16 12:00:00 1,5
1987-04-30 12:00:00 1,2
1987-06-25 12:00:00 1,7
1987-07-14 12:00:00 1,3
Can you tell me on how to use the function difftime properly in that case or any other function that does the job. The result should be the number of days between the samplings and/or a true and false for the 14 days.
Thanks to you guys in advance. Any google-fu was to no avail!
Assuming your data.frame is named dd, you'll want to verify that the Date column is being treated as a date. Most times R will read them as a character which gets converted to a factor in a data.frame. If class(df$Date) is "character" or "factor", run
dd$Date<-as.POSIXct(as.character(dd$Date), format="%Y-%m-%d %H:%M:%S")
Then you can so a simple diff() to get the time difference in days
diff(dd$Date)
# Time differences in days
# [1] 14 56 19
# attr(,"tzone")
# [1] ""
so you can check which ones are over 14 days.

Subsetting dataframe by day according to most non zero data

I have an example dataframe:
a <- c(1:6)
b <- c("05/12/2012 05:00","05/12/2012 06:00","06/12/2012 05:00",
"06/12/2012 06:00", "07/12/2012 09:00","07/12/2012 07:00")
c <-c("0","0","0","1","1","1")
df1 <- data.frame(a,b,c,stringsAsFactors = FALSE)
Firstly, I want to make sure R recognises the date and time format, so I used:
df1$b <- strptime(df1$b, "%d/%m/%Y %H:%M")
However this can't be right as R always aborts my session as soon as I try to view the new dataframe.
Assuming that this gets resolves, I want to get a subset of the data according to whichever day in the dataframe contains the most data in 'C' that is not a zero. In the above example I should be left with the two data points on 7th Dec 2012.
I also have an additional, related question.
If I want to be left with a subset of the data with the most non zero values between a certain time period in the day (say between 07:00 and 08:00), how would I go about doing this?
Any help on the above problems would be greatly appreciated.
Well, the good news is that I have an answer for you, and the bad news is that you have more questions to ask yourself. First the bad news: you need to consider how you want to treat multiple days that have the same number of non-zero values for 'c'. I'm not going to address that in this answer.
Now the good news: this is really simple.
Step 1: First, let's reformat your data frame. Since we're changing data types on a couple of the variables (b to datetime and c to numeric), we need to create a new data frame or recalibrate the old one. I prefer to preserve the original and create a new one, like so:
a <- df1$a
b <- strptime(df1$b, "%d/%m/%Y %H:%M")
c <- as.numeric(df1$c)
hour <- as.numeric(format(b, "%H"))
date <- format(b, "%x")
df2 <- data.frame(a, b, c, hour, date)
# a b c hour date
# 1 1 2012-12-05 05:00:00 0 5 12/5/2012
# 2 2 2012-12-05 06:00:00 0 6 12/5/2012
# 3 3 2012-12-06 05:00:00 0 5 12/6/2012
# 4 4 2012-12-06 06:00:00 1 6 12/6/2012
# 5 5 2012-12-07 09:00:00 1 9 12/7/2012
# 6 6 2012-12-07 07:00:00 1 7 12/7/2012
Notice that I also added 'hour' and 'date' variables. This is to make our data easily sortable by those fields for our later aggregation function.
Step 2: Now, let's calculate how many non-zero values there are for each day between the hours of 06:00 and 08:00. Since we're using the 'hour' values, this means the values of '6' and '7' (represents 06:00 - 07:59).
library(plyr)
df2 <- ddply(df2[df2$hour %in% 6:7,], .(date), mutate, non_zero=sum(c))
# a b c hour date non_zero
# 1 2 2012-12-05 06:00:00 0 6 12/5/2012 0
# 2 4 2012-12-06 06:00:00 1 6 12/6/2012 1
# 3 6 2012-12-07 07:00:00 1 7 12/7/2012 1
The 'plyr' package is wonderful for things like this. The 'ddply' package specifically takes data frames as both input and output (hence the "dd"), and the 'mutate' function allows us to preserve all the data while adding additional columns. In this case, we're wanting a sum of 'c' for each day in .(date). Subsetting our data by the hours is taken care of in the data argument df2[df2$hour %in% 6:7,], which says to show us the rows where the hour value is in the set {6,7}.
Step 3: The final step is just to subset the data by the max number of non-zero values. We can drop the extra columns we used and go back to our original three.
subset_df <- df2[df2$non_zero==max(df2$non_zero),1:3]
# a b c
# 2 4 2012-12-06 06:00:00 1
# 3 6 2012-12-07 07:00:00 1
Good luck!
Update: At the OP's request, I am writing a new 'ddply' function that will also include a time column for plotting.
df2 <- ddply(df2[df2$hour %in% 6:7,], .(date), mutate, non_zero=sum(c), plot_time=as.numeric(format(b, "%H")) + as.numeric(format(b, "%M")) / 60)
subset_df <- df2[df2$non_zero==max(df2$non_zero),c("a","b","c","plot_time")]
We need to collapse the time down into one continuous variable, so I chose hours. Leaving any data in a time format will require us to fiddle with stuff later, and using a string format (like "hh:mm") will limit the types of functions you can use on it. Continuous numbers are the most flexible, so here we get the number of hours as.numeric(format(b, "%H")) and add it to the number of minutes divided by 60 as.numeric(format(b, "%M")) / 60 to convert the minutes into units of hours. Also, since we're dealing with more columns, I've switched the final subset statement to name the columns we want, rather than referring to the numbers. Once I'm dealing with columns that aren't in continuous order, I find that using names is easier to debug.
Agreeing with Jack. Sounds like a corrupted installation of R. First thing to try would be to delete the .Rdata file that holds the results of the prior session. They are hidden in both Mac and Windows so unless you "reveal" the 'dotfiles'(system files), the OS file manager (Finder.app and Windows Explorer) will not show them. How you find and delete that file is OS-specific task. It's going to be in your working directory and you will need to do the deletion outside of R since once R is started it will have locked access to it. It's also possible to get a corrupt .history file but in my experience that is not usually the source of the problem.
If that is not successful, you may need to reinstall R.

Resources