Merge repeated measurements dataframe in R - r

I would like to merge two dataframe of repeated measurements. Both of them have format like this and the difference is that the first one has observation1 while the other has observation2.
Location Date Time observation1
1 1/1/2000 6:00 20
1 1/1/2000 7:00 14
1 1/1/2000 8:00 35
1 1/2/2000 6:00 20
1 1/2/2000 7:00 14
1 1/2/2000 8:00 35
2 1/1/2000 6:00 10
2 1/1/2000 7:00 14
2 1/1/2000 8:00 45
2 1/2/2000 6:00 30
2 1/2/2000 7:00 24
2 1/2/2000 8:00 35
.
.
100 10/31/2000 6:00 80
100 10/31/2000 7:00 80
100 10/31/2000 8:00 80
I want to process them so for each location at a specific date and time, the observation1 and observation2 can match up.
I planned to use a for loop to do it, meaning I pick one row from dataframe1, match it with dataframe2, and then pick another row from dataframe1 and do it over and over. But since the dataframes both have several millions of rows, this is super slow.
Can anyone suggest a more efficient way? Thanks!

Following, #Anrew Taylor
A direct way of doing it is using Merge : a reproducible example is as below:
Location = c(1,1,2,3,4,1)
Date1 = c(as.Date("2014-01-01"), as.Date("2000-01-01"), as.Date("2005-01-01"), as.Date("2001-12-01"), as.Date("2001-11-01"), as.Date("2001-10-01"))
Time1 = c(20,30,40,50,60,70)
Observation1 = c(1,2,3,4,5,6)
Date2 = c(as.Date("2014-10-01"), as.Date("2001-01-01"), as.Date("2005-01-01"), as.Date("2001-12-01"), as.Date("2001-11-01"), as.Date("2001-10-01"))
Time2 = c(20,20,40,50,50,70)
Observation2 = c(7,8,9,10,11,12)
data1 = data.frame(Location = Location, Date = Date, Time = Time, Observation1 = Observation1)
data2 = data.frame(Location = Location, Date = Date2, Time = Time2, Observation2 = Observation2)
merge(data1,data2, by = c("Date", "Time", "Location"))
That will return :
Date Time Location Observation1 Observation2
1 2001-10-01 70 1 6 12
2 2001-12-01 50 3 4 10
3 2005-01-01 40 2 3 9

Related

separating data with respect to month, day, year and hour in R

I have two columns in a data frame first is water consumption and the second column is for date+hour. for example
Value Time
12.2 1/1/2016 1:00
11.2 1/1/2016 2:00
10.2 1/1/2016 3:00
The data is for 4 years and I want to create separate columns for month date year and hour.
I would appreciate any help
We can convert to Datetime and then extract the components. We assume the format of 'Time' column is 'dd/mm/yyyy H:M' (in case it is different i.e. 'mm/dd/yyyy H:M', change the dmy_hm to mdy_hm)
library(dplyr)
library(lubridate)
df1 %>%
mutate(Time = dmy_hm(Time), month = month(Time),
year = year(Time), hour = hour(Time))
# Value Time month year hour
#1 12.2 2016-01-01 01:00:00 1 2016 1
#2 11.2 2016-01-01 02:00:00 1 2016 2
#3 10.2 2016-01-01 03:00:00 1 2016 3
In base R, we can either use strptime or as.POSIXct and then use either format or extract components
df1$Time <- strptime(df1$Time, "%d/%m/%Y %H:%M")
transform(df1, month = Time$mon+1, year = Time$year + 1900, hour = Time$hour)
# Value Time month year hour
#1 12.2 2016-01-01 01:00:00 1 2016 1
#2 11.2 2016-01-01 02:00:00 1 2016 2
#3 10.2 2016-01-01 03:00:00 1 2016 3
data
df1 <- structure(list(Value = c(12.2, 11.2, 10.2), Time = c("1/1/2016 1:00",
"1/1/2016 2:00", "1/1/2016 3:00")), class = "data.frame", row.names = c(NA,
-3L))

bifurcate count basis datetime in R

I have below-mentioned dataframe in R.
DF
ID Datetime Value
T-1 2020-01-01 15:12:14 10
T-2 2020-01-01 00:12:10 20
T-3 2020-01-01 03:11:11 25
T-4 2020-01-01 14:01:01 20
T-5 2020-01-01 18:07:11 10
T-6 2020-01-01 20:10:09 15
T-7 2020-01-01 15:45:23 15
By utilizing the above-mentioned dataframe, I want to bifurcate the count basis month and time bucket considering the Datetime.
Required Output:
Month Count Sum
Jan-20 7 115
12:00 AM to 05:00 AM 2 45
06:00 AM to 12:00 PM 0 0
12:00 PM to 03:00 PM 1 20
03:00 PM to 08:00 PM 3 35
08:00 PM to 12:00 AM 1 15
You can bin the hours of the day by using hour from the lubridate package and then cut from base R, before summarizing with dplyr.
Here, I am assuming that your Datetime column is actually in a date-time format and not just a character string or factor. If it is, ensure you have done DF$Datetime <- as.POSIXct(as.character(DF$Datetime)) first to convert it.
library(tidyverse)
DF$bins <- cut(lubridate::hour(DF$Datetime), c(-1, 5.99, 11.99, 14.99, 19.99, 24))
levels(DF$bins) <- c("00:00 to 05:59", "06:00 to 11:59", "12:00 to 14:59",
"15:00 to 19:59", "20:00 to 23:59")
newDF <- DF %>%
group_by(bins, .drop = FALSE) %>%
summarise(Count = length(Value), Total = sum(Value))
This gives the following result:
newDF
#> # A tibble: 5 x 3
#> bins Count Total
#> <fct> <int> <dbl>
#> 1 00:00 to 05:59 2 45
#> 2 06:00 to 11:59 0 0
#> 3 12:00 to 14:59 1 20
#> 4 15:00 to 19:59 3 35
#> 5 20:00 to 23:59 1 15
And if you want to add January as a first row (though I'm not sure how much sense this makes in this context) you could do:
newDF %>%
summarise(bins = "January", Count = sum(Count), Total = sum(Total)) %>% bind_rows(newDF)
#> # A tibble: 6 x 3
#> bins Count Total
#> <chr> <int> <dbl>
#> 1 January 7 115
#> 2 00:00 to 05:59 2 45
#> 3 06:00 to 11:59 0 0
#> 4 12:00 to 14:59 1 20
#> 5 15:00 to 19:59 3 35
#> 6 20:00 to 23:59 1 15
Incidentally, the reproducible version of the data I used for this was:
structure(list(ID = structure(1:7, .Label = c("T-1", "T-2", "T-3",
"T-4", "T-5", "T-6", "T-7"), class = "factor"), Datetime = structure(c(1577891534,
1577837530, 1577848271, 1577887261, 1577902031, 1577909409, 1577893523
), class = c("POSIXct", "POSIXt"), tzone = ""), Value = c(10,
20, 25, 20, 10, 15, 15)), class = "data.frame", row.names = c(NA,
-7L))

create an unique week variable NOT depending on the calendar in R

I have a daily revenue time series df from 01-01-2014 to 15-06-2017 and I want to aggregate the daily revenue data to weekly revenue data and do the weekly predictions. Before I aggregate the revenue, I need to create a continuously week variable, which will NOT start from week 1 again when a new year starts. Since 01-01-2014 was not Monday, so I decided to start my first week from 06-01-2014.
My df now looks like this
date year month total
7 2014-01-06 2014 1 1857679.4
8 2014-01-07 2014 1 1735488.0
9 2014-01-08 2014 1 1477269.9
10 2014-01-09 2014 1 1329882.9
11 2014-01-10 2014 1 1195215.7
...
709 2017-06-14 2017 6 1677476.9
710 2017-06-15 2017 6 1533083.4
I want to create a unique week variable starting from 2014-01-06 until the last row of my dataset (1257 rows in total), which is 2017-06-15.
I wrote a loop:
week = c()
for (i in 1:179) {
week = rep(i,7)
print(week)
}
However, the result of this loop is not saved for each iteration. When I type week, it just shows 179,179,179,179,179,179,179
Where is the problem and how can I add 180, 180, 180, 180 after the repeat loop?
And if I will add more new data after 2017-06-15, how can I create the weekly variable automatically depending on my end of row (date)? (In other words, by doing that, I don't need to calculate how many daily observations I have and divide it by 7 and plus the rest of the dates to become the week index)
Thank you!
Does this work
library(lubridate)
#DATA
x = data.frame(date = seq.Date(from = ymd("2014-01-06"),
to = ymd("2017-06-15"), length.out = 15))
#Add year and week for each date
x$week = year(x$date) + week(x$date)/100
#Convert the addition of year and week to factor and then to numeric
x$week_variable = as.numeric(as.factor(x$week))
#Another alternative
x$week_variable2 = floor(as.numeric(x$date - min(x$date))/7) + 1
x
# date week week_variable week_variable2
#1 2014-01-06 2014.01 1 1
#2 2014-04-05 2014.14 2 13
#3 2014-07-04 2014.27 3 26
#4 2014-10-02 2014.40 4 39
#5 2014-12-30 2014.52 5 52
#6 2015-03-30 2015.13 6 65
#7 2015-06-28 2015.26 7 77
#8 2015-09-26 2015.39 8 90
#9 2015-12-24 2015.52 9 103
#10 2016-03-23 2016.12 10 116
#11 2016-06-21 2016.25 11 129
#12 2016-09-18 2016.38 12 141
#13 2016-12-17 2016.51 13 154
#14 2017-03-17 2017.11 14 167
#15 2017-06-15 2017.24 15 180
Here is the answer:
week = c()
for (i in 1:184) {
for (j in 1:7) {
week[j+(i-1)*7] = i
}
}
week = as.data.frame(week)
I created a week variable, and from week 1 to the week 184 (end of my dataset). For each week number, I repeat 7 times because there are 7 days in a week. Later I assigned the week variable to my data frame.

Removing multiple data entries based on a total number of entries per day

I start with a data frame titled 'dat' in R that looks like the following:
datetime lat long id extra step
1 8/9/2014 13:00 31.34767 -81.39117 36 1 31.38946
2 8/9/2014 17:00 31.34767 -81.39150 36 1 11155.67502
3 8/9/2014 23:00 31.30683 -81.28433 36 1 206.33342
4 8/10/2014 5:00 31.30867 -81.28400 36 1 11152.88177
What I need to do is find out what days have less than 3 entries and remove all entries associated with those days from the original data.
I initially did this by the following:
library(plyr)
datetime<-dat$datetime
###strip the time down to only have the date no hh:mm:ss
date<- strptime(datetime, format = "%m/%d/%Y")
### bind the date to the old data
dat2<-cbind(date, dat)
### count using just the date so you can ID which days have fewer than 3 points
datecount<- count(dat2, "date")
datecount<- subset(datecount, datecount$freq < 3)
This end up producing the following:
row.names date freq
1 49 2014-09-26 1
2 50 2014-09-27 2
3 135 2014-12-21 2
Which is great, but I cannot figure out how to remove the entries from these days with less than three entries from the original 'dat' because this is a compressed version of the original data frame.
So to try and deal with this I have come up with another way of looking at the problem. I will use the strptime and cbind from above:
datetime<-dat$datetime
###strip the time down to only have the date no hh:mm:ss
date<- strptime(datetime, format = "%m/%d/%Y")
### bind the date to the old data
dat2<-cbind(date, dat)
And I will utilize the column titled "extra". I would like to create a new column which is the result of summing the values in this "extra" column by the simplified strptime dates. But find a way to apply this new value to all entries from that date, like the following:
date datetime lat long id extra extra_sum
1 2014-08-09 8/9/2014 13:00 31.34767 -81.39117 36 1 3
2 2014-08-09 8/9/2014 17:00 31.34767 -81.39150 36 1 3
3 2014-08-09 8/9/2014 23:00 31.30683 -81.28433 36 1 3
4 2014-08-10 8/10/2014 5:00 31.30867 -81.28400 36 1 4
5 2014-08-10 8/10/2014 13:00 31.34533 -81.39317 36 1 4
6 2014-08-10 8/10/2014 17:00 31.34517 -81.39317 36 1 4
7 2014-08-10 8/10/2014 23:00 31.34483 -81.39283 36 1 4
8 2014-08-11 8/11/2014 5:00 31.30600 -81.28317 36 1 2
9 2014-08-11 8/11/2014 13:00 31.34433 -81.39300 36 1 2
The code that creates the "extra_sum" column is what I am struggling with.
After creating this I can simply subset my data to all entries that have a value >2. Any help figuring out how to use my initial methodology or this new one to remove days with fewer than 3 entries from my initial data set would be much appreciated!
The plyr way.
library(plyr)
datetime <- dat$datetime
###strip the time down to only have the date no hh:mm:ss
date <- strptime(datetime, format = "%m/%d/%Y")
### bind the date to the old data
dat2 <-cbind(date, dat)
dat3 <- ddply(dat2, .(date), function(df){
if (nrow(df)>=3) {
return(df)
} else {
return(NULL)
}
})
I recommend using the data.table package
library(data.table)
dat<-data.table(dat)
dat$Date<-as.Date(as.character(dat$datetime), format = "%m/%d/%Y")
dat_sum<-dat[, .N, by = Date ]
dat_3plus<-dat_sum[N>=3]
dat<-dat[Date%in%dat_3plus$Date]

create a new column based on the subtraction results from two columns

I have two large data sets like these:
df1=data.frame(subject = c(rep(1, 12), rep(2, 10)), day =c(1,1,1,1,1,2,3,15,15,15,15,19,1,1,1,1,2,3,15,15,15,15),stime=c('4/16/2012 6:25','4/16/2012 7:01','4/16/2012 17:22','4/16/2012 17:45','4/16/2012 18:13','4/18/2012 6:50','4/19/2012 6:55','5/1/2012 6:28','5/1/2012 7:00','5/1/2012 16:28','5/1/2012 17:00','5/5/2012 17:00','4/23/2012 5:56','4/23/2012 6:30','4/23/2012 16:55','4/23/2012 17:20','4/25/2012 6:32','4/26/2012 6:28','5/8/2012 5:54','5/8/2012 6:30','5/8/2012 15:55','5/8/2012 16:30'))
df2=data.frame(subject = c(rep(1, 10), rep(2, 10)), day=c(1,1,2,2,3,3,9,9,15,15,1,1,2,2,3,3,9,9,15,15),dtime=c('4/16/2012 6:15','4/16/2012 15:16','4/18/2012 7:15','4/18/2012 21:45','4/19/2012 7:05','4/19/2012 23:17','4/28/2012 7:15','4/28/2012 21:12','5/1/2012 7:15','5/1/2012 15:15','4/23/2012 6:45','4/23/2012 16:45','4/25/2012 6:45','4/25/2012 21:30','4/26/2012 6:45','4/26/2012 22:00','5/2/2012 7:00','5/2/2012 22:00','5/8/2012 6:45','5/8/2012 15:45'))
...
in df2, the 'dtime' contains two time points for each subject on each day. I want to use the time points for each sub on each day in df1 (ie. 'stime') to subtract the second time point for each sub on each day in df2, if the result is positive, then give the second time point in dtime for that observation, otherwise give the first time point. For example, for subject 1 on day 1, ('4/16/2012 6:25'-'4/16/2012 15:16')<0, so we give the first time point '4/16/2012 6:15' to this obs; ('4/16/2012 17:22'-'4/16/2012 15:16')>0,
so we give this second time point '4/16/2012 15:16' to this obs. The expected output should look like this:
df3=data.frame(subject = c(rep(1, 12), rep(2, 10)), day =c(1,1,1,1,1,2,3,15,15,15,15,19,1,1,1,1,2,3,15,15,15,15),stime=c('4/16/2012 6:25','4/16/2012 7:01','4/16/2012 17:22','4/16/2012 17:45','4/16/2012 18:13','4/18/2012 6:50','4/19/2012 6:55','5/1/2012 6:28','5/1/2012 7:00','5/1/2012 16:28','5/1/2012 17:00','5/5/2012 17:00','4/23/2012 5:56','4/23/2012 6:30','4/23/2012 16:55','4/23/2012 17:20','4/25/2012 6:32','4/26/2012 6:28','5/8/2012 5:54','5/8/2012 6:30','5/8/2012 15:55','5/8/2012 16:30'), dtime=c('4/16/2012 6:15','4/16/2012 6:15','4/16/2012 15:16','4/16/2012 15:16','4/16/2012 15:16','4/18/2012 7:15','4/19/2012 7:05','5/1/2012 7:15','5/1/2012 7:15','5/1/2012 15:15','5/1/2012 15:15','.','4/23/2012 6:45','4/23/2012 6:45','4/23/2012 16:45','4/23/2012 16:45','4/25/2012 6:45','4/26/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','5/8/2012 15:45','5/8/2012 15:45'))
...
I used the code below to realize this, however, due to the missing 'dtime' for day 19, R kept giving me the error:
df1$dtime <- apply(df1, 1, function(x){
choices <- df2[ df2$subject==as.numeric(x["subject"]) &
df2$day==as.numeric(x["day"]) , "dtime"]
if( as.POSIXct(x["stime"], format="%m/%d/%Y %H:%M") <
as.POSIXct(choices[2],format="%m/%d/%Y %H:%M") ) {
choices[1]
}else{ choices[2] }
} )
Error in if (as.POSIXct(x["stime"], format = "%m/%d/%Y %H:%M") < as.POSIXct(choices[2], : missing value where TRUE/FALSE needed
Does anyone have idea how to solve this problem?
As a start, I inputted the two data frames in to try things out. Here is what I am thinking in terms of a pseudo-code approach (will leave you to finish the code). df1, when inputted, looks like the following:
subject day stime
1 1 1 4/16/2012 6:25
2 1 1 4/16/2012 7:01
3 1 1 4/16/2012 17:22
4 1 1 4/16/2012 17:45
5 1 1 4/16/2012 18:13
6 1 2 4/18/2012 6:50
7 1 3 4/19/2012 6:55
8 1 15 5/1/2012 6:28
9 1 15 5/1/2012 7:00
10 1 15 5/1/2012 16:28
11 1 15 5/1/2012 17:00
12 2 1 4/23/2012 5:56
13 2 1 4/23/2012 6:30
14 2 1 4/23/2012 16:55
15 2 1 4/23/2012 17:20
16 2 2 4/25/2012 6:32
17 2 3 4/26/2012 6:28
18 2 15 5/8/2012 5:54
19 2 15 5/8/2012 6:30
20 2 15 5/8/2012 15:55
21 2 15 5/8/2012 16:30
Why not try the following:
First, write a simple loop that will enable you to loop through each of the values in the stime column for both df1 and df2. Do make this easy, you could convert the df1 and df2 data frame into a matrix if you like (using as.matrix(), which is my preference).
After you grab the first value in row 1, column, 3 from df1, which is 4/16/2012 6:25, pull out the 6:25 and store it in a temporary variable ... let's call this variable a
Do the exact same thing for df2, which you also want to compare to, and store this in a temporary variable, except grab the variable from the relevant position ... let's call this variable b
Subtract the two temporary variables (you may need to write some code to get the two parts set up so that you can easily do an a-b and get a numerical answer. That said, I will leave that up to you).
Check whether the answer is positive or negative using a simple conditional if statement
Get the value of a or b depending on the output from your conditional check
Add this new value to a new data table, with the appropriate subject and day. You have called this df3.
I'm getting different answers than you. First I made a copy of df1 to work with:
df4 <- df1
df4$dtime <- apply(df4, 1, function(x){
choices <- df2[ df2$subject==as.numeric(x["subject"]) &
df2$day==as.numeric(x["day"]) , "dtime"]
if( as.POSIXct(x["stime"], format="%m/%d/%Y %H:%M") <
as.POSIXct(choices[1],format="%m/%d/%Y %H:%M") ) {
choices[1]
}else{ choices[2] }
} )
#----------------------------------------------
subject day stime dtime
1 1 1 4/16/2012 6:25 4/16/2012 15:16
2 1 1 4/16/2012 7:01 4/16/2012 15:16
3 1 1 4/16/2012 17:22 4/16/2012 15:16
4 1 1 4/16/2012 17:45 4/16/2012 15:16
5 1 1 4/16/2012 18:13 4/16/2012 15:16
6 1 2 4/18/2012 6:50 4/18/2012 7:15
7 1 3 4/19/2012 6:55 4/19/2012 7:05
8 1 15 5/1/2012 6:28 5/1/2012 7:15
9 1 15 5/1/2012 7:00 5/1/2012 7:15
10 1 15 5/1/2012 16:28 5/1/2012 15:15
11 1 15 5/1/2012 17:00 5/1/2012 15:15
12 2 1 4/23/2012 5:56 4/23/2012 6:45
13 2 1 4/23/2012 6:30 4/23/2012 6:45
14 2 1 4/23/2012 16:55 4/23/2012 16:45
15 2 1 4/23/2012 17:20 4/23/2012 16:45
16 2 2 4/25/2012 6:32 4/25/2012 6:45
17 2 3 4/26/2012 6:28 4/26/2012 6:45
18 2 15 5/8/2012 5:54 5/8/2012 6:45
19 2 15 5/8/2012 6:30 5/8/2012 6:45
20 2 15 5/8/2012 15:55 5/8/2012 15:45
21 2 15 5/8/2012 16:30 5/8/2012 15:45

Resources