create a new column based on the subtraction results from two columns - r

I have two large data sets like these:
df1=data.frame(subject = c(rep(1, 12), rep(2, 10)), day =c(1,1,1,1,1,2,3,15,15,15,15,19,1,1,1,1,2,3,15,15,15,15),stime=c('4/16/2012 6:25','4/16/2012 7:01','4/16/2012 17:22','4/16/2012 17:45','4/16/2012 18:13','4/18/2012 6:50','4/19/2012 6:55','5/1/2012 6:28','5/1/2012 7:00','5/1/2012 16:28','5/1/2012 17:00','5/5/2012 17:00','4/23/2012 5:56','4/23/2012 6:30','4/23/2012 16:55','4/23/2012 17:20','4/25/2012 6:32','4/26/2012 6:28','5/8/2012 5:54','5/8/2012 6:30','5/8/2012 15:55','5/8/2012 16:30'))
df2=data.frame(subject = c(rep(1, 10), rep(2, 10)), day=c(1,1,2,2,3,3,9,9,15,15,1,1,2,2,3,3,9,9,15,15),dtime=c('4/16/2012 6:15','4/16/2012 15:16','4/18/2012 7:15','4/18/2012 21:45','4/19/2012 7:05','4/19/2012 23:17','4/28/2012 7:15','4/28/2012 21:12','5/1/2012 7:15','5/1/2012 15:15','4/23/2012 6:45','4/23/2012 16:45','4/25/2012 6:45','4/25/2012 21:30','4/26/2012 6:45','4/26/2012 22:00','5/2/2012 7:00','5/2/2012 22:00','5/8/2012 6:45','5/8/2012 15:45'))
...
in df2, the 'dtime' contains two time points for each subject on each day. I want to use the time points for each sub on each day in df1 (ie. 'stime') to subtract the second time point for each sub on each day in df2, if the result is positive, then give the second time point in dtime for that observation, otherwise give the first time point. For example, for subject 1 on day 1, ('4/16/2012 6:25'-'4/16/2012 15:16')<0, so we give the first time point '4/16/2012 6:15' to this obs; ('4/16/2012 17:22'-'4/16/2012 15:16')>0,
so we give this second time point '4/16/2012 15:16' to this obs. The expected output should look like this:
df3=data.frame(subject = c(rep(1, 12), rep(2, 10)), day =c(1,1,1,1,1,2,3,15,15,15,15,19,1,1,1,1,2,3,15,15,15,15),stime=c('4/16/2012 6:25','4/16/2012 7:01','4/16/2012 17:22','4/16/2012 17:45','4/16/2012 18:13','4/18/2012 6:50','4/19/2012 6:55','5/1/2012 6:28','5/1/2012 7:00','5/1/2012 16:28','5/1/2012 17:00','5/5/2012 17:00','4/23/2012 5:56','4/23/2012 6:30','4/23/2012 16:55','4/23/2012 17:20','4/25/2012 6:32','4/26/2012 6:28','5/8/2012 5:54','5/8/2012 6:30','5/8/2012 15:55','5/8/2012 16:30'), dtime=c('4/16/2012 6:15','4/16/2012 6:15','4/16/2012 15:16','4/16/2012 15:16','4/16/2012 15:16','4/18/2012 7:15','4/19/2012 7:05','5/1/2012 7:15','5/1/2012 7:15','5/1/2012 15:15','5/1/2012 15:15','.','4/23/2012 6:45','4/23/2012 6:45','4/23/2012 16:45','4/23/2012 16:45','4/25/2012 6:45','4/26/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','5/8/2012 15:45','5/8/2012 15:45'))
...
I used the code below to realize this, however, due to the missing 'dtime' for day 19, R kept giving me the error:
df1$dtime <- apply(df1, 1, function(x){
choices <- df2[ df2$subject==as.numeric(x["subject"]) &
df2$day==as.numeric(x["day"]) , "dtime"]
if( as.POSIXct(x["stime"], format="%m/%d/%Y %H:%M") <
as.POSIXct(choices[2],format="%m/%d/%Y %H:%M") ) {
choices[1]
}else{ choices[2] }
} )
Error in if (as.POSIXct(x["stime"], format = "%m/%d/%Y %H:%M") < as.POSIXct(choices[2], : missing value where TRUE/FALSE needed
Does anyone have idea how to solve this problem?

As a start, I inputted the two data frames in to try things out. Here is what I am thinking in terms of a pseudo-code approach (will leave you to finish the code). df1, when inputted, looks like the following:
subject day stime
1 1 1 4/16/2012 6:25
2 1 1 4/16/2012 7:01
3 1 1 4/16/2012 17:22
4 1 1 4/16/2012 17:45
5 1 1 4/16/2012 18:13
6 1 2 4/18/2012 6:50
7 1 3 4/19/2012 6:55
8 1 15 5/1/2012 6:28
9 1 15 5/1/2012 7:00
10 1 15 5/1/2012 16:28
11 1 15 5/1/2012 17:00
12 2 1 4/23/2012 5:56
13 2 1 4/23/2012 6:30
14 2 1 4/23/2012 16:55
15 2 1 4/23/2012 17:20
16 2 2 4/25/2012 6:32
17 2 3 4/26/2012 6:28
18 2 15 5/8/2012 5:54
19 2 15 5/8/2012 6:30
20 2 15 5/8/2012 15:55
21 2 15 5/8/2012 16:30
Why not try the following:
First, write a simple loop that will enable you to loop through each of the values in the stime column for both df1 and df2. Do make this easy, you could convert the df1 and df2 data frame into a matrix if you like (using as.matrix(), which is my preference).
After you grab the first value in row 1, column, 3 from df1, which is 4/16/2012 6:25, pull out the 6:25 and store it in a temporary variable ... let's call this variable a
Do the exact same thing for df2, which you also want to compare to, and store this in a temporary variable, except grab the variable from the relevant position ... let's call this variable b
Subtract the two temporary variables (you may need to write some code to get the two parts set up so that you can easily do an a-b and get a numerical answer. That said, I will leave that up to you).
Check whether the answer is positive or negative using a simple conditional if statement
Get the value of a or b depending on the output from your conditional check
Add this new value to a new data table, with the appropriate subject and day. You have called this df3.

I'm getting different answers than you. First I made a copy of df1 to work with:
df4 <- df1
df4$dtime <- apply(df4, 1, function(x){
choices <- df2[ df2$subject==as.numeric(x["subject"]) &
df2$day==as.numeric(x["day"]) , "dtime"]
if( as.POSIXct(x["stime"], format="%m/%d/%Y %H:%M") <
as.POSIXct(choices[1],format="%m/%d/%Y %H:%M") ) {
choices[1]
}else{ choices[2] }
} )
#----------------------------------------------
subject day stime dtime
1 1 1 4/16/2012 6:25 4/16/2012 15:16
2 1 1 4/16/2012 7:01 4/16/2012 15:16
3 1 1 4/16/2012 17:22 4/16/2012 15:16
4 1 1 4/16/2012 17:45 4/16/2012 15:16
5 1 1 4/16/2012 18:13 4/16/2012 15:16
6 1 2 4/18/2012 6:50 4/18/2012 7:15
7 1 3 4/19/2012 6:55 4/19/2012 7:05
8 1 15 5/1/2012 6:28 5/1/2012 7:15
9 1 15 5/1/2012 7:00 5/1/2012 7:15
10 1 15 5/1/2012 16:28 5/1/2012 15:15
11 1 15 5/1/2012 17:00 5/1/2012 15:15
12 2 1 4/23/2012 5:56 4/23/2012 6:45
13 2 1 4/23/2012 6:30 4/23/2012 6:45
14 2 1 4/23/2012 16:55 4/23/2012 16:45
15 2 1 4/23/2012 17:20 4/23/2012 16:45
16 2 2 4/25/2012 6:32 4/25/2012 6:45
17 2 3 4/26/2012 6:28 4/26/2012 6:45
18 2 15 5/8/2012 5:54 5/8/2012 6:45
19 2 15 5/8/2012 6:30 5/8/2012 6:45
20 2 15 5/8/2012 15:55 5/8/2012 15:45
21 2 15 5/8/2012 16:30 5/8/2012 15:45

Related

Get the first date of the first five-consecutive dates in a list using R

I have the following data:
structure(list(V1 = c("1979-01-28", "1979-01-29", "1979-01-30",
"1979-02-13", "1979-02-14", "1979-02-17", "1979-02-18", "1979-02-19",
"1979-02-20", "1979-02-21", "1979-02-22", "1979-02-23", "1979-03-07",
"1979-03-14", "1979-03-18", "1979-03-29", "1979-03-30", "1979-03-31",
"1979-04-01", "1979-04-02", "1979-04-03", "1979-04-04", "1979-04-05")), class =
"data.frame", row.names = c(NA,-22L))
This is a list of dates. The interval is daily but with gaps.
I would like to get the first date of a five-day sequence that occurred first.
So in the example above, the expected output is "1979-02-17".
Right now, I am getting the dates manually. How can I do this in R?
I'll appreciate any help on this.
Using rle and diff.
df$V1[with(rle(diff(as.Date(df$V1)) == 1), {
inds <- which.max(values & lengths >= 5)
sum(lengths[1:(inds - 1)]) + 1
})]
#[1] "1979-02-17"
How about
df=data.frame("V1"=df$V1)
df$V2=difftime(df$V1,c(tail(df$V1,-1),NA))
tmp=rle(as.numeric(df$V2))
df$V3=rep(tmp$lengths,tmp$lengths)
df
V1 V2 V3
1 1979-01-28 24 hours 2
2 1979-01-29 24 hours 2
3 1979-01-30 336 hours 1
4 1979-02-13 24 hours 1
5 1979-02-14 72 hours 1
6 1979-02-17 24 hours 6
7 1979-02-18 24 hours 6
8 1979-02-19 24 hours 6
9 1979-02-20 24 hours 6
10 1979-02-21 24 hours 6
11 1979-02-22 24 hours 6
12 1979-02-23 288 hours 1
13 1979-03-07 168 hours 1
14 1979-03-14 96 hours 1
15 1979-03-18 264 hours 1
16 1979-03-29 24 hours 3
17 1979-03-30 24 hours 3
18 1979-03-31 24 hours 3
19 1979-04-01 23 hours 1
20 1979-04-02 24 hours 3
21 1979-04-03 24 hours 3
22 1979-04-04 24 hours 3
23 1979-04-05 NA hours 1
df$V1[which.max(df$V3>=5)]
[1] "1979-02-17"

create an unique week variable NOT depending on the calendar in R

I have a daily revenue time series df from 01-01-2014 to 15-06-2017 and I want to aggregate the daily revenue data to weekly revenue data and do the weekly predictions. Before I aggregate the revenue, I need to create a continuously week variable, which will NOT start from week 1 again when a new year starts. Since 01-01-2014 was not Monday, so I decided to start my first week from 06-01-2014.
My df now looks like this
date year month total
7 2014-01-06 2014 1 1857679.4
8 2014-01-07 2014 1 1735488.0
9 2014-01-08 2014 1 1477269.9
10 2014-01-09 2014 1 1329882.9
11 2014-01-10 2014 1 1195215.7
...
709 2017-06-14 2017 6 1677476.9
710 2017-06-15 2017 6 1533083.4
I want to create a unique week variable starting from 2014-01-06 until the last row of my dataset (1257 rows in total), which is 2017-06-15.
I wrote a loop:
week = c()
for (i in 1:179) {
week = rep(i,7)
print(week)
}
However, the result of this loop is not saved for each iteration. When I type week, it just shows 179,179,179,179,179,179,179
Where is the problem and how can I add 180, 180, 180, 180 after the repeat loop?
And if I will add more new data after 2017-06-15, how can I create the weekly variable automatically depending on my end of row (date)? (In other words, by doing that, I don't need to calculate how many daily observations I have and divide it by 7 and plus the rest of the dates to become the week index)
Thank you!
Does this work
library(lubridate)
#DATA
x = data.frame(date = seq.Date(from = ymd("2014-01-06"),
to = ymd("2017-06-15"), length.out = 15))
#Add year and week for each date
x$week = year(x$date) + week(x$date)/100
#Convert the addition of year and week to factor and then to numeric
x$week_variable = as.numeric(as.factor(x$week))
#Another alternative
x$week_variable2 = floor(as.numeric(x$date - min(x$date))/7) + 1
x
# date week week_variable week_variable2
#1 2014-01-06 2014.01 1 1
#2 2014-04-05 2014.14 2 13
#3 2014-07-04 2014.27 3 26
#4 2014-10-02 2014.40 4 39
#5 2014-12-30 2014.52 5 52
#6 2015-03-30 2015.13 6 65
#7 2015-06-28 2015.26 7 77
#8 2015-09-26 2015.39 8 90
#9 2015-12-24 2015.52 9 103
#10 2016-03-23 2016.12 10 116
#11 2016-06-21 2016.25 11 129
#12 2016-09-18 2016.38 12 141
#13 2016-12-17 2016.51 13 154
#14 2017-03-17 2017.11 14 167
#15 2017-06-15 2017.24 15 180
Here is the answer:
week = c()
for (i in 1:184) {
for (j in 1:7) {
week[j+(i-1)*7] = i
}
}
week = as.data.frame(week)
I created a week variable, and from week 1 to the week 184 (end of my dataset). For each week number, I repeat 7 times because there are 7 days in a week. Later I assigned the week variable to my data frame.

R get the particular hour from the time

I have the following data
Num Date Time
1 2015.05.21 12:12:12
2 2015.05.22 13:12:12
3 2015.05.23 14:12:12
4 2015.05.24 15:12:12
5 2015.05.25 16:12:12
By using weekdays(as.Date(data$Date, format='%Y.%m.%d')) I can get the corresponding days of the week. Also by using months I can get the corresponding months. Is there a way to get the hour only in a new column? Something like hours(as.Date(data$Time, format='%H:%M:%S')) which will provide me the following output.
Num Date Time Hour
1 2015.05.21 12:12:12 12
2 2015.05.22 13:12:12 13
3 2015.05.23 14:12:12 14
4 2015.05.24 15:12:12 15
5 2015.05.25 16:12:12 16
R doesn't have a native data type for just time values without dates. With the sample data
dd<-read.table(text="Num Date Time
1 2015.05.21 12:12:12
2 2015.05.22 13:12:12
3 2015.05.23 14:12:12
4 2015.05.24 15:12:12
5 2015.05.25 16:12:12", header=T, stringsAsFactors=F)
You can do
transform(dd, Hour=as.POSIXlt(paste(Date, Time), format="%Y.%m.%d %H:%M:%S")$hour)
to get
Num Date Time Hour
1 1 2015.05.21 12:12:12 12
2 2 2015.05.22 13:12:12 13
3 3 2015.05.23 14:12:12 14
4 4 2015.05.24 15:12:12 15
5 5 2015.05.25 16:12:12 16

Merge repeated measurements dataframe in R

I would like to merge two dataframe of repeated measurements. Both of them have format like this and the difference is that the first one has observation1 while the other has observation2.
Location Date Time observation1
1 1/1/2000 6:00 20
1 1/1/2000 7:00 14
1 1/1/2000 8:00 35
1 1/2/2000 6:00 20
1 1/2/2000 7:00 14
1 1/2/2000 8:00 35
2 1/1/2000 6:00 10
2 1/1/2000 7:00 14
2 1/1/2000 8:00 45
2 1/2/2000 6:00 30
2 1/2/2000 7:00 24
2 1/2/2000 8:00 35
.
.
100 10/31/2000 6:00 80
100 10/31/2000 7:00 80
100 10/31/2000 8:00 80
I want to process them so for each location at a specific date and time, the observation1 and observation2 can match up.
I planned to use a for loop to do it, meaning I pick one row from dataframe1, match it with dataframe2, and then pick another row from dataframe1 and do it over and over. But since the dataframes both have several millions of rows, this is super slow.
Can anyone suggest a more efficient way? Thanks!
Following, #Anrew Taylor
A direct way of doing it is using Merge : a reproducible example is as below:
Location = c(1,1,2,3,4,1)
Date1 = c(as.Date("2014-01-01"), as.Date("2000-01-01"), as.Date("2005-01-01"), as.Date("2001-12-01"), as.Date("2001-11-01"), as.Date("2001-10-01"))
Time1 = c(20,30,40,50,60,70)
Observation1 = c(1,2,3,4,5,6)
Date2 = c(as.Date("2014-10-01"), as.Date("2001-01-01"), as.Date("2005-01-01"), as.Date("2001-12-01"), as.Date("2001-11-01"), as.Date("2001-10-01"))
Time2 = c(20,20,40,50,50,70)
Observation2 = c(7,8,9,10,11,12)
data1 = data.frame(Location = Location, Date = Date, Time = Time, Observation1 = Observation1)
data2 = data.frame(Location = Location, Date = Date2, Time = Time2, Observation2 = Observation2)
merge(data1,data2, by = c("Date", "Time", "Location"))
That will return :
Date Time Location Observation1 Observation2
1 2001-10-01 70 1 6 12
2 2001-12-01 50 3 4 10
3 2005-01-01 40 2 3 9

Assigning values in a sequence to a group of consecutive rows leaving some rows empty

I'm trying to group several consecutives rows (and assigning them the same value) while leaving some of the rows empty (when a certain condition is not fulfilled).
My data are locations (xy coordinates), the date/time at which they were measured, and the time span between measures. Somehow simplified, they look like this:
ID X Y Time Span
1 3445 7671 0:00 -
2 3312 7677 4:00 4
3 3309 7680 12:00 8
4 3299 7681 16:00 4
5 3243 7655 20:00 4
6 3222 7612 4:00 8
7 3260 7633 0:00 4
8 3254 7641 8:00 8
9 3230 7612 0:00 16
10 3203 7656 4:00 4
11 3202 7678 8:00 4
12 3159 7609 20:00 12
...
I'd like to assign a value to every sequence of locations that are measured within a time span of 4 hours, and make my data look like this:
ID X Y Time Span Sequence
1 3445 7671 0:00 - -
2 3312 7677 4:00 4 1
3 3309 7680 12:00 8 NA
4 3299 7681 16:00 4 2
5 3243 7655 20:00 4 2
6 3222 7612 4:00 8 NA
7 3260 7633 0:00 4 3
8 3254 7641 8:00 8 NA
9 3230 7612 0:00 16 NA
10 3203 7656 4:00 4 4
11 3202 7678 8:00 4 4
12 3159 7609 20:00 12 NA
I've tried several algorithms with a loop "for" plus "ifelse" condition like:
Sequence <- for (i in 1:max(ID)) {
ifelse (Span <= 4, i+1, "NA")
}
without any luck. I know my attempt is incorrect, but my programming skills are really basic and I haven't found any similar problem in the web.
Any ideas would be very appreciated!
Here is a longish one liner:
ifelse(x <- DF$Span == 4, cumsum(c(head(x, 1), tail(x, -1) - head(x, -1) == 1)), NA)
# [1] NA 1 NA 2 2 NA 3 NA NA 4 4 NA
Explanation:
x is a vector of TRUE/FALSE showing where Span is 4.
tail(x, -1) is a safe way of writing x[2:length(x)]
head(x, -1) is a safe way of writing x[1:(length(x)-1)]
tail(x, -1) - head(x, -1) == 1 is a vector of TRUE/FALSE showing where we went from Span != 4 to Span == 4.
since the vector above is one element shorter than x, I prepended head(x, 1) in front of it. head(x, 1) is a safe way of writing x[1].
Then I take the cumsum so it converts the vector TRUE/FALSE into a vector of increasing integers: where Span jumps from !=4 to ==4 it increases by 1, otherwise stays constant.
Everything is wrapped into an ifelse so you only see numbers where x is TRUE, i.e., where Span == 4.
Here's another alternative using rle and rep. We'll assume that your data.frame is named "test".
First, initialize your "Sequence" column, filling it with NA.
test$Sequence <- NA
Second, specify the condition that you are matching, in this case, test$Span == 4.
x <- test$Span == 4
Third, use the combination of rle's output (lengths and values) to get how many times each new run in the sequence occurs.
spanSeq <- rle(x)$lengths[rle(x)$values == TRUE]
Finally, use rep with the times argument set to the result obtained in step 3. Subset the required values of test$Sequence according to the index matched by test$Span == 4, and replace them with your new sequence.
test$Sequence[x] <- rep(seq_along(spanSeq), times = spanSeq)
test
# ID X Y Time Span Sequence
# 1 1 3445 7671 0:00 - NA
# 2 2 3312 7677 4:00 4 1
# 3 3 3309 7680 12:00 8 NA
# 4 4 3299 7681 16:00 4 2
# 5 5 3243 7655 20:00 4 2
# 6 6 3222 7612 4:00 8 NA
# 7 7 3260 7633 0:00 4 3
# 8 8 3254 7641 8:00 8 NA
# 9 9 3230 7612 0:00 16 NA
# 10 10 3203 7656 4:00 4 4
# 11 11 3202 7678 8:00 4 4
# 12 12 3159 7609 20:00 12 NA
Once you understand the steps involved, you can also do this directly with within(). The following would give you the same result:
within(test, {
Sequence <- NA
spanSeq <- rle(Span == 4)$lengths[rle(Span == 4)$values == TRUE]
Sequence[Span == 4] <- rep(seq_along(spanSeq), times = spanSeq)
rm(spanSeq)
})
count = 0
for (i in 1:max(ID)) {
Sequence[i] = ifelse(Span[i] <= 4, count <- count+1, NA)
}

Resources