r group by date difference with respect to first date - r

I have a dataset that looks like this.
Id Date1 Cars
1 2007-04-05 72
2 2014-01-07 12
2 2018-07-09 10
2 2018-07-09 13
3 2005-11-19 22
3 2005-11-23 13
4 2010-06-17 38
4 2010-09-23 57
4 2010-09-23 41
4 2010-10-04 17
What I would like to do is for each Id get the date difference with respect to the 1st Date (Earliest) date for that Id. For each Id, (EarliestDate - 2nd Earliest Date), (EarliestDate - 3rd Earliest Date), (Earliest Date - 4th Earliest Date) ... so on.
I would end up with a dataset like this
Id Date1 Cars Diff
1 2007-04-05 72 NA
2 2014-01-07 12 NA
2 2018-07-09 10 1644 = (2018-07-09 - 2014-01-07)
2 2018-07-09 13 1644 = (2018-07-09 - 2014-01-07)
3 2005-11-19 22 NA
3 2005-11-23 13 4 = (2005-11-23 - 2005-11-19)
4 2010-06-17 38 NA
4 2010-09-23 57 98 = (2010-09-23 - 2010-06-17)
4 2010-09-23 41 98 = (2010-09-23 - 2010-06-17)
4 2010-10-04 17 109 = (2010-10-04 - 2010-09-23)
I am unclear on how to accomplish this. Any help would be much appreciated. Thanks

Change Date1 to date class.
df$Date1 = as.Date(df$Date1)
You can subtract with the first value in each Id. This can be done using dplyr.
library(dplyr)
df %>% group_by(Id) %>% mutate(Diff = as.integer(Date1 - first(Date1)))
# Id Date1 Cars Diff
# <int> <date> <int> <int>
# 1 1 2007-04-05 72 0
# 2 2 2014-01-07 12 0
# 3 2 2018-07-09 10 1644
# 4 2 2018-07-09 13 1644
# 5 3 2005-11-19 22 0
# 6 3 2005-11-23 13 4
# 7 4 2010-06-17 38 0
# 8 4 2010-09-23 57 98
# 9 4 2010-09-23 41 98
#10 4 2010-10-04 17 109
data.table
setDT(df)[, Diff := as.integer(Date1 - first(Date1)), Id]
OR base R :
df$diff <- with(df, ave(as.integer(Date1), Id, FUN = function(x) x - x[1]))
Replace 0's to NA if you want output as such.

Related

Split data frame based on group into list in defined order in R

I have simple data.frame
ts_df <- data.frame(val=c(20,30,40,50,21,26,11,41,47,41),
cycle=c(3,3,3,3,2,2,2,1,1,1),
date=as.Date(c("1985-06-30","1985-09-30","1985-12-31","1986-03-31","1986-06-30","1986-09-30","1986-12-31","1987-03-31","1987-06-30","1987-09-30")))
and I need split ts_df based on group but keep order inside resulted list based on date.
list_ts_df <- split(ts_df, ts_df$cycle)
So instead of
> list_ts_df
$`1`
val cycle date
8 41 1 1987-03-31
9 47 1 1987-06-30
10 41 1 1987-09-30
$`2`
val cycle date
5 21 2 1986-06-30
6 26 2 1986-09-30
7 11 2 1986-12-31
$`3`
val cycle date
1 20 3 1985-06-30
2 30 3 1985-09-30
3 40 3 1985-12-31
4 50 3 1986-03-31
I need desired output as
> list_ts_df
$`1`
val cycle date
1 20 3 1985-06-30
2 30 3 1985-09-30
3 40 3 1985-12-31
4 50 3 1986-03-31
$`2`
val cycle date
5 21 2 1986-06-30
6 26 2 1986-09-30
7 11 2 1986-12-31
$`3`
val cycle date
8 41 1 1987-03-31
9 47 1 1987-06-30
10 41 1 1987-09-30
Is there any simple solution how to achieve desired output? Thank you very much for any of your advises.
We can do an order of the dataset first and then do the split on the 'cycle' by creating a factor with the levels specified as unique elements
t1 <- ts_df[order(ts_df$date),]
split(t1, factor(t1$cycle, levels = unique(t1$cycle)) )

Finding discrepancy between two data sets when setdiff is not working

I have data for spot price and day-ahead price for hour 2 and hour 3. They are as below. They are from 2015-12-31 to 2011-01-01 all the way down.
> head(da2)
Date Price Hour
43802 2015-12-31 12.56 2
43778 2015-12-30 23.59 2
43754 2015-12-29 17.07 2
> head(sp2)
# A tibble: 6 x 3
Date Hour Price
<dttm> <chr> <dbl>
1 2015-12-31 2 17.15
2 2015-12-30 2 26.23
3 2015-12-29 2 23.01
> head(da3)
Date Price Hour
43803 2015-12-31 10.46 3
43779 2015-12-30 23.55 3
43755 2015-12-29 16.52 3
> head(sp3)
# A tibble: 6 x 3
Date Hour Price
<dttm> <chr> <dbl>
1 2015-12-31 3 12.96
2 2015-12-30 3 25.65
3 2015-12-29 3 23.59
I tried to put da2$Price and sp2$Price together, and again the same for hour 3.
But unfortunately, I get this.
> rpdf2<-data.frame(da2$Date,da2$Price,sp2$Price)
Error in data.frame(da2$Date, da2$Price, sp2$Price) :
arguments imply differing number of rows: 1826, 1822
> rpdf3<-data.frame(da3$Date,da3$Price,sp3$Price)
Error in data.frame(da3$Date, da3$Price, sp3$Price) :
arguments imply differing number of rows: 1821, 1825
So I applied > setdiff(paste(da2$Date),paste(sp2$Date))
Then I found
[1] "2014-03-30" "2013-03-31" "2012-03-25" "2011-03-27"
It was okay. But when I did setdiff(paste(da3$Date),paste(sp3$Date)), It shows me character(0).
There must be 4 observations difference. But I cannot find those four. Can anyone help me with this situation? Thank you.
When setdiff(da3$Date,sp3$Date)
result is
[1] 16800.04 16799.04 16798.04 16797.04 16796.04 16795.04 16794.04 16793.04 16792.04 16791.04 16790.04 16789.04 16788.04 16787.04 16786.04 16785.04 16784.04
[18] 16783.04 16782.04 16781.04 16780.04 16779.04 16778.04 16777.04 16776.04 16775.04 16774.04 16773.04 16772.04 16771.04 16770.04 16769.04 16768.04 16767.04
[35] 16766.04 16765.04 16764.04 16763.04 16762.04 16761.04 16760.04 16759.04 16758.04 16757.04 16756.04 16755.04 16754.04 16753.04 16752.04 16751.04 16750.04
[52] 16749.04 16748.04 16747.04 16746.04 16745.04 16744.04 16743.04 16742.04 16741.04 16740.04 16739.04 16738.04 16737.04 16736.04 16735.04 16734.04 16733.04
[69] 16732.04 16731.04 16730.04 16729.04 16728.04 16727.04 16726.04 16725.04 16724.04 16723.04 16722.04 16721.04 16720.04 16719.04 16718.04 16717.04 16716.04
[86] 16715.04 16714.04 16713.04 16712.04 16711.04 16710.04 16709.04 16708.04 16707.04 16706.04 16705.04 16704.04 16703.04 16702.04 16701.04 16700.04 16699.04
and so further.
One way (of many) to tackle this is instead of looking directly for the differences is to find a way to join your tables which will work regardless. To do so you simply need to generate a complete sequence of all dates from the first date on your list to the last, then left-join these to each of your daily and spot price data frames in turn. Missing date rows in each table will show as NA columns in the resulting joined table.
Example sequence, shortened to one month only for this exemplar. You'd start it at 2011-01-01 instead.
somedates = seq(as.Date("2015-12-01"), as.Date("2015-12-31"), by = "day")
Generate some test data each with four randomly missed dates to simulate your da2, da3, sp2 and sp3 tables:
library(dplyr)
set.seed(0)
da2 = data.frame(Date = sample(somedates, 27)) %>%
mutate(hour = 2, price = 20)
set.seed(1)
da3 = data.frame(Date = sample(somedates, 27)) %>%
mutate(hour = 3, price = 21)
set.seed(2)
sp2 = data.frame(Date = sample(somedates, 27)) %>%
mutate(hour = 2, price = 19)
set.seed(3)
sp3 = data.frame(Date = sample(somedates, 27)) %>%
mutate(hour = 3, price = 18)
Joining the da2, da3, sp2 and sp3 tables
With the test data generated, joining the tables to the complete sequence of dates (as a data frame) is straightforward. (NB I haven't replaced the joined column names with more meaningful versions in the result below).
all =
left_join(data.frame(Date = somedates), da2, by = "Date") %>%
left_join(da3, by = "Date") %>%
left_join(sp2, by = "Date") %>%
left_join(sp3, by = "Date")
Results from the test data joined
>all
Date hour.x price.x hour.y price.y hour.x.x price.x.x hour.y.y price.y.y
1 2015-12-01 2 20 3 21 2 19 3 18
2 2015-12-02 2 20 3 21 2 19 3 18
3 2015-12-03 NA NA 3 21 2 19 3 18
4 2015-12-04 2 20 3 21 2 19 3 18
5 2015-12-05 2 20 3 21 2 19 3 18
6 2015-12-06 2 20 3 21 2 19 3 18
7 2015-12-07 2 20 3 21 2 19 NA NA
8 2015-12-08 2 20 3 21 2 19 3 18
9 2015-12-09 2 20 3 21 NA NA 3 18
10 2015-12-10 2 20 3 21 NA NA 3 18
11 2015-12-11 2 20 3 21 2 19 3 18
12 2015-12-12 NA NA 3 21 2 19 3 18
13 2015-12-13 2 20 NA NA 2 19 NA NA
14 2015-12-14 2 20 3 21 2 19 3 18
15 2015-12-15 2 20 3 21 2 19 3 18
16 2015-12-16 2 20 3 21 2 19 3 18
17 2015-12-17 2 20 3 21 2 19 3 18
18 2015-12-18 2 20 NA NA 2 19 3 18
19 2015-12-19 NA NA 3 21 2 19 3 18
20 2015-12-20 2 20 NA NA NA NA 3 18
21 2015-12-21 2 20 3 21 2 19 3 18
22 2015-12-22 2 20 3 21 2 19 3 18
23 2015-12-23 2 20 3 21 2 19 3 18
24 2015-12-24 2 20 3 21 2 19 NA NA
25 2015-12-25 2 20 3 21 2 19 3 18
26 2015-12-26 2 20 3 21 2 19 3 18
27 2015-12-27 2 20 3 21 2 19 3 18
28 2015-12-28 2 20 3 21 2 19 3 18
29 2015-12-29 2 20 3 21 2 19 3 18
30 2015-12-30 2 20 3 21 NA NA 3 18
31 2015-12-31 NA NA NA NA 2 19 NA NA
Edit I note the numeric dates you posted as a result of your set join have a 0.04 time component as well as the whole-number date. You will need to add this to the date sequence to get the join to work. I have now tested this and without adding the time component you'd have to convert each date to a whole number. This can be done fairly simply though:
da2$Date = trunc.Date(da2$Date, "days")
da3$Date = trunc.Date(da3$Date, "days")
sp2$Date = trunc.Date(sp2$Date, "days")
sp3$Date = trunc.Date(sp3$Date, "days")
You'd do this before the joins.

reshaping 3 columns into matrix

I have 3 columns of data that I would like to reshape into a matrix where the columns are created_at and rows are citibike_station_id
head(sample)
available_bike_count created_at citibike_station_id
1 21 2015-10-08 00:00:00 72
2 7 2015-10-08 20:10:00 72
3 18 2015-10-08 06:50:00 72
4 19 2015-10-08 10:10:00 72
5 18 2015-10-08 02:30:00 72
6 17 2015-10-08 05:00:00 72
> dim(sample)
[1] 69511 3
Therefore, I have to group by created_at and by citibike_station_id
> length(unique(sample$created_at))
[1] 145
> length(unique(sample$citibike_station_id))
[1] 482
created_at represents a 10-minute time intervals - there should be 145 columns as there are 145 unique time intervals (representing one day of data) ; and there should be 482 rows as there are 482 unique values of citibike_station_id.
This is an example of what the data should look like in the end - however, in this example the column names are from a different day and year.
head(data[1:6])
station_id X2014.08.18.20.00.00 X2014.08.18.20.10.00 X2014.08.18.20.20.00
1 1 1 0 0
2 2 18 18 19
3 3 5 4 4
4 4 21 20 20
5 5 9 10 8
6 6 9 9 9
X2014.08.18.20.30.00 X2014.08.18.20.40.00
1 2 1
2 18 18
3 4 4
4 21 22
5 5 7
6 9 9
how would one do this with dplyr and tidyr ?
library(dplyr)
library(tidyr)
matrix <- sample %>%
group_by(created_at, citibike_station_id)%>%
spread(citibike_station_id, created_at)
however this does not work. Would the reshape2 package provide a better solution?

Conditional column-wise or row-wise subtraction in a data frame

I need to do a column-wise subtraction and row-wise subtraction in R.
id on fail
1 10-10-2014 11-11-2014
1 11-10-2014 12-12-2014
1 12-10-2014 12-01-2015
2 13-10-2014 12-02-2015
2 14-10-2014 15-03-2015
2 15-10-2014 15-04-2015
2 16-10-2014 16-05-2015
3 17-10-2014 16-06-2015
3 18-10-2014 17-07-2015
3 19-10-2014 17-08-2015
3 20-10-2014 17-09-2015
For example, in the above table whenever a new id appears it should do a column-wise subtraction, else it should do row-wise subtraction. I need to have a result like this:
id on fail res
1 10-10-2014 11-11-2014 32
1 11-10-2014 12-12-2014 31
1 12-10-2014 12-01-2015 31
2 13-10-2014 12-02-2015 122
2 14-10-2014 15-03-2015 31
2 15-10-2014 15-04-2015 31
2 16-10-2014 16-05-2015 31
3 17-10-2014 16-06-2015 242
3 18-10-2014 17-07-2015 31
3 19-10-2014 17-08-2015 31
3 20-10-2014 17-09-2015 31
As of now I am using the following code:
data[,2] <- as.Date(data[,2],format="%d-%m-%Y")
data[,3] <- as.Date(data[,3],format="%d-%m-%Y")
x <- as.numeric(diff(data[,3]))
DF <- read.table(text="id on fail
1 10-10-2014 11-11-2014
1 11-10-2014 12-12-2014
1 12-10-2014 12-01-2015
2 13-10-2014 12-02-2015
2 14-10-2014 15-03-2015
2 15-10-2014 15-04-2015
2 16-10-2014 16-05-2015
3 17-10-2014 16-06-2015
3 18-10-2014 17-07-2015
3 19-10-2014 17-08-2015
3 20-10-2014 17-09-2015 ", header=TRUE)
DF[,2:3] <- lapply(DF[,2:3], as.Date, format="%d-%m-%Y")
DF$res <- c(NA, diff(DF$fail))
DF[c(TRUE ,diff(DF$id)!=0), "res"] <- DF[c(TRUE ,diff(DF$id)!=0), "fail"] -
DF[c(TRUE ,diff(DF$id)!=0), "on"]
# id on fail res
# 1 1 2014-10-10 2014-11-11 32
# 2 1 2014-10-11 2014-12-12 31
# 3 1 2014-10-12 2015-01-12 31
# 4 2 2014-10-13 2015-02-12 122
# 5 2 2014-10-14 2015-03-15 31
# 6 2 2014-10-15 2015-04-15 31
# 7 2 2014-10-16 2015-05-16 31
# 8 3 2014-10-17 2015-06-16 242
# 9 3 2014-10-18 2015-07-17 31
# 10 3 2014-10-19 2015-08-17 31
# 11 3 2014-10-20 2015-09-17 31

ddply type functionality on multiple datafrmaes

I have two dataframes that are structured as follows:
Dataframe A:
id sqft traf month
1 1030 16 35 1
1 1030 15 32 2
2 1027 1 31 1
2 1027 2 31 2
Dataframe B:
id price frequency month day
1 1030 8 196 1 1
2 1030 9 101 1 15
3 1030 10 156 1 30
4 1030 3 137 2 1
5 1030 7 190 2 15
6 1027 6 188 1 1
7 1027 1 198 1 15
8 1027 2 123 1 30
9 1027 4 185 2 1
10 1027 5 122 2 15
I want to output certain types of summary statistics (centered around each unique ID) from both these columns. This would be easy with ddply if say I wanted the mean price for each ID for each month (split by id and month) from Dataframe B or if I wanted the average ratio of sqft to traf for each id (split by id).
But what would be a potential solution if I wanted to make combined variables from both dataframes. For instance, how would I get the average price for each id/month (Dataframe B) divided by sqft for each id/month?
The varying frequencies at of the dataframes are measured makes combining them not easily doable. The only solution I've found so far is to ddply the first dataframe to extract average sqft/id/month and then pass that value into a second ddply call on the second dataframe.
Is there a more efficient/less convoluted way to do this? I would be splitting both dataframes on the same variables (id and month).
Thanks in advance for any suggestions!
In the case of the sample data, you could merge the two data sets like this (by specifying all.y = TRUE you can make sure that all rows of dfb are kept and, in this case, corresponding entries of dfa are repeated accordingly)
dfall <- merge(dfa, dfb, by = c("id", "month"), all.y=TRUE)
# id month sqft traf price frequency day
#1 1027 1 1 31 6 188 1
#2 1027 1 1 31 1 198 15
#3 1027 1 1 31 2 123 30
#4 1027 2 2 31 4 185 1
#5 1027 2 2 31 5 122 15
#6 1030 1 16 35 8 196 1
#7 1030 1 16 35 9 101 15
#8 1030 1 16 35 10 156 30
#9 1030 2 15 32 3 137 1
#10 1030 2 15 32 7 190 15
Then, you can use ddply as usual:
ddply(dfall, .(id, month), mutate, newcol = mean(price)/sqft)
# id month sqft traf price frequency day newcol
#1 1027 1 1 31 6 188 1 3.0000000
#2 1027 1 1 31 1 198 15 3.0000000
#3 1027 1 1 31 2 123 30 3.0000000
#4 1027 2 2 31 4 185 1 2.2500000
#5 1027 2 2 31 5 122 15 2.2500000
#6 1030 1 16 35 8 196 1 0.5625000
#7 1030 1 16 35 9 101 15 0.5625000
#8 1030 1 16 35 10 156 30 0.5625000
#9 1030 2 15 32 3 137 1 0.3333333
#10 1030 2 15 32 7 190 15 0.3333333
Edit: if you're looking for better performance, consider using dplyr instead of plyr. The equivalent dplyr code (including the merge) is:
library(dplyr)
dfall <- dfb %>%
left_join(., dfa, by = c("id", "month")) %>%
group_by(id, month) %>%
dplyr::mutate(newcol = mean(price)/sqft) # I added dplyr:: to avoid confusion with plyr::mutate
Of course, you could also check out data.table which is also very efficient.
AFAIK ddply is not designed to be used with different data frames at the same time.
dplyr does well here. This code merges the data frames, gets price and sqft means by unique id/month combination, then creates a new variable pricePerSqft.
require(dplyr)
dfa %>%
left_join(dfb, by = c("id", "month")) %>%
group_by(id, month) %>%
summarize(
avgPrice = mean(price),
avgSqft = mean(sqft)) %>%
mutate(pricePerSqft = round(avgPrice / avgSqft, 2))
Here's the result:
id month avgPrice avgSqft pricePerSqft
1 1027 1 3.0 1 3.00
2 1027 2 4.5 2 2.25
3 1030 1 9.0 16 0.56
4 1030 2 5.0 15 0.33

Resources