I am trying to sort my data frame after two columns. The first column is the number either 0 or 6317 and the second column is the date in mm-yy format from January to December from different years.
Below is just a selection of my dataframe:
number date count
1 0 Sep-13 10
2 0 Jan-12 15
3 0 Feb-13 4
4 0 Oct-12 13
5 0 Nov-13 14
6 6317 Jan-12 20
7 6317 Nov-13 40
8 6317 Dez-13 20
9 6317 Feb-13 10
10 6317 Oct-12 15
11 6317 Oct-13 19
I have used the following commands
orderbydate <- count[order(as.Date(count$date, format=c("%b-%y")),]
and
orderbydate <- count[order(count[,1], count[,2]),]
I was planning on having it look something like this in the end.
date 6317 0
Jan-12 20 15
Feb-12 8 10
Mrch-12 15 20
. . .
. . .
. . .
Jan-13 18 19
Feb-13 10 4
Mrch-13 14 2
Apr-13 11 9
We can convert to yearmon class with zoo and then arrange
library(dplyr)
library(zoo)
count %>%
arrange(number, as.yearmon(date, '%b-%y'))
Or without using packages, convert the 'date' to Date class by pasteing a day (for e.g. 01) and then do the order
count[order(count$number, as.Date(paste0(count$date, "-01"), "%b-%y-%d")),]
You can use the cast() function from the reshape library.
The below code will yield the desired result:
library(reshape)
new_df <- cast(count, date~number)
Here's another option without using any packages:
DATA:
number date count
0 Sep-13 10
0 Jan-12 15
0 Feb-13 4
0 Oct-12 13
0 Nov-13 14
6317 Jan-12 20
6317 Nov-13 40
6317 Dec-13 20
6317 Feb-13 10
6317 Oct-12 15
6317 Oct-13 19
CODE:
dt <- read.table('clipboard', header = T, stringsAsFactors = F)
dt$date <- as.Date(paste(dt$date, '01', sep = '-'), format = '%b-%y-%d')
> dt
number date count
1 0 2013-09-01 10
2 0 2012-01-01 15
3 0 2013-02-01 4
4 0 2012-10-01 13
5 0 2013-11-01 14
6 6317 2012-01-01 20
7 6317 2013-11-01 40
8 6317 2013-12-01 20
9 6317 2013-02-01 10
10 6317 2012-10-01 15
11 6317 2013-10-01 19
To get what you indicated above, we can use merge:
> merge(dt[dt$number == 6317, 2:3], dt[dt$number == 0, 2:3], by = 'date', suffixes = c('_6317', '_0'), fill = T, all = T)
date count_6317 count_0
1 2012-01-01 20 15
2 2012-10-01 15 13
3 2013-02-01 10 4
4 2013-09-01 NA 10
5 2013-10-01 19 NA
6 2013-11-01 40 14
7 2013-12-01 20 NA
Related
I have the following data:
Date Price
2-Jul-13 20
3-Jul-13 22
4-Jul-13 30
5-Jul-13 18
8-Jul-13 12
9-Jul-13 24
10-Jul-13 28
11-Jul-13 14
The output has to be:
Date Price day_diff week_diff
2-Jul-13 20 0 4
3-Jul-13 22 2
4-Jul-13 30 8
5-Jul-13 18 -12
8-Jul-13 12 -6
9-Jul-13 24 12 -4
10-Jul-13 28 4
11-Jul-13 14 -14
12-Jul-13 18 4
15-Jul-13 12 -6
16-Jul-13 20 8 ....
....
To calculate day_diff first value is taken as 0 and then 22-20 = 2 and so on. To calculate week_diff the next week starts from 9-Jul-13 so 24-20 = 4 similarly next week starts from 16-July-13 so 20-24 = -4 and so on...
Please help me with this.
Please provide your data with dput() in future.
The data:
read.table(
text = " Date Price
2-Jul-13 20
3-Jul-13 22
4-Jul-13 30
5-Jul-13 18
8-Jul-13 12
9-Jul-13 24
10-Jul-13 28
11-Jul-13 14
12-Jul-13 18
15-Jul-13 12
16-Jul-13 20 ",
header = T
) -> df
Solution:
library(tidyverse)
library(lubridate)
df %>%
mutate(
Date = dmy(Date),
day_diff = Price - lag(Price),
week_date = floor_date(Date, unit = 'week', week_start = 2),
week_number = ifelse(Date == week_date, Price, 0),
week_diff = lead(week_number, 5) - week_number
) %>%
select(Date, Price, day_diff, week_diff)-> output_df
Output
> output_df
Date Price day_diff week_diff
1 2013-07-02 20 NA 4
2 2013-07-03 22 2 0
3 2013-07-04 30 8 0
4 2013-07-05 18 -12 0
5 2013-07-08 12 -6 0
6 2013-07-09 24 12 -4
7 2013-07-10 28 4 NA
8 2013-07-11 14 -14 NA
9 2013-07-12 18 4 NA
10 2013-07-15 12 -6 NA
11 2013-07-16 20 8 NA
I have a dataset that looks like this.
Id Date1 Cars
1 2007-04-05 72
2 2014-01-07 12
2 2018-07-09 10
2 2018-07-09 13
3 2005-11-19 22
3 2005-11-23 13
4 2010-06-17 38
4 2010-09-23 57
4 2010-09-23 41
4 2010-10-04 17
What I would like to do is for each Id get the date difference with respect to the 1st Date (Earliest) date for that Id. For each Id, (EarliestDate - 2nd Earliest Date), (EarliestDate - 3rd Earliest Date), (Earliest Date - 4th Earliest Date) ... so on.
I would end up with a dataset like this
Id Date1 Cars Diff
1 2007-04-05 72 NA
2 2014-01-07 12 NA
2 2018-07-09 10 1644 = (2018-07-09 - 2014-01-07)
2 2018-07-09 13 1644 = (2018-07-09 - 2014-01-07)
3 2005-11-19 22 NA
3 2005-11-23 13 4 = (2005-11-23 - 2005-11-19)
4 2010-06-17 38 NA
4 2010-09-23 57 98 = (2010-09-23 - 2010-06-17)
4 2010-09-23 41 98 = (2010-09-23 - 2010-06-17)
4 2010-10-04 17 109 = (2010-10-04 - 2010-09-23)
I am unclear on how to accomplish this. Any help would be much appreciated. Thanks
Change Date1 to date class.
df$Date1 = as.Date(df$Date1)
You can subtract with the first value in each Id. This can be done using dplyr.
library(dplyr)
df %>% group_by(Id) %>% mutate(Diff = as.integer(Date1 - first(Date1)))
# Id Date1 Cars Diff
# <int> <date> <int> <int>
# 1 1 2007-04-05 72 0
# 2 2 2014-01-07 12 0
# 3 2 2018-07-09 10 1644
# 4 2 2018-07-09 13 1644
# 5 3 2005-11-19 22 0
# 6 3 2005-11-23 13 4
# 7 4 2010-06-17 38 0
# 8 4 2010-09-23 57 98
# 9 4 2010-09-23 41 98
#10 4 2010-10-04 17 109
data.table
setDT(df)[, Diff := as.integer(Date1 - first(Date1)), Id]
OR base R :
df$diff <- with(df, ave(as.integer(Date1), Id, FUN = function(x) x - x[1]))
Replace 0's to NA if you want output as such.
I was looking to separate rows of data by Cue and adding a row which calculate averages per subject. Here is an example:
Before:
Cue ITI a b c
1 0 16 0.82062 0.52185 0.27679
2 0 24 0.53894 0.49957 0.35767
3 4 22 0.26855 0.17487 0.22461
4 4 20 0.15106 0.48767 0.49072
5 7 18 0.11627 0.12604 0.2832
6 7 24 0.50201 0.14252 0.21454
7 12 16 0.27649 0.96008 0.42114
8 12 18 0.60852 0.21637 0.18799
9 22 20 0.32867 0.65308 0.29388
10 22 24 0.25726 0.37048 0.32379
After:
Cue ITI a b c
1 0 16 0.82062 0.52185 0.27679
2 0 24 0.53894 0.49957 0.35767
3 0.67978 0.51071 0.31723
4 4 22 0.26855 0.17487 0.22461
5 4 20 0.15106 0.48767 0.49072
6 0.209 0.331 0.357
7 7 18 0.11627 0.12604 0.2832
8 7 24 0.50201 0.14252 0.21454
9 0.309 0.134 0.248
10 12 16 0.27649 0.96008 0.42114
11 12 18 0.60852 0.21637 0.18799
12 0.442 0.588 0.304
13 22 20 0.32867 0.65308 0.29388
14 22 24 0.25726 0.37048 0.32379
15 0.292 0.511 0.308
So in the "after" example, line 3 is the average of lines 1 and 2 (line 6 is the average of lines 4 and 5, etc...).
Any help/information would be greatly appreciated!
Thank you!
You can use base r to do something like:
Reduce(rbind,by(data,data[1],function(x)rbind(x,c(NA,NA,colMeans(x[-(1:2)])))))
Cue ITI a b c
1 0 16 0.820620 0.521850 0.276790
2 0 24 0.538940 0.499570 0.357670
3 NA NA 0.679780 0.510710 0.317230
32 4 22 0.268550 0.174870 0.224610
4 4 20 0.151060 0.487670 0.490720
31 NA NA 0.209805 0.331270 0.357665
5 7 18 0.116270 0.126040 0.283200
6 7 24 0.502010 0.142520 0.214540
33 NA NA 0.309140 0.134280 0.248870
7 12 16 0.276490 0.960080 0.421140
8 12 18 0.608520 0.216370 0.187990
34 NA NA 0.442505 0.588225 0.304565
9 22 20 0.328670 0.653080 0.293880
10 22 24 0.257260 0.370480 0.323790
35 NA NA 0.292965 0.511780 0.308835
Here is one idea. Split the data frame, perform the analysis, and then combine them together.
DF_list <- split(DF, f = DF$Cue)
DF_list2 <- lapply(DF_list, function(x){
df_temp <- as.data.frame(t(colMeans(x[, -c(1, 2)])))
df_temp[, c("Cue", "ITI")] <- NA
df <- rbind(x, df_temp)
return(df)
})
DF2 <- do.call(rbind, DF_list2)
rownames(DF2) <- 1:nrow(DF2)
DF2
# Cue ITI a b c
# 1 0 16 0.820620 0.521850 0.276790
# 2 0 24 0.538940 0.499570 0.357670
# 3 NA NA 0.679780 0.510710 0.317230
# 4 4 22 0.268550 0.174870 0.224610
# 5 4 20 0.151060 0.487670 0.490720
# 6 NA NA 0.209805 0.331270 0.357665
# 7 7 18 0.116270 0.126040 0.283200
# 8 7 24 0.502010 0.142520 0.214540
# 9 NA NA 0.309140 0.134280 0.248870
# 10 12 16 0.276490 0.960080 0.421140
# 11 12 18 0.608520 0.216370 0.187990
# 12 NA NA 0.442505 0.588225 0.304565
# 13 22 20 0.328670 0.653080 0.293880
# 14 22 24 0.257260 0.370480 0.323790
# 15 NA NA 0.292965 0.511780 0.308835
DATA
DF <- read.table(text = " Cue ITI a b c
1 0 16 0.82062 0.52185 0.27679
2 0 24 0.53894 0.49957 0.35767
3 4 22 0.26855 0.17487 0.22461
4 4 20 0.15106 0.48767 0.49072
5 7 18 0.11627 0.12604 0.2832
6 7 24 0.50201 0.14252 0.21454
7 12 16 0.27649 0.96008 0.42114
8 12 18 0.60852 0.21637 0.18799
9 22 20 0.32867 0.65308 0.29388
10 22 24 0.25726 0.37048 0.32379", header = TRUE)
A data.table approach, but if someone can offer some improvements I'd be keen to hear.
library(data.table)
dt <- data.table(df)
dt2 <- dt[, lapply(.SD, mean), by = Cue][,ITI := NA][]
data.table(rbind(dt, dt2))[order(Cue)][is.na(ITI), Cue := NA][]
> data.table(rbind(dt, dt2))[order(Cue)][is.na(ITI), Cue := NA][]
Cue ITI a b c
1: 0 16 0.820620 0.521850 0.276790
2: 0 24 0.538940 0.499570 0.357670
3: NA NA 0.679780 0.510710 0.317230
4: 4 22 0.268550 0.174870 0.224610
5: 4 20 0.151060 0.487670 0.490720
6: NA NA 0.209805 0.331270 0.357665
If you want to leave the Cue values as-is to confirm group, just drop the [is.na(ITI), Cue := NA] from the last line.
I would use group_by and summarise from the DPLYR package to get a dataframe with the average values. Then rbind the new data frame with the old one and sort by Cue:
df_averages <- df_orig >%>
group_by(Cue) >%>
summarise(ITI = NA, a = mean(a), b = mean(b), c = mean(c)) >%>
ungroup()
df_all <- rbind(df_orig, df_averages)
I have data for spot price and day-ahead price for hour 2 and hour 3. They are as below. They are from 2015-12-31 to 2011-01-01 all the way down.
> head(da2)
Date Price Hour
43802 2015-12-31 12.56 2
43778 2015-12-30 23.59 2
43754 2015-12-29 17.07 2
> head(sp2)
# A tibble: 6 x 3
Date Hour Price
<dttm> <chr> <dbl>
1 2015-12-31 2 17.15
2 2015-12-30 2 26.23
3 2015-12-29 2 23.01
> head(da3)
Date Price Hour
43803 2015-12-31 10.46 3
43779 2015-12-30 23.55 3
43755 2015-12-29 16.52 3
> head(sp3)
# A tibble: 6 x 3
Date Hour Price
<dttm> <chr> <dbl>
1 2015-12-31 3 12.96
2 2015-12-30 3 25.65
3 2015-12-29 3 23.59
I tried to put da2$Price and sp2$Price together, and again the same for hour 3.
But unfortunately, I get this.
> rpdf2<-data.frame(da2$Date,da2$Price,sp2$Price)
Error in data.frame(da2$Date, da2$Price, sp2$Price) :
arguments imply differing number of rows: 1826, 1822
> rpdf3<-data.frame(da3$Date,da3$Price,sp3$Price)
Error in data.frame(da3$Date, da3$Price, sp3$Price) :
arguments imply differing number of rows: 1821, 1825
So I applied > setdiff(paste(da2$Date),paste(sp2$Date))
Then I found
[1] "2014-03-30" "2013-03-31" "2012-03-25" "2011-03-27"
It was okay. But when I did setdiff(paste(da3$Date),paste(sp3$Date)), It shows me character(0).
There must be 4 observations difference. But I cannot find those four. Can anyone help me with this situation? Thank you.
When setdiff(da3$Date,sp3$Date)
result is
[1] 16800.04 16799.04 16798.04 16797.04 16796.04 16795.04 16794.04 16793.04 16792.04 16791.04 16790.04 16789.04 16788.04 16787.04 16786.04 16785.04 16784.04
[18] 16783.04 16782.04 16781.04 16780.04 16779.04 16778.04 16777.04 16776.04 16775.04 16774.04 16773.04 16772.04 16771.04 16770.04 16769.04 16768.04 16767.04
[35] 16766.04 16765.04 16764.04 16763.04 16762.04 16761.04 16760.04 16759.04 16758.04 16757.04 16756.04 16755.04 16754.04 16753.04 16752.04 16751.04 16750.04
[52] 16749.04 16748.04 16747.04 16746.04 16745.04 16744.04 16743.04 16742.04 16741.04 16740.04 16739.04 16738.04 16737.04 16736.04 16735.04 16734.04 16733.04
[69] 16732.04 16731.04 16730.04 16729.04 16728.04 16727.04 16726.04 16725.04 16724.04 16723.04 16722.04 16721.04 16720.04 16719.04 16718.04 16717.04 16716.04
[86] 16715.04 16714.04 16713.04 16712.04 16711.04 16710.04 16709.04 16708.04 16707.04 16706.04 16705.04 16704.04 16703.04 16702.04 16701.04 16700.04 16699.04
and so further.
One way (of many) to tackle this is instead of looking directly for the differences is to find a way to join your tables which will work regardless. To do so you simply need to generate a complete sequence of all dates from the first date on your list to the last, then left-join these to each of your daily and spot price data frames in turn. Missing date rows in each table will show as NA columns in the resulting joined table.
Example sequence, shortened to one month only for this exemplar. You'd start it at 2011-01-01 instead.
somedates = seq(as.Date("2015-12-01"), as.Date("2015-12-31"), by = "day")
Generate some test data each with four randomly missed dates to simulate your da2, da3, sp2 and sp3 tables:
library(dplyr)
set.seed(0)
da2 = data.frame(Date = sample(somedates, 27)) %>%
mutate(hour = 2, price = 20)
set.seed(1)
da3 = data.frame(Date = sample(somedates, 27)) %>%
mutate(hour = 3, price = 21)
set.seed(2)
sp2 = data.frame(Date = sample(somedates, 27)) %>%
mutate(hour = 2, price = 19)
set.seed(3)
sp3 = data.frame(Date = sample(somedates, 27)) %>%
mutate(hour = 3, price = 18)
Joining the da2, da3, sp2 and sp3 tables
With the test data generated, joining the tables to the complete sequence of dates (as a data frame) is straightforward. (NB I haven't replaced the joined column names with more meaningful versions in the result below).
all =
left_join(data.frame(Date = somedates), da2, by = "Date") %>%
left_join(da3, by = "Date") %>%
left_join(sp2, by = "Date") %>%
left_join(sp3, by = "Date")
Results from the test data joined
>all
Date hour.x price.x hour.y price.y hour.x.x price.x.x hour.y.y price.y.y
1 2015-12-01 2 20 3 21 2 19 3 18
2 2015-12-02 2 20 3 21 2 19 3 18
3 2015-12-03 NA NA 3 21 2 19 3 18
4 2015-12-04 2 20 3 21 2 19 3 18
5 2015-12-05 2 20 3 21 2 19 3 18
6 2015-12-06 2 20 3 21 2 19 3 18
7 2015-12-07 2 20 3 21 2 19 NA NA
8 2015-12-08 2 20 3 21 2 19 3 18
9 2015-12-09 2 20 3 21 NA NA 3 18
10 2015-12-10 2 20 3 21 NA NA 3 18
11 2015-12-11 2 20 3 21 2 19 3 18
12 2015-12-12 NA NA 3 21 2 19 3 18
13 2015-12-13 2 20 NA NA 2 19 NA NA
14 2015-12-14 2 20 3 21 2 19 3 18
15 2015-12-15 2 20 3 21 2 19 3 18
16 2015-12-16 2 20 3 21 2 19 3 18
17 2015-12-17 2 20 3 21 2 19 3 18
18 2015-12-18 2 20 NA NA 2 19 3 18
19 2015-12-19 NA NA 3 21 2 19 3 18
20 2015-12-20 2 20 NA NA NA NA 3 18
21 2015-12-21 2 20 3 21 2 19 3 18
22 2015-12-22 2 20 3 21 2 19 3 18
23 2015-12-23 2 20 3 21 2 19 3 18
24 2015-12-24 2 20 3 21 2 19 NA NA
25 2015-12-25 2 20 3 21 2 19 3 18
26 2015-12-26 2 20 3 21 2 19 3 18
27 2015-12-27 2 20 3 21 2 19 3 18
28 2015-12-28 2 20 3 21 2 19 3 18
29 2015-12-29 2 20 3 21 2 19 3 18
30 2015-12-30 2 20 3 21 NA NA 3 18
31 2015-12-31 NA NA NA NA 2 19 NA NA
Edit I note the numeric dates you posted as a result of your set join have a 0.04 time component as well as the whole-number date. You will need to add this to the date sequence to get the join to work. I have now tested this and without adding the time component you'd have to convert each date to a whole number. This can be done fairly simply though:
da2$Date = trunc.Date(da2$Date, "days")
da3$Date = trunc.Date(da3$Date, "days")
sp2$Date = trunc.Date(sp2$Date, "days")
sp3$Date = trunc.Date(sp3$Date, "days")
You'd do this before the joins.
I have 3 columns of data that I would like to reshape into a matrix where the columns are created_at and rows are citibike_station_id
head(sample)
available_bike_count created_at citibike_station_id
1 21 2015-10-08 00:00:00 72
2 7 2015-10-08 20:10:00 72
3 18 2015-10-08 06:50:00 72
4 19 2015-10-08 10:10:00 72
5 18 2015-10-08 02:30:00 72
6 17 2015-10-08 05:00:00 72
> dim(sample)
[1] 69511 3
Therefore, I have to group by created_at and by citibike_station_id
> length(unique(sample$created_at))
[1] 145
> length(unique(sample$citibike_station_id))
[1] 482
created_at represents a 10-minute time intervals - there should be 145 columns as there are 145 unique time intervals (representing one day of data) ; and there should be 482 rows as there are 482 unique values of citibike_station_id.
This is an example of what the data should look like in the end - however, in this example the column names are from a different day and year.
head(data[1:6])
station_id X2014.08.18.20.00.00 X2014.08.18.20.10.00 X2014.08.18.20.20.00
1 1 1 0 0
2 2 18 18 19
3 3 5 4 4
4 4 21 20 20
5 5 9 10 8
6 6 9 9 9
X2014.08.18.20.30.00 X2014.08.18.20.40.00
1 2 1
2 18 18
3 4 4
4 21 22
5 5 7
6 9 9
how would one do this with dplyr and tidyr ?
library(dplyr)
library(tidyr)
matrix <- sample %>%
group_by(created_at, citibike_station_id)%>%
spread(citibike_station_id, created_at)
however this does not work. Would the reshape2 package provide a better solution?