reshaping 3 columns into matrix - r

I have 3 columns of data that I would like to reshape into a matrix where the columns are created_at and rows are citibike_station_id
head(sample)
available_bike_count created_at citibike_station_id
1 21 2015-10-08 00:00:00 72
2 7 2015-10-08 20:10:00 72
3 18 2015-10-08 06:50:00 72
4 19 2015-10-08 10:10:00 72
5 18 2015-10-08 02:30:00 72
6 17 2015-10-08 05:00:00 72
> dim(sample)
[1] 69511 3
Therefore, I have to group by created_at and by citibike_station_id
> length(unique(sample$created_at))
[1] 145
> length(unique(sample$citibike_station_id))
[1] 482
created_at represents a 10-minute time intervals - there should be 145 columns as there are 145 unique time intervals (representing one day of data) ; and there should be 482 rows as there are 482 unique values of citibike_station_id.
This is an example of what the data should look like in the end - however, in this example the column names are from a different day and year.
head(data[1:6])
station_id X2014.08.18.20.00.00 X2014.08.18.20.10.00 X2014.08.18.20.20.00
1 1 1 0 0
2 2 18 18 19
3 3 5 4 4
4 4 21 20 20
5 5 9 10 8
6 6 9 9 9
X2014.08.18.20.30.00 X2014.08.18.20.40.00
1 2 1
2 18 18
3 4 4
4 21 22
5 5 7
6 9 9
how would one do this with dplyr and tidyr ?
library(dplyr)
library(tidyr)
matrix <- sample %>%
group_by(created_at, citibike_station_id)%>%
spread(citibike_station_id, created_at)
however this does not work. Would the reshape2 package provide a better solution?

Related

r group by date difference with respect to first date

I have a dataset that looks like this.
Id Date1 Cars
1 2007-04-05 72
2 2014-01-07 12
2 2018-07-09 10
2 2018-07-09 13
3 2005-11-19 22
3 2005-11-23 13
4 2010-06-17 38
4 2010-09-23 57
4 2010-09-23 41
4 2010-10-04 17
What I would like to do is for each Id get the date difference with respect to the 1st Date (Earliest) date for that Id. For each Id, (EarliestDate - 2nd Earliest Date), (EarliestDate - 3rd Earliest Date), (Earliest Date - 4th Earliest Date) ... so on.
I would end up with a dataset like this
Id Date1 Cars Diff
1 2007-04-05 72 NA
2 2014-01-07 12 NA
2 2018-07-09 10 1644 = (2018-07-09 - 2014-01-07)
2 2018-07-09 13 1644 = (2018-07-09 - 2014-01-07)
3 2005-11-19 22 NA
3 2005-11-23 13 4 = (2005-11-23 - 2005-11-19)
4 2010-06-17 38 NA
4 2010-09-23 57 98 = (2010-09-23 - 2010-06-17)
4 2010-09-23 41 98 = (2010-09-23 - 2010-06-17)
4 2010-10-04 17 109 = (2010-10-04 - 2010-09-23)
I am unclear on how to accomplish this. Any help would be much appreciated. Thanks
Change Date1 to date class.
df$Date1 = as.Date(df$Date1)
You can subtract with the first value in each Id. This can be done using dplyr.
library(dplyr)
df %>% group_by(Id) %>% mutate(Diff = as.integer(Date1 - first(Date1)))
# Id Date1 Cars Diff
# <int> <date> <int> <int>
# 1 1 2007-04-05 72 0
# 2 2 2014-01-07 12 0
# 3 2 2018-07-09 10 1644
# 4 2 2018-07-09 13 1644
# 5 3 2005-11-19 22 0
# 6 3 2005-11-23 13 4
# 7 4 2010-06-17 38 0
# 8 4 2010-09-23 57 98
# 9 4 2010-09-23 41 98
#10 4 2010-10-04 17 109
data.table
setDT(df)[, Diff := as.integer(Date1 - first(Date1)), Id]
OR base R :
df$diff <- with(df, ave(as.integer(Date1), Id, FUN = function(x) x - x[1]))
Replace 0's to NA if you want output as such.

How to subset dataframe on dates?

I have got a panel dataframe in R with a many rows. I wish to subset the dataframe to only include the last 10 (or last observation 10 days before the end of the month) days of each month. However the months are varying and not all month include end of the month observations. I need a subset of the data to include of every month the final 10 or five days.
CIV50s = CIV50sub %>%
select(cusip, date, impl_volatility) %>%
group_by(year(date), month(date), cusip) %>%
summarize(impl_volatility = tail(impl_volatility, 1)) %>%
mutate(date = make_date(`year(date)`, `month(date)`))
I have tried this. However this only gives me the last day of the month observation. I need either the last 10 days or the last observations 10 days before the end of the month.
my dataset looks like this:
Here are two possible solutions. The first is quick but imprecise, as you can extract the day of each date and filter those from 21 onward. But this doesn't work precisely since months have different lengths.
library(dplyr)
library(lubridate)
df <- data.frame(t=seq(ymd('2018-01-01'),ymd('2019-01-01'),by='days'))
#extract day of month
df$day <- as.numeric(format(df$t,'%d'))
df %>% filter(day>=20) # can change this to 21 or other number
t day
1 2018-01-20 20
2 2018-01-21 21
3 2018-01-22 22
4 2018-01-23 23
5 2018-01-24 24
6 2018-01-25 25
7 2018-01-26 26
The other option is to add the length of each month, find the last 10 days, then filter based on the difference. Either option will work if you have missing days for the last days of each month.
df %>% mutate(month=as.numeric(format(t,'%m')),
month.length=case_when(month %in% c(1,3,5,7,8,10,12)~31,
month==2~28,
TRUE~30),
diff=month.length-day) %>%
filter(diff<=10)
t day month month.length diff
1 2018-01-21 21 1 31 10
2 2018-01-22 22 1 31 9
3 2018-01-23 23 1 31 8
4 2018-01-24 24 1 31 7
5 2018-01-25 25 1 31 6
6 2018-01-26 26 1 31 5
7 2018-01-27 27 1 31 4
8 2018-01-28 28 1 31 3
9 2018-01-29 29 1 31 2
10 2018-01-30 30 1 31 1
11 2018-01-31 31 1 31 0
12 2018-02-18 18 2 28 10
13 2018-02-19 19 2 28 9
14 2018-02-20 20 2 28 8
15 2018-02-21 21 2 28 7
16 2018-02-22 22 2 28 6

Order by date and other column

I am trying to sort my data frame after two columns. The first column is the number either 0 or 6317 and the second column is the date in mm-yy format from January to December from different years.
Below is just a selection of my dataframe:
number date count
1 0 Sep-13 10
2 0 Jan-12 15
3 0 Feb-13 4
4 0 Oct-12 13
5 0 Nov-13 14
6 6317 Jan-12 20
7 6317 Nov-13 40
8 6317 Dez-13 20
9 6317 Feb-13 10
10 6317 Oct-12 15
11 6317 Oct-13 19
I have used the following commands
orderbydate <- count[order(as.Date(count$date, format=c("%b-%y")),]
and
orderbydate <- count[order(count[,1], count[,2]),]
I was planning on having it look something like this in the end.
date 6317 0
Jan-12 20 15
Feb-12 8 10
Mrch-12 15 20
. . .
. . .
. . .
Jan-13 18 19
Feb-13 10 4
Mrch-13 14 2
Apr-13 11 9
We can convert to yearmon class with zoo and then arrange
library(dplyr)
library(zoo)
count %>%
arrange(number, as.yearmon(date, '%b-%y'))
Or without using packages, convert the 'date' to Date class by pasteing a day (for e.g. 01) and then do the order
count[order(count$number, as.Date(paste0(count$date, "-01"), "%b-%y-%d")),]
You can use the cast() function from the reshape library.
The below code will yield the desired result:
library(reshape)
new_df <- cast(count, date~number)
Here's another option without using any packages:
DATA:
number date count
0 Sep-13 10
0 Jan-12 15
0 Feb-13 4
0 Oct-12 13
0 Nov-13 14
6317 Jan-12 20
6317 Nov-13 40
6317 Dec-13 20
6317 Feb-13 10
6317 Oct-12 15
6317 Oct-13 19
CODE:
dt <- read.table('clipboard', header = T, stringsAsFactors = F)
dt$date <- as.Date(paste(dt$date, '01', sep = '-'), format = '%b-%y-%d')
> dt
number date count
1 0 2013-09-01 10
2 0 2012-01-01 15
3 0 2013-02-01 4
4 0 2012-10-01 13
5 0 2013-11-01 14
6 6317 2012-01-01 20
7 6317 2013-11-01 40
8 6317 2013-12-01 20
9 6317 2013-02-01 10
10 6317 2012-10-01 15
11 6317 2013-10-01 19
To get what you indicated above, we can use merge:
> merge(dt[dt$number == 6317, 2:3], dt[dt$number == 0, 2:3], by = 'date', suffixes = c('_6317', '_0'), fill = T, all = T)
date count_6317 count_0
1 2012-01-01 20 15
2 2012-10-01 15 13
3 2013-02-01 10 4
4 2013-09-01 NA 10
5 2013-10-01 19 NA
6 2013-11-01 40 14
7 2013-12-01 20 NA

Finding discrepancy between two data sets when setdiff is not working

I have data for spot price and day-ahead price for hour 2 and hour 3. They are as below. They are from 2015-12-31 to 2011-01-01 all the way down.
> head(da2)
Date Price Hour
43802 2015-12-31 12.56 2
43778 2015-12-30 23.59 2
43754 2015-12-29 17.07 2
> head(sp2)
# A tibble: 6 x 3
Date Hour Price
<dttm> <chr> <dbl>
1 2015-12-31 2 17.15
2 2015-12-30 2 26.23
3 2015-12-29 2 23.01
> head(da3)
Date Price Hour
43803 2015-12-31 10.46 3
43779 2015-12-30 23.55 3
43755 2015-12-29 16.52 3
> head(sp3)
# A tibble: 6 x 3
Date Hour Price
<dttm> <chr> <dbl>
1 2015-12-31 3 12.96
2 2015-12-30 3 25.65
3 2015-12-29 3 23.59
I tried to put da2$Price and sp2$Price together, and again the same for hour 3.
But unfortunately, I get this.
> rpdf2<-data.frame(da2$Date,da2$Price,sp2$Price)
Error in data.frame(da2$Date, da2$Price, sp2$Price) :
arguments imply differing number of rows: 1826, 1822
> rpdf3<-data.frame(da3$Date,da3$Price,sp3$Price)
Error in data.frame(da3$Date, da3$Price, sp3$Price) :
arguments imply differing number of rows: 1821, 1825
So I applied > setdiff(paste(da2$Date),paste(sp2$Date))
Then I found
[1] "2014-03-30" "2013-03-31" "2012-03-25" "2011-03-27"
It was okay. But when I did setdiff(paste(da3$Date),paste(sp3$Date)), It shows me character(0).
There must be 4 observations difference. But I cannot find those four. Can anyone help me with this situation? Thank you.
When setdiff(da3$Date,sp3$Date)
result is
[1] 16800.04 16799.04 16798.04 16797.04 16796.04 16795.04 16794.04 16793.04 16792.04 16791.04 16790.04 16789.04 16788.04 16787.04 16786.04 16785.04 16784.04
[18] 16783.04 16782.04 16781.04 16780.04 16779.04 16778.04 16777.04 16776.04 16775.04 16774.04 16773.04 16772.04 16771.04 16770.04 16769.04 16768.04 16767.04
[35] 16766.04 16765.04 16764.04 16763.04 16762.04 16761.04 16760.04 16759.04 16758.04 16757.04 16756.04 16755.04 16754.04 16753.04 16752.04 16751.04 16750.04
[52] 16749.04 16748.04 16747.04 16746.04 16745.04 16744.04 16743.04 16742.04 16741.04 16740.04 16739.04 16738.04 16737.04 16736.04 16735.04 16734.04 16733.04
[69] 16732.04 16731.04 16730.04 16729.04 16728.04 16727.04 16726.04 16725.04 16724.04 16723.04 16722.04 16721.04 16720.04 16719.04 16718.04 16717.04 16716.04
[86] 16715.04 16714.04 16713.04 16712.04 16711.04 16710.04 16709.04 16708.04 16707.04 16706.04 16705.04 16704.04 16703.04 16702.04 16701.04 16700.04 16699.04
and so further.
One way (of many) to tackle this is instead of looking directly for the differences is to find a way to join your tables which will work regardless. To do so you simply need to generate a complete sequence of all dates from the first date on your list to the last, then left-join these to each of your daily and spot price data frames in turn. Missing date rows in each table will show as NA columns in the resulting joined table.
Example sequence, shortened to one month only for this exemplar. You'd start it at 2011-01-01 instead.
somedates = seq(as.Date("2015-12-01"), as.Date("2015-12-31"), by = "day")
Generate some test data each with four randomly missed dates to simulate your da2, da3, sp2 and sp3 tables:
library(dplyr)
set.seed(0)
da2 = data.frame(Date = sample(somedates, 27)) %>%
mutate(hour = 2, price = 20)
set.seed(1)
da3 = data.frame(Date = sample(somedates, 27)) %>%
mutate(hour = 3, price = 21)
set.seed(2)
sp2 = data.frame(Date = sample(somedates, 27)) %>%
mutate(hour = 2, price = 19)
set.seed(3)
sp3 = data.frame(Date = sample(somedates, 27)) %>%
mutate(hour = 3, price = 18)
Joining the da2, da3, sp2 and sp3 tables
With the test data generated, joining the tables to the complete sequence of dates (as a data frame) is straightforward. (NB I haven't replaced the joined column names with more meaningful versions in the result below).
all =
left_join(data.frame(Date = somedates), da2, by = "Date") %>%
left_join(da3, by = "Date") %>%
left_join(sp2, by = "Date") %>%
left_join(sp3, by = "Date")
Results from the test data joined
>all
Date hour.x price.x hour.y price.y hour.x.x price.x.x hour.y.y price.y.y
1 2015-12-01 2 20 3 21 2 19 3 18
2 2015-12-02 2 20 3 21 2 19 3 18
3 2015-12-03 NA NA 3 21 2 19 3 18
4 2015-12-04 2 20 3 21 2 19 3 18
5 2015-12-05 2 20 3 21 2 19 3 18
6 2015-12-06 2 20 3 21 2 19 3 18
7 2015-12-07 2 20 3 21 2 19 NA NA
8 2015-12-08 2 20 3 21 2 19 3 18
9 2015-12-09 2 20 3 21 NA NA 3 18
10 2015-12-10 2 20 3 21 NA NA 3 18
11 2015-12-11 2 20 3 21 2 19 3 18
12 2015-12-12 NA NA 3 21 2 19 3 18
13 2015-12-13 2 20 NA NA 2 19 NA NA
14 2015-12-14 2 20 3 21 2 19 3 18
15 2015-12-15 2 20 3 21 2 19 3 18
16 2015-12-16 2 20 3 21 2 19 3 18
17 2015-12-17 2 20 3 21 2 19 3 18
18 2015-12-18 2 20 NA NA 2 19 3 18
19 2015-12-19 NA NA 3 21 2 19 3 18
20 2015-12-20 2 20 NA NA NA NA 3 18
21 2015-12-21 2 20 3 21 2 19 3 18
22 2015-12-22 2 20 3 21 2 19 3 18
23 2015-12-23 2 20 3 21 2 19 3 18
24 2015-12-24 2 20 3 21 2 19 NA NA
25 2015-12-25 2 20 3 21 2 19 3 18
26 2015-12-26 2 20 3 21 2 19 3 18
27 2015-12-27 2 20 3 21 2 19 3 18
28 2015-12-28 2 20 3 21 2 19 3 18
29 2015-12-29 2 20 3 21 2 19 3 18
30 2015-12-30 2 20 3 21 NA NA 3 18
31 2015-12-31 NA NA NA NA 2 19 NA NA
Edit I note the numeric dates you posted as a result of your set join have a 0.04 time component as well as the whole-number date. You will need to add this to the date sequence to get the join to work. I have now tested this and without adding the time component you'd have to convert each date to a whole number. This can be done fairly simply though:
da2$Date = trunc.Date(da2$Date, "days")
da3$Date = trunc.Date(da3$Date, "days")
sp2$Date = trunc.Date(sp2$Date, "days")
sp3$Date = trunc.Date(sp3$Date, "days")
You'd do this before the joins.

Cumulative function for a specific range of values

I have a table with a column "Age" that has a values from 1 to 10, and a column "Population" that has values specified for each of the "age" values. I want to generate a cumulative function for population such that resultant values start from ages at least 1 and above, 2 and above, and so on. I mean, the resultant array should be (203,180..and so on). Any help would be appreciated!
Age Population Withdrawn
1 23 3
2 12 2
3 32 2
4 33 3
5 15 4
6 10 1
7 19 2
8 18 3
9 19 1
10 22 5
You can use cumsum and rev:
df$sum_above <- rev(cumsum(rev(df$Population)))
The result:
> df
Age Population sum_above
1 1 23 203
2 2 12 180
3 3 32 168
4 4 33 136
5 5 15 103
6 6 10 88
7 7 19 78
8 8 18 59
9 9 19 41
10 10 22 22

Resources