Sorry relative R novice here trying to do more with dplyr. I have a large data frame with an id column, end (dates as POSIXct) and D (codes the outcome ~6 different types), sample here:
id end D
1143 1996-08-10 KT
1148 2000-07-27 KT
1150 2004-07-02 KT
1158 2001-11-03 KT
I want to create a subset for outcome KT. Within this outcome, many id's have 1-4 outcomes with separate dates. I want to create a wide data frame that looks like this:
id datetx1 datetx2 datetx3 datetx4
1915 2014-10-13 2014-10-18 <NA> <NA>
2715 2006-09-17 2006-09-17 2006-09-17 2008-02-01
5089 2007-02-02 2007-02-11 <NA> <NA>
5266 2007-04-16 2010-07-14 2010-07-14 <NA>
I have been trying to use dplyr and tidyr with some success at going from long to wide, but am getting stuck with the dates with this approach:
transplant<-filter(outcomes, D == "KT") %>%
filter(!is.na(date)) %>%
group_by(id) %>%
arrange((end)) %>%
mutate(number = row_number()) %>%
spread(number, date)
The problem is with spread, which creates numeric values in the other columns that I can’t seem to coerce into dates. I have resorted to coding the date as.character and then recoding as.Date after the spread which works but am I missing something about tidyr and date coding?
Related
newdf=data.frame(date=as.Date(c("2021-01-04","2021-01-05","2021-01-06","2021-01-07")),
time=c("10:32:29","11:25","12:18:42","09:58"))
This is my data frame. I want to calculate time difference between two consecutive days in hours. Could you please suggest a method to calculate? Note, some time values do not contain seconds. So, first we have to convert it to standard form. Could you please give me a method to solve all these problems. This is completely R programming.
Paste date and time together in one column, use parse_date_time to change the time value in standard format (Posixct) and use difftime to calculate difference between consecutive time in hours.
library(dplyr)
library(tidyr)
library(lubridate)
newdf %>%
unite(datetime, date, time, sep = ' ') %>%
mutate(datetime = parse_date_time(datetime, c('Ymd HMS', 'Ymd HM')),
difference_in_hours = round(as.numeric(difftime(datetime,
lag(datetime), 'hours')), 2))
# datetime difference_in_hours
#1 2021-01-04 10:32:29 NA
#2 2021-01-05 11:25:00 24.88
#3 2021-01-06 12:18:42 24.90
#4 2021-01-07 09:58:00 21.66
This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 2 years ago.
I have troubles putting my question into words (hence the weird title) but :
I want to create a new column Earnings, that will take the value of the Price of the Date that matches the Last trading day. Like this :
For the first row, the last trading day is 2014-02-17, so I check in the Date column, and in the 5th row, the Date is equal to 2014-02-17. So I take the price of the 5th row which is 235 and assign it to all rows that have 2014-02-17 as the Last trading day.
Price Date `Last trading day` Earnings
<dbl> <date> <date> <dbl>
224. 2013-01-02 2014-02-17 235
224. 2013-01-02 2014-02-17 235
224. 2013-01-02 2014-02-17 235
224. 2013-01-02 2014-04-19 260
235. 2014-02-17 2014-04-19 260
260. 2014-04-19 2014-06-17 253
I tried this, but it doesn't work :
library(dplyr)
library(plyr)
df<-data %>%
group_by(`Last trading day`) %>%
mutate(Earnings = if_else(data$Date==data$`Last trading day`, Price, NA_real_))
Thanks a lot for your help.
We can use match :
df$Earnings <- df$Price[match(df$Last_trading_day, df$Date)]
Using it in dplyr pipe :
library(dplyr)
df %>% mutate(Earnings = Price[match(Last_trading_day, Date)])
Another option is to join dataframe with itself.
library(dplyr)
df %>% left_join(df, by = c('Last_trading_day' = 'Date'))
I renamed the spaces in column name of Last Trading day with an underscore.
We can remove the data$ as it will take the whole column by breaking the group attribute instead of the values within each group
library(dplyr)
data %>%
group_by(`Last trading day`) %>%
mutate(Earnings = if_else(Date== `Last trading day`, Price, NA_real_))
Or another option is case_when
data %>%
group_by(`Last trading day`) %>%
mutate(Earnings = case_when(Date== `Last trading day` ~ Price))
Also, as we are comparing elementwise, we don't need any group_by
data %>%
mutate(Earnings = if_else(Date== `Last trading day`, Price, NA_real_))
Or with case_when remove the group_by
The above solutions were based on the code OP showed. If we need to do a replacement based on the two columns
library(data.table)
setDT(df)[df, on = .(Last_trading_day = Date)]
I have been trying to do the following within the dplyr package but without success.
I need to find which levels of a certain factor column are present in every level of another factor column, in my case, it is a Year column. This could be an example dataset:
ID Year
1568 2013
1341 2013
1568 2014
1341 2014
1261 2014
1348 2015
1568 2015
1341 2015
So I would like a list of the ID names that are present in every year. In the above example would be:
"1568", "1341"
I have been trying with dplyr to first grou_by column Year and then summarise the data somehow, but withouth achieving it.
Thanks
Using dplyr, we can group_by ID and select the groups which has same number of unique Year as the complete data.
library(dplyr)
df %>%
group_by(ID) %>%
filter(n_distinct(Year) == n_distinct(.$Year)) %>%
pull(ID) %>%unique()
#[1] 1568 1341
Here is a base R solution using intersect() + split()
comm <- Reduce(intersect,split(df$ID,df$Year))
such that
> comm
[1] 1568 1341
I have a dataframe (df) like the following:
derv market date
-10.7803563 S&P 500 Index 2008-01-02
-15.6922552 S&P 500 Index 2008-01-03
-15.7648483 S&P 500 Index 2008-01-04
-10.2294744 S&P 500 Index 2008-01-07
-0.5918593 S&P 500 Index 2008-01-08
8.1518987 S&P 500 Index 2008-01-09
.....
84.1518987 S&P 500 Index 2014-12-31
and I want to find the 10 trading days in df before a specific day. For example, 2008-01-12.
I have thought of using dplyr like the following:
df %>% select(derv,Market,date) %>%
filter(date > 2008-01-12 - 10 & Date <2008-01-12)
but the issue I am having is about how to index the 10 trading days before the specific day. The code I have above is not working and I do not know how to deal with it in the case of using dplyr.
Another concerning issue is that the specific day (e.g. 2008-01-12) may or may not be in df. If the specific is in df, I think I only need to go back 9 days to count; but it is not in df, I need to go back 10 indices. I am not sure if I am correct here or not, but this is the part making me confused.
Would greatly appreciate any insight.
Using dplyr and data.table::rleid()
Example data:
set.seed(123)
df=data.frame(derv=rnorm(18),Date=as.Date(c(1,2,3,4,6,7,9,11,12,13,14,15,18,19,20,21,23,24),origin="2008-01-01"))
An column with an index is created in order to select no more than 10 days before the chosen date.
library(dplyr)
library(data.table)
df %>%
filter(Date < "2008-01-19") %>%
mutate(id = rleid(Date)) %>%
filter(id > (max(id)-10)) %>%
ungroup() %>%
select(derv,Date)
derv Date
1 -1.0678237 2008-01-04
2 -0.2179749 2008-01-05
3 -1.0260044 2008-01-07
4 -0.7288912 2008-01-08
5 -0.6250393 2008-01-10
6 -1.6866933 2008-01-12
7 0.8377870 2008-01-13
8 0.1533731 2008-01-14
9 -1.1381369 2008-01-15
10 1.2538149 2008-01-16
EDIT: Procrastinatus Maximus' solution is shorter and only requires dplyr
df %>% filter(Date < "2008-01-19") %>% filter(row_number() > (max(row_number())-10))
This gives the same output.
So the answer to this question really depends on how your dates are stored in R. But let's assume ISO 8601, which is what it looks like based on your code.
So first let's make some data.
mydates <- as.Date("2007-06-22")
mydates<-c(mydates[1]+1:11, mydates[1]+14:19)
StockPrice<-c(1:17)
df<-data.frame(mydates,StockPrice)
Then specify the date of interest like #stats_guy
dateofinterest<-as.Date("2007-07-11")
I'd say use subset, and just subtract 11 from your date since it's already in that format.
foo<-subset(df, mydates<dateofinterest & mydates>(dateofinterest-11))
Then you'll have a nice span of 10 days, but I'm not sure if you want 10 trading days? Or just 10 consecutive days, even if that means your list of prices might be < 10. I intentionally made my dataset with breaks like real market data to illustrate that point. So I came up with 8 values over the 10 day period instead of 10. Interested to hear what you're actually looking for.
Say you were actually looking for 10 trading days. Just to be the devil's advocate here you could assume that there won't be more than 10 ten days of no trading. So we go 20 days back in time before your date of interest.
foo<-subset(df, mydates<dateofinterest & mydates>(dateofinterest-20))
Then we check your subset of data to see if there are more than 10 trading days within it using an if statement. If there are more then 10 rows then you have too many days. We just trim it the subset data, foo, to the right length starting from the bottom (the latest date) and then count up 9 entries from there. Now you have ten trading days in a nice tidy dataset.
if (nrow(foo)>10){
foo<-foo[(nrow(foo)-9):(nrow(foo)),]
}
Right now I have two dataframes. One contains over 11 million rows of a start date, end date, and other variables. The second dataframe contains daily values for heating degree days (basically a temperature measure).
set.seed(1)
library(lubridate)
date.range <- ymd(paste(2008,3,1:31,sep="-"))
daily <- data.frame(date=date.range,value=runif(31,min=0,max=45))
intervals <- data.frame(start=daily$date[1:5],end=daily$date[c(6,9,15,24,31)])
In reality my daily dataframe has every day for 9 years and my intervals dataframe has entries that span over arbitrary dates in this time period. What I wanted to do was to add a column to my intervals dataframe called nhdd that summed over the values in daily corresponding to that time interval (end exclusive).
For example, in this case the first entry of this new column would be
sum(daily$value[1:5])
and the second would be
sum(daily$value[2:8]) and so on.
I tried using the following code
intervals <- mutate(intervals,nhdd=sum(filter(daily,date>=start&date<end)$value))
This is not working and I think it might have something to do with not referencing the columns correctly but I'm not sure where to go.
I'd really like to use dplyr to solve this and not a loop because 11 million rows will take long enough using dplyr. I tried using more of lubridate but dplyr doesn't seem to support the Period class.
Edit: I'm actually using dates from as.Date now instead of lubridatebut the basic question of how to refer to a different dataframe from within mutate still stands
eps <- .Machine$double.eps
library(dplyr)
intervals %>%
rowwise() %>%
mutate(nhdd = sum(daily$value[between(daily$date, start, end - eps )]))
# start end nhdd
#1 2008-03-01 2008-03-06 144.8444
#2 2008-03-02 2008-03-09 233.4530
#3 2008-03-03 2008-03-15 319.5452
#4 2008-03-04 2008-03-24 531.7620
#5 2008-03-05 2008-03-31 614.2481
In case if you find dplyr solution bit slow (basically due torowwise), you might want to use data.table for pure speed
library(data.table)
setkey(setDT(intervals), start, end)
setDT(daily)[, date1 := date]
foverlaps(daily, by.x = c("date", "date1"), intervals)[, sum(value), by=c("start", "end")]
# start end V1
#1: 2008-03-01 2008-03-06 144.8444
#2: 2008-03-02 2008-03-09 233.4530
#3: 2008-03-03 2008-03-15 319.5452
#4: 2008-03-04 2008-03-24 531.7620
#5: 2008-03-05 2008-03-31 614.2481