I want to create a new dummy variable that prints 1 if my observation is within a certain set of date ranges, and a 0 if its not. My dataset is a list of political contributions over a 10 year range and I want to make a dummy variable to mark if the donation came during a certain range of dates. I have 10 date ranges I'm looking at.
Does anyone know if the right way to do this is to create a loop? I've been looking at this question, which seems similar, but I think mine would be a bit more complicated: Creating a weekend dummy variable
By way of example, what I have a variable listing dates that contributions were recorded and I want to create dummy to show whether this contribution came during a budget crisis. So, if there were a budget crisis from 2010-2-01 until 2010-03-25 and another from 2009-06-05 until 2009-07-30, the variable would ideally look like this:
Contribution Date.......Budget Crisis
2009-06-01...........................0
2009-06-06...........................1
2009-07-30...........................1
2009-07-31...........................0
2010-01-31...........................0
2010-03-05...........................1
2010-03-26...........................0
Thanks yet again for your help!
This looks like a good opportunity to use the %in% syntax of the match(...) function.
dat <- data.frame(ContributionDate = as.Date(c("2009-06-01", "2009-06-06", "2009-07-30", "2009-07-31", "2010-01-31", "2010-03-05", "2010-03-26")), CrisisYes = NA)
crisisDates <- c(seq(as.Date("2010-02-01"), as.Date("2010-03-25"), by = "1 day"),
seq(as.Date("2009-06-05"), as.Date("2009-07-30"), by = "1 day")
)
dat$CrisisYes <- as.numeric(dat$ContributionDate %in% crisisDates)
dat
ContributionDate CrisisYes
1 2009-06-01 0
2 2009-06-06 1
3 2009-07-30 1
4 2009-07-31 0
5 2010-01-31 0
6 2010-03-05 1
7 2010-03-26 0
Related
My dataframe in R studio is as follows:
StudyID FITDate.1 ScopeDate.1 ScopeDate.2 ScopeDate.3 ScopeDate.4
1 2014-05-15 2010-06-02 2014-05-28 2014-08-01 2015-10-27
2 2017-11-29 2018-02-27
3 2015-10-04 2016-06-24 2017-01-18
I have a variable "FITDate.1" indicates the date for FIT test, and several variables "ScopeDate.x" indicates the dates for multiple scope tests.
In my research, a person can have only one date for FIT test, but can have multiple dates for scope. Clinically, if a person has a FIT test, then he will be referred to undertake scope test. However, this person may receive scope tests for other reasons.
So if the date of a scope test is right after the date of a FIT test, then we will define them highly related.
I want to create a variable "FITrelatedscopedate" to include the dates of FIT related scopes. For example, in the row of StudyID==1, the date of "FITDate.1"is 2014-05-15, which is right between ScopeDate.1 (2010-06-02) and ScopeDate.2 (2014-05-28). So the date value 2014-05-28 of ScopeDate.2 is what i need, and I will use 2014-05-28 as the FIT related scope date and write it in the new variable "FITrelatedscopedate".
I think I have to use loop syntax, but i had no experience to realize it. Do you have any experience to solve similar problem? Do you know any codes to realize it? Thanks, any help are appreciated.
Here is one approach with tidyverse assuming you start with two long data.frames, one for FIT testing, and the other for endoscopy.
df_fit <- data.frame(
StudyID = 1:3,
FITDate = as.Date(c("2014-05-15", "2017-11-29", "2015-10-04"))
)
df_fit
StudyID FITDate
1 1 2014-05-15
2 2 2017-11-29
3 3 2015-10-04
df_scope <- data.frame(
StudyID = c(1,1,1,1,2,3,3),
ScopeDate = as.Date(c("2010-06-02", "2014-05-28", "2014-08-01", "2015-10-27", "2018-02-27",
"2016-06-24", "2017-01-18"))
)
df_scope
StudyID ScopeDate
1 1 2010-06-02
2 1 2014-05-28
3 1 2014-08-01
4 1 2015-10-27
5 2 2018-02-27
6 3 2016-06-24
7 3 2017-01-18
First, you can do a left_join by the StudyID to add the scope dates to the FIT data. Then, you can filter to only keep scope dates after FIT testing. For each StudyID, use slice to retain only the first row (this assumes dates are in chronological order...if not, add arrange(ScopeDate) first in the pipe - let me know if you need help with this).
Then, you can right_join back to df_fit so that those FIT testing dates without endoscopy will have NA for the ScopeDate. The final statement with mutate will calculate the time duration between endoscopy and FIT testing.
library(tidyverse)
left_join(
df_fit,
df_scope,
by = "StudyID"
) %>%
filter(ScopeDate > FITDate) %>%
group_by(StudyID) %>%
slice(1) %>%
right_join(df_fit) %>%
mutate(Duration = ScopeDate - FITDate)
Output
StudyID FITDate ScopeDate Duration
<dbl> <date> <date> <drtn>
1 1 2014-05-15 2014-05-28 13 days
2 2 2017-11-29 2018-02-27 90 days
3 3 2015-10-04 2016-06-24 264 days
Let me know if this works for you. A data.table approach can be considered if you need something faster and have a very large dataset.
If you need the Duration as a numeric column, you can use as.numeric(ScopeDate - FITDate).
How do I index rows I need by with specifications?
id<-c(65,65,65,65,65,900,900,900,900,900,900,211,211,211,211,211,211,211,45,45,45,45,45,45,45)
age<-c(19,22,23,24,25,21,26,31,32,37,38,22,23,25,28,29,31,32,30,31,36,39,42,44,48)
stat<-c('intern','reg','manage1','left','reg','manage1','manage2','left','reg',
'reg','left','intern','left','intern','reg','left','reg','manage1','reg','left','intern','manage1','left','reg','manage2')
mydf<-data.frame(id,age,stat)
I need to create 5 variables:
m01time & m12time: measure the amount of years elapsed before becoming a level1 manager (manage1), and then since manage1 to manage2 regardless of whether or not it's at the same job. (numeric in years)
change: capture whether or not they experienced a job change between manage1 and manage2 (if 'left' happens somewhere in between manage1 and manage2), (0 or 1)
& 4: m1p & m2p: capture the position before becoming manager1 and manager2 (intern, reg, or manage1).
There's a lot of information I don't need here that I am not sure how to ignore (all the jobs 211 went through before going to one where they become a manager).
The end result should look something like this:
id m01time m02time change m1p m2p
1 65 4 NA NA reg <NA>
2 900 NA 5 0 <NA> manage1
3 211 1 NA NA reg <NA>
4 45 3 9 1 intern reg
I tried to use ifelse with lag() and lead() to capture some conditions, but there are more for loop type of jobs (such as how to capture a "left" somewhere in between) that I am not sure what to do with.
I'd calculate the variables the first three variables differently than m1p and m2p. Maybe there's an elegant unified approach that I don't see at the moment.
So for the last position before manager you could do:
mydt <- data.table(mydf)
mydt[,.(m1p=stat[.I[stat=="manage1"]-1],
m2p=stat[.I[stat=="manage2"]-1]),by=id]
The other variables are more conveniently calculated in a wide data.format:
dt <- dcast(unique(mydt,by=c("id","stat")),
formula=id~stat,value.var="age")
dt[,.(m01time = manage1-intern,
m12time = manage2-manage1,
change = manage1<left & left<manage2)]
Two caveats:
reshaping might be quite costly larger data sets
I (over-)simplified your dummy data by ignoring duplicates of id and stat
Firstly: I have seen other posts about AVERAGEIF translations from excel into R but I didn't see one that worked on my specific case and I couldn't get around to making one work.
I have a dataset which encompasses the daily pricings of a bunch of listings.
It looks like this
listing_id date price
1 1000 1/2/2015 $100
2 1200 2/4/2016 $150
Sample of the dataset (and desired outcome) # https://send.firefox.com/download/228f31e39d18738d/#rlMmm6UeGxgbkzsSD5OsQw
The dataset I would like to have has only the date and the average prices of all listings on that date. The goal is to get a (different) dataframe which would look something like this so I can work with it:
Date Average Price
1 4/5/2015 204.5438
2 4/6/2015 182.6439
3 4/7/2015 176.553
4 4/8/2015 182.0448
5 4/9/2015 183.3617
6 4/10/2015 205.0997
7 4/11/2015 197.0118
8 4/12/2015 172.2943
I created this in Excel using the Average.if function (and copy pasting by value) from the sample provided above.
I tried to format the data in Excel first where I could use the AVERAGE.IF function saying take the average if it is this specific date. The problem with this is that the dataset consists of 30million rows and excel only allows for 1 million so it didn't work.
What I have done so far: I created a data frame in R (where i want the average prices to go into) using
Avg = data.frame("Date" =1:2, "Average Price"=1:2)
Avg[nrow(Avg) + 2036,] = list("v1","v2")
Avg$Date = seq(from = as.Date("2015-04-05"), to = as.Date("2020-11-01"), by = 'day')
I tried to create an averageif-like function by this article and another but could not get it to work.
I hope this is enough information to go on otherwise I would be more than happy to provide more.
If your question is how to replicate the AVERAGEIF function, you can use logical indexing :
R code :
> df
Dates Prices
1 1 100
2 2 120
3 3 150
4 1 320
5 2 250
6 3 210
7 1 102
8 2 180
9 3 150
idx <- df$Dates == 1 # Positions where condition is true
mean(df$Prices[idx]) # Prints same output as Excel
I have a dataset that looks somewhat like this (the actual dataset is ~150000 lines with additional columns of fluff information such as company name, etc.):
Date return1 return2 rank
01/31/2008 0.05434 0.23413 3
01/31/2008 0.03423 0.43423 4
01/31/2008 0.65277 0.23423 1
01/31/2008 0.02342 0.47234 4
02/31/2008 0.01463 0.01231 4
02/31/2008 0.13456 0.52552 2
02/31/2008 0.34534 0.36663 1
02/31/2008 0.00324 0.56463 3
...
12/31/2015 0.21234 0.02333 2
12/31/2015 0.07245 0.87234 1
12/31/2015 0.47282 0.12998 1
12/31/2015 0.99022 0.03445 2
Basically I need to caculate the date-specific correlation between return1 and rank (so the corr. on 01/31/2008, 02/31/2008, and so on). I know I can split the data using the split() function but I am unsure as to how to get the date-specific correlation. The real data has about 260 entries per date and around 68 dates, so manually subsetting the original table and performing calculations is time consuming but more importantly more susceptible to error.
My ultimate goal is to create a time series of the correlations on different dates.
Thank you in advance!
I had this same problem earlier, except I wasn't calculating correlation. What I would do is
a %>% group_by(Date) %>% summarise(Correlation = cor(return1, rank))
And this will provide, for each date, a correlation value between return1 and rank. Don't forget that you can specify what kind of correlation you would like (e.g. Spearman).
I'm attempting to model customer lifetimes on subscriptions. As the data is censored I'll be using R's survival package to create a survival curve.
The original subscriptions dataset looks like this..
id start_date end_date
1 2013-06-01 2013-08-25
2 2013-06-01 NA
3 2013-08-01 2013-09-12
Which I manipulate to look like this..
id tenure_in_months status(1=cancelled, 0=active)
1 2 1
2 ? 0
3 1 1
..in order to feed the survival model:
obj <- with(subscriptions, Surv(time=tenure_in_months, event=status, type="right"))
fit <- survfit(obj~1, data=subscriptions)
plot(fit)
What shall I put in the tenure_in_months variable for the consored cases i.e. the cases where the subscription is still active today - should it be the tenure up until today or should it be NA?
First I shall say I disagree with the previous answer. For a subscription still active today, it should not be considered as tenure up until today, nor NA. What do we know exactly about those subscriptions? We know they tenured up until today, that is equivalent to say tenure_in_months for those observations, although we don't know exactly how long they are, they are longer than their tenure duration up to today.
This is a situation known as right-censor in survival analysis. See: http://en.wikipedia.org/wiki/Censoring_%28statistics%29
So your data would need to translate from
id start_date end_date
1 2013-06-01 2013-08-25
2 2013-06-01 NA
3 2013-08-01 2013-09-12
to:
id t1 t2 status(3=interval_censored)
1 2 2 3
2 3 NA 3
3 1 1 3
Then you will need to change your R surv object, from:
Surv(time=tenure_in_months, event=status, type="right")
to:
Surv(t1, t2, event=status, type="interval2")
See http://stat.ethz.ch/R-manual/R-devel/library/survival/html/Surv.html for more syntax details. A very good summary of computational details can be found: http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_lifereg_sect018.htm
Interval censored data can be represented in two ways. For the first use type = interval and the codes shown above. In that usage the value of the time2 argument is ignored unless event=3. The second approach is to think of each observation as a time interval with (-infinity, t) for left censored, (t, infinity) for right censored, (t,t) for exact and (t1, t2) for an interval. This is the approach used for type = interval2, with NA taking the place of infinity. It has proven to be the more useful.
If a missing end date means that the subscription is still active, then you need to take the time until the current date as censor date.
NA wont work with the survival object. I think those cases will be omitted. That is not what you want! Because these cases contain important information about the survival.
SQL code to get the time till event (use in SELECT part of query)
DATEDIFF(M,start_date,ISNULL(end_date,GETDATE()) AS tenure_in_months
BTW:
I would use difference in days, for my analysis. Does not make sense to round off the time to months.
You need to know the date the data was collected. The tenure_in_months for id 2 should then be this date minus 2013-06-01.
Otherwise I believe your encoding of the data is correct. the status of 0 for id 2 indicates it's right-censored (meaning we have a lower bound on it's lifetime, but not an upper bound).