This is my data:
Year1 <- c(2015,2013,2012,2018)
Year2 <- c(2017,2015,2014,2020)
my_data <- data.frame(Year1, Year2)
I need an if statement that returns 1 when year 1 equals 2015 OR 2016 AND year 2 is greater than 2016. Currently, my code looks like this:
my_data <- my_data %>%
mutate(Y_2016=ifelse(my_data$Year1==2015|2016 & my_data$Year1>2016,1,0))
But this does not work and only seems to check the condition if Year 2 is greater than 2016, since it returns 1 even for the last row when Year 1 is 2018 and Year 2 is 2020.
Thank you for your help!
Instead of my_data$Year1==2015|2016, use %in% like my_data$Year1 %in% c(2015,2016).
Typo in my_data$Year1>2016
As you using dplyr you do not need to specify every variable with $ like my_data$...
my_data%>%
mutate(Y_2016=ifelse(Year1 %in% c(2015,2016) & Year2>2016,1,0))
Year1 Year2 Y_2016
1 2015 2017 1
2 2013 2015 0
3 2012 2014 0
4 2018 2020 0
Related
I'm working with a panel data
and i want keep observations of id that the first time v_1=1 is not the first time of the specific id.
Kind of bysort command in stata
In the example i want to keep only 61312 obs and not 42848
Thanks
dd <- read.table(text="
id year v_1
61312 2015 0
61312 2016 0
61312 2017 1
61312 2018 1
42848 2016 1
42848 2017 0", header=TRUE)
You can use group_by and filter from dplyr to help with this task
library(dplyr)
dd %>%
group_by(id) %>%
filter(first(v_1) != 1)
we use group_by so when we use first() it looks at the first values for each id
You can use -
subset(dd, id %in% unique(id)[v_1[!duplicated(id)] != 1])
# id year v_1
#1 61312 2015 0
#2 61312 2016 0
#3 61312 2017 1
#4 61312 2018 1
v1[!duplicated(id)] keeps only the first v1 value of each id and we select only those id's which has that first value not equal to 1.
I have a data set with the variable 'months' from 1 to 12, but need to change them to the month names. i.e "1" needs to be January and so on. Whats the easiest way to do this?
R has an inbuilt vector called month.name for your purpose you could do something like the following:
# Some dummy data
set.seed(1)
df <- data.frame(
month = sample(1:12, size = 10)
)
# Now use your integer month to subset month.name
df$month2 <- month.name[df$month] # Also has month.abb
df
month month2
1 9 September
2 4 April
3 7 July
4 1 January
5 2 February
6 5 May
7 3 March
8 8 August
9 6 June
10 11 November
I want to subset my dataframe from Sept 2017 to April 2018. My dataframe is like this:-
Year Month Day Avg_Temp
2017 8 31 20
2017 9 1 22
.
.
.
2018 4 30 26
2018 5 1 30
I want that my dataset from 1 Sept 2017 to 30 April 2018.
Year Month Day Avg_Temp
2017 9 1 22
.
.
.
2018 4 30 26
based on just the year I am to do subset.
df <-df[df$YEAR>="2017" & df$YEAR<="2018", ]
But I need to subset from month as well. Any help would be great
Try this option:
df <- df[(df$Year == 2017 & df$Month >= 9) |
(df$Year == 2018 & df$Month <= 4), ]
By the way, you might want to consider storing your dates as a proper date type, including a day component.
Here is a dplyr approach:
require(tidyverse)
df<-data.frame(Year=c(2018,2017,2017,2017,2018,2018,2018),
Month=c(9,8,10,4,9,3,4),Day=c(13,12,14,15,17,15,14))
df %>%
filter(Year==2017&Month>=9|Year==2018&Month<=4)
Which Yields this:
Year Month Day
1 2017 9 14
2 2018 3 15
3 2018 4 14
Perhaps it would be easier if the three date components were encoded in one Date column :
df$Date <- as.Date(paste(df$Year, df$Month, df$Date, sep = '-'))
df$Year <- NULL
df$Month <- NULL
df <- df[df$Date > as.Date('2017-09-01') & df$Date < as.Date('2018-04-01'), ]
I have a dataset of patents, where I have recorded 1) the month and year associated with a patent renewal and 2) whether the patent holder chose to pay the patent fee or let the patent lapse. So
patentid fee1date fee1paid fee2date fee2paid
1 May 2010 True May 2013 False
2 May 2010 True April 2014 True
What I want to do is develop a count of the number of renewals by month and by year, as well as the number of abandoned patents, as follows:
date renewed lapsed
May 2010 2 0
How might I count the data that I have now? Thank you!
EDIT: The key point is to aggregate these across different columns. The issue that I am running into now is that when I try using the count library, it treats 2 renewals in May 2010 as two separate values.
Using dplyr
require(tidyr)
require(dplyr)
data %>% gather(year,value, -Patent.ID) %>%
separate('year',c('Fee','N','Act')) %>%
spread(Act,value) %>%
unite(Fee, Fee,N, sep = '.') %>%
group_by(Date) %>%
summarise(R=sum(Paid=='True'), NotR=sum(Paid=='False'))
# A tibble: 3 x 3
Date R NotR
<chr> <int> <int>
1 April 2014 1 0
2 May 2010 2 0
3 May 2013 0 1
Data
data <- read.table(text="
'Patent ID' 'Fee 1 Date' 'Fee 1 Paid' 'Fee 2 Date' 'Fee 2 Paid'
1 'May 2010' True 'May 2013' False
2 'May 2010' True 'April 2014' True
",header=T, stringsAsFactors = F)
Problem Description :
I am trying to calculate the recency , based on , what is the most recent value in Year column where the target achieved indicator was equal to 1 and in case the indicator column has 0 as the only available value for the Salesman + Year key, choose the minimum year in that case
Data:
Salesman_ID Year Yearly_Targets_Achieved_Indicator
1 AA-5468 2012 1
2 AA-5468 2013 0
3 AA-5468 2014 0
4 AA-5468 2015 0
5 AA-5468 2016 1
6 AL-3791 2012 1
7 AL-3791 2013 1
8 AL-3791 2014 0
9 AL-3893 2015 0
10 AL-3893 2016 0
Expected Output:
Salesman_ID Year Yearly_Targets_Achieved_Indicator
<chr> <dbl> <dbl>
1 AA-5468 2016 1
2 AA-3791 2013 1
9 AL-3893 2015 0
Using the package tidyverse I suggest you the following code:
library(tidyverse)
Prashant_df <- data.frame(
c("AA-5468","AA-5468","AA-5468","AA-5468","AA-5468","AL-3791","AL-3791","AL-3791","AL-3893","AL-3893"),
c(2012,2013,2014,2015,2016,2012,2013,2014,2015,2016),
c(1,0,0,0,1,1,1,0,0,0)
)
names(Prashant_df) <- c("Salesman_ID","Year","Yearly_Targets_Achieved_Indicator")
Prashant_df <- Prashant_df %>%
group_by(Salesman_ID) %>%
mutate(Year_target=case_when(
Yearly_Targets_Achieved_Indicator==1 ~ max(Year),
Yearly_Targets_Achieved_Indicator==0 ~ min(Year)
))
Prashant_df_collapsed <- Prashant_df %>%
group_by(Salesman_ID) %>%
summarise(Year=max(Year_target),
Yearly_Targets_Achieved_Indicator=max(Yearly_Targets_Achieved_Indicator))
You can store both maximum and minimum year for each salesman, and the maximum of your binary variable.
newdf = df %>% group_by(Salesman_ID) %>% summarise(
maximum = max(Year),
minimum = min(Year),
maxInd = max(Yearly_Targets_Achieved_Indicator))
From this you can pretty much construct your resulting variable.
Using Base R:
c(by(dat,dat[1],function(x)if(all(x[,3]==0)) x[1,2] else max(x[which(x[,3]==1),2])))
AA-5468 AL-3791 AL-3893
2016 2013 2015
This code is kind of a messy but produces the desired output: Here is the explanation:
first groupby salesman_id, then for that specific group check whether all the indicators are zero, if yes, return the first year. else, look for the latest/maximum year among those which the indicators are 1