I would like to count the numbers of months a person has worked for.
Separation_month refers to the calendar month of dismissal if there was one and is equal to 0 if the person was not dismissed in the current year (2017).
I want to count the months from hire date to dismissal date (if the person was dismissed).
If he was not it means he worked until the end of the current year. So I want to count all months of 2017, that is 12 months for 2017 plus the months from other years.
structure(list(id = 1:5, current_year = c(2017L, 2017L, 2017L,
2017L, 2017L), hire_month = c(2L, 9L, 10L, 3L, 2L), hire_year = c(2016L,
2014L, 1980L, 2017L, 2017L), separation_month = c(0L, 3L, 4L,
4L, 0L)), class = "data.frame", row.names = c(NA, -5L))
id current_year hire_month hire_year separation_month
1 1 2017 2 2016 0
2 2 2017 9 2014 3
3 3 2017 10 1980 4
4 4 2017 3 2017 4
5 5 2017 2 2017 0
E.g. for the first observation, I expect there to be 23 months (he worked for 11 months in 2016 and for 12 months in 2017 since he was not separated from his job).
Stata:
gen months_worked = separation_month+ (separation_month==0)*12
replace months_worked = months_worked + (current_year-hire_year)*12-hire_month+1
R:
df %>%
mutate(months_worked = separation_month + (separation_month<1)*12,
months_worked = months_worked + (current_year-hire_year)*12-hire_month+1
)
Another Stata solution:
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id int current_year byte hire_month int hire_year byte separation_month
1 2017 2 2016 0
2 2017 9 2014 3
3 2017 10 1980 4
4 2017 3 2017 4
5 2017 2 2017 0
end
gen wanted = 1 + cond(separation_month == 0, ym(2017, 12) - ym(hire_year, hire_month), ym(2017, separation_month) - ym(hire_year, hire_month))
Related
What we have:
companyID year status
1 2010
1 2011
1 2012 2
1 2013
1 2014
2 2007
2 2008
2 2009 2
2 2010
2 2011
2 2012 1
2 2013
For companyID 1: I have the observation with status 2 in year 2012. I would want R to make any observations prior to that as status 1 (by companyID). Then I would want R to make observations after that (the status 2 in 2012) to a status of 2 (still per company).
For companyID 2: I have the observation with status 2 in year 2009. i would want R to make any observations prior to that as status 1 (by companyID). Then I would want R to make observations to status 2 until a status 1 shows up again (still per company).
(Summing up: Fill in the other value (1) before the one that is already there (2), then continue with 2 until there is another change (change as in: either that there is a new company or that there was a status change that had already been stated in the original dataframe))
This would then look like the following, and is what we want to acheive:
companyID year status
1 2010 1
1 2011 1
1 2012 2
1 2013 2
1 2014 2
2 2007 1
2 2008 1
2 2009 2
2 2010 2
2 2011 2
2 2012 1
2 2013 1
We have a large dataset and that is why this would not be possible manually. Is there a way to code for both of the companyID’s simultaneously (and hence for all the thousands of observations we have) in R?
Here is one way :
library(dplyr)
library(tidyr)
df %>%
group_by(companyID) %>%
fill(status) %>%
mutate(status = replace(status, is.na(status),
ifelse(na.omit(status)[1] == 1, 2, 1))) %>%
ungroup
# companyID year status
# <int> <int> <dbl>
# 1 1 2010 1
# 2 1 2011 1
# 3 1 2012 2
# 4 1 2013 2
# 5 1 2014 2
# 6 2 2007 1
# 7 2 2008 1
# 8 2 2009 2
# 9 2 2010 2
#10 2 2011 2
#11 2 2012 1
#12 2 2013 1
data
df <- structure(list(companyID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L), year = c(2010L, 2011L, 2012L, 2013L, 2014L,
2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L), status = c(NA,
NA, 2L, NA, NA, NA, NA, 2L, NA, NA, 1L, NA)),
class = "data.frame", row.names = c(NA, -12L))
I would like to create a new data frame by merging two unequal data frames by matching two columns and replace with 0 the missing values.
These are two examples of the data frames I have:
df1
ID YEAR INTERVIEW ID_HOUSEHOLD
1 2017 300
1 2018 300
1 2019 300
2 2017 150
2 2018 150
2 2019 150
3 2017 420
3 2018 420
df2
ID YEAR INTERVIEW YEARS_EDU
1 2017 10
1 2018 10
1 2019 10
3 2017 3
3 2018 3
*note that in the second data frame I don´t have information for individual 2
I would like to get the following data frame:
df3
df1
ID YEAR INTERVIEW ID_HOUSEHOLD YEARS_EDU
1 2017 300 10
1 2018 300 10
1 2019 300 10
2 2017 150 0
2 2018 150 0
2 2019 150 0
3 2017 420 3
3 2018 420 3
I am trying:
df3<-merge(df1,df2, by="ID", all=TRUE)
df3<-merge(df1,df2, by="ID","YEAR_INTERVIEW", all=TRUE)
The first option replicates hundreds of ID observations with years of interviews while the second gives me 0 values.
Any help would be much appreciated :) THANK YOU
The by needs to be a vector i.e. we can create a vector with c(). Also, all = TRUE, is a full join, but here, it should be a left join, so it is all.x = TRUE. If there is no match, then the element will be NA by default
out <- merge(df1,df2, by=c("ID","YEAR_INTERVIEW"), all.x=TRUE)
The NAs can be converted to 0
out$YEARS_EDU[is.na(out$YEARS_EDU)] <- 0
-output
out
# ID YEAR_INTERVIEW ID_HOUSEHOLD YEARS_EDU
#1 1 2017 300 10
#2 1 2018 300 10
#3 1 2019 300 10
#4 2 2017 150 0
#5 2 2018 150 0
#6 2 2019 150 0
#7 3 2017 420 3
#8 3 2018 420 3
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L),
YEAR_INTERVIEW = c(2017L,
2018L, 2019L, 2017L, 2018L, 2019L, 2017L, 2018L), ID_HOUSEHOLD = c(300L,
300L, 300L, 150L, 150L, 150L, 420L, 420L)), class = "data.frame",
row.names = c(NA,
-8L))
df2 <- structure(list(ID = c(1L, 1L, 1L, 3L, 3L),
YEAR_INTERVIEW = c(2017L,
2018L, 2019L, 2017L, 2018L), YEARS_EDU = c(10L, 10L, 10L, 3L,
3L)), class = "data.frame", row.names = c(NA, -5L))
My dataframe looks like this:
Index Year Renovation
1 2012 1
1 2018 1
2 2012 1
2 2018 1
3 2012 0
3 2018 0
I would like to change the Renovation variable for 2012 to '0', IF the renovation variable for 2018 was "1". So I am facing a double condition here. How can I do this in R?
You can use ifelse to check for condition.
library(dplyr)
df %>%
group_by(Index) %>%
mutate(Renovation = ifelse(Year == 2012 &
Renovation[match(2018, Year)] == 1, 0, Renovation))
# Index Year Renovation
# <int> <int> <dbl>
#1 1 2012 0
#2 1 2018 1
#3 2 2012 0
#4 2 2018 1
#5 3 2012 0
#6 3 2018 0
data
df <- structure(list(Index = c(1L, 1L, 2L, 2L, 3L, 3L), Year = c(2012L,
2018L, 2012L, 2018L, 2012L, 2018L), Renovation = c(1L, 1L, 1L,
1L, 0L, 0L)), class = "data.frame", row.names = c(NA, -6L))
I have a dataframe, called dets_per_month, that looks like so...
**Zone month yearcollected total**
1 Jul 2017 183
1 Jul 2015 18
1 Aug 2015 202
1 Aug 2017 202
1 Aug 2017 150
1 Sep 2017 68
2 Apr 2018 65
2 Jun 2018 25
2 Sep 2018 278
I'm trying to input 0's for months where there are no totals in a particular zone. This is the code I tried using to input those 0's
complete(dets_per_month, nesting(zone, month), yearcollected = 2016:2018, fill = list(count = 0))
But the output of this doesn't give me any 0's, instead it adds on columns from my original dataframe.
Can anyone tell me how to get 0's for this?
You could use complete after grouping by Zone and yearcollected. We can use month.abb which is in-built constant for month name in English.
library(dplyr)
df %>%
group_by(Zone, yearcollected) %>%
tidyr::complete(month = month.abb, fill = list(total = 0))
# Zone yearcollected month total
# <int> <int> <chr> <dbl>
# 1 1 2015 Apr 0
# 2 1 2015 Aug 202
# 3 1 2015 Dec 0
# 4 1 2015 Feb 0
# 5 1 2015 Jan 0
# 6 1 2015 Jul 18
# 7 1 2015 Jun 0
# 8 1 2015 Mar 0
# 9 1 2015 May 0
#10 1 2015 Nov 0
# … with 27 more rows
data
df <- structure(list(Zone = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L),
month = structure(c(3L, 3L, 2L, 2L, 2L, 5L, 1L, 4L, 5L), .Label = c("Apr",
"Aug", "Jul", "Jun", "Sep"), class = "factor"), yearcollected = c(2017L,
2015L, 2015L, 2017L, 2017L, 2017L, 2018L, 2018L, 2018L),
total = c(183L, 18L, 202L, 202L, 150L, 68L, 65L, 25L, 278L
)), class = "data.frame", row.names = c(NA, -9L))
I have 2 dataframes that I need to loop through.
Df1[1:5,]
year month Vol
1 2015 7 4.82e-05
2 2015 6 5.91e-05
3 2015 5 6.56e-05
4 2015 4 6.10e-05
5 2015 3 7.85e-05
Df2[1:5,]
year month IB
1 2015 7 0
2 2015 4 1
3 2015 3 0
4 2015 6 1
5 2015 5 0
I need to loop through DF1, compare the months from DF1 and DF2, and if they are the same then set DF1$IB<-DF2$IB. I tried using sapply, but I get this error
tmp<-sapply(DF1$month,function(x){if(DF2$month==x){
DF1$IB<-DF2$IB
}})
Warning messages:
1: In if (DF2$month == x) { :
the condition has length > 1 and only the first element will be used
.....
Any help would be greatly appreciated. Otherwise I would have to resort to multiple for loops, and since DF1 is 900K rows long and DF2 is 300 rows long, that seems very inefficient to me.
With the latest version (see here how to install v1.9.5 from GH) you don't need to set keys and just need setDT(df1)[df2, on = c("year","month")] which add the IB, this gives:
year month Vol IB
1: 2015 7 4.82e-05 0
2: 2015 4 6.10e-05 1
3: 2015 3 7.85e-05 0
4: 2015 6 5.91e-05 1
5: 2015 5 6.56e-05 0
Supposing that the year/month are not equal for both datasets, you have to join differently:
setDT(df2)[df1, on = c("year","month")]
which gives:
year month IB Vol
1: 2015 7 0 4.82e-05
2: 2015 6 1 5.91e-05
3: 2015 5 0 6.56e-05
4: 2015 4 1 6.10e-05
5: 2015 3 NA 7.85e-05
Used data for second example:
df1 <- structure(list(year = c(2015L, 2015L, 2015L, 2015L, 2015L), month = c(7L, 6L, 5L, 4L, 3L), Vol = c(4.82e-05, 5.91e-05, 6.56e-05, 6.1e-05, 7.85e-05)), .Names = c("year", "month", "Vol"), class = "data.frame", row.names = c("1", "2", "3", "4", "5"))
df2 <- structure(list(year = c(2015L, 2015L, 2015L, 2015L, 2015L), month = c(7L, 4L, 2L, 6L, 5L), IB = c(0L, 1L, 0L, 1L, 0L)), .Names = c("year", "month", "IB"), class = "data.frame", row.names = c("1", "2", "3", "4", "5"))
If your Df1 is that large data.tables might be better than merge.
library(data.table)
setkey(setDT(Df1),year,month)[setDT(Df2),IB:=IB]
Df1
# year month Vol IB
# 1: 2015 3 7.85e-05 0
# 2: 2015 4 6.10e-05 1
# 3: 2015 5 6.56e-05 0
# 4: 2015 6 5.91e-05 1
# 5: 2015 7 4.82e-05 0
So this converts Df1 to a data.table in indexes it on year and month, then does a data.table join on Df2 (also converted to a data.table), then adds the IB column from Df2 to Df1.
Using a more realistic example:
set.seed(1)
Df1 <- data.frame(year=rep(2015,1e6),
month=sample(3:7,1e6,replace=TRUE),
Vol=rnorm(1e6))
system.time(result.mrg <- merge(Df1,Df2,by=c("year","month")))
# user system elapsed
# 11.8 0.0 11.8
system.time(result.dt <- setkey(setDT(Df1),year,month[setDT(Df2),IB:=IB])
# user system elapsed
# 0.07 0.00 0.06
identical(result.mrg$IB, result.dt$IB)
# [1] TRUE