I have a dataframe, called dets_per_month, that looks like so...
**Zone month yearcollected total**
1 Jul 2017 183
1 Jul 2015 18
1 Aug 2015 202
1 Aug 2017 202
1 Aug 2017 150
1 Sep 2017 68
2 Apr 2018 65
2 Jun 2018 25
2 Sep 2018 278
I'm trying to input 0's for months where there are no totals in a particular zone. This is the code I tried using to input those 0's
complete(dets_per_month, nesting(zone, month), yearcollected = 2016:2018, fill = list(count = 0))
But the output of this doesn't give me any 0's, instead it adds on columns from my original dataframe.
Can anyone tell me how to get 0's for this?
You could use complete after grouping by Zone and yearcollected. We can use month.abb which is in-built constant for month name in English.
library(dplyr)
df %>%
group_by(Zone, yearcollected) %>%
tidyr::complete(month = month.abb, fill = list(total = 0))
# Zone yearcollected month total
# <int> <int> <chr> <dbl>
# 1 1 2015 Apr 0
# 2 1 2015 Aug 202
# 3 1 2015 Dec 0
# 4 1 2015 Feb 0
# 5 1 2015 Jan 0
# 6 1 2015 Jul 18
# 7 1 2015 Jun 0
# 8 1 2015 Mar 0
# 9 1 2015 May 0
#10 1 2015 Nov 0
# … with 27 more rows
data
df <- structure(list(Zone = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L),
month = structure(c(3L, 3L, 2L, 2L, 2L, 5L, 1L, 4L, 5L), .Label = c("Apr",
"Aug", "Jul", "Jun", "Sep"), class = "factor"), yearcollected = c(2017L,
2015L, 2015L, 2017L, 2017L, 2017L, 2018L, 2018L, 2018L),
total = c(183L, 18L, 202L, 202L, 150L, 68L, 65L, 25L, 278L
)), class = "data.frame", row.names = c(NA, -9L))
Related
I would like to count the numbers of months a person has worked for.
Separation_month refers to the calendar month of dismissal if there was one and is equal to 0 if the person was not dismissed in the current year (2017).
I want to count the months from hire date to dismissal date (if the person was dismissed).
If he was not it means he worked until the end of the current year. So I want to count all months of 2017, that is 12 months for 2017 plus the months from other years.
structure(list(id = 1:5, current_year = c(2017L, 2017L, 2017L,
2017L, 2017L), hire_month = c(2L, 9L, 10L, 3L, 2L), hire_year = c(2016L,
2014L, 1980L, 2017L, 2017L), separation_month = c(0L, 3L, 4L,
4L, 0L)), class = "data.frame", row.names = c(NA, -5L))
id current_year hire_month hire_year separation_month
1 1 2017 2 2016 0
2 2 2017 9 2014 3
3 3 2017 10 1980 4
4 4 2017 3 2017 4
5 5 2017 2 2017 0
E.g. for the first observation, I expect there to be 23 months (he worked for 11 months in 2016 and for 12 months in 2017 since he was not separated from his job).
Stata:
gen months_worked = separation_month+ (separation_month==0)*12
replace months_worked = months_worked + (current_year-hire_year)*12-hire_month+1
R:
df %>%
mutate(months_worked = separation_month + (separation_month<1)*12,
months_worked = months_worked + (current_year-hire_year)*12-hire_month+1
)
Another Stata solution:
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id int current_year byte hire_month int hire_year byte separation_month
1 2017 2 2016 0
2 2017 9 2014 3
3 2017 10 1980 4
4 2017 3 2017 4
5 2017 2 2017 0
end
gen wanted = 1 + cond(separation_month == 0, ym(2017, 12) - ym(hire_year, hire_month), ym(2017, separation_month) - ym(hire_year, hire_month))
I would like to create a new data frame by merging two unequal data frames by matching two columns and replace with 0 the missing values.
These are two examples of the data frames I have:
df1
ID YEAR INTERVIEW ID_HOUSEHOLD
1 2017 300
1 2018 300
1 2019 300
2 2017 150
2 2018 150
2 2019 150
3 2017 420
3 2018 420
df2
ID YEAR INTERVIEW YEARS_EDU
1 2017 10
1 2018 10
1 2019 10
3 2017 3
3 2018 3
*note that in the second data frame I don´t have information for individual 2
I would like to get the following data frame:
df3
df1
ID YEAR INTERVIEW ID_HOUSEHOLD YEARS_EDU
1 2017 300 10
1 2018 300 10
1 2019 300 10
2 2017 150 0
2 2018 150 0
2 2019 150 0
3 2017 420 3
3 2018 420 3
I am trying:
df3<-merge(df1,df2, by="ID", all=TRUE)
df3<-merge(df1,df2, by="ID","YEAR_INTERVIEW", all=TRUE)
The first option replicates hundreds of ID observations with years of interviews while the second gives me 0 values.
Any help would be much appreciated :) THANK YOU
The by needs to be a vector i.e. we can create a vector with c(). Also, all = TRUE, is a full join, but here, it should be a left join, so it is all.x = TRUE. If there is no match, then the element will be NA by default
out <- merge(df1,df2, by=c("ID","YEAR_INTERVIEW"), all.x=TRUE)
The NAs can be converted to 0
out$YEARS_EDU[is.na(out$YEARS_EDU)] <- 0
-output
out
# ID YEAR_INTERVIEW ID_HOUSEHOLD YEARS_EDU
#1 1 2017 300 10
#2 1 2018 300 10
#3 1 2019 300 10
#4 2 2017 150 0
#5 2 2018 150 0
#6 2 2019 150 0
#7 3 2017 420 3
#8 3 2018 420 3
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L),
YEAR_INTERVIEW = c(2017L,
2018L, 2019L, 2017L, 2018L, 2019L, 2017L, 2018L), ID_HOUSEHOLD = c(300L,
300L, 300L, 150L, 150L, 150L, 420L, 420L)), class = "data.frame",
row.names = c(NA,
-8L))
df2 <- structure(list(ID = c(1L, 1L, 1L, 3L, 3L),
YEAR_INTERVIEW = c(2017L,
2018L, 2019L, 2017L, 2018L), YEARS_EDU = c(10L, 10L, 10L, 3L,
3L)), class = "data.frame", row.names = c(NA, -5L))
My dataframe looks like this:
Index Year Renovation
1 2012 1
1 2018 1
2 2012 1
2 2018 1
3 2012 0
3 2018 0
I would like to change the Renovation variable for 2012 to '0', IF the renovation variable for 2018 was "1". So I am facing a double condition here. How can I do this in R?
You can use ifelse to check for condition.
library(dplyr)
df %>%
group_by(Index) %>%
mutate(Renovation = ifelse(Year == 2012 &
Renovation[match(2018, Year)] == 1, 0, Renovation))
# Index Year Renovation
# <int> <int> <dbl>
#1 1 2012 0
#2 1 2018 1
#3 2 2012 0
#4 2 2018 1
#5 3 2012 0
#6 3 2018 0
data
df <- structure(list(Index = c(1L, 1L, 2L, 2L, 3L, 3L), Year = c(2012L,
2018L, 2012L, 2018L, 2012L, 2018L), Renovation = c(1L, 1L, 1L,
1L, 0L, 0L)), class = "data.frame", row.names = c(NA, -6L))
I have two lists of data frames. Each list has 6 data frames.
The dataframes has the same columns, but in list1 the dataframes has info from 2015 to 2017 and list2 has info of 2018. Like below
List1$A
Name Value Year
AAA 123 2015
BBB 456 2016
CCC 789 2017
AAA 543 2018
List2$A
Name Value Year
AAA 543 2018
BBB 248 2018
I want to merge the dataframes from both lists. So I want in the end just one list of dataframes with all the info for all years.
Some dataframes from list1 has already info of 2018, so when I merge them with the others I want those 2018 values to be replaced.
Newlist$A
Name Value Year
AAA 123 2015
BBB 456 2016
CCC 789 2017
AAA 543 2018
BBB 248 2018
I tried this but didn't work
data<- lapply(list1,list2, function (x,y) merge(x,y))
How can I do this?
It's always helpful to include a sample of data with dput, but here's an attempt without the data's confirmation:
library(tidyverse)
map2(list1, list2, ~bind_rows(.y, .x) %>% group_by(Name, Year) %>% slice(1))
We bind the rows (with list2 first), then grouping by Name and Year and taking the first occurrence with slice, which should take the first value for any Name/Year repeated measures from the 2nd data frame.
We could first bind everything into a long data frame and remove the entries for "2018" that first occur if there's an entry in list 2.
To do this we could list the lists and rbind them after adding an ID column that later helps to remove the duplicates of year "2018" that stem from list 1 with by/ave, but keep those which don't occur in list 2.
The trick of the latter is to us a rev(seq_along(x)).
To demonstrate I have created sample data that probably resembles your data.
# list the lists
L <- list(L1=L1, L2=L2)
# add id column to sublists
L <- lapply(seq(L), function(x)
Map(`[<-`, L[[x]], "list", value=substr(names(L)[x], 2, 2)))
# rbind lists to long data frame
d <- do.call(rbind, unlist(L, recursive=FALSE))
# remove 2018 duplicates of list L1, keep if no 2018 in list L2
do.call(rbind, by(d, d$name, function(y) {
i <- cbind(y, id=ave(y$year, y$year, FUN=function(z) rev(seq_along(z))))
i[!i$id == 2, ]
}))
Result
# name value year list id
# A.A.1 A 998 2015 1 1
# A.A.4 A 456 2016 1 1
# A.A.7 A 312 2017 1 1
# A.A.13 A 478 2018 2 1
# B.A.2 B 1592 2015 1 1
# B.A.5 B 1072 2016 1 1
# B.A.8 B 673 2017 1 1
# B.A.21 B 445 2018 2 1
# C.A.3 C 957 2015 1 1
# C.A.6 C 199 2016 1 1
# C.A.9 C 2165 2017 1 1
# C.A.31 C 342 2018 2 1
# D.B.1 D 877 2015 1 1
# D.B.4 D 876 2016 1 1
# D.B.7 D 482 2017 1 1
# D.B.13 D 1077 2018 2 1
# E.B.2 E 370 2015 1 1
# E.B.5 E 1475 2016 1 1
# E.B.8 E 768 2017 1 1
# E.B.11 E 385 2018 1 1 <- this stems from list 1!
# F.B.3 F 421 2015 1 1
# F.B.6 F 930 2016 1 1
# F.B.9 F 1105 2017 1 1
# F.B.31 F 1836 2018 2 1
Data
l1 <- list(A = structure(list(name = structure(c(1L, 2L, 3L, 1L, 2L,
3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("A", "B", "C"), class = "factor"),
value = c(1371, 565, 363, 633, 404, 106, 1512, 95, 2018,
63, 1305, 2287), year = c(2015L, 2015L, 2015L, 2016L, 2016L,
2016L, 2017L, 2017L, 2017L, 2018L, 2018L, 2018L)), class = "data.frame", row.names = c(NA,
-12L)), B = structure(list(name = structure(c(1L, 2L, 3L, 1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("D", "E", "F"), class = "factor"),
value = c(1389, 279, 133, 636, 284, 2656, 2440, 1320, 307,
1781, 172, 1215), year = c(2015L, 2015L, 2015L, 2016L, 2016L,
2016L, 2017L, 2017L, 2017L, 2018L, 2018L, 2018L)), class = "data.frame", row.names = c(NA,
-12L)))
L2 <- list(A = structure(list(name = structure(1:3, .Label = c("A",
"B", "C"), class = "factor"), value = c(1895, 430, 257), year = c(2018,
2018, 2018)), class = "data.frame", row.names = c(NA, -3L)),
B = structure(list(name = structure(c(1L, 3L), .Label = c("D",
"E", "F"), class = "factor"), value = c(1763, 640), year = c(2018,
2018)), row.names = c(1L, 3L), class = "data.frame"))
L2$B <- L2$B[-2, ] # remove intentionally value
Hi i am a stata user and i am trying to pass my codes to R. I have a panel data as shown below, and i am looking for a command that can create a constant variable according to which year and quarter the row is located. In stata such command would be reproduced by gen new_variable = yq(year, quarter)
My dataframe look like this
id year quarter
1 2007 1
1 2007 2
1 2007 3
1 2007 4
1 2008 1
1 2008 2
1 2008 3
1 2008 4
1 2009 1
1 2009 2
1 2009 3
1 2009 4
2 2007 1
2 2007 2
2 2007 3
2 2007 4
2 2008 1
2 2008 2
2 2008 3
2 2008 4
3 2009 2
3 2009 3
3 2010 2
3 2010 3
I my expected output should look like this: (Values inside new_variable are arbitrary, just looking for a constant value the would be always the same for each year and quarter)
id year quarter new_variable
1 2007 1 220
1 2007 2 221
1 2007 3 222
1 2007 4 223
1 2008 1 224
1 2008 2 225
1 2008 3 226
1 2008 4 227
1 2009 1 228
1 2009 2 229
1 2009 3 230
1 2009 4 231
2 2007 1 220
2 2007 2 221
2 2007 3 222
2 2007 4 223
2 2008 1 224
2 2008 2 225
2 2008 3 226
2 2008 4 227
3 2009 2 229
3 2009 3 230
3 2010 2 233
3 2010 3 234
Any of these will work:
# basic: just concatenate year and quarter
df$new_variable = paste(df$year, df$quarter)
# made for this, has additional options around
# ordering of the categories and including unobserved combos
df$new_variable = interaction(df$year, df$quarter)
# for an integer value, 1 to the number of combos
df$new_variable = as.integer(factor(paste(df$year, df$quarter)))
Here are two options:
library(dplyr) # with dplyr
df %>% mutate(new_variable = group_indices(., year, quarter))
library(data.table) # with data.table
setDT(df)[, new_variable := .GRP, .(year, quarter)]
Data
df <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), year = c(2007L,
2007L, 2007L, 2007L, 2008L, 2008L, 2008L, 2008L, 2009L, 2009L,
2009L, 2009L, 2007L, 2007L, 2007L, 2007L, 2008L, 2008L, 2008L,
2008L, 2009L, 2009L, 2010L, 2010L), quarter = c(1L, 2L, 3L, 4L,
1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L,
2L, 3L, 2L, 3L)), .Names = c("id", "year", "quarter"), class = "data.frame", row.names = c(NA,
-24L))
1) yearqtr The yearqtr class in the zoo package does this. yearqtr objects have a type of double with the value year + 0 for Q1, year + 1/4 for Q2, etc. When displayed they are shown in a meaningful way; however, they can still be manipulated as if they were plain numbers, e.g. if yq is yearqtr variable then yq + 1 is the same quarter in the next year.
library(zoo)
transform(df, new_variable = as.yearqtr(year + (quarter - 1)/4))
1a) or
transform(df, new_variable = as.yearqtr(paste(year, quarter, sep = "-")))
Either of these give:
id year quarter new_variable
1 1 2007 1 2007 Q1
2 1 2007 2 2007 Q2
3 1 2007 3 2007 Q3
4 1 2007 4 2007 Q4
5 1 2008 1 2008 Q1
... etc ...
2) 220 If you specifically wanted to assign 220 to the first date and have each subsequent quarter increment by 1 then:
transform(df, new_variable = as.numeric(factor(4 * year + quarter)) + 220 - 1)