Related
structure(list(id = c(1L, 1L, 2L, 3L, 3L, 4L), hire_year = c(2017L,
2017L, 2017L, 2017L, 2016L, 2016L)), class = "data.frame", row.names = c(NA,
-6L))
id hire_year
1 1 2017
2 1 2017
3 2 2017
4 3 2017
5 3 2016
6 4 2016
**Expected output**
id hire_year dummy
1 1 2017 0
2 1 2017 0
3 2 2017 1
4 3 2017 0
5 3 2016 0
6 4 2016 1
How to create dummy that equals 1 (and 0 otherwise) if an id appears only once?
With tidyverse, we can group by the id, then use the number of observations within an ifelse statement.
library(tidyverse)
df %>%
group_by(id) %>%
mutate(dummy = ifelse(n() == 1, 1, 0))
Or we could add the number of observations, then change the value based on the condition.
df %>%
add_count(id, name = "dummy") %>%
mutate(n = ifelse(n == 1, 1, 0))
Output
id hire_year dummy
1 1 2017 0
2 1 2017 0
3 2 2017 1
4 3 2017 0
5 3 2016 0
6 4 2016 1
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
structure(list(id = c(1L, 1L, 2L, 3L, 3L, 4L), hire_year = c(2017L,
2017L, 2017L, 2017L, 2016L, 2016L)), class = "data.frame", row.names = c(NA,
-6L)
) %>%
add_count(id, name = 'dummy') %>%
mutate(
dummy = as.integer(dummy == 1)
)
#> id hire_year dummy
#> 1 1 2017 0
#> 2 1 2017 0
#> 3 2 2017 1
#> 4 3 2017 0
#> 5 3 2016 0
#> 6 4 2016 1
Created on 2022-03-04 by the reprex package (v2.0.0)
We can use ave in base R like below
> transform(df, dummy = +(ave(id, id, FUN = length) == 1))
id hire_year dummy
1 1 2017 0
2 1 2017 0
3 2 2017 1
4 3 2017 0
5 3 2016 0
6 4 2016 1
A data.table solution:
library(data.table)
DT <- structure(list(id = c(1L, 1L, 2L, 3L, 3L, 4L), hire_year = c(2017L,
2017L, 2017L, 2017L, 2016L, 2016L)), class = "data.frame", row.names = c(NA,
-6L))
# Convert into data.table
setDT(DT)
# Count number of times "id" shows up
DT[, count := .N, by =.(id)]
# Create a dummy variable that equals 1 if count ==1
DT[, dummy := fifelse(count == 1,1,0)]
id hire_year count dummy
<int> <int> <int> <num>
1: 1 2017 2 0
2: 1 2017 2 0
3: 2 2017 1 1
4: 3 2017 2 0
5: 3 2016 2 0
6: 4 2016 1 1
What we have:
companyID year status
1 2010
1 2011
1 2012 2
1 2013
1 2014
2 2007
2 2008
2 2009 2
2 2010
2 2011
2 2012 1
2 2013
For companyID 1: I have the observation with status 2 in year 2012. I would want R to make any observations prior to that as status 1 (by companyID). Then I would want R to make observations after that (the status 2 in 2012) to a status of 2 (still per company).
For companyID 2: I have the observation with status 2 in year 2009. i would want R to make any observations prior to that as status 1 (by companyID). Then I would want R to make observations to status 2 until a status 1 shows up again (still per company).
(Summing up: Fill in the other value (1) before the one that is already there (2), then continue with 2 until there is another change (change as in: either that there is a new company or that there was a status change that had already been stated in the original dataframe))
This would then look like the following, and is what we want to acheive:
companyID year status
1 2010 1
1 2011 1
1 2012 2
1 2013 2
1 2014 2
2 2007 1
2 2008 1
2 2009 2
2 2010 2
2 2011 2
2 2012 1
2 2013 1
We have a large dataset and that is why this would not be possible manually. Is there a way to code for both of the companyID’s simultaneously (and hence for all the thousands of observations we have) in R?
Here is one way :
library(dplyr)
library(tidyr)
df %>%
group_by(companyID) %>%
fill(status) %>%
mutate(status = replace(status, is.na(status),
ifelse(na.omit(status)[1] == 1, 2, 1))) %>%
ungroup
# companyID year status
# <int> <int> <dbl>
# 1 1 2010 1
# 2 1 2011 1
# 3 1 2012 2
# 4 1 2013 2
# 5 1 2014 2
# 6 2 2007 1
# 7 2 2008 1
# 8 2 2009 2
# 9 2 2010 2
#10 2 2011 2
#11 2 2012 1
#12 2 2013 1
data
df <- structure(list(companyID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L), year = c(2010L, 2011L, 2012L, 2013L, 2014L,
2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L), status = c(NA,
NA, 2L, NA, NA, NA, NA, 2L, NA, NA, 1L, NA)),
class = "data.frame", row.names = c(NA, -12L))
I have a dataframe, called dets_per_month, that looks like so...
**Zone month yearcollected total**
1 Jul 2017 183
1 Jul 2015 18
1 Aug 2015 202
1 Aug 2017 202
1 Aug 2017 150
1 Sep 2017 68
2 Apr 2018 65
2 Jun 2018 25
2 Sep 2018 278
I'm trying to input 0's for months where there are no totals in a particular zone. This is the code I tried using to input those 0's
complete(dets_per_month, nesting(zone, month), yearcollected = 2016:2018, fill = list(count = 0))
But the output of this doesn't give me any 0's, instead it adds on columns from my original dataframe.
Can anyone tell me how to get 0's for this?
You could use complete after grouping by Zone and yearcollected. We can use month.abb which is in-built constant for month name in English.
library(dplyr)
df %>%
group_by(Zone, yearcollected) %>%
tidyr::complete(month = month.abb, fill = list(total = 0))
# Zone yearcollected month total
# <int> <int> <chr> <dbl>
# 1 1 2015 Apr 0
# 2 1 2015 Aug 202
# 3 1 2015 Dec 0
# 4 1 2015 Feb 0
# 5 1 2015 Jan 0
# 6 1 2015 Jul 18
# 7 1 2015 Jun 0
# 8 1 2015 Mar 0
# 9 1 2015 May 0
#10 1 2015 Nov 0
# … with 27 more rows
data
df <- structure(list(Zone = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L),
month = structure(c(3L, 3L, 2L, 2L, 2L, 5L, 1L, 4L, 5L), .Label = c("Apr",
"Aug", "Jul", "Jun", "Sep"), class = "factor"), yearcollected = c(2017L,
2015L, 2015L, 2017L, 2017L, 2017L, 2018L, 2018L, 2018L),
total = c(183L, 18L, 202L, 202L, 150L, 68L, 65L, 25L, 278L
)), class = "data.frame", row.names = c(NA, -9L))
I've got monthly year over year data in a long format that I'm trying to spread with two columns. The only examples I've seen include a single key.
> dput(df)
structure(list(ID = c("a", "a", "a", "a", "a", "a", "a", "a",
"a", "b", "b", "b", "b", "b", "b", "b", "b", "b"), Year = c(2015L,
2015L, 2015L, 2016L, 2016L, 2016L, 2017L, 2017L, 2017L, 2015L,
2015L, 2015L, 2016L, 2016L, 2016L, 2017L, 2017L, 2017L), Month = c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L,
3L), Value = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 6L, 7L, 8L,
9L, 10L, 11L, 12L, 13L, 14L)), .Names = c("ID", "Year", "Month",
"Value"), class = "data.frame", row.names = c(NA, -18L))
I'm trying to get it into a data format with years as columns 2:5, and one row per month per ID
ID Month 2015 2016 2017
a 1 1 2 3
a 2 1 2 3
a 3 1 2 3
a 1 6 9 12
a 2 7 10 13
a 3 8 11 14
I've tried the following with the following error:
by_month_over_years = spread(df,key = c(Year,Month), Value)
Error: `var` must evaluate to a single number or a column name, not an integer vector
library(tidyr)
library(dplyr)
df %>% group_by(ID) %>% spread(Year, Value)
# A tibble: 6 x 5
# Groups: ID [2]
ID Month `2015` `2016` `2017`
<chr> <int> <int> <int> <int>
1 a 1 1 2 3
2 a 2 1 2 3
3 a 3 1 2 3
4 b 1 6 9 12
5 b 2 7 10 13
6 b 3 8 11 14
library(reshape2) # or data.table, for dcast
dcast(df, ID + Month ~ Year)
# ID Month 2015 2016 2017
# 1 a 1 1 2 3
# 2 a 2 1 2 3
# 3 a 3 1 2 3
# 4 b 1 6 9 12
# 5 b 2 7 10 13
# 6 b 3 8 11 14
Here is a base R option with reshape
reshape(df, idvar = c('ID', 'Month'), direction = 'wide', timevar = 'Year')
# ID Month Value.2015 Value.2016 Value.2017
#1 a 1 1 2 3
#2 a 2 1 2 3
#3 a 3 1 2 3
#10 b 1 6 9 12
#11 b 2 7 10 13
#12 b 3 8 11 14
Hi i am a stata user and i am trying to pass my codes to R. I have a panel data as shown below, and i am looking for a command that can create a constant variable according to which year and quarter the row is located. In stata such command would be reproduced by gen new_variable = yq(year, quarter)
My dataframe look like this
id year quarter
1 2007 1
1 2007 2
1 2007 3
1 2007 4
1 2008 1
1 2008 2
1 2008 3
1 2008 4
1 2009 1
1 2009 2
1 2009 3
1 2009 4
2 2007 1
2 2007 2
2 2007 3
2 2007 4
2 2008 1
2 2008 2
2 2008 3
2 2008 4
3 2009 2
3 2009 3
3 2010 2
3 2010 3
I my expected output should look like this: (Values inside new_variable are arbitrary, just looking for a constant value the would be always the same for each year and quarter)
id year quarter new_variable
1 2007 1 220
1 2007 2 221
1 2007 3 222
1 2007 4 223
1 2008 1 224
1 2008 2 225
1 2008 3 226
1 2008 4 227
1 2009 1 228
1 2009 2 229
1 2009 3 230
1 2009 4 231
2 2007 1 220
2 2007 2 221
2 2007 3 222
2 2007 4 223
2 2008 1 224
2 2008 2 225
2 2008 3 226
2 2008 4 227
3 2009 2 229
3 2009 3 230
3 2010 2 233
3 2010 3 234
Any of these will work:
# basic: just concatenate year and quarter
df$new_variable = paste(df$year, df$quarter)
# made for this, has additional options around
# ordering of the categories and including unobserved combos
df$new_variable = interaction(df$year, df$quarter)
# for an integer value, 1 to the number of combos
df$new_variable = as.integer(factor(paste(df$year, df$quarter)))
Here are two options:
library(dplyr) # with dplyr
df %>% mutate(new_variable = group_indices(., year, quarter))
library(data.table) # with data.table
setDT(df)[, new_variable := .GRP, .(year, quarter)]
Data
df <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), year = c(2007L,
2007L, 2007L, 2007L, 2008L, 2008L, 2008L, 2008L, 2009L, 2009L,
2009L, 2009L, 2007L, 2007L, 2007L, 2007L, 2008L, 2008L, 2008L,
2008L, 2009L, 2009L, 2010L, 2010L), quarter = c(1L, 2L, 3L, 4L,
1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L,
2L, 3L, 2L, 3L)), .Names = c("id", "year", "quarter"), class = "data.frame", row.names = c(NA,
-24L))
1) yearqtr The yearqtr class in the zoo package does this. yearqtr objects have a type of double with the value year + 0 for Q1, year + 1/4 for Q2, etc. When displayed they are shown in a meaningful way; however, they can still be manipulated as if they were plain numbers, e.g. if yq is yearqtr variable then yq + 1 is the same quarter in the next year.
library(zoo)
transform(df, new_variable = as.yearqtr(year + (quarter - 1)/4))
1a) or
transform(df, new_variable = as.yearqtr(paste(year, quarter, sep = "-")))
Either of these give:
id year quarter new_variable
1 1 2007 1 2007 Q1
2 1 2007 2 2007 Q2
3 1 2007 3 2007 Q3
4 1 2007 4 2007 Q4
5 1 2008 1 2008 Q1
... etc ...
2) 220 If you specifically wanted to assign 220 to the first date and have each subsequent quarter increment by 1 then:
transform(df, new_variable = as.numeric(factor(4 * year + quarter)) + 220 - 1)