Create a constant variable link to year and quarter - r

Hi i am a stata user and i am trying to pass my codes to R. I have a panel data as shown below, and i am looking for a command that can create a constant variable according to which year and quarter the row is located. In stata such command would be reproduced by gen new_variable = yq(year, quarter)
My dataframe look like this
id year quarter
1 2007 1
1 2007 2
1 2007 3
1 2007 4
1 2008 1
1 2008 2
1 2008 3
1 2008 4
1 2009 1
1 2009 2
1 2009 3
1 2009 4
2 2007 1
2 2007 2
2 2007 3
2 2007 4
2 2008 1
2 2008 2
2 2008 3
2 2008 4
3 2009 2
3 2009 3
3 2010 2
3 2010 3
I my expected output should look like this: (Values inside new_variable are arbitrary, just looking for a constant value the would be always the same for each year and quarter)
id year quarter new_variable
1 2007 1 220
1 2007 2 221
1 2007 3 222
1 2007 4 223
1 2008 1 224
1 2008 2 225
1 2008 3 226
1 2008 4 227
1 2009 1 228
1 2009 2 229
1 2009 3 230
1 2009 4 231
2 2007 1 220
2 2007 2 221
2 2007 3 222
2 2007 4 223
2 2008 1 224
2 2008 2 225
2 2008 3 226
2 2008 4 227
3 2009 2 229
3 2009 3 230
3 2010 2 233
3 2010 3 234

Any of these will work:
# basic: just concatenate year and quarter
df$new_variable = paste(df$year, df$quarter)
# made for this, has additional options around
# ordering of the categories and including unobserved combos
df$new_variable = interaction(df$year, df$quarter)
# for an integer value, 1 to the number of combos
df$new_variable = as.integer(factor(paste(df$year, df$quarter)))

Here are two options:
library(dplyr) # with dplyr
df %>% mutate(new_variable = group_indices(., year, quarter))
library(data.table) # with data.table
setDT(df)[, new_variable := .GRP, .(year, quarter)]
Data
df <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), year = c(2007L,
2007L, 2007L, 2007L, 2008L, 2008L, 2008L, 2008L, 2009L, 2009L,
2009L, 2009L, 2007L, 2007L, 2007L, 2007L, 2008L, 2008L, 2008L,
2008L, 2009L, 2009L, 2010L, 2010L), quarter = c(1L, 2L, 3L, 4L,
1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L,
2L, 3L, 2L, 3L)), .Names = c("id", "year", "quarter"), class = "data.frame", row.names = c(NA,
-24L))

1) yearqtr The yearqtr class in the zoo package does this. yearqtr objects have a type of double with the value year + 0 for Q1, year + 1/4 for Q2, etc. When displayed they are shown in a meaningful way; however, they can still be manipulated as if they were plain numbers, e.g. if yq is yearqtr variable then yq + 1 is the same quarter in the next year.
library(zoo)
transform(df, new_variable = as.yearqtr(year + (quarter - 1)/4))
1a) or
transform(df, new_variable = as.yearqtr(paste(year, quarter, sep = "-")))
Either of these give:
id year quarter new_variable
1 1 2007 1 2007 Q1
2 1 2007 2 2007 Q2
3 1 2007 3 2007 Q3
4 1 2007 4 2007 Q4
5 1 2008 1 2008 Q1
... etc ...
2) 220 If you specifically wanted to assign 220 to the first date and have each subsequent quarter increment by 1 then:
transform(df, new_variable = as.numeric(factor(4 * year + quarter)) + 220 - 1)

Related

Re-label Year ID data nested in person to time point (enumerate time point)

I have a longitudinal data where respondents recruited as cohort. Right now, I have year in which they took the survey. But I want to create a new column simply counting if it is the first, second, or third time a person took the survey.
Original Table
PersonID
SurveyYear
SurveyQ1Rating
SurveyQ2Rating
Gender
12
2013
5
4
f
12
2012
4
4
f
12
2010
3
3
f
2
2007
4
4
m
2
2008
3
3
m
2
2009
3
5
m
2
2010
5
5
m
2
2013
2
2
m
5
2013
4
4
f
5
2014
5
5
f
Target Table (Where I created a new col SurveytTime to mark the ith time one took the survey)
PersonID
SurveyYear
SurveyTime
SurveyQ1Rating
SurveyQ2Rating
Gender
12
2013
3
5
4
f
12
2012
2
4
4
f
12
2010
1
3
3
f
2
2007
1
4
4
m
2
2008
2
3
3
m
2
2009
3
3
5
m
2
2010
4
5
5
m
2
2013
5
2
2
m
5
2013
1
4
4
f
5
2014
2
5
5
f
A base solution:
df |>
transform(SurveyTime = ave(SurveyYear, PersonID, FUN = rank))
Its dplyr equivalent:
library(dplyr)
df %>%
group_by(PersonID) %>%
mutate(SurveyTime = dense_rank(SurveyYear)) %>%
ungroup()
Data
df <- structure(list(PersonID = c(12L, 12L, 12L, 2L, 2L, 2L, 2L, 2L,
5L, 5L), SurveyYear = c(2013L, 2012L, 2010L, 2007L, 2008L, 2009L,
2010L, 2013L, 2013L, 2014L), SurveyQ1Rating = c(5L, 4L, 3L, 4L,
3L, 3L, 5L, 2L, 4L, 5L), SurveyQ2Rating = c(4L, 4L, 3L, 4L, 3L,
5L, 5L, 2L, 4L, 5L), Gender = c("f", "f", "f", "m", "m", "m",
"m", "m", "f", "f")), class = "data.frame", row.names = c(NA, -10L))
Using data.table
library(data.table)
setDT(df1)[, SurveyTime := frank(SurveyYear), PersonID]

In R: How to tell R that a value should be inserted into a categorical column while applying two conditions

What we have:
companyID year status
1 2010
1 2011
1 2012 2
1 2013
1 2014
2 2007
2 2008
2 2009 2
2 2010
2 2011
2 2012 1
2 2013
For companyID 1: I have the observation with status 2 in year 2012. I would want R to make any observations prior to that as status 1 (by companyID). Then I would want R to make observations after that (the status 2 in 2012) to a status of 2 (still per company).
For companyID 2: I have the observation with status 2 in year 2009. i would want R to make any observations prior to that as status 1 (by companyID). Then I would want R to make observations to status 2 until a status 1 shows up again (still per company).
(Summing up: Fill in the other value (1) before the one that is already there (2), then continue with 2 until there is another change (change as in: either that there is a new company or that there was a status change that had already been stated in the original dataframe))
This would then look like the following, and is what we want to acheive:
companyID year status
1 2010 1
1 2011 1
1 2012 2
1 2013 2
1 2014 2
2 2007 1
2 2008 1
2 2009 2
2 2010 2
2 2011 2
2 2012 1
2 2013 1
We have a large dataset and that is why this would not be possible manually. Is there a way to code for both of the companyID’s simultaneously (and hence for all the thousands of observations we have) in R?
Here is one way :
library(dplyr)
library(tidyr)
df %>%
group_by(companyID) %>%
fill(status) %>%
mutate(status = replace(status, is.na(status),
ifelse(na.omit(status)[1] == 1, 2, 1))) %>%
ungroup
# companyID year status
# <int> <int> <dbl>
# 1 1 2010 1
# 2 1 2011 1
# 3 1 2012 2
# 4 1 2013 2
# 5 1 2014 2
# 6 2 2007 1
# 7 2 2008 1
# 8 2 2009 2
# 9 2 2010 2
#10 2 2011 2
#11 2 2012 1
#12 2 2013 1
data
df <- structure(list(companyID = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L), year = c(2010L, 2011L, 2012L, 2013L, 2014L,
2007L, 2008L, 2009L, 2010L, 2011L, 2012L, 2013L), status = c(NA,
NA, 2L, NA, NA, NA, NA, 2L, NA, NA, 1L, NA)),
class = "data.frame", row.names = c(NA, -12L))

Changing value of row based on condition in other condition

My dataframe looks like this:
Index Year Renovation
1 2012 1
1 2018 1
2 2012 1
2 2018 1
3 2012 0
3 2018 0
I would like to change the Renovation variable for 2012 to '0', IF the renovation variable for 2018 was "1". So I am facing a double condition here. How can I do this in R?
You can use ifelse to check for condition.
library(dplyr)
df %>%
group_by(Index) %>%
mutate(Renovation = ifelse(Year == 2012 &
Renovation[match(2018, Year)] == 1, 0, Renovation))
# Index Year Renovation
# <int> <int> <dbl>
#1 1 2012 0
#2 1 2018 1
#3 2 2012 0
#4 2 2018 1
#5 3 2012 0
#6 3 2018 0
data
df <- structure(list(Index = c(1L, 1L, 2L, 2L, 3L, 3L), Year = c(2012L,
2018L, 2012L, 2018L, 2012L, 2018L), Renovation = c(1L, 1L, 1L,
1L, 0L, 0L)), class = "data.frame", row.names = c(NA, -6L))

how to sum based on three different condition in R

The following is my data.
gcode code year P Q
1 101 2000 1 3
1 101 2001 2 4
1 102 2000 1 1
1 102 2001 4 5
1 102 2002 2 6
1 102 2003 6 5
1 103 1999 6 1
1 103 2000 4 2
1 103 2001 2 1
2 104 2000 1 3
2 104 2001 2 4
2 105 2001 4 5
2 105 2002 2 6
2 105 2003 6 5
2 105 2004 6 1
2 106 2000 4 2
2 106 2001 2 1
gcode 1 has 3 different codes 101, 102 and 103. They all have the same year (2000 and 2001). I want to sum up P and Q for these years. Otherwise, I want to delete the irrelevant data. I want to do the same for gcode 2 as well.
How can I get the result like this?
gcode year P Q
1 2000 1+1+4 3+1+2
1 2001 2+4+2 4+5+1
2 2001 2+4+2 4+5+1
We can split the data based on gcode subset the data based on common year which is present in all the code and aggregate the data by gcode and year.
do.call(rbind, lapply(split(df, df$gcode), function(x) {
aggregate(cbind(P, Q)~gcode+year,
subset(x, year %in% Reduce(intersect, split(x$year, x$code))), sum)
}))
# gcode year P Q
#1.1 1 2000 6 6
#1.2 1 2001 8 10
#2 2 2001 8 10
Using dplyr with similar logic we can do
library(dplyr)
df %>%
group_split(gcode) %>%
purrr::map_df(. %>%
group_by(year) %>%
filter(n_distinct(code) == n_distinct(.$code)) %>%
group_by(gcode, year) %>%
summarise_at(vars(P:Q), sum))
data
df <- structure(list(gcode = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), code = c(101L, 101L, 102L, 102L,
102L, 102L, 103L, 103L, 103L, 104L, 104L, 105L, 105L, 105L, 105L,
106L, 106L), year = c(2000L, 2001L, 2000L, 2001L, 2002L, 2003L,
1999L, 2000L, 2001L, 2000L, 2001L, 2001L, 2002L, 2003L, 2004L,
2000L, 2001L), P = c(1L, 2L, 1L, 4L, 2L, 6L, 6L, 4L, 2L, 1L,
2L, 4L, 2L, 6L, 6L, 4L, 2L), Q = c(3L, 4L, 1L, 5L, 6L, 5L, 1L,
2L, 1L, 3L, 4L, 5L, 6L, 5L, 1L, 2L, 1L)), class = "data.frame",
row.names = c(NA, -17L))
An option using data.table package:
years <- DT[, {
m <- min(year)
ty <- tabulate(year-m)
.(year=which(ty==uniqueN(code)) + m)
}, gcode]
DT[years, on=.(gcode, year),
by=.EACHI, .(P=sum(P), Q=sum(Q))]
output:
gcode year P Q
1: 1 2000 6 6
2: 1 2001 8 10
3: 2 2001 8 10
data:
library(data.table)
DT <- fread("gcode code year P Q
1 101 2000 1 3
1 101 2001 2 4
1 102 2000 1 1
1 102 2001 4 5
1 102 2002 2 6
1 102 2003 6 5
1 103 1999 6 1
1 103 2000 4 2
1 103 2001 2 1
2 104 2000 1 3
2 104 2001 2 4
2 105 2001 4 5
2 105 2002 2 6
2 105 2003 6 5
2 105 2004 6 1
2 106 2000 4 2
2 106 2001 2 1")
I came up with the following solution. First, I counted how many times each year appear for each gcode. I also counted how many unique codes exist for each gcode. Then, join the two results using left_join(). Then, I identified rows that have same values in n_year and n_code. Then, I joined the original data frame, which is called mydf. Then, I defined groups by gcode and year, and summed up P and Q for each group.
library(dplyr)
left_join(count(mydf, gcode, year, name = "n_year"),
group_by(mydf, gcode) %>% summarize(n_code = n_distinct(code))) %>%
filter(n_year == n_code) %>%
left_join(mydf, by = c("gcode", "year")) %>%
group_by(gcode, year) %>%
summarize_at(vars(P:Q),
.funs = list(~sum(.)))
# gcode year P Q
# <int> <int> <int> <int>
#1 1 2000 6 6
#2 1 2001 8 10
#3 2 2001 8 10
Another idea
I was reviewing this question later and came up with the following idea, which is much simpler. First, I defined groups by gcode and year. For each group, I counted how many data points existed using add_count(). Then, I defined groups again with gcode only. For each gcode group, I wanted to get rows that meet n == n_distinct(code). n is a column created by add_count(). If a number in n and a number returned by n_distinct() matches, that means that a year in that row exists among all code. Finally, I defined groups by gcode and year again and summed up values in P and Q.
group_by(mydf, gcode, year) %>%
add_count() %>%
group_by(gcode) %>%
filter(n == n_distinct(code)) %>%
group_by(gcode, year) %>%
summarize_at(vars(P:Q),
.funs = list(~sum(.)))
# This is the same code in data.table.
setDT(mydf)[, check := .N, by = .(gcode, year)][,
.SD[check == uniqueN(code)], by = gcode][,
lapply(.SD, sum), .SDcols = P:Q, by = .(gcode, year)][]

Grouping data and then assigning values to variable names stored in strings - R

I am trying to migrate this activity from excel/SQL to R and I am stuck - any help is very much appreciated. Thanks !
Format of Data:
There are unique customer ids. Each customer has purchases in different groups in different years.
Objective:
For each customer id - get one row of output. Use variable names stored in column and create columns - for each column assign sum of amount. Create a similar column and assign as 1 or 0 depending on presence or absence of revenue.
SOURCE:
Cust_ID Group Year Variable_Name Amount
1 1 A 2009 A_2009 2000
2 1 B 2009 B_2009 100
3 2 B 2009 B_2009 300
4 2 C 2009 C_2009 20
5 3 D 2009 D_2009 299090
6 3 A 2011 A_2011 89778456
7 1 B 2011 B_2011 884
8 1 C 2010 C_2010 34894
9 3 D 2010 D_2010 389849
10 2 A 2013 A_2013 742
11 1 B 2013 B_2013 25661
12 2 C 2007 C_2007 393
13 3 D 2007 D_2007 23
OUTPUT:
Cust_ID A_2009 B_2009 C_2009 D_2009 A_2011 …. A_2009_P B_2009_P
1 sum of amount .. 1 0 ….
2
3
dput of original data:
structure(list(Cust_ID = c(1L, 1L, 2L, 2L, 3L, 3L, 1L, 1L, 3L,
2L, 1L, 2L, 3L), Group = c("A", "B", "B", "C", "D", "A", "B",
"C", "D", "A", "B", "C", "D"), Year = c(2009L, 2009L, 2009L,
2009L, 2009L, 2011L, 2011L, 2010L, 2010L, 2013L, 2013L, 2007L,
2007L), Variable_Name = c("A_2009", "B_2009", "B_2009", "C_2009",
"D_2009", "A_2011", "B_2011", "C_2010", "D_2010", "A_2013", "B_2013",
"C_2007", "D_2007"), Amount = c(2000L, 100L, 300L, 20L, 299090L,
89778456L, 884L, 34894L, 389849L, 742L, 25661L, 393L, 23L)), .Names = c("Cust_ID",
"Group", "Year", "Variable_Name", "Amount"), class = "data.frame", row.names = c(NA,
-13L))
One option:
intm <- as.data.frame.matrix(xtabs(Amount ~ Cust_ID + Variable_Name,data=dat))
result <- data.frame(aggregate(Amount~Cust_ID, data=dat,sum),intm,(intm > 0)+0 )
Result (abridged):
Cust_ID Amount A_2009 A_2011 ... A_2009.1 A_2011.1
1 1 65539 4000 0 ... 1 0
2 2 1455 0 0 ... 0 0
3 3 90467418 0 89778456 ... 0 1
If the names are a concern, they can easily be fixed via:
names(res) <- gsub("\\.1","_P",names(res))

Resources