I have a dataset where I observe individuals for different years (e.g., individual 1 is observed in 2012 and 2014, while individuals 2 and 3 are only observed in 2016). I would like to expand the data for each individual (i.e., each individual would have 3 rows: 2012, 2014 and 2016) in order to create a panel data with an indicator for whether an individual is observed or not.
My initial dataset is:
year
individual_id
rank
2012
1
11
2014
1
16
2016
2
76
2016
3
125
And I would like to get something like that:
year
individual_id
rank
present
2012
1
11
1
2014
1
16
1
2016
1
.
0
2012
2
.
0
2014
2
.
0
2016
2
76
1
2012
3
.
0
2014
3
.
0
2016
3
125
1
So far I have tried to play with "expand":
bys researcher: egen count=count(year)
replace count=3-count+1
bys researcher: replace count=. if _n>1
expand count
which gives me 3 rows per individual. Unfortunately this copies one of the initial row, but I am unable to go from there to the final desired dataset.
Thanks in advance for your help!
You can use expand.grid to create a data frame of all combinations your inputs. Then full join the tables together and add a condition to determine if the individual was present that year or not.
library(dplyr)
dt = data.frame(
year = c(2012,2014,2016,2016),
individual_id = c(1,1,2,3),
rank = c(11,16,76,125)
)
exp = expand.grid(year = c(2012,2014,2016), individual_id = c(1:3))
dt %>%
full_join(exp, by = c("year","individual_id")) %>%
mutate(present = ifelse(!is.na(rank), 1, 0)) %>%
arrange(individual_id, year)
year individual_id rank present
1 2012 1 11 1
2 2014 1 16 1
3 2016 1 NA 0
4 2012 2 NA 0
5 2014 2 NA 0
6 2016 2 76 1
7 2012 3 NA 0
8 2014 3 NA 0
9 2016 3 125 1
Related
I have a df like this in R
year id value time
2012 1 180 1
2012 1 149 1
2010 2 131 0
2010 2 120 0
2010 2 120 0
2010 2 16 0
2010 2 120 0
2012 2 50 1
I would want to create a dummy variable that is 1 if id is in both years in 2010 and 2012 in the column year, just like this
year id value time both
2012 1 180 1 0
2012 1 149 1 0
2010 2 131 0 1
2010 2 120 0 1
2010 2 120 0 1
2010 2 16 0 1
2010 2 120 0 1
2012 2 50 1 1
The following code first creates list holding all ids vs. years, then checks which ids are same for all the available years. Such resulting ids are then tested for match with id column in dataframe and saved as 0/1 values in a separate column named both_resp:
df <- ... your dataframe ...
idsPerYear <- split(df$id, df$year)
idsInAllYears <- Reduce(intersect, idsPerYear)
df$both_resp <- as.numeric( df$id %in% idsInAllYears )
or an alternative with hardcoded values:
df$both_resp <- as.numeric(
df$id %in% intersect(
df[ df$year == 2010, "id"],
df[ df$year == 2012, "id"]
)
)
This question already has answers here:
Replace all 0 values to NA
(11 answers)
Replace missing values (NA) with most recent non-NA by group
(7 answers)
Closed 17 days ago.
So I have a dataframe structured like this:
df <- data.frame("id" = c(rep("a",4),rep("b",4)),
"Year" = c(2020,2019,2018,2017,
2020,2019,2018,2017),
"value" = c(30,20,0,0,
70,50,30,0))
> df
id Year value
1 a 2020 30
2 a 2019 20
3 a 2018 0
4 a 2017 0
5 b 2020 70
6 b 2019 50
7 b 2018 30
8 b 2017 0
What I want to do is create a new column which has the same values as the value column, except wherever there is a 0 value it looks at the closest year with a non-zero value and applies that value to all 0 rows by each id. So the output should be:
> df
id Year value newoutput
1 a 2020 30 30
2 a 2019 20 20
3 a 2018 0 20
4 a 2017 0 20
5 b 2020 70 70
6 b 2019 50 50
7 b 2018 30 30
8 b 2017 0 30
So for id a we see that years 2018, 2017 both have 0 values so need to be amended. The next year which has a non zero value is 2019, so we take the value in that year which is 20 and apply it to both 2018, 2017. Similar for id b.
Any ideas on how to do this using dplyr?
A possible solution, based on dplyr and cummax:
library(dplyr)
df %>%
group_by(id) %>%
mutate(newoutput = value + cummax((value == 0) * lag(value, default = T))) %>%
ungroup
#> # A tibble: 8 × 4
#> id Year value newoutput
#> <chr> <dbl> <dbl> <dbl>
#> 1 a 2020 30 30
#> 2 a 2019 20 20
#> 3 a 2018 0 20
#> 4 a 2017 0 20
#> 5 b 2020 70 70
#> 6 b 2019 50 50
#> 7 b 2018 30 30
#> 8 b 2017 0 30
My dataset is like this
df
ID Year County APV sample
1 2014 A 1 1
1 2015 A 1 1
1 2016 A 0 0
1 2017 A NA 0
1 2018 A NA 0
1 2019 A NA 0
2 2014 B 1 1
2 2015 B 1 1
2 2016 B 1 1
2 2017 B 1 1
2 2018 B 0 0
2 2019 B NA 0
3 2014 A 1 1
3 2015 A 1 1
3 2016 A 0 0
3 2017 A NA 0
3 2018 A NA 0
3 2019 A NA 0
And so on
So I want to tabulate this data.
If I only want to tabulate by year
datos<-as.data.frame(table(df$APV==0 & df$sample==0, by=df$Year))
the data set that I obtain looks like this:
df1
Var1 by Freq
FALSE 2014 3
TRUE 2014 0
FALSE 2015 3
TRUE 2015 0
FALSE 2016 1
TRUE 2016 2
. . .
. . .
. . .
So false means the still open firms.
How can I tabulate by year and County?
APV tells me the first closure of the enterprise, (the 0) so I want to know how many enterprises closed by year and county
There are two approaches.
I added !is.na(APV) for two reasons: (1) it wasn't clear to me what you expected to happen there; and (2) table was actually more robust to NA than xtabs, so I wanted the two results to be the same. The premise of the two approaches are the same, but they do appear to handle NAs differently.
table
You might just need to know that table takes an arbitrary number of arguments, so
head(as.data.frame(table(df$Var1, df$Year, df$County)))
# Var1 Var2 Var3 Freq
# 1 FALSE 2014 A 2
# 2 TRUE 2014 A 0
# 3 FALSE 2015 A 2
# 4 TRUE 2015 A 0
# 5 FALSE 2016 A 0
# 6 TRUE 2016 A 2
While the names are lost, it still works.
xtabs
out <- as.data.frame(
xtabs(~ Var1 + Year + County,
data = transform(df, Var1 = (!is.na(APV) & APV == 0 & sample == 0)))
)
head(out)
# Var1 Year County Freq
# 1 FALSE 2014 A 2
# 2 TRUE 2014 A 0
# 3 FALSE 2015 A 2
# 4 TRUE 2015 A 0
# 5 FALSE 2016 A 0
# 6 TRUE 2016 A 2
(I used transform for my simplicity.)
do.call for dynamic columns
out2 <- as.data.frame(
do.call(table, subset(transform(df, Var1 = (!is.na(APV) & APV == 0 & sample == 0)),
select = c(Var1, Year, County)))
)
(same results)
I am trying to create a new table from an existing one.
I've selected the columns I need, Month, Year, and Temperature. There are is one row for each day.
I've managed to add another column, with a 1 or 0 for each day above freezing.
I would now like to aggregate the rows so I have a row for each season and a column for each year, as well as an annual total row.
KAN_U <- kan_u_df %>%
select(Year, MonthOfYear, AirTemperature.C.)
KAN_U$Melt <- as.numeric(KAN_U$AirTemperature.C. > 0)
head(KAN_U)
Year MonthOfYear AirTemperature.C. Melt
1 2009 4 -999.00 0
2 2009 4 -25.30 0
3 2009 4 -23.44 0
4 2009 4 -28.18 0
5 2009 4 -32.15 0
6 2009 4 -24.35 0'
I would like my final table to look as such
Total Winter Spring Summer Autumn
2009 10 0 2 7 1
2010 10 0 2 7 1
2011 10 0 2 7 1
Let's say I have two data frames. Each has a DAY, a MONTH, and a YEAR column along with one other variable, C and P, respectively. I want to merge the two data frames in two different ways. First, I merge by data:
test<-merge(data1,data2,by.x=c("DAY","MONTH","YEAR"),by.y=c("DAY","MONTH","YEAR"),all.x=T,all.y=F)
This works perfectly. The second merge is the one I'm having trouble with. So, I currently I have merged the value for January 5, 1996 from data1 and the value for January 5, 1996 from data2 into one data frame, but now I would like to merge a third value onto each row of the new data frame. Specifically, I want to merge the value for Jan 4, 1996 from data2 with the two values from January 5, 1996. Any tips on getting merge to be flexible in this way?
sample data:
data1
C DAY MONTH YEAR
1 1 1 1996
6 5 1 1996
5 8 1 1996
3 11 1 1996
9 13 1 1996
2 14 1 1996
3 15 1 1996
4 17 1 1996
data2
P DAY MONTH YEAR
1 1 1 1996
4 2 1 1996
8 3 1 1996
2 4 1 1996
5 5 1 1996
2 6 1 1996
7 7 1 1996
4 8 1 1996
6 9 1 1996
1 10 1 1996
7 11 1 1996
3 12 1 1996
2 13 1 1996
2 14 1 1996
5 15 1 1996
9 16 1 1996
1 17 1 1996
Make a new column that is a Date type, not just some day,month,year integers. You can use as.Date() to do this, though you will need to look up the right format the format= argument given your string. Let's call that column D1. Now do data1$D2 = data1$D1 + 1. The key point here is that Date types allow simple date arithmetic. Now just merge by x=D1 and y=D2.
In case that was confusing, the bottom line is that you need to covert you columns to Date types so that you can do date arithmetic.