creating a 'survival' dummy in R (panel data) - r

I would like to create a dummy that will be equal to 0 if the firm did not close next year (in this case I close means it was not recorded in the next time period ) in my unbalanced panel and 1 otherwise.
My data look like this:
firm_id year
1 90
1 92
2 90
2 92
2 94
2 96
3 90
So my desired output would look like:
firm_id year dummy
1 90 1
1 92 0
2 90 1
2 92 1
2 94 1
2 96 1
3 90 0
I am unsure of how to approach this problem, my original idea was to count the number of years associated with each firm firm_id and then if firm has 4 years assign it always 1, if firm had 3 years assign the first 2 years 1 and 3rd year 0, but then I discovered I had firms that entered the panel later so this method will not work. Is there some better approach that would solve this issue?

Assign the value 0 to dummy if it is last entry of the firm id and it is not equal to max value of the year.
df$dummy <- 1
df$dummy[!duplicated(df$firm_id, fromLast = TRUE) & df$year != max(df$year)] <- 0
df
# firm_id year dummy
#1 1 90 1
#2 1 92 0
#3 2 90 1
#4 2 92 1
#5 2 94 1
#6 2 96 1
#7 3 90 0
This requires your data to be sorted by year like in the example. If it is not sorted you can use order to sort them first.
df <- df[with(df, order(firm_id, year)), ]
data
df <- structure(list(firm_id = c(1L, 1L, 2L, 2L, 2L, 2L, 3L), year = c(90L,
92L, 90L, 92L, 94L, 96L, 90L)), class = "data.frame", row.names = c(NA, -7L))

We can use
library(dplyr)
df %>%
mutate(dummy = +(duplicated(firm_id, fromLast = TRUE) | year == max(year)))
firm_id year dummy
1 1 90 1
2 1 92 0
3 2 90 1
4 2 92 1
5 2 94 1
6 2 96 1
7 3 90 0
data
df <- structure(list(firm_id = c(1L, 1L, 2L, 2L, 2L, 2L, 3L), year = c(90L,
92L, 90L, 92L, 94L, 96L, 90L)), class = "data.frame", row.names = c(NA, -7L))

Related

merge ID of three data base to obtain multiples row

I have 3 data base, in this three data base my sample ID varryig. But i want to merge to obtain one data base with multiple row of the same ID but not the same values
This is whats i have
df1
ID
tstart
1
12
2
4
df2
ID
tstart
2
40
3
15
df3
ID
tstart
2
80
3
80
this is what i want
ID
tstart
1
12
2
4
2
40
3
15
3
80
now i want to create a new variable i have this
ID
tstart
t stop
results 1
result 2
1
12
20
5
NA
2
4
40
10
NA
2
40
80
NA
52
3
15
80
68
NA
3
80
100
NA
56
and i want a new variable to have this df :
ID
tstart
t stop
result
1
12
20
5
2
4
40
10
2
40
80
52
3
15
80
68
3
80
100
56
We can use bind_rows to bind the datasets and then get the distinct rows
library(dplyr)
bind_rows(df1, df2, df3) %>%
distinct()
For the second case, if the input already have the 'tstart', 'tstop', we could coalesce the 'result' columns
dfnew %>%
mutate(result = coalesce(result1, result2), .keep = 'unused')
-output
ID tstart tstop result
1 1 12 20 5
2 2 4 40 10
3 2 40 80 52
4 3 15 80 68
5 3 80 100 56
data
dfnew <- structure(list(ID = c(1L, 2L, 2L, 3L, 3L), tstart = c(12L, 4L,
40L, 15L, 80L), tstop = c(20L, 40L, 80L, 80L, 100L), result1 = c(5L,
10L, NA, 68L, NA), result2 = c(NA, NA, 52L, NA, 56L)),
class = "data.frame", row.names = c(NA,
-5L))

How to change value with condition dataframe in r?

I have dataframe something like:
myData <- User X Y Similar
A 1 4 100
A 1 2 100
A 1 1 100
A 3 2 80
A 2 1 20
A 2 4 100
B 3 1 50
B 4 2 90
B 1 3 100
To something like this:
myData <- User X Y Similar
A 1 4 0
A 1 2 0
A 1 1 0
A 3 2 80
A 2 1 20
A 2 4 100
B 3 1 50
B 4 2 90
B 1 3 0
Question
I want to change value in similar column to 0 with condition. The condition is if variable x = 1 and variable similar = 100. How to do that in r?
Thanks
We create a logical vector based on the 'X' and 'Similar' and do the assignment of 'Similar with that index to replace those values to 0
i1 <- with(myData, X ==1 & Similar == 100)
myData$Similar[i1] <- 0
-output
myData
# User X Y Similar
#1 A 1 4 0
#2 A 1 2 0
#3 A 1 1 0
#4 A 3 2 80
#5 A 2 1 20
#6 A 2 4 100
#7 B 3 1 50
#8 B 4 2 90
#9 B 1 3 0
data
myData <- structure(list(User = c("A", "A", "A", "A", "A", "A", "B", "B",
"B"), X = c(1L, 1L, 1L, 3L, 2L, 2L, 3L, 4L, 1L), Y = c(4L, 2L,
1L, 2L, 1L, 4L, 1L, 2L, 3L), Similar = c(100L, 100L, 100L, 80L,
20L, 100L, 50L, 90L, 100L)), class = "data.frame", row.names = c(NA,
-9L))

how to sum based on three different condition in R

The following is my data.
gcode code year P Q
1 101 2000 1 3
1 101 2001 2 4
1 102 2000 1 1
1 102 2001 4 5
1 102 2002 2 6
1 102 2003 6 5
1 103 1999 6 1
1 103 2000 4 2
1 103 2001 2 1
2 104 2000 1 3
2 104 2001 2 4
2 105 2001 4 5
2 105 2002 2 6
2 105 2003 6 5
2 105 2004 6 1
2 106 2000 4 2
2 106 2001 2 1
gcode 1 has 3 different codes 101, 102 and 103. They all have the same year (2000 and 2001). I want to sum up P and Q for these years. Otherwise, I want to delete the irrelevant data. I want to do the same for gcode 2 as well.
How can I get the result like this?
gcode year P Q
1 2000 1+1+4 3+1+2
1 2001 2+4+2 4+5+1
2 2001 2+4+2 4+5+1
We can split the data based on gcode subset the data based on common year which is present in all the code and aggregate the data by gcode and year.
do.call(rbind, lapply(split(df, df$gcode), function(x) {
aggregate(cbind(P, Q)~gcode+year,
subset(x, year %in% Reduce(intersect, split(x$year, x$code))), sum)
}))
# gcode year P Q
#1.1 1 2000 6 6
#1.2 1 2001 8 10
#2 2 2001 8 10
Using dplyr with similar logic we can do
library(dplyr)
df %>%
group_split(gcode) %>%
purrr::map_df(. %>%
group_by(year) %>%
filter(n_distinct(code) == n_distinct(.$code)) %>%
group_by(gcode, year) %>%
summarise_at(vars(P:Q), sum))
data
df <- structure(list(gcode = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), code = c(101L, 101L, 102L, 102L,
102L, 102L, 103L, 103L, 103L, 104L, 104L, 105L, 105L, 105L, 105L,
106L, 106L), year = c(2000L, 2001L, 2000L, 2001L, 2002L, 2003L,
1999L, 2000L, 2001L, 2000L, 2001L, 2001L, 2002L, 2003L, 2004L,
2000L, 2001L), P = c(1L, 2L, 1L, 4L, 2L, 6L, 6L, 4L, 2L, 1L,
2L, 4L, 2L, 6L, 6L, 4L, 2L), Q = c(3L, 4L, 1L, 5L, 6L, 5L, 1L,
2L, 1L, 3L, 4L, 5L, 6L, 5L, 1L, 2L, 1L)), class = "data.frame",
row.names = c(NA, -17L))
An option using data.table package:
years <- DT[, {
m <- min(year)
ty <- tabulate(year-m)
.(year=which(ty==uniqueN(code)) + m)
}, gcode]
DT[years, on=.(gcode, year),
by=.EACHI, .(P=sum(P), Q=sum(Q))]
output:
gcode year P Q
1: 1 2000 6 6
2: 1 2001 8 10
3: 2 2001 8 10
data:
library(data.table)
DT <- fread("gcode code year P Q
1 101 2000 1 3
1 101 2001 2 4
1 102 2000 1 1
1 102 2001 4 5
1 102 2002 2 6
1 102 2003 6 5
1 103 1999 6 1
1 103 2000 4 2
1 103 2001 2 1
2 104 2000 1 3
2 104 2001 2 4
2 105 2001 4 5
2 105 2002 2 6
2 105 2003 6 5
2 105 2004 6 1
2 106 2000 4 2
2 106 2001 2 1")
I came up with the following solution. First, I counted how many times each year appear for each gcode. I also counted how many unique codes exist for each gcode. Then, join the two results using left_join(). Then, I identified rows that have same values in n_year and n_code. Then, I joined the original data frame, which is called mydf. Then, I defined groups by gcode and year, and summed up P and Q for each group.
library(dplyr)
left_join(count(mydf, gcode, year, name = "n_year"),
group_by(mydf, gcode) %>% summarize(n_code = n_distinct(code))) %>%
filter(n_year == n_code) %>%
left_join(mydf, by = c("gcode", "year")) %>%
group_by(gcode, year) %>%
summarize_at(vars(P:Q),
.funs = list(~sum(.)))
# gcode year P Q
# <int> <int> <int> <int>
#1 1 2000 6 6
#2 1 2001 8 10
#3 2 2001 8 10
Another idea
I was reviewing this question later and came up with the following idea, which is much simpler. First, I defined groups by gcode and year. For each group, I counted how many data points existed using add_count(). Then, I defined groups again with gcode only. For each gcode group, I wanted to get rows that meet n == n_distinct(code). n is a column created by add_count(). If a number in n and a number returned by n_distinct() matches, that means that a year in that row exists among all code. Finally, I defined groups by gcode and year again and summed up values in P and Q.
group_by(mydf, gcode, year) %>%
add_count() %>%
group_by(gcode) %>%
filter(n == n_distinct(code)) %>%
group_by(gcode, year) %>%
summarize_at(vars(P:Q),
.funs = list(~sum(.)))
# This is the same code in data.table.
setDT(mydf)[, check := .N, by = .(gcode, year)][,
.SD[check == uniqueN(code)], by = gcode][,
lapply(.SD, sum), .SDcols = P:Q, by = .(gcode, year)][]

reshaping a dataframe for prediction

I just picked up on the package reshape today and I'm having some trouble to understand how it works.
I have the following dataframe:
name workoutnum time weight raceid final position
tommy 1 12 140 1 2
tommy 2 14 140 1 2
tommy 3 11 140 1 2
sarah 1 10 115 1 1
sarah 2 10 115 1 1
sarah 3 11 115 1 1
sarah 4 15 115 1 1
How would I put all this in one row? So the dataframe would look like:
name workoutnum1 workoutnum2 workoutnum3 workoutnum4 time1 time2 time3 time4 weight raceid final_position
tommy 1 1 1 0 12 14 11 NA 140 1 2
sarah 1 1 1 1 10 10 11 15 115 1 1
So all columns would be attached to the workout values.
Is this even the proper way to do it?
reshape seems like a natural part of what you want to do, but won't get you all the way there.
Here's a reshape2 approach that fully melts the data, then casts it back to data.frame, with some tweaks along the way to get the desired output.
Note that in the call to melt(), the variables in the id.vars arguments will remain wide. Then in dcast(), the variable that'll be cast wide is on the RHS of the ~.
library(reshape2)
library(dplyr)
# fully melt the data
d_melt <- melt(d, id.vars = c("name", "raceid", "position", "weight"))
# index the variables within name and variable
d_melt <- d_melt %>%
group_by(name, variable) %>%
mutate(i = row_number(),
wide_variable = paste0(variable, i))
# cast as wide
d_wide <- dcast(d_melt, name + raceid + position + weight ~ wide_variable, value.var = "value")
# replace the workoutnum indices with indicators for missingness
d_wide %>% mutate_each(funs(ifelse(!is.na(.), 1L, 0L)), matches("workoutnum\\d"))
# name raceid position weight time1 time2 time3 time4 workoutnum1 workoutnum2
# 1 sarah 1 1 115 10 10 11 15 1 1
# 2 tommy 1 2 140 12 14 11 NA 1 1
# workoutnum3 workoutnum4
# 1 1 1
# 2 1 0
Data:
structure(list(name = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("sarah", "tommy"), class = "factor"), workoutnum = c(1L, 2L, 3L, 1L, 2L, 3L, 4L), time = c(12L, 14L, 11L, 10L, 10L, 11L, 15L), weight = c(140L, 140L, 140L, 115L, 115L, 115L, 115L), raceid = c(1L, 1L, 1L, 1L, 1L, 1L, 1L), position = c(2L, 2L, 2L, 1L, 1L, 1L, 1L)), .Names = c("name", "workoutnum", "time", "weight", "raceid", "position"), class = "data.frame", row.names = c(NA, -7L))
Here's an approach using dcast from "data.table", which reshapes a little more like the reshape function in base R.
The only change I've made to the data is the inclusion of another "time" variable though, as pointed out by #rawr in the comments, it almost seems like your "workoutnum" is the time variable.
I've used getanID from my "splitstackshape" package to generate the "time" variable, but you can create this variable in many different ways.
library(splitstackshape)
dcast(getanID(mydf, c("name", "raceid", "final_position")),
name + raceid + final_position ~ .id,
value.var = c("workoutnum", "time", "weight"))
## name raceid final_position workoutnum_1 workoutnum_2 workoutnum_3
## 1: sarah 1 1 1 2 3
## 2: tommy 1 2 1 2 3
## workoutnum_4 time_1 time_2 time_3 time_4 weight_1 weight_2 weight_3 weight_4
## 1: 4 10 10 11 15 115 115 115 115
## 2: NA 12 14 11 NA 140 140 140 NA
If you're using getanID, you can also use reshape like this:
reshape(getanID(mydf, c("name", "raceid", "final_position")),
idvar = c("name", "raceid", "final_position"), timevar = ".id",
direction = "wide")
## name raceid final_position workoutnum.1 time.1 weight.1 workoutnum.2 time.2
## 1: tommy 1 2 1 12 140 2 14
## 2: sarah 1 1 1 10 115 2 10
## weight.2 workoutnum.3 time.3 weight.3 workoutnum.4 time.4 weight.4
## 1: 140 3 11 140 NA NA NA
## 2: 115 3 11 115 4 15 115
but dcast would be more efficient in general.

Subset of data with criteria of two columns

I would like to create a subset of data that consists of Units that have a higher score in QTR 4 than QTR 1 (upward trend). Doesn't matter if QTR 2 or 3 are present.
Unit QTR Score
5 4 34
1 1 22
5 3 67
2 4 78
3 2 39
5 2 34
1 2 34
5 1 67
1 3 70
1 4 89
3 4 19
Subset would be:
Unit QTR Score
1 1 22
1 2 34
1 3 70
1 4 89
I've tried variants of something like this:
upward_subset <- subset(mydata,Unit if QTR=4~Score > QTR=1~Score)
Thank you for your time
If the dataframe is named "d", then this succeeds on your test set:
d[ which(d$Unit %in%
(sapply( split(d, d["Unit"]),
function(dd) dd[dd$QTR ==4, "Score"] - dd[dd$QTR ==1, "Score"]) > 0)) ,
]
#-------------
Unit QTR Score
2 1 1 22
7 1 2 34
9 1 3 70
10 1 4 89
An alternative in two steps:
result <- unlist(
by(
test,
test$Unit,
function(x) x$Score[x$QTR==4] > x$Score[x$QTR==2])
)
test[test$Unit %in% names(result[result==TRUE]),]
Unit QTR Score
2 1 1 22
7 1 2 34
9 1 3 70
10 1 4 89
A solution using data.table (Probably there are better versions than what I have at the moment).
Note: Assuming a QTR value for a given Unit is unique
Data:
df <- structure(list(Unit = c(5L, 1L, 5L, 2L, 3L, 5L, 1L, 5L, 1L, 1L,
3L), QTR = c(4L, 1L, 3L, 4L, 2L, 2L, 2L, 1L, 3L, 4L, 4L), Score = c(34L,
22L, 67L, 78L, 39L, 34L, 34L, 67L, 70L, 89L, 19L)), .Names = c("Unit",
"QTR", "Score"), class = "data.frame", row.names = c(NA, -11L
))
Solution:
dt <- data.table(df, key=c("Unit", "QTR"))
dt[, Score[Score[QTR == 4] > Score[QTR == 1]], by=Unit]
Unit V1
1: 1 22
2: 1 34
3: 1 70
4: 1 89

Resources