I have a df like this in R
year id value time
2012 1 180 1
2012 1 149 1
2010 2 131 0
2010 2 120 0
2010 2 120 0
2010 2 16 0
2010 2 120 0
2012 2 50 1
I would want to create a dummy variable that is 1 if id is in both years in 2010 and 2012 in the column year, just like this
year id value time both
2012 1 180 1 0
2012 1 149 1 0
2010 2 131 0 1
2010 2 120 0 1
2010 2 120 0 1
2010 2 16 0 1
2010 2 120 0 1
2012 2 50 1 1
The following code first creates list holding all ids vs. years, then checks which ids are same for all the available years. Such resulting ids are then tested for match with id column in dataframe and saved as 0/1 values in a separate column named both_resp:
df <- ... your dataframe ...
idsPerYear <- split(df$id, df$year)
idsInAllYears <- Reduce(intersect, idsPerYear)
df$both_resp <- as.numeric( df$id %in% idsInAllYears )
or an alternative with hardcoded values:
df$both_resp <- as.numeric(
df$id %in% intersect(
df[ df$year == 2010, "id"],
df[ df$year == 2012, "id"]
)
)
I have a dataset where I observe individuals for different years (e.g., individual 1 is observed in 2012 and 2014, while individuals 2 and 3 are only observed in 2016). I would like to expand the data for each individual (i.e., each individual would have 3 rows: 2012, 2014 and 2016) in order to create a panel data with an indicator for whether an individual is observed or not.
My initial dataset is:
year
individual_id
rank
2012
1
11
2014
1
16
2016
2
76
2016
3
125
And I would like to get something like that:
year
individual_id
rank
present
2012
1
11
1
2014
1
16
1
2016
1
.
0
2012
2
.
0
2014
2
.
0
2016
2
76
1
2012
3
.
0
2014
3
.
0
2016
3
125
1
So far I have tried to play with "expand":
bys researcher: egen count=count(year)
replace count=3-count+1
bys researcher: replace count=. if _n>1
expand count
which gives me 3 rows per individual. Unfortunately this copies one of the initial row, but I am unable to go from there to the final desired dataset.
Thanks in advance for your help!
You can use expand.grid to create a data frame of all combinations your inputs. Then full join the tables together and add a condition to determine if the individual was present that year or not.
library(dplyr)
dt = data.frame(
year = c(2012,2014,2016,2016),
individual_id = c(1,1,2,3),
rank = c(11,16,76,125)
)
exp = expand.grid(year = c(2012,2014,2016), individual_id = c(1:3))
dt %>%
full_join(exp, by = c("year","individual_id")) %>%
mutate(present = ifelse(!is.na(rank), 1, 0)) %>%
arrange(individual_id, year)
year individual_id rank present
1 2012 1 11 1
2 2014 1 16 1
3 2016 1 NA 0
4 2012 2 NA 0
5 2014 2 NA 0
6 2016 2 76 1
7 2012 3 NA 0
8 2014 3 NA 0
9 2016 3 125 1
I have a dataset which simplified look something like this:
YEAR = c(2009,2009,2009,2009,2009,2009,2009,2010,2010,2010,2010,2010,2010,2010)
FROM = c("A","C","B","D","B","A","C","A","C","B","A","D","B","A")
TO = c("B","D","C","A","D","C","B","B","A","D","D","C","A","D")
DATA = data.frame(YEAR,FROM,TO)
YEAR FROM TO
2009 A B
2009 C D
2009 B C
2009 D A
2009 B D
2009 A C
2009 C B
2010 A B
2010 C A
2010 B D
2010 A D
2010 D C
2010 B A
2010 A D
What I want is two additional columns, let's say OCC_FROM and OCC_TO, that is the cumulative count of occurrences in both FROM and TO columns in prior rows, by YEAR. Like this:
YEAR FROM TO OCC_FROM OCC_TO
2009 A B 0 0
2009 C D 0 0
2009 B C 1 1
2009 C A 2 1
2009 B D 2 1
2009 A C 2 3
2009 C B 4 3
2010 A B 0 0
2010 C A 0 1
2010 B D 1 0
2010 A B 2 2
2010 D C 1 1
2010 B A 3 3
2010 A D 4 2
What I've managed to produce, with the help of Cumulative count in R, is this, which is obviously not quite what I want since it doesn't take YEAR into consideration:
DATA$OCC_FROM = sapply(1:length(DATA$FROM),function(i)sum(DATA$FROM[i]==DATA$FROM[1:i]))+sapply(1:length(DATA$FROM),function(i)sum(DATA$FROM[i]==DATA$TO[1:i]))-1
DATA$OCC_TO = sapply(1:length(DATA$TO),function(i)sum(DATA$TO[i]==DATA$FROM[1:i]))+sapply(1:length(DATA$TO),function(i)sum(DATA$TO[i]==DATA$TO[1:i]))-1
YEAR FROM TO OCC_FROM OCC_TO
2009 A B 0 0
2009 C D 0 0
2009 B C 1 1
2009 C A 2 1
2009 B D 2 1
2009 A C 2 3
2009 C B 4 3
2010 A B 3 4
2010 C A 5 4
2010 B D 5 2
2010 A B 5 6
2010 D C 3 6
2010 B A 7 6
2010 A D 7 4
Edit: I also want to be able to sum two columns cumulatively based on FROM and TO, by YEAR as before. For simplicity I'll use OCC_FROM and OCC_TO. Like this:
YEAR FROM TO OCC_FROM OCC_TO TOTAL_FROM TOTAL_TO
2009 A B 0 0 0 0
2009 C D 0 0 0 0
2009 B C 1 1 0 0
2009 C A 2 1 1 0
2009 B D 2 1 1 0
2009 A C 2 3 1 3
2009 C B 4 3 6 3
2010 A B 0 0 0 0
2010 C A 0 1 0 0
2010 B D 1 0 0 0
2010 A B 2 2 1 1
2010 D C 1 1 0 0
2010 B A 3 3 3 3
2010 A D 4 2 6 1
You could try
prevCount <- function(x) {
eq <- outer(x,x,"==")
eq <- eq & upper.tri(eq)
eqInt <- ifelse(eq, 1, 0)
return(apply(eqInt,2,sum))
}
DATA$OCC_FROM <- ave(DATA$FROM, DATA$YEAR, FUN=prevCount )
prevCount is a function that, within a year, returns the number of elements prior to each element that are identical. The ave call then applies this per year.
Rolling up the corrections in the comment, we get
ord <- order(c(1:nrow(DATA), 1:nrow(data)))
targets <- c(data$FROM, data$TO)[ord]
yr <- c(data$YEAR, data$YEAR)[ord]
res <- ave(targets, yr, FUN=prevCount)
data$occ_from <- res[seq(1, length(res), 2)]
data$occ_to <- res[seq(2, length(res), 2)]
Also, the prevCount function can be simplified as:
prevCount <- function(x) {ave(x==x, x, FUN=cumsum)}
# Your data - this is not the data at the top of your question but from your
# solution [the 'from' and 'to' don't correspond exactly between your question
# and solution]
YEAR <- c(2009,2009,2009,2009,2009,2009,2009,2010,2010,2010,2010,2010,2010,2010)
FROM <- c("A","C","B","C","B","A","C","A","C","B","A","D","B","A")
TO <- c( "B","D","C","A","D","C","B","B","A","D","B","C","A","D")
mydf <- data.frame(YEAR,FROM,TO)
names(mydf) <- tolower(names(mydf))
#---------------------------------------------------
# Function to get cumulative sum across columns by group
f <- function(from , to){
# combine the columns 'from' and 'to' alternately
l <- c(rbind(from , to))
# Get and sum duplicate values
dup <- duplicated(l)
sums <- ave(dup , l , FUN= cumsum)
# Reshape data & output
out <- t(matrix(sums ,2))
colnames(out) <- c("occ_from","occ_to")
out
}
# Not considering year
f(mydf$from , mydf$to)
# (data.frame(mydf , f(mydf$from , mydf$to) )
# Calculate by year
s <- split(mydf , mydf$year)
d <- do.call(rbind,lapply(s,function(i) f(i[,"from"],i[,"to"])))
(mydf <- data.frame(mydf , d , row.names=NULL))
I would like to create a panel from a dataset that has one observation for every given time period such that every unit has a new observation for every time period. Using the following example:
id <- seq(1:4)
year <- c(2005, 2008, 2008, 2007)
y <- c(1,0,0,1)
frame <- data.frame(id, year, y)
frame
id year y
1 1 2005 1
2 2 2008 0
3 3 2008 0
4 4 2007 1
For each unique ID, I would like there to be a unique observation for the year 2005, 2006, 2007, and 2008 (the lower and upper time periods on this frame), and set the outcome y to 0 for all the times in which there isn't an existing observation, such that the new frame looks like:
id year y
1 1 2005 1
2 1 2006 0
3 1 2007 0
4 1 2008 0
....
13 4 2005 0
14 4 2006 0
15 4 2007 1
16 4 2008 0
I haven't had much success with loops; Any and all thoughts would be greatly appreciated.
1) reshape2 Create a grid g of all years and id values crossed and rbind it with frame.
Then using the reshape2 package cast frame from long to wide form and then melt it back to long form. Finally rearrange the rows and columns as desired.
The lines ending in one # are only to ensure that every year is present so if we knew that were the case those lines could be omitted. The line ending in ## is only to rearrange the rows and columns so if that did not matter that line could be omitted too.
library(reshape2)
g <- with(frame, expand.grid(year = seq(min(year), max(year)), id = unique(id), y = 0)) #
frame <- rbind(frame, g) #
wide <- dcast(frame, year ~ id, fill = 0, fun = sum, value.var = "y")
long <- melt(wide, id = "year", variable.name = "id", value.name = "y")
long <- long[order(long$id, long$year), c("id", "year", "y")] ##
giving:
> long
id year y
1 1 2005 1
2 1 2006 0
3 1 2007 0
4 1 2008 0
5 2 2005 0
6 2 2006 0
7 2 2007 0
8 2 2008 0
9 3 2005 0
10 3 2006 0
11 3 2007 0
12 3 2008 0
13 4 2005 0
14 4 2006 0
15 4 2007 1
16 4 2008 0
2) aggregate A shorter solution would be to run just the two lines that end with # above and then follow those with an aggregate as shown. This solution uses no addon packages.
g <- with(frame, expand.grid(year = seq(min(year), max(year)), id = unique(id), y = 0)) #
frame <- rbind(frame, g) #
aggregate(y ~ year + id, frame, sum)[c("id", "year", "y")]
This gives the same answer as solution (1) except as noted by a commenter solution (1) above makes id a factor whereas it is not in this solution.
Using data.table:
require(data.table)
DT <- data.table(frame, key=c("id", "year"))
comb <- CJ(1:4, 2005:2008) # like 'expand.grid', but faster + sets key
ans <- DT[comb][is.na(y), y:=0L] # perform a join (DT[comb]), then set NAs to 0
# id year y
# 1: 1 2005 1
# 2: 1 2006 0
# 3: 1 2007 0
# 4: 1 2008 0
# 5: 2 2005 0
# 6: 2 2006 0
# 7: 2 2007 0
# 8: 2 2008 0
# 9: 3 2005 0
# 10: 3 2006 0
# 11: 3 2007 0
# 12: 3 2008 0
# 13: 4 2005 0
# 14: 4 2006 0
# 15: 4 2007 1
# 16: 4 2008 0
maybe not an elegant solution, but anyway:
df <- expand.grid(id=id, year=unique(year))
frame <- frame[frame$y != 0,]
df$y <- 0
df2 <- rbind(frame, df)
df2 <- df2[!duplicated(df2[,c("id", "year")]),]
df2 <- df2[order(df2$id, df2$year),]
rownames(df2) <- NULL
df2
# id year y
# 1 1 2005 1
# 2 1 2006 0
# 3 1 2007 0
# 4 1 2008 0
# 5 2 2005 0
# 6 2 2006 0
# 7 2 2007 0
# 8 2 2008 0
# 9 3 2005 0
# 10 3 2006 0
# 11 3 2007 0
# 12 3 2008 0
# 13 4 2005 0
# 14 4 2006 0
# 15 4 2007 1
# 16 4 2008 0
I am interested in learning how a specific factor such as foreign investments behaves 5 years before and after change, e.g. outbreak of civil war.
This is the structure of my data (the factor is not included here):
year country change time
2001 A 0 ? (-1)
2002 A 1 0
2003 A 0 ? (+1)
2004 A 0 ? (+2)
2002 B 0 ? (-2)
2003 B 0 ? (-1)
2004 B 1 0
...
I am seeking to replace the question marks by the respective values in brackets, e.g., "-1" for the year prior to change (t-1) and "+1" for the year following change (t+1). The presence of change is coded with 1.
How would you do this? I am grateful for any suggestions.
> dat <- read.table(text="year country change time
+ 2001 A 0 ?(-1)
+ 2002 A 1 0
+ 2003 A 0 ?(+1)
+ 2004 A 0 ?(+2)
+ 2002 B 0 ?(-2)
+ 2003 B 0 ?(-1)
+ 2004 B 1 0
+ ", header=TRUE)
> with(dat, tapply(change, country,
function(x) seq(length(x))-which(x==1) ) )
$A
[1] -1 0 1 2
$B
[1] -2 -1 0
> dat$time <-unlist( with(dat, tapply(change, country,
function(x) seq(length(x))-which(x==1) ) ) )
> dat
year country change time
1 2001 A 0 -1
2 2002 A 1 0
3 2003 A 0 1
4 2004 A 0 2
5 2002 B 0 -2
6 2003 B 0 -1
7 2004 B 1 0
>
Slightly less complex would be to use ave instead of unlist(tapply(...))
> dat$time <- with(dat, ave(change, country, FUN=function(x) seq(length(x))-which(x==1) ) )
> dat
year country change time
1 2001 A 0 -1
2 2002 A 1 0
3 2003 A 0 1
4 2004 A 0 2
5 2002 B 0 -2
6 2003 B 0 -1
7 2004 B 1 0