I have a df called laws with a row for each law (one for each id):
laws <- data.frame(id=c(1,2,3),beginyear=c(2001,2002,2005),endyear=c(2003,2005,2006), law1=c(0,0,1), law2=c(1,0,1))
from which I want to create second called idyear with a row for each id and year:
idyear <- data.frame(id=c(rep(1,6),rep(2,6),rep(3,6)), year=(rep(c(2001:2006),3)), law1=c(rep(0,16),1,1), law2=c(1,1,1,rep(0,13),1,1))
How would I efficiently go about writing some code to get the idyear df output from the laws df? The two law variables are indicator variables == 1 if the idyear$year is >= laws$beginyear AND idyear$year is <= laws$endyear.
I am a beginner with R, but I'm willing to try anything (apply, loops, etc.) to get this to work.
1) base expand.grid will create an 18 x 2 data frame of all id and year combinations and then merge will merge it back together with laws. Zero out any law1 and law2 entry for which year is not between beginyear and endyear. Finally drop the beginyear and endyear columns. No packages are used.
g <- with(laws, expand.grid(year = min(beginyear):max(endyear), id = id))
m <- merge(g, laws)
m[m$year < m$beginyear | m$year > m$endyear, c("law1", "law2")] <- 0
m <- subset(m, select = - c(beginyear, endyear))
# check
identical(m, idyear)
## [1] TRUE
2) magrittr This is the same solution as (1) except we have used magrittr pipelines to express it. Note the mixture of pipe operators.
library(magrittr)
laws %$%
expand.grid(year = min(beginyear):max(endyear), id = id) %>%
merge(laws) %$%
{ .[year < beginyear | year > endyear, c("law1", "law2")] <- 0; .} %>%
subset(select = - c(beginyear, endyear))
Update: Fixed. Added (2).
A solution using tidyverse. The last as.data.frame() is optional, which just convert the tbl to a data frame.
library(tidyverse)
idyear <- laws %>%
mutate(year = map2(beginyear, endyear, `:`)) %>%
unnest() %>%
complete(id, year = full_seq(year, period = 1L), fill = list(law1 = 0L, law2 = 0L)) %>%
select(-beginyear, -endyear) %>%
as.data.frame()
idyear
# id year law1 law2
# 1 1 2001 0 1
# 2 1 2002 0 1
# 3 1 2003 0 1
# 4 1 2004 0 0
# 5 1 2005 0 0
# 6 1 2006 0 0
# 7 2 2001 0 0
# 8 2 2002 0 0
# 9 2 2003 0 0
# 10 2 2004 0 0
# 11 2 2005 0 0
# 12 2 2006 0 0
# 13 3 2001 0 0
# 14 3 2002 0 0
# 15 3 2003 0 0
# 16 3 2004 0 0
# 17 3 2005 1 1
# 18 3 2006 1 1
Use of mapply function can help.
# Function to expand year between begin and end
gen_data <- function(x_id, x_beginyear, x_endyear, x_law1, x_law2){
df <- data.frame(x_id, x_beginyear:x_endyear, x_law1, x_law2)
df
}
idyearlst <- data.frame()
idyearlst <- rbind(idyearlst, mapply(gen_data, laws$id, laws$beginyear,
laws$endyear, laws$law1, laws$law2))
# Finally convert list to data.frame
idyear <- setNames(do.call(rbind.data.frame, idyearlst), c("id", "year", "law1", "law2"))
Result will be like:
> idyear
id year law1 law2
V1.1 1 2001 0 1
V1.2 1 2002 0 1
V1.3 1 2003 0 1
V2.4 2 2002 0 0
V2.5 2 2003 0 0
V2.6 2 2004 0 0
V2.7 2 2005 0 0
V3.8 3 2005 1 1
V3.9 3 2006 1 1
Kind of an ugly approach, but I think it gets what you're after, using G. Grothendieck's g expand.grid data frame as a base, and your laws dataframe.
new.df <- data.frame(t(apply(g, 1, function(x){
yearspan = laws[laws$id == x['id'], 'beginyear']:laws[laws$id == x['id'], 'endyear']
law1 = laws$law1[laws$id == x['id'] & x['year'] %in% yearspan]
law2 = laws$law2[laws$id == x['id'] & x['year'] %in% yearspan]
x['law1'] = ifelse(length(law1 > 0), law1, 0)
x['law2'] = ifelse(length(law2 > 0), law2, 0)
return(x)
})))
> new.df
id year law1 law2
1 1 2001 0 1
2 1 2002 0 1
3 1 2003 0 1
4 1 2004 0 0
5 1 2005 0 0
6 1 2006 0 0
7 2 2001 0 0
8 2 2002 0 0
9 2 2003 0 0
10 2 2004 0 0
11 2 2005 0 0
12 2 2006 0 0
13 3 2001 0 0
14 3 2002 0 0
15 3 2003 0 0
16 3 2004 0 0
17 3 2005 1 1
18 3 2006 1 1
Libraries:
dplyr (for arrange, not really necessary)
Data:
laws <- data.frame(id=c(1,2,3),
beginyear=c(2001,2002,2005),
endyear=c(2003,2005,2006),
law1=c(0,0,1), law2=c(1,0,1))
g <- with(laws, expand.grid(id = id, year = min(beginyear):max(endyear)))
g <- arrange(g, id)
Dear all: I'm thinking of creating a "time to event" variable in R and need your expertice to get it done. Below you can see a small sample of my data. The time variable is in years and it starts at 0 and resets itself when Event = 1.
In the real data the observation period starts in 1989 but there are some countries (that had not ratified certain conventions before 1989) that come in later on, like the US in the example below. Whenever it starts, the first value for the "time to event" variable should be zero.
Thanks for all suggestions!
Country year Event **Time-to-event**
USA 2000 0 0
USA 2001 0 1
USA 2002 1 2
USA 2003 0 0
USA 2004 0 1
USA 2005 0 2
USA 2006 1 3
USA 2007 0 0
USA 2008 1 1
USA 2009 0 0
USA 2010 0 1
USA 2011 0 2
USA 2012 0 3
We can use ave
i1 <- with(df2, ave(Event, Country, FUN=
function(x) cumsum(c(TRUE, diff(x)<0))))
df2$Time_to_event <- with(df2, ave(i1, i1, Country, FUN= seq_along)-1)
df2$Time_to_event
#[1] 0 1 2 0 1 2 3 0 1 0 1 2 3
count_until(x) is always equal to rev(count_since(rev(x))).
one might use something like this:
count_since<-function(trigger)
{
i <- seq_along(trigger)
(i - cummax(i*trigger))*cummax(trigger)
}
count_until<-function(x)rev(count_since(rev(x)))
> count_until(1:10%%5==0)
[1] 4 3 2 1 0 4 3 2 1 0
I have two datasets – data A and data B. Data A contains 30.000 observations while data B has 10.000 observations. Both datasets have 156 countries – noted with their ISO–number.
I want to add some of the variables in data B to data A (let's say the variable Y*). However, I face problems when merging these two datasets.
Below you can see the samples of the datasets
Data A
Country ISO year X
A 1 1990 0
A 1 1991 0
A 1 1992 0
A 1 1993 0
A 1 1994 1
B 2 1990 0
B 2 1991 0
B 2 1992 0
B 2 1993 0
B 2 1994 1
Data B
Country ISO year Y*
A 1 1990 1
A 1 1994 0
B 2 1990 1
B 2 1992 0
So I am interested in getting the variable Y* into my data A. To be more precise, I want to add it by country and year.
Below you see the code that I use to add the Y* variable. I have used this code many times and it works perfectly. I cannot figure out why it doesn't work in this case.
variables <- c("Country", "year", "Y*")
newdata <- merge(DataA, DataB[,variables], by=c("Country","Year"), all.x=TRUE)
When I run this code, I get "newdata" with the variable Y* but with 5 times more rows than Data A.
Question: Is there any relatively simple and efficient ways of doing this properly? Is there something with the structure of dataset B that creates more rows? In any ways, I am grateful for all kinds of suggestions that could solve this problem.
This is the outcome I want to get:
Country ISO year X Y*
A 1 1990 0 1
A 1 1991 0 0
A 1 1992 0 0
A 1 1993 0 0
A 1 1994 1 0
B 2 1990 0 1
B 2 1991 0 0
B 2 1992 0 0
B 2 1993 0 0
B 2 1994 1 0
Using the merge. Make sure to readjust the values of the Y* variable
z <- merge(DataA,DataB, by = intersect(names(DataA), names(DataB)), all = TRUE)
require(dplyr)
left_join(DataA,DataB %>% select(Country,year,Y*), by=c("Country"="Country","year"="year"))
I have a categorical dataset that I am trying to summarize that has inherent differences in the nature of questions that were asked. The data below represent a questionnaire that had standard close-ended questions, but also questions where one could choose multiple answers from a list. "village" and "income" represent close-ended questions. "responsible.1"...etc... represent a list where the respondent either said yes or no to each.
VILLAGE INCOME responsible.1 responsible.2 responsible.3 responsible.4 responsible.5
j both DLNR NA DEQ NA Public
k regular.income DLNR NA NA NA NA
k regular.income DLNR CRM DEQ Mayor NA
l both DLNR NA NA Mayor NA
j both DLNR CRM NA Mayor NA
m regular.income DLNR NA NA NA Public
What I want is a 3-way table output with "village" and the suite of of "responsible" responsible variables wrapped up into a ftable. This way, I could use the table with numerous R packages for graphs and analyses.
RESPONSIBLE
VILLAGE INCOME responsible.1 responsible.2 responsible.3 responsible.4 responsible.5
j both 2 1 1 1 1
k regular income 2 1 1 1 0
l both 1 0 0 1 0
m regular income 1 0 0 0 1
as.data.frame(table(village, responsible.1) would get me the first, but I can't figure out how to get the entire thing wrapped up in a nice ftable.
> aggregate(dat[-(1:2)], dat[1:2], function(x) sum(!is.na(x)) )
VILLAGE INCOME responsible.1 responsible.2 responsible.3 responsible.4 responsible.5
1 j both 2 1 1 1 1
2 l both 1 0 0 1 0
3 k regular.income 2 1 1 1 0
4 m regular.income 1 0 0 0 1
I'm guessing you actually had another grouping vector , perhaps the first "responsible" column?
I don't really understand the sorting rules but reversing the order of the grouping columns may be closer to what you posted:
> aggregate(dat[-(1:2)], dat[2:1], function(x) sum(!is.na(x)) )
INCOME VILLAGE responsible.1 responsible.2 responsible.3 responsible.4 responsible.5
1 both j 2 1 1 1 1
2 regular.income k 2 1 1 1 0
3 both l 1 0 0 1 0
4 regular.income m 1 0 0 0 1
I am interested in learning how a specific factor such as foreign investments behaves 5 years before and after change, e.g. outbreak of civil war.
This is the structure of my data (the factor is not included here):
year country change time
2001 A 0 ? (-1)
2002 A 1 0
2003 A 0 ? (+1)
2004 A 0 ? (+2)
2002 B 0 ? (-2)
2003 B 0 ? (-1)
2004 B 1 0
...
I am seeking to replace the question marks by the respective values in brackets, e.g., "-1" for the year prior to change (t-1) and "+1" for the year following change (t+1). The presence of change is coded with 1.
How would you do this? I am grateful for any suggestions.
> dat <- read.table(text="year country change time
+ 2001 A 0 ?(-1)
+ 2002 A 1 0
+ 2003 A 0 ?(+1)
+ 2004 A 0 ?(+2)
+ 2002 B 0 ?(-2)
+ 2003 B 0 ?(-1)
+ 2004 B 1 0
+ ", header=TRUE)
> with(dat, tapply(change, country,
function(x) seq(length(x))-which(x==1) ) )
$A
[1] -1 0 1 2
$B
[1] -2 -1 0
> dat$time <-unlist( with(dat, tapply(change, country,
function(x) seq(length(x))-which(x==1) ) ) )
> dat
year country change time
1 2001 A 0 -1
2 2002 A 1 0
3 2003 A 0 1
4 2004 A 0 2
5 2002 B 0 -2
6 2003 B 0 -1
7 2004 B 1 0
>
Slightly less complex would be to use ave instead of unlist(tapply(...))
> dat$time <- with(dat, ave(change, country, FUN=function(x) seq(length(x))-which(x==1) ) )
> dat
year country change time
1 2001 A 0 -1
2 2002 A 1 0
3 2003 A 0 1
4 2004 A 0 2
5 2002 B 0 -2
6 2003 B 0 -1
7 2004 B 1 0