Dear all: I'm thinking of creating a "time to event" variable in R and need your expertice to get it done. Below you can see a small sample of my data. The time variable is in years and it starts at 0 and resets itself when Event = 1.
In the real data the observation period starts in 1989 but there are some countries (that had not ratified certain conventions before 1989) that come in later on, like the US in the example below. Whenever it starts, the first value for the "time to event" variable should be zero.
Thanks for all suggestions!
Country year Event **Time-to-event**
USA 2000 0 0
USA 2001 0 1
USA 2002 1 2
USA 2003 0 0
USA 2004 0 1
USA 2005 0 2
USA 2006 1 3
USA 2007 0 0
USA 2008 1 1
USA 2009 0 0
USA 2010 0 1
USA 2011 0 2
USA 2012 0 3
We can use ave
i1 <- with(df2, ave(Event, Country, FUN=
function(x) cumsum(c(TRUE, diff(x)<0))))
df2$Time_to_event <- with(df2, ave(i1, i1, Country, FUN= seq_along)-1)
df2$Time_to_event
#[1] 0 1 2 0 1 2 3 0 1 0 1 2 3
count_until(x) is always equal to rev(count_since(rev(x))).
one might use something like this:
count_since<-function(trigger)
{
i <- seq_along(trigger)
(i - cummax(i*trigger))*cummax(trigger)
}
count_until<-function(x)rev(count_since(rev(x)))
> count_until(1:10%%5==0)
[1] 4 3 2 1 0 4 3 2 1 0
Related
I have two datasets – data A and data B. Data A contains 30.000 observations while data B has 10.000 observations. Both datasets have 156 countries – noted with their ISO–number.
I want to add some of the variables in data B to data A (let's say the variable Y*). However, I face problems when merging these two datasets.
Below you can see the samples of the datasets
Data A
Country ISO year X
A 1 1990 0
A 1 1991 0
A 1 1992 0
A 1 1993 0
A 1 1994 1
B 2 1990 0
B 2 1991 0
B 2 1992 0
B 2 1993 0
B 2 1994 1
Data B
Country ISO year Y*
A 1 1990 1
A 1 1994 0
B 2 1990 1
B 2 1992 0
So I am interested in getting the variable Y* into my data A. To be more precise, I want to add it by country and year.
Below you see the code that I use to add the Y* variable. I have used this code many times and it works perfectly. I cannot figure out why it doesn't work in this case.
variables <- c("Country", "year", "Y*")
newdata <- merge(DataA, DataB[,variables], by=c("Country","Year"), all.x=TRUE)
When I run this code, I get "newdata" with the variable Y* but with 5 times more rows than Data A.
Question: Is there any relatively simple and efficient ways of doing this properly? Is there something with the structure of dataset B that creates more rows? In any ways, I am grateful for all kinds of suggestions that could solve this problem.
This is the outcome I want to get:
Country ISO year X Y*
A 1 1990 0 1
A 1 1991 0 0
A 1 1992 0 0
A 1 1993 0 0
A 1 1994 1 0
B 2 1990 0 1
B 2 1991 0 0
B 2 1992 0 0
B 2 1993 0 0
B 2 1994 1 0
Using the merge. Make sure to readjust the values of the Y* variable
z <- merge(DataA,DataB, by = intersect(names(DataA), names(DataB)), all = TRUE)
require(dplyr)
left_join(DataA,DataB %>% select(Country,year,Y*), by=c("Country"="Country","year"="year"))
I have unbalanced panel data with a binary variable indicating if the event occurred or not. I want to control for time dependency. The way to do this is to control for the time it that had elapsed since the event has occured before.
Here is a reproducible example, with a vector of what I am trying to achieve. Thanks!
id year onset time_since_event
1 1 1989 0 0
2 1 1990 0 1
3 1 1991 1 2
4 1 1992 0 0
5 1 1993 0 1
6 1 1994 0 2
7 2 1989 0 0
8 2 1990 1 1
9 2 1991 0 0
10 2 1992 1 1
11 2 1993 0 2
12 2 1994 0 3
13 3 1991 0 0
14 3 1992 0 1
15 3 1993 0 2
˚
id <- c(1,1,1,1,1,2,2,2,2,3,3)
year <- c(1989,1990,1991,1992,1993,1994,1989,1990,1991,1992,1993,1994,1991,1992,1993)
onset <- c(0,0,1,0,0,0,0,1,0,1,0,0,0,0)
time_since_event<-c(0,1,2,0,1,2,0,1,2,3,0,1,2) #what I want to create
df <- data.frame(cbind(id, year, onset,time_since_event))
Try this:
id <- c(1,1,1,1,1,2,2,2,2,3,3)
year <- c(1989,1990,1991,1992,1993,1989,1990,1991,1992,1991,1992)
onset <- c(0,0,1,0,0,0,1,0,1,0,0)
period <- c(0, cumsum(onset)[-length(onset)])
time_since_event <- ave(year, id, period, FUN=function(x) x-x[1])
df <- data.frame(id, year, onset, time_since_event)
I created a variable called period which describes the different periods until each event. It doesn't matter that the periods overlap patients, since we're going to group by patient and by period, so the count will start over if it's a new patient or a new period.
Using the ave() function allows us to assign values within each grouping. Here we're analyzing year based on the grouping variables id and period. The function I used at the end just subtracts the first value from the current value within each grouping.
I am interested in learning how a specific factor such as foreign investments behaves 5 years before and after change, e.g. outbreak of civil war.
This is the structure of my data (the factor is not included here):
year country change time
2001 A 0 ? (-1)
2002 A 1 0
2003 A 0 ? (+1)
2004 A 0 ? (+2)
2002 B 0 ? (-2)
2003 B 0 ? (-1)
2004 B 1 0
...
I am seeking to replace the question marks by the respective values in brackets, e.g., "-1" for the year prior to change (t-1) and "+1" for the year following change (t+1). The presence of change is coded with 1.
How would you do this? I am grateful for any suggestions.
> dat <- read.table(text="year country change time
+ 2001 A 0 ?(-1)
+ 2002 A 1 0
+ 2003 A 0 ?(+1)
+ 2004 A 0 ?(+2)
+ 2002 B 0 ?(-2)
+ 2003 B 0 ?(-1)
+ 2004 B 1 0
+ ", header=TRUE)
> with(dat, tapply(change, country,
function(x) seq(length(x))-which(x==1) ) )
$A
[1] -1 0 1 2
$B
[1] -2 -1 0
> dat$time <-unlist( with(dat, tapply(change, country,
function(x) seq(length(x))-which(x==1) ) ) )
> dat
year country change time
1 2001 A 0 -1
2 2002 A 1 0
3 2003 A 0 1
4 2004 A 0 2
5 2002 B 0 -2
6 2003 B 0 -1
7 2004 B 1 0
>
Slightly less complex would be to use ave instead of unlist(tapply(...))
> dat$time <- with(dat, ave(change, country, FUN=function(x) seq(length(x))-which(x==1) ) )
> dat
year country change time
1 2001 A 0 -1
2 2002 A 1 0
3 2003 A 0 1
4 2004 A 0 2
5 2002 B 0 -2
6 2003 B 0 -1
7 2004 B 1 0
I have a variable that is a factor :
$ year : Factor w/ 8 levels "2003","2004",..: 4 6 4 2 4 1 3 3 7 2 ...
I would like to create 8 dummy variables, named "2003", "2004" etc that take the value 0 or 1 depending on the value that the variable "year" takes. The nearest I could come up with is
dt1 <- cbind (dt1, model.matrix(~dt1$year - 1) )
But this has the unfortunate consequences of
The dummy variables are named dt1$year2003, not just "2003", "2004" etc
It seems that NA rows are omitted altogether by model.matrix (so the above command fails due to different lengths when NA is present in the year variable).
Of course I can get around these problems with more code, but I like my code to be as concise as possible (within reason) so if anyone can suggest better ways to make the dummy variables I would be obliged.
This is as concise as I could get. The na.action option takes care of the NA values (I would rather do this with an argument than with a global options setting, but I can't see how). The naming of columns is pretty deeply hard-coded, don't see any way to override it within model.matrix ...
options(na.action=na.pass)
dt1 <- data.frame(year=factor(c(NA,2003:2005)))
dt2 <- setNames(cbind(dt1,model.matrix(~year-1,data=dt1)),
c("year",levels(dt1$year)))
As pointed out above, you may run into trouble in some contexts with column names that are not legal R variable names.
year 2003 2004 2005
1 <NA> NA NA NA
2 2003 1 0 0
3 2004 0 1 0
4 2005 0 0 1
You could use ifelse() which won't omit na rows (but I guess you might not count it as being "as concise as possible"):
dt1 <- data.frame(year=factor(rep(2003:2010, 10))) # example data
dt1 <- within(dt1, yr2003<-ifelse(year=="2003", 1, 0))
dt1 <- within(dt1, yr2004<-ifelse(year=="2004", 1, 0))
dt1 <- within(dt1, yr2005<-ifelse(year=="2005", 1, 0))
# ...
head(dt1)
# year yr2003 yr2004 yr2005
# 1 2003 1 0 0
# 2 2004 0 1 0
# 3 2005 0 0 1
# 4 2006 0 0 0
# 5 2007 0 0 0
# 6 2008 0 0 0
library(caret) provides a very simple function (dummyVars) to create dummy variables, especially when you have more than one factor variables. But you have to make sure the target variables are factor. e.g. if your Sales$year are numeric, you have to convert them to factor: as.factor(Sales$year)
Suppose we have the original dataset 'Sales' as follows:
year Sales Region
1 2010 3695.543 North
2 2010 9873.037 West
3 2008 3579.458 West
4 2005 2788.857 North
5 2005 2952.183 North
6 2008 7255.337 West
7 2005 5237.081 West
8 2010 8987.096 North
9 2008 5545.343 North
10 2008 1809.446 West
Now we can create two dummy variables simultaneously:
>library(lattice)
>library(ggplot2)
>library(caret)
>Salesdummy <- dummyVars(~., data = Sales, levelsOnly = TRUE)
>Sdummy <- predict(Salesdummy, Sales)
The outcome will be:
2005 2008 2010 Sales RegionNorth RegionWest
1 0 0 1 3695.543 1 0
2 0 0 1 9873.037 0 1
3 0 1 0 3579.458 0 1
4 1 0 0 2788.857 1 0
5 1 0 0 2952.183 1 0
6 0 1 0 7255.337 0 1
7 1 0 0 5237.081 0 1
8 0 0 1 8987.096 1 0
9 0 1 0 5545.343 1 0
10 0 1 0 1809.446 0 1
It looks like the website is blocking direct access from Curl.
library(XML)
library(RCurl)
theurl <- "http://www.london2012.com/medals/medal-count/"
page <- getURL(theurl)
page # fail
[1] "<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD><BODY>\n<H1>Access Denied</H1>\n \nYou don't have permission to access \"http://www.london2012.com/medals/medal-count/\" on this server.<P>\nReference #18.358a503f.1343590091.c056ae2\n</BODY>\n</HTML>\n"
Let's try to see if we can access it directly from the Table.
page <- readHTMLTable(theurl)
No luck there Error in htmlParse(doc) : error in creating parser for http://www.london2012.com/medals/medal-count/
How would you go about getting this table into R?
Update: in response to comments and toying, faking a user agent string worked to get the content. But readHTMLtable returns an error.
page <- getURLContent(theurl, useragent="Mozilla/5.0 (Windows NT 6.1; rv:15.0) Gecko/20120716 Firefox/15.0a2")
It looks like this works:
rr <- readHTMLTable(page,header=FALSE)
rr2 <- setNames(rr[[1]],
c("rank","country","gold","silver","bronze","junk","total"))
rr3 <- subset(rr2,select=-junk)
## oops, numbers all got turned into factors ...
tmpf <- function(x) { as.numeric(as.character(x)) }
rr3[,-2] <- sapply(rr3[,-2],tmpf)
head(rr3)
## rank country gold silver bronze total
## 1 1 People's Republic of China 6 4 2 12
## 2 2 United States of America 3 5 3 11
## 3 3 Italy 2 3 2 7
## 4 4 Republic of Korea 2 1 2 5
## 5 5 France 2 1 1 4
## 6 6 Democratic People's Republic of Korea 2 0 1 3
with(rr3,dotchart(total,country))
Here is what I came up with using regular expressions. Very specific and definitely not better than using readHTMLTable used in the other answer. More to show that you can go quite far with textmining in R:
# file <- "~/Documents/R/medals.html"
# page <- readChar(file,file.info(file)$size)
library(RCurl)
theurl <- "http://www.london2012.com/medals/medal-count/"
page <- getURLContent(theurl, useragent="Mozilla/5.0 (Windows NT 6.1; rv:15.0) Gecko/20120716 Firefox/15.0a2")
# Remove html tags:
page <- gsub("<(.|\n)*?>","",page)
# Remove newlines and tabs:
page <- gsub("\\n","",page)
# match table:
page <- regmatches(page,regexpr("(?<=Total).*(?=Detailed)",page,perl=TRUE))
# Extract country+medals+rank
codes <-regmatches(page,gregexpr("\\d+[^\\r]*\\d+",page,perl=TRUE))[[1]]
codes <- codes[seq(1,length(codes)-2,by=2)]
# Extract country and medals:
Names <- gsub("\\d","",codes)
Medals <- sapply(regmatches(codes,gregexpr("\\d",codes)),function(x)x[(length(x)-2):length(x)])
# Create data frame:
data.frame(
Country = Names,
Gold = as.numeric(Medals[1,]),
Silver = as.numeric(Medals[2,]),
Bronze = as.numeric(Medals[3,]))
And the output:
Country Gold Silver Bronze
1 People's Republic of China 6 4 2
2 United States of America 3 5 3
3 Italy 2 3 2
4 Republic of Korea 2 1 2
5 France 2 1 1
6 Democratic People's Republic of Korea 2 0 1
7 Kazakhstan 2 0 0
8 Australia 1 1 1
9 Brazil 1 1 1
10 Hungary 1 1 1
11 Netherlands 1 1 0
12 Russian Federation 1 0 3
13 Georgia 1 0 0
14 South Africa 1 0 0
15 Japan 0 2 3
16 Great Britain 0 1 1
17 Colombia 0 1 0
18 Cuba 0 1 0
19 Poland 0 1 0
20 Romania 0 1 0
21 Taipei (Chinese Taipei) 0 1 0
22 Azerbaijan 0 0 1
23 Belgium 0 0 1
24 Canada 0 0 1
25 Republic of Moldova 0 0 1
26 Norway 0 0 1
27 Serbia 0 0 1
28 Slovakia 0 0 1
29 Ukraine 0 0 1
30 Uzbekistan 0 0 1