How to create dummy variables? - r

I have a variable that is a factor :
$ year : Factor w/ 8 levels "2003","2004",..: 4 6 4 2 4 1 3 3 7 2 ...
I would like to create 8 dummy variables, named "2003", "2004" etc that take the value 0 or 1 depending on the value that the variable "year" takes. The nearest I could come up with is
dt1 <- cbind (dt1, model.matrix(~dt1$year - 1) )
But this has the unfortunate consequences of
The dummy variables are named dt1$year2003, not just "2003", "2004" etc
It seems that NA rows are omitted altogether by model.matrix (so the above command fails due to different lengths when NA is present in the year variable).
Of course I can get around these problems with more code, but I like my code to be as concise as possible (within reason) so if anyone can suggest better ways to make the dummy variables I would be obliged.

This is as concise as I could get. The na.action option takes care of the NA values (I would rather do this with an argument than with a global options setting, but I can't see how). The naming of columns is pretty deeply hard-coded, don't see any way to override it within model.matrix ...
options(na.action=na.pass)
dt1 <- data.frame(year=factor(c(NA,2003:2005)))
dt2 <- setNames(cbind(dt1,model.matrix(~year-1,data=dt1)),
c("year",levels(dt1$year)))
As pointed out above, you may run into trouble in some contexts with column names that are not legal R variable names.
year 2003 2004 2005
1 <NA> NA NA NA
2 2003 1 0 0
3 2004 0 1 0
4 2005 0 0 1

You could use ifelse() which won't omit na rows (but I guess you might not count it as being "as concise as possible"):
dt1 <- data.frame(year=factor(rep(2003:2010, 10))) # example data
dt1 <- within(dt1, yr2003<-ifelse(year=="2003", 1, 0))
dt1 <- within(dt1, yr2004<-ifelse(year=="2004", 1, 0))
dt1 <- within(dt1, yr2005<-ifelse(year=="2005", 1, 0))
# ...
head(dt1)
# year yr2003 yr2004 yr2005
# 1 2003 1 0 0
# 2 2004 0 1 0
# 3 2005 0 0 1
# 4 2006 0 0 0
# 5 2007 0 0 0
# 6 2008 0 0 0

library(caret) provides a very simple function (dummyVars) to create dummy variables, especially when you have more than one factor variables. But you have to make sure the target variables are factor. e.g. if your Sales$year are numeric, you have to convert them to factor: as.factor(Sales$year)
Suppose we have the original dataset 'Sales' as follows:
year Sales Region
1 2010 3695.543 North
2 2010 9873.037 West
3 2008 3579.458 West
4 2005 2788.857 North
5 2005 2952.183 North
6 2008 7255.337 West
7 2005 5237.081 West
8 2010 8987.096 North
9 2008 5545.343 North
10 2008 1809.446 West
Now we can create two dummy variables simultaneously:
>library(lattice)
>library(ggplot2)
>library(caret)
>Salesdummy <- dummyVars(~., data = Sales, levelsOnly = TRUE)
>Sdummy <- predict(Salesdummy, Sales)
The outcome will be:
2005 2008 2010 Sales RegionNorth RegionWest
1 0 0 1 3695.543 1 0
2 0 0 1 9873.037 0 1
3 0 1 0 3579.458 0 1
4 1 0 0 2788.857 1 0
5 1 0 0 2952.183 1 0
6 0 1 0 7255.337 0 1
7 1 0 0 5237.081 0 1
8 0 0 1 8987.096 1 0
9 0 1 0 5545.343 1 0
10 0 1 0 1809.446 0 1

Related

Turn dataframe with a row for each id and law (with begin and end years) into a file with a row for each id and year

I have a df called laws with a row for each law (one for each id):
laws <- data.frame(id=c(1,2,3),beginyear=c(2001,2002,2005),endyear=c(2003,2005,2006), law1=c(0,0,1), law2=c(1,0,1))
from which I want to create second called idyear with a row for each id and year:
idyear <- data.frame(id=c(rep(1,6),rep(2,6),rep(3,6)), year=(rep(c(2001:2006),3)), law1=c(rep(0,16),1,1), law2=c(1,1,1,rep(0,13),1,1))
How would I efficiently go about writing some code to get the idyear df output from the laws df? The two law variables are indicator variables == 1 if the idyear$year is >= laws$beginyear AND idyear$year is <= laws$endyear.
I am a beginner with R, but I'm willing to try anything (apply, loops, etc.) to get this to work.
1) base expand.grid will create an 18 x 2 data frame of all id and year combinations and then merge will merge it back together with laws. Zero out any law1 and law2 entry for which year is not between beginyear and endyear. Finally drop the beginyear and endyear columns. No packages are used.
g <- with(laws, expand.grid(year = min(beginyear):max(endyear), id = id))
m <- merge(g, laws)
m[m$year < m$beginyear | m$year > m$endyear, c("law1", "law2")] <- 0
m <- subset(m, select = - c(beginyear, endyear))
# check
identical(m, idyear)
## [1] TRUE
2) magrittr This is the same solution as (1) except we have used magrittr pipelines to express it. Note the mixture of pipe operators.
library(magrittr)
laws %$%
expand.grid(year = min(beginyear):max(endyear), id = id) %>%
merge(laws) %$%
{ .[year < beginyear | year > endyear, c("law1", "law2")] <- 0; .} %>%
subset(select = - c(beginyear, endyear))
Update: Fixed. Added (2).
A solution using tidyverse. The last as.data.frame() is optional, which just convert the tbl to a data frame.
library(tidyverse)
idyear <- laws %>%
mutate(year = map2(beginyear, endyear, `:`)) %>%
unnest() %>%
complete(id, year = full_seq(year, period = 1L), fill = list(law1 = 0L, law2 = 0L)) %>%
select(-beginyear, -endyear) %>%
as.data.frame()
idyear
# id year law1 law2
# 1 1 2001 0 1
# 2 1 2002 0 1
# 3 1 2003 0 1
# 4 1 2004 0 0
# 5 1 2005 0 0
# 6 1 2006 0 0
# 7 2 2001 0 0
# 8 2 2002 0 0
# 9 2 2003 0 0
# 10 2 2004 0 0
# 11 2 2005 0 0
# 12 2 2006 0 0
# 13 3 2001 0 0
# 14 3 2002 0 0
# 15 3 2003 0 0
# 16 3 2004 0 0
# 17 3 2005 1 1
# 18 3 2006 1 1
Use of mapply function can help.
# Function to expand year between begin and end
gen_data <- function(x_id, x_beginyear, x_endyear, x_law1, x_law2){
df <- data.frame(x_id, x_beginyear:x_endyear, x_law1, x_law2)
df
}
idyearlst <- data.frame()
idyearlst <- rbind(idyearlst, mapply(gen_data, laws$id, laws$beginyear,
laws$endyear, laws$law1, laws$law2))
# Finally convert list to data.frame
idyear <- setNames(do.call(rbind.data.frame, idyearlst), c("id", "year", "law1", "law2"))
Result will be like:
> idyear
id year law1 law2
V1.1 1 2001 0 1
V1.2 1 2002 0 1
V1.3 1 2003 0 1
V2.4 2 2002 0 0
V2.5 2 2003 0 0
V2.6 2 2004 0 0
V2.7 2 2005 0 0
V3.8 3 2005 1 1
V3.9 3 2006 1 1
Kind of an ugly approach, but I think it gets what you're after, using G. Grothendieck's g expand.grid data frame as a base, and your laws dataframe.
new.df <- data.frame(t(apply(g, 1, function(x){
yearspan = laws[laws$id == x['id'], 'beginyear']:laws[laws$id == x['id'], 'endyear']
law1 = laws$law1[laws$id == x['id'] & x['year'] %in% yearspan]
law2 = laws$law2[laws$id == x['id'] & x['year'] %in% yearspan]
x['law1'] = ifelse(length(law1 > 0), law1, 0)
x['law2'] = ifelse(length(law2 > 0), law2, 0)
return(x)
})))
> new.df
id year law1 law2
1 1 2001 0 1
2 1 2002 0 1
3 1 2003 0 1
4 1 2004 0 0
5 1 2005 0 0
6 1 2006 0 0
7 2 2001 0 0
8 2 2002 0 0
9 2 2003 0 0
10 2 2004 0 0
11 2 2005 0 0
12 2 2006 0 0
13 3 2001 0 0
14 3 2002 0 0
15 3 2003 0 0
16 3 2004 0 0
17 3 2005 1 1
18 3 2006 1 1
Libraries:
dplyr (for arrange, not really necessary)
Data:
laws <- data.frame(id=c(1,2,3),
beginyear=c(2001,2002,2005),
endyear=c(2003,2005,2006),
law1=c(0,0,1), law2=c(1,0,1))
g <- with(laws, expand.grid(id = id, year = min(beginyear):max(endyear)))
g <- arrange(g, id)

How to create Time-to-Event variable?

Dear all: I'm thinking of creating a "time to event" variable in R and need your expertice to get it done. Below you can see a small sample of my data. The time variable is in years and it starts at 0 and resets itself when Event = 1.
In the real data the observation period starts in 1989 but there are some countries (that had not ratified certain conventions before 1989) that come in later on, like the US in the example below. Whenever it starts, the first value for the "time to event" variable should be zero.
Thanks for all suggestions!
Country year Event **Time-to-event**
USA 2000 0 0
USA 2001 0 1
USA 2002 1 2
USA 2003 0 0
USA 2004 0 1
USA 2005 0 2
USA 2006 1 3
USA 2007 0 0
USA 2008 1 1
USA 2009 0 0
USA 2010 0 1
USA 2011 0 2
USA 2012 0 3
We can use ave
i1 <- with(df2, ave(Event, Country, FUN=
function(x) cumsum(c(TRUE, diff(x)<0))))
df2$Time_to_event <- with(df2, ave(i1, i1, Country, FUN= seq_along)-1)
df2$Time_to_event
#[1] 0 1 2 0 1 2 3 0 1 0 1 2 3
count_until(x) is always equal to rev(count_since(rev(x))).
one might use something like this:
count_since<-function(trigger)
{
i <- seq_along(trigger)
(i - cummax(i*trigger))*cummax(trigger)
}
count_until<-function(x)rev(count_since(rev(x)))
> count_until(1:10%%5==0)
[1] 4 3 2 1 0 4 3 2 1 0

Merge/Append in R – how to add variables without generating more rows

I have two datasets – data A and data B. Data A contains 30.000 observations while data B has 10.000 observations. Both datasets have 156 countries – noted with their ISO–number.
I want to add some of the variables in data B to data A (let's say the variable Y*). However, I face problems when merging these two datasets.
Below you can see the samples of the datasets
Data A
Country ISO year X
A 1 1990 0
A 1 1991 0
A 1 1992 0
A 1 1993 0
A 1 1994 1
B 2 1990 0
B 2 1991 0
B 2 1992 0
B 2 1993 0
B 2 1994 1
Data B
Country ISO year Y*
A 1 1990 1
A 1 1994 0
B 2 1990 1
B 2 1992 0
So I am interested in getting the variable Y* into my data A. To be more precise, I want to add it by country and year.
Below you see the code that I use to add the Y* variable. I have used this code many times and it works perfectly. I cannot figure out why it doesn't work in this case.
variables <- c("Country", "year", "Y*")
newdata <- merge(DataA, DataB[,variables], by=c("Country","Year"), all.x=TRUE)
When I run this code, I get "newdata" with the variable Y* but with 5 times more rows than Data A.
Question: Is there any relatively simple and efficient ways of doing this properly? Is there something with the structure of dataset B that creates more rows? In any ways, I am grateful for all kinds of suggestions that could solve this problem.
This is the outcome I want to get:
Country ISO year X Y*
A 1 1990 0 1
A 1 1991 0 0
A 1 1992 0 0
A 1 1993 0 0
A 1 1994 1 0
B 2 1990 0 1
B 2 1991 0 0
B 2 1992 0 0
B 2 1993 0 0
B 2 1994 1 0
Using the merge. Make sure to readjust the values of the Y* variable
z <- merge(DataA,DataB, by = intersect(names(DataA), names(DataB)), all = TRUE)
require(dplyr)
left_join(DataA,DataB %>% select(Country,year,Y*), by=c("Country"="Country","year"="year"))

Reformat categorical data in R

I have a categorical dataset that I am trying to summarize that has inherent differences in the nature of questions that were asked. The data below represent a questionnaire that had standard close-ended questions, but also questions where one could choose multiple answers from a list. "village" and "income" represent close-ended questions. "responsible.1"...etc... represent a list where the respondent either said yes or no to each.
VILLAGE INCOME responsible.1 responsible.2 responsible.3 responsible.4 responsible.5
j both DLNR NA DEQ NA Public
k regular.income DLNR NA NA NA NA
k regular.income DLNR CRM DEQ Mayor NA
l both DLNR NA NA Mayor NA
j both DLNR CRM NA Mayor NA
m regular.income DLNR NA NA NA Public
What I want is a 3-way table output with "village" and the suite of of "responsible" responsible variables wrapped up into a ftable. This way, I could use the table with numerous R packages for graphs and analyses.
RESPONSIBLE
VILLAGE INCOME responsible.1 responsible.2 responsible.3 responsible.4 responsible.5
j both 2 1 1 1 1
k regular income 2 1 1 1 0
l both 1 0 0 1 0
m regular income 1 0 0 0 1
as.data.frame(table(village, responsible.1) would get me the first, but I can't figure out how to get the entire thing wrapped up in a nice ftable.
> aggregate(dat[-(1:2)], dat[1:2], function(x) sum(!is.na(x)) )
VILLAGE INCOME responsible.1 responsible.2 responsible.3 responsible.4 responsible.5
1 j both 2 1 1 1 1
2 l both 1 0 0 1 0
3 k regular.income 2 1 1 1 0
4 m regular.income 1 0 0 0 1
I'm guessing you actually had another grouping vector , perhaps the first "responsible" column?
I don't really understand the sorting rules but reversing the order of the grouping columns may be closer to what you posted:
> aggregate(dat[-(1:2)], dat[2:1], function(x) sum(!is.na(x)) )
INCOME VILLAGE responsible.1 responsible.2 responsible.3 responsible.4 responsible.5
1 both j 2 1 1 1 1
2 regular.income k 2 1 1 1 0
3 both l 1 0 0 1 0
4 regular.income m 1 0 0 0 1

Time-series data: How to code t-1 and t+1 based on value of specific variable in t0?

I am interested in learning how a specific factor such as foreign investments behaves 5 years before and after change, e.g. outbreak of civil war.
This is the structure of my data (the factor is not included here):
year country change time
2001 A 0 ? (-1)
2002 A 1 0
2003 A 0 ? (+1)
2004 A 0 ? (+2)
2002 B 0 ? (-2)
2003 B 0 ? (-1)
2004 B 1 0
...
I am seeking to replace the question marks by the respective values in brackets, e.g., "-1" for the year prior to change (t-1) and "+1" for the year following change (t+1). The presence of change is coded with 1.
How would you do this? I am grateful for any suggestions.
> dat <- read.table(text="year country change time
+ 2001 A 0 ?(-1)
+ 2002 A 1 0
+ 2003 A 0 ?(+1)
+ 2004 A 0 ?(+2)
+ 2002 B 0 ?(-2)
+ 2003 B 0 ?(-1)
+ 2004 B 1 0
+ ", header=TRUE)
> with(dat, tapply(change, country,
function(x) seq(length(x))-which(x==1) ) )
$A
[1] -1 0 1 2
$B
[1] -2 -1 0
> dat$time <-unlist( with(dat, tapply(change, country,
function(x) seq(length(x))-which(x==1) ) ) )
> dat
year country change time
1 2001 A 0 -1
2 2002 A 1 0
3 2003 A 0 1
4 2004 A 0 2
5 2002 B 0 -2
6 2003 B 0 -1
7 2004 B 1 0
>
Slightly less complex would be to use ave instead of unlist(tapply(...))
> dat$time <- with(dat, ave(change, country, FUN=function(x) seq(length(x))-which(x==1) ) )
> dat
year country change time
1 2001 A 0 -1
2 2002 A 1 0
3 2003 A 0 1
4 2004 A 0 2
5 2002 B 0 -2
6 2003 B 0 -1
7 2004 B 1 0

Resources