Translating Stata code into R - r

General newbie when it comes to time series data analysis in R. I am having trouble translating a bit of Stata code into R code for a replication project I am doing.
The intent of the Stata code and the Stata code (from the original analysis) are the following:
#### Delete extra yearc observations with different wartypes #####
drop if yearc==yearc[_n+1] & wartype!="CIVIL"
drop if yearc==yearc[_n-1] & wartype!="CIVIL"
So, translated, I keep the rows in which the country is having a civil war and delete the rows in which there is an interstate war during the same years.
I have named the data object (i.e., the data set)
mywar
in R.
I am assuming I somehow do a conditional ifelse statement, or something similar, such as:
invisible(mywar$yearc <- ifelse(mywar$yearc==n-1 | mywar$yearc==n+1 | mywar$wartype!=civil, NA,
mywar$yearc)) # I am assuming I cannot condition ifelse statements like this; but, this is how I imagine it
mywar <- mywar[!is.na(mywar$yearc),]
EDIT:
So perhaps an example
> b <- c(1970, 1970, 1970, 1971, 1982, 1999, 1999, 2000, 2001, 2002)
> c <- c("inter", "civil", "intra", "civil", "civil", "inter", "civil", "civil", "civil", "civil")
> df <- data.frame(b,c)
> df$j <- ifelse(df$b==n-1 & df$b==n+1 & df$c!="civil", NA, df$b)
> df
b c j
1 1970 inter 1970
2 1970 civil 1970
3 1970 intra 1970
4 1971 civil 1971
5 1982 civil 1982
6 1999 inter 1999
7 1999 civil 1999
8 2000 civil 2000
9 2001 civil 2001
10 2002 civil 2002
So, what I was trying to do was create NAs for rows 1,3,and 6 as they are duplicate years in my logistic regression on the onset of civil war (I am not interested in inter and intra wars, however defined) so that I can delete these rows from my data set. Here, I just recreated row b. (Note, what is missing from this made up data are the country ids. But assume that these ten entries represent the same country (for instance, Somalia)). So, I am interested in how to delete these type of rows in a data set with 28,000 rows.

dplyr is also a good way — you just need to "keep" instead of "drop"
library(dplyr)
filter(df, (yearc != lead(yearc, 1) & yearc != lag(yearc, 1)) | wartype == "CIVIL")

You're focusing on Stata's if qualifier, but it sounds like you simply want to subset the data frame--hence your use of the drop command in Stata. I also learned Stata before R and was confused since I relied so heavily on the if qualifier in Stata and immediately pursued ifelse in R. But, I later realized that the more relevant technique in R revolved around subsetting. There is a subset() command, but most people prefer subsetting by using brackets (see code below).
In your original question you ask how to do two things:
how to delete observations (i.e. rows) that are coded "inter" or "intra" on column C, and
how to mark them as missing
Sample Data
b <- c(1970, 1970, 1970, 1971, 1982, 1999, 1999, 2000, 2001, 2002)
c <- c("inter", "civil", "intra", "civil", "civil", "inter", "civil", "civil", "civil", "civil")
df <- data.frame(b,c)
df
b c
1 1970 inter
2 1970 civil
3 1970 intra
4 1971 civil
5 1982 civil
6 1999 inter
7 1999 civil
8 2000 civil
9 2001 civil
10 2002 civil
1. Dropping Observations
If you want to delete observations that are not "civil" in column C, you can subset the data frame to only keep those cases that are "civil":
df2 <- df[df$c=="civil",]
df2
b c
2 1970 civil
4 1971 civil
5 1982 civil
7 1999 civil
8 2000 civil
9 2001 civil
10 2002 civil
The above code creates a new data frame, df2, that is a subset of df, but you can also completely overwrite the original data frame:
df <- df[df$c=="civil",]
Or, you can generate a new one and then remove the old one, if you don't like your workspace cluttered with lots of data frames:
df2 <- df[df$c=="civil",]
rm(df)
2. Marking Observations as Missing
If you want to mark observations that are not "civil" in column C, you can do that by overwriting them as NA:
df$c[df$c != "civil"] <- NA
df
b c
1 1970 <NA>
2 1970 civil
3 1970 <NA>
4 1971 civil
5 1982 civil
6 1999 <NA>
7 1999 civil
8 2000 civil
9 2001 civil
10 2002 civil
You could then use listwise deletion (see the na.omit() command) to remove the cases from whatever analyses you're doing.
Side Note
Your original Stata code seeks to subset when column b is a duplicate and column c is "inter" or "intra". However, the way your sample data were presented, this seemed to be a redundant concern, which is why my solution above only looks at column c. However, if you want to match your Stata code as closely as possible, you can do that by
df <- df[order(df$b, df$c),]
df$duplicate <- duplicated(df$b)
df2 <- df[df$c=="civil" & df$duplicate==FALSE,]
which
orders the data chronologically by year and then alphabetically by war
creates a new variable that specifies whether column b is a duplicate year
subsets the data frame to remove undesirable cases.

Try changing your | operator to &.
Here is some made up data:
R> b <- c(rep(1:4, each=3))
R> c <- 1:length(b)
R> df <- data.frame(c,b)
R> df$j <- ifelse(df$b != 2 & df$b != 3 & df$b != 1, NA, df$b)
R> df
c b j
1 1 1 1
2 2 1 1
3 3 1 1
4 4 2 2
5 5 2 2
6 6 2 2
7 7 3 3
8 8 3 3
9 9 3 3
10 10 4 NA
11 11 4 NA
12 12 4 NA
That last line of your code mywar <- mywar[!is.na(mywar$yearc),] should work fine as well

Related

converting continuous number into a binary value

I have a dataset that includes a column called BirthYear that includes lots of years in which people were born and I need to create a new column that prints "young" if their BirthYear is > 1993 and to print "old" if their BirthYear is < 1993. I've tried using the if function but I cant seem to achieve it, I would appreciate if u let me know how to do it, thanks!
I also like cut() for this, especially if you want the result to be a factor.
year <- sample(1989:1999, size=20, replace=T) # Arbitrary vector of years
breaks <- c(-Inf, 1993, Inf) # The 3 bounds of the 2 intervals
labels <- c("old", "young") # The 2 labels of the 2 intervals
binary <- cut(x=year, breaks=breaks, labels=labels, right=F)
# Inspect
data.frame(year, binary)
The result:
year binary
1 1993 young
2 1997 young
3 1989 old
4 1998 young
5 1999 young
6 1989 old
7 1994 young
8 1991 old
9 1991 old
10 1991 old
...
This is close to a duplicate, but involves custom labels.
If you have to inspect more than one variable eventually, look at dplyr::case_when().
Another option could be use dplyr::recode_factor as below:
set.seed(1)
year <- sample(1970:2005, size=10, replace=T)
> year
#[1] 2001 1975 1979 1994 1974 1973 1985 1994 1975 1981
recode_factor(as.factor(year > 1993), 'TRUE' = "Old", 'FALSE' = "Young")
#[1] Old Young Young Old Young Young Young Old Young Young
#Levels: Old Young

efficiently creating a panel data.frame from cross sections with unharmonized column names

I need to create a panel data set (long format) from multiple yearly (cross-sectional) data sets. The variables of interest have different names in the single data sets and i need to harmonize them.
I loaded the dataframes to a list and now want to manipulate the names using lapply or a chunk of code that allows binding the dataframes. I can see several ways of doing this, but would like to use one which works with little code on a large list of data.frames, so that I can do this for several variables and easily change specifics later on.
So what I am looking for is either a way to rename the columns, so that I able to simple use bind_rows() from dplyr or an equivalent method, or a way to rename and bind the datasets in one step. Since I need to do this for several variables it might be safer to keep the two steps apart.
To illustrate, here an example:
a <- data.frame(id=c("Marc", "Julia", "Rico"), year=2000:2002, laborincome=1:3)
b <- data.frame(id=c("Marc", "Julia", "Rico"), earningsfromlabor=2:4, year=2003:2005)
dflist <- list(a, b)
equivalent_vars <- c("laborincome", "earningsfromlabor")
newnanme <- "income"
Desired result:
data.frame(id=c("Marc", "Julia", "Rico"), income=c(1,2,3,2,3,4), year=2000:2005)
id income year
1 Marc 1 2000
2 Julia 2 2001
3 Rico 3 2002
4 Marc 2 2003
5 Julia 3 2004
6 Rico 4 2005
We could use setnames from data.table
library(data.table)
do.call(rbind, Map(setnames, dflist, old = equivalent_vars, new = newnanme))
# id year income
#1 Marc 2000 1
#2 Julia 2001 2
#3 Rico 2002 3
#4 Marc 2003 2
#5 Julia 2004 3
#6 Rico 2005 4
Or we can use the :=
library(dplyr)
library(purrr)
map2_df(dflist, equivalent_vars, ~ .x %>%
rename(!! (newnanme) := !! .y)) %>%
select(id, income, year)
# id income year
#1 Marc 1 2000
#2 Julia 2 2001
#3 Rico 3 2002
#4 Marc 2 2003
#5 Julia 3 2004
#6 Rico 4 2005

Adding data points in a column by factors in R

The data.frame my_data consists of two columns("PM2.5" & "years") & around 6400000 rows. The data.frame has various data points for pollutant levels of "PM2.5" for years 1999, 2002, 2005 & 2008.
This is what i have done to the data.drame:
{
my_data <- arrange(my_data,year)
my_data$year <- as.factor(my_data$year)
my_data$PM2.5 <- as.numeric(my_data$PM2.5)
}
I want to find the sum of all PM2.5 levels (i.e sum of all data points under PM2.5) according to different year. How can I do it.
!The image shows the first 20 rows of the data.frame.
Since the column "years" is arranged, it is showing only 1999
Say this is your data:
library(plyr) # <- don't forget to tell us what libraries you are using
give us an easy sample set
my_data <- data.frame(year=sample(c("1999","2002","2005","2008"), 10, replace=T), PM2.5 = rnorm(10,mean = 5))
my_data <- arrange(my_data,year)
my_data$year <- as.factor(my_data$year)
my_data$PM2.5 <- as.numeric(my_data$PM2.5)
> my_data
year PM2.5
1 1999 5.556852
2 2002 5.508820
3 2002 4.836500
4 2002 3.766266
5 2005 6.688936
6 2005 5.025600
7 2005 4.041670
8 2005 4.614784
9 2005 4.352046
10 2008 6.378134
One way to do it (out of many, many ways already shown by a simple google search):
> with(my_data, (aggregate(PM2.5, by=list(year), FUN="sum")))
Group.1 x
1 1999 5.556852
2 2002 14.111586
3 2005 24.723037
4 2008 6.378134

Grouping and conditions without loop (big data)

I have several observations of the same groups, and for each observation I have a year.
dat = data.frame(group = rep(c("a","b","c"),each = 3), year = c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995))
group year
1 a 2000
2 a 1996
3 a 1975
4 b 2002
5 b 2010
6 b 1980
7 c 1990
8 c 1986
9 c 1995
For each observation, i would like to know if another observation of the same group can be found with given conditions relative to the focal observation. e.g. : "Is there any other observation (than the focal one) that has been done during the last 6 years (starting from the focal year) in the same group".
Ideally the dataframe should be like that
group year six_years
1 a 2000 1 # there is another member of group a that is year = 1996 (2000-6 = 1994, this value is inside the threshold)
2 a 1996 0
3 a 1975 0
4 b 2002 0
5 b 2010 0
6 b 1980 0
7 c 1990 1
8 c 1986 0
9 c 1995 1
Basically for each row we should look into the subset of groups, and see if any(dat$year == conditions). It is very easy to do with a for loop, but it's of no use here : the dataframe is massive (several millions of row) and a loop would take forever.
I am searching for an efficient way with vectorized functions or a fast package.
Thanks !
EDITED
Actually thinking about it you will probably have a lot of recurring year/group combinations, in which case much quicker to pre-calculate the frequencies using count() - which is also a plyr function:
90M rows took ~4sec
require(plyr)
dat <- data.frame(group = sample(c("a","b","c"),size=9000000,replace=TRUE),
year = sample(c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995),size=9000000,replace=TRUE))
test<-function(y,g,df){
d<-df[df$year>=y-6 &
df$year<y &
df$group== g,]
return(nrow(d))
}
rollup<-function(){
summ<-count(dat) # add a frequency to each combination
return(ddply(summ,.(group,year),transform,t=test(as.numeric(year),group,summ)*freq))
}
system.time(rollup())
user system elapsed
3.44 0.42 3.90
My dataset had too many different groups, and the plyr option proposed by Troy was too slow.
I found a hack (experts would probably say "an ugly one") with package data.table : the idea is to merge the data.table with itself quickly with the fast merge function. It gives every possible combination between a given year of a group and all others years from the same group.
Then proceed with an ifelse for every row with the condition you're looking for.
Finally, aggregate everything with a sum function to know how many times every given years can be found in a given timespan relative to another year.
On my computer, it took few milliseconds, instead of the probable hours that plyr was going to take
dat = data.table(group = rep(c("a","b","c"),each = 3), year = c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995), key = "group")
Produces this :
group year
1 a 2000
2 a 1996
3 a 1975
4 b 2002
5 b 2010
6 b 1980
7 c 1990
8 c 1986
9 c 1995
Then :
z = merge(dat, dat, by = "group", all = T, allow.cartesian = T) # super fast
z$sixyears = ifelse(z$year.y >= z$year.x - 6 & z$year.y < z$year.x, 1, 0) # creates a 0/1 column for our condition
z$sixyears = as.numeric(z$sixyears) # we want to sum this up after
z$year.y = NULL # useless column now
z2 = z[ , list(sixyears = sum(sixyears)), by = list(group, year.x)]
(Years with another year of the same group in the last six years are given a "1" :
group year x
1 a 1975 0
2 b 1980 0
3 c 1986 0
4 c 1990 1 # e.g. here there is another "c" which was in the timespan 1990 -6 ..
5 c 1995 1 # <== this one. This one too has another reference in the last 6 years, two rows above.
6 a 1996 0
7 a 2000 1
8 b 2002 0
9 b 2010 0
Icing on the cake : it deals with NA seamlessly.
Here's another possibility also using data.table but including diff().
dat <- data.table(group = rep(c("a","b","c"), each = 3),
year = c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995),
key = "group")
valid_case <- subset(dt[,list(valid_case = diff(year)), by=key(dt)],
abs(valid_case)<6)
dat$valid_case <- ifelse(dat$group %in% valid_case$group, 1, 0)
I am not sure how this compares in terms of speed or NA handling (I think it should be fine with NAs since they propagate in diff() and abs()), but I certainly find it more readable. Joins are really fast in data.table, but I'd have to think avoiding that all together helps. There's probably a more idiomatic way to do the condition in the ifelse statement using data.table joins. That could potentially speed things up, although my experience has never found %in% to be the limiting factor.

R - Bootstrap by several column criteria

So what I have is data of cod weights at different ages. This data is taken at several locations over time.
What I would like to create is "weight at age", basically a mean value of weights at a certain age. I want do this for each location at each year.
However, the ages are not sampled the same way (all old fish caught are measured, while younger fish are sub sampled), so I can't just create a normal average, I would like to bootstrap samples.
The bootstrap should take out 5 random values of weight at an age, create a mean value and repeat this a 1000 times, and then create an average of the means. The values should be able to be used again (replace). This should be done for each age at every AreaCode for every year. Dependent factors: Year-location-Age.
So here's an example of what my data could look like.
df <- data.frame( Year= rep(c(2000:2008),2), AreaCode = c("39G4", "38G5","40G5"), Age = c(0:8), IndWgt = c(rnorm(18, mean=5, sd=3)))
> df
Year AreaCode Age IndWgt
1 2000 39G4 0 7.317489899
2 2001 38G5 1 7.846606144
3 2002 40G5 2 0.009212455
4 2003 39G4 3 6.498688035
5 2004 38G5 4 3.121134937
6 2005 40G5 5 11.283096043
7 2006 39G4 6 0.258404136
8 2007 38G5 7 6.689780137
9 2008 40G5 8 10.180511929
10 2000 39G4 0 5.972879108
11 2001 38G5 1 1.872273650
12 2002 40G5 2 5.552962065
13 2003 39G4 3 4.897882549
14 2004 38G5 4 5.649438631
15 2005 40G5 5 4.525012587
16 2006 39G4 6 2.985615831
17 2007 38G5 7 8.042884181
18 2008 40G5 8 5.847629941
AreaCode contains the different locations, in reality I have 85 different levels. The time series stretches 1991-2013, the ages 0-15. IndWgt contain the weight. My whole data frame has a row length of 185726.
Also, every age does not exist for every location and every year. Don't know if this would be a problem, just so the scripts isn't based on references to certain row number. There are some NA values in the weight column, but I could just remove them before hand.
I was thinking that I maybe should use replicate, and apply or another plyr function. I've tried to understand the boot function but I don't really know if I would write my arguments under statistics, and in that case how. So yeah, basically I have no idea.
I would be thankful for any help I can get!
How about this with plyr. I think from the question you wanted to bootstrap only the "young" fish weights and use actual means for the older ones. If not, just replace the ifelse() statement with its last argument.
require(plyr)
#cod<-read.csv("cod.csv",header=T) #I loaded your data from csv
bootstrap<-function(Age,IndWgt){
ifelse(Age>2, # treat differently for old/young fish
res<-mean(IndWgt), # old fish mean
res<-mean(replicate(1000,sample(IndWgt,5,replace = TRUE))) # young fish bootstrap
)
return(res)
}
ddply(cod,.(Year,AreaCode,Age),summarize,boot_mean=bootstrap(Age,IndWgt))
Year AreaCode Age boot_mean
1 2000 39G4 0 6.650294
2 2001 38G5 1 4.863024
3 2002 40G5 2 2.724541
4 2003 39G4 3 5.698285
5 2004 38G5 4 4.385287
6 2005 40G5 5 7.904054
7 2006 39G4 6 1.622010
8 2007 38G5 7 7.366332
9 2008 40G5 8 8.014071
PS: If you want to sample all ages in the same way, no need for the function, just:
ddply(cod,.(Year,AreaCode,Age),
summarize,
boot_mean=mean(replicate(1000,mean(sample(IndWgt,5,replace = TRUE)))))
Since you don't provide enough code, it's too hard (lazy) for me to test it properly. You should get your first step using the following code. If you wrap this into replicate, you should get your end result that you can average.
part.result <- aggregate(IndWgt ~ Year + AreaCode + Age, data = data, FUN = function(x) {
rws <- length(x)
get.em <- sample(x, size = 5, replace = TRUE)
out <- mean(get.em)
out
})
To handle any missing combination of year/age/location, you could probably add an if statement checking for NULL/NA and producing a warning and/or skipping the iteration.

Resources