How do I avoid a slow loop with large data set? - r

Consider this data set:
> DATA <- data.frame(Agreement_number = c(1,1,1,1,2,2,2,2),
+ country = c("Canada","Canada", "USA", "USA", "Canada","Canada", "USA", "USA"),
+ action = c("signature", "ratification","signature", "ratification", "signature", "ratification","signature", "ratification"),
+ signature_date = c(2000,NA,2000,NA, 2001, NA, 2002, NA),
+ ratification_date = c(NA, 2001, NA, 2002, NA, 2001, NA, 2002))
> DATA
Agreement_number country action signature_date ratification_date
1 Canada signature 2000 NA
1 Canada ratification NA 2001
1 USA signature 2000 NA
1 USA ratification NA 2002
2 Canada signature 2001 NA
2 Canada ratification NA 2001
2 USA signature 2002 NA
2 USA ratification NA 2002
As you can see, half of the rows have duplicate information. For a small data set like this it is really easy to remove duplicates. I could use the coalesce function (dplyr package), get rid of the "action" column and then erase all the irrelevant rows. Though, there many other ways. The final result should look like this:
> DATA <- data.frame( Agreement_number = c(1,1,2,2),
+ country = c("Canada", "USA", "Canada","USA"),
+ signature_date = c(2000,2000,2001,2002),
+ ratification_date = c(2001, 2002, 2001, 2002))
> DATA
Agreement_number country signature_date ratification_date
1 Canada 2000 2001
1 USA 2000 2002
2 Canada 2001 2001
2 USA 2002 2002
The problem, is that my real data set is MUCH bigger (102000 x 270) and there are many more variables. The real data is also more irregular and there are more absent values. The coalesce function seems very slow. The best loop I could make so far still takes up to 5-10 minutes to run.
Is there a simple way of doing this which would be faster? I have the feeling that there must be some function in R for that kind of operation, but I couldn't find any.

I think you need dcast. The version in the data.table library calls itself "fast", and in my experience, it is speedy on large datasets.
First, let's create one column which is either the signature_date or ratification_date, depending on the action
library(data.table)
setDT(DATA)[, date := ifelse(action == "ratification", ratification_date, signature_date)]
Now, let's cast it so that the action are the columns and the value is the date
wide <- dcast(DATA, Agreement_number + country ~ action, value.var = 'date')
So wide looks like this
Agreement_number country ratification signature
1 1 Canada 2001 2000
2 1 USA 2002 2000
3 2 Canada 2001 2001
4 2 USA 2002 2002

The OP has told that his production data has 100 k rows x 270 columns, and speed is a concern for him. Therefore, I suggest to use data.table.
I'm aware that Harland also has proposed to use data.table and dcast() but the solution below is a different approach. It brings the rows in the correct order and copies the ratification_date to the signature row. After some clean-up we get the desired result.
library(data.table)
# coerce to data.table,
# make sure that the actions are ordered properly, not alphabetically
setDT(DATA)[, action := ordered(action, levels = c("signature", "ratification"))]
# order the rows to make sure that signature row and ratification row are
# subsequent for each agreement and country
setorder(DATA, Agreement_number, country, action)
# copy the ratification date from the row below but only within each group
result <- DATA[, ratification_date := shift(ratification_date, type = "lead"),
by = c("Agreement_number", "country")][
# keep only signature rows, remove action column
action == "signature"][, action := NULL]
result
Agreement_number country signature_date ratification_date dummy1 dummy2
1: 1 Canada 2000 2001 2 D
2: 1 USA 2000 2002 3 A
3: 2 Canada 2001 2001 1 B
4: 2 USA 2002 2002 4 C
Data
The OP has mentioned that his production data has 270 columns. To simulate this I've added two dummy columns:
set.seed(123L)
DATA <- data.frame(Agreement_number = c(1,1,1,1,2,2,2,2),
country = c("Canada","Canada", "USA", "USA", "Canada","Canada", "USA", "USA"),
action = c("signature", "ratification","signature", "ratification", "signature", "ratification","signature", "ratification"),
signature_date = c(2000,NA,2000,NA, 2001, NA, 2002, NA),
ratification_date = c(NA, 2001, NA, 2002, NA, 2001, NA, 2002),
dummy1 = rep(sample(4), each = 2L),
dummy2 = rep(sample(LETTERS[1:4]), each = 2L))
Note that set.seed() is used for repeatable results when sampling.
Agreement_number country action signature_date ratification_date dummy1 dummy2
1 1 Canada signature 2000 NA 2 D
2 1 Canada ratification NA 2001 2 D
3 1 USA signature 2000 NA 3 A
4 1 USA ratification NA 2002 3 A
5 2 Canada signature 2001 NA 1 B
6 2 Canada ratification NA 2001 1 B
7 2 USA signature 2002 NA 4 C
8 2 USA ratification NA 2002 4 C
Addendum: dcast() with additional columns
Harland has suggested to use data.table and dcast(). Besides several other flaws in his answer, it doesn't handle the additional columns the OP has mentioned.
The dcast() approach below will return also the additional columns:
library(data.table)
# coerce to data table
setDT(DATA)[, action := ordered(action, levels = c("signature", "ratification"))]
# use already existing column to "coalesce" dates
DATA[action == "ratification", signature_date := ratification_date]
DATA[, ratification_date := NULL]
# dcast from long to wide form, note that ... refers to all other columns
result <- dcast(DATA, Agreement_number + country + ... ~ action,
value.var = "signature_date")
result
Agreement_number country dummy1 dummy2 signature ratification
1: 1 Canada 2 D 2000 2001
2: 1 USA 3 A 2000 2002
3: 2 Canada 1 B 2001 2001
4: 2 USA 4 C 2002 2002
Note that this approach will change the order of columns.

Here is another data.table solution using uwe-block's data.frame. It is similar to uwe-block's method, but uses max to collapse the data.
# covert data.frame to data.table and factor variables to character variables
library(data.table)
setDT(DATA)[, names(DATA) := lapply(.SD,
function(x) if(is.factor(x)) as.character(x) else x)]
# collapse data set, by agreement and country. Take max of remaining variables.
DATA[, lapply(.SD, max, na.rm=TRUE), by=.(Agreement_number, country)][,action := NULL][]
The lapply runs through variables not included in the by statement and calculates the maximum after removing NA values. The next link in the chain drops the unneeded action variable and the final (unnecessary) link prints the output.
This returns
Agreement_number country signature_date ratification_date dummy1 dummy2
1: 1 Canada 2000 2001 2 D
2: 1 USA 2000 2002 3 A
3: 2 Canada 2001 2001 1 B
4: 2 USA 2002 2002 4 C

Related

r merge data with different year

I would like to merge two data using different years.
My data are like the below with more than 1,000 firms with 20 years span.
And I want to merge data to examine firm A's ratio at t's impact on firm A's count at t+1.
Data A
firm year ratio
A 1990 0.2
A 1991 0.3
...
B 1990 0.1
Data B
firm tyear count
A 1990 2
A 1991 6
...
B 1990 4
Expected Output
firm year ratio count
A 1990 0.2 6
Any suggestion for code to merge data?
Thank you
This should get you started on the dataset, just make sure you do the right lag/lead transformation on the table.
library(data.table)
dt.a.years <- data.table(Year =seq(from = 1990, to = 2010, by = 1L))
dt.b.years <- data.table(Year =seq(from = 1990, to = 2010, by = 1L))
dt.merged <- merge( x = dt.a.years
, y = dt.b.years[, .(Year, lag.Year = shift(Year, n = 1, fill = NA))]
, by.x = "Year"
, by.y = "lag.Year")
>dt.merged
Year Year.y
1: 1990 1991
2: 1991 1992
3: 1992 1993
4: 1993 1994
5: 1994 1995
6: 1995 1996
7: 1996 1997
8: 1997 1998
9: 1998 1999
How about like this:
A$tyear = A$year+1
AB = merge(A,B,by=c('firm','tyear'),all=F)

Translating Stata code into R

General newbie when it comes to time series data analysis in R. I am having trouble translating a bit of Stata code into R code for a replication project I am doing.
The intent of the Stata code and the Stata code (from the original analysis) are the following:
#### Delete extra yearc observations with different wartypes #####
drop if yearc==yearc[_n+1] & wartype!="CIVIL"
drop if yearc==yearc[_n-1] & wartype!="CIVIL"
So, translated, I keep the rows in which the country is having a civil war and delete the rows in which there is an interstate war during the same years.
I have named the data object (i.e., the data set)
mywar
in R.
I am assuming I somehow do a conditional ifelse statement, or something similar, such as:
invisible(mywar$yearc <- ifelse(mywar$yearc==n-1 | mywar$yearc==n+1 | mywar$wartype!=civil, NA,
mywar$yearc)) # I am assuming I cannot condition ifelse statements like this; but, this is how I imagine it
mywar <- mywar[!is.na(mywar$yearc),]
EDIT:
So perhaps an example
> b <- c(1970, 1970, 1970, 1971, 1982, 1999, 1999, 2000, 2001, 2002)
> c <- c("inter", "civil", "intra", "civil", "civil", "inter", "civil", "civil", "civil", "civil")
> df <- data.frame(b,c)
> df$j <- ifelse(df$b==n-1 & df$b==n+1 & df$c!="civil", NA, df$b)
> df
b c j
1 1970 inter 1970
2 1970 civil 1970
3 1970 intra 1970
4 1971 civil 1971
5 1982 civil 1982
6 1999 inter 1999
7 1999 civil 1999
8 2000 civil 2000
9 2001 civil 2001
10 2002 civil 2002
So, what I was trying to do was create NAs for rows 1,3,and 6 as they are duplicate years in my logistic regression on the onset of civil war (I am not interested in inter and intra wars, however defined) so that I can delete these rows from my data set. Here, I just recreated row b. (Note, what is missing from this made up data are the country ids. But assume that these ten entries represent the same country (for instance, Somalia)). So, I am interested in how to delete these type of rows in a data set with 28,000 rows.
dplyr is also a good way — you just need to "keep" instead of "drop"
library(dplyr)
filter(df, (yearc != lead(yearc, 1) & yearc != lag(yearc, 1)) | wartype == "CIVIL")
You're focusing on Stata's if qualifier, but it sounds like you simply want to subset the data frame--hence your use of the drop command in Stata. I also learned Stata before R and was confused since I relied so heavily on the if qualifier in Stata and immediately pursued ifelse in R. But, I later realized that the more relevant technique in R revolved around subsetting. There is a subset() command, but most people prefer subsetting by using brackets (see code below).
In your original question you ask how to do two things:
how to delete observations (i.e. rows) that are coded "inter" or "intra" on column C, and
how to mark them as missing
Sample Data
b <- c(1970, 1970, 1970, 1971, 1982, 1999, 1999, 2000, 2001, 2002)
c <- c("inter", "civil", "intra", "civil", "civil", "inter", "civil", "civil", "civil", "civil")
df <- data.frame(b,c)
df
b c
1 1970 inter
2 1970 civil
3 1970 intra
4 1971 civil
5 1982 civil
6 1999 inter
7 1999 civil
8 2000 civil
9 2001 civil
10 2002 civil
1. Dropping Observations
If you want to delete observations that are not "civil" in column C, you can subset the data frame to only keep those cases that are "civil":
df2 <- df[df$c=="civil",]
df2
b c
2 1970 civil
4 1971 civil
5 1982 civil
7 1999 civil
8 2000 civil
9 2001 civil
10 2002 civil
The above code creates a new data frame, df2, that is a subset of df, but you can also completely overwrite the original data frame:
df <- df[df$c=="civil",]
Or, you can generate a new one and then remove the old one, if you don't like your workspace cluttered with lots of data frames:
df2 <- df[df$c=="civil",]
rm(df)
2. Marking Observations as Missing
If you want to mark observations that are not "civil" in column C, you can do that by overwriting them as NA:
df$c[df$c != "civil"] <- NA
df
b c
1 1970 <NA>
2 1970 civil
3 1970 <NA>
4 1971 civil
5 1982 civil
6 1999 <NA>
7 1999 civil
8 2000 civil
9 2001 civil
10 2002 civil
You could then use listwise deletion (see the na.omit() command) to remove the cases from whatever analyses you're doing.
Side Note
Your original Stata code seeks to subset when column b is a duplicate and column c is "inter" or "intra". However, the way your sample data were presented, this seemed to be a redundant concern, which is why my solution above only looks at column c. However, if you want to match your Stata code as closely as possible, you can do that by
df <- df[order(df$b, df$c),]
df$duplicate <- duplicated(df$b)
df2 <- df[df$c=="civil" & df$duplicate==FALSE,]
which
orders the data chronologically by year and then alphabetically by war
creates a new variable that specifies whether column b is a duplicate year
subsets the data frame to remove undesirable cases.
Try changing your | operator to &.
Here is some made up data:
R> b <- c(rep(1:4, each=3))
R> c <- 1:length(b)
R> df <- data.frame(c,b)
R> df$j <- ifelse(df$b != 2 & df$b != 3 & df$b != 1, NA, df$b)
R> df
c b j
1 1 1 1
2 2 1 1
3 3 1 1
4 4 2 2
5 5 2 2
6 6 2 2
7 7 3 3
8 8 3 3
9 9 3 3
10 10 4 NA
11 11 4 NA
12 12 4 NA
That last line of your code mywar <- mywar[!is.na(mywar$yearc),] should work fine as well

Grouping and conditions without loop (big data)

I have several observations of the same groups, and for each observation I have a year.
dat = data.frame(group = rep(c("a","b","c"),each = 3), year = c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995))
group year
1 a 2000
2 a 1996
3 a 1975
4 b 2002
5 b 2010
6 b 1980
7 c 1990
8 c 1986
9 c 1995
For each observation, i would like to know if another observation of the same group can be found with given conditions relative to the focal observation. e.g. : "Is there any other observation (than the focal one) that has been done during the last 6 years (starting from the focal year) in the same group".
Ideally the dataframe should be like that
group year six_years
1 a 2000 1 # there is another member of group a that is year = 1996 (2000-6 = 1994, this value is inside the threshold)
2 a 1996 0
3 a 1975 0
4 b 2002 0
5 b 2010 0
6 b 1980 0
7 c 1990 1
8 c 1986 0
9 c 1995 1
Basically for each row we should look into the subset of groups, and see if any(dat$year == conditions). It is very easy to do with a for loop, but it's of no use here : the dataframe is massive (several millions of row) and a loop would take forever.
I am searching for an efficient way with vectorized functions or a fast package.
Thanks !
EDITED
Actually thinking about it you will probably have a lot of recurring year/group combinations, in which case much quicker to pre-calculate the frequencies using count() - which is also a plyr function:
90M rows took ~4sec
require(plyr)
dat <- data.frame(group = sample(c("a","b","c"),size=9000000,replace=TRUE),
year = sample(c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995),size=9000000,replace=TRUE))
test<-function(y,g,df){
d<-df[df$year>=y-6 &
df$year<y &
df$group== g,]
return(nrow(d))
}
rollup<-function(){
summ<-count(dat) # add a frequency to each combination
return(ddply(summ,.(group,year),transform,t=test(as.numeric(year),group,summ)*freq))
}
system.time(rollup())
user system elapsed
3.44 0.42 3.90
My dataset had too many different groups, and the plyr option proposed by Troy was too slow.
I found a hack (experts would probably say "an ugly one") with package data.table : the idea is to merge the data.table with itself quickly with the fast merge function. It gives every possible combination between a given year of a group and all others years from the same group.
Then proceed with an ifelse for every row with the condition you're looking for.
Finally, aggregate everything with a sum function to know how many times every given years can be found in a given timespan relative to another year.
On my computer, it took few milliseconds, instead of the probable hours that plyr was going to take
dat = data.table(group = rep(c("a","b","c"),each = 3), year = c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995), key = "group")
Produces this :
group year
1 a 2000
2 a 1996
3 a 1975
4 b 2002
5 b 2010
6 b 1980
7 c 1990
8 c 1986
9 c 1995
Then :
z = merge(dat, dat, by = "group", all = T, allow.cartesian = T) # super fast
z$sixyears = ifelse(z$year.y >= z$year.x - 6 & z$year.y < z$year.x, 1, 0) # creates a 0/1 column for our condition
z$sixyears = as.numeric(z$sixyears) # we want to sum this up after
z$year.y = NULL # useless column now
z2 = z[ , list(sixyears = sum(sixyears)), by = list(group, year.x)]
(Years with another year of the same group in the last six years are given a "1" :
group year x
1 a 1975 0
2 b 1980 0
3 c 1986 0
4 c 1990 1 # e.g. here there is another "c" which was in the timespan 1990 -6 ..
5 c 1995 1 # <== this one. This one too has another reference in the last 6 years, two rows above.
6 a 1996 0
7 a 2000 1
8 b 2002 0
9 b 2010 0
Icing on the cake : it deals with NA seamlessly.
Here's another possibility also using data.table but including diff().
dat <- data.table(group = rep(c("a","b","c"), each = 3),
year = c(2000, 1996, 1975, 2002, 2010, 1980, 1990,1986,1995),
key = "group")
valid_case <- subset(dt[,list(valid_case = diff(year)), by=key(dt)],
abs(valid_case)<6)
dat$valid_case <- ifelse(dat$group %in% valid_case$group, 1, 0)
I am not sure how this compares in terms of speed or NA handling (I think it should be fine with NAs since they propagate in diff() and abs()), but I certainly find it more readable. Joins are really fast in data.table, but I'd have to think avoiding that all together helps. There's probably a more idiomatic way to do the condition in the ifelse statement using data.table joins. That could potentially speed things up, although my experience has never found %in% to be the limiting factor.

ddply and adding columns

I have a data frame with columns year|country|growth_rate. I wanted to to find country with highest growth rate in every year, which I did with:
ddply(data, .(year), summarise, highest=max(growth_rate))
and I've got data frame with 2 columns; year and highest
I would like to add third column here, which would show that country that had that max growth_rate, but I can't figure out how to do this.
R> data = data.frame(year = rep(1990:1993, 2), growth_rate = runif(8), country = rep(c("US", "FR"), each = 4))
R> data
year growth_rate country
1 1990 0.82785327 US
2 1991 0.86724498 US
3 1992 0.84813164 US
4 1993 0.35884355 US
5 1990 0.92792399 FR
6 1991 0.08659153 FR
7 1992 0.26732516 FR
8 1993 0.37819132 FR
R> ddply(data, .(year), summarize, highest = max(growth_rate), country = country[which.max(growth_rate)])
year highest country
1 1990 0.9279240 FR
2 1991 0.8672450 US
3 1992 0.8481316 US
4 1993 0.3781913 FR

Convert Dataframe to key value pair list in R [duplicate]

This question already has answers here:
Reshaping data.frame from wide to long format
(8 answers)
Closed 6 years ago.
How can I 'unpivot' a table? What is the proper technical term for this?
UPDATE: The term is called melt
I have a data frame for countries and data for each year
Country 2001 2002 2003
Nigeria 1 2 3
UK 2 NA 1
And I want to have something like
Country Year Value
Nigeria 2001 1
Nigeria 2002 2
Nigeria 2003 3
UK 2001 2
UK 2002 NA
UK 2003 1
I still can't believe I beat Andrie with an answer. :)
> library(reshape)
> my.df <- read.table(text = "Country 2001 2002 2003
+ Nigeria 1 2 3
+ UK 2 NA 1", header = TRUE)
> my.result <- melt(my.df, id = c("Country"))
> my.result[order(my.result$Country),]
Country variable value
1 Nigeria X2001 1
3 Nigeria X2002 2
5 Nigeria X2003 3
2 UK X2001 2
4 UK X2002 NA
6 UK X2003 1
The base R reshape approach for this problem is pretty ugly, particularly since the names aren't in a form that reshape likes. It would be something like the following, where the first setNames line modifies the column names into something that reshape can make use of.
reshape(
setNames(mydf, c("Country", paste0("val.", c(2001, 2002, 2003)))),
direction = "long", idvar = "Country", varying = 2:ncol(mydf),
sep = ".", new.row.names = seq_len(prod(dim(mydf[-1]))))
A better alternative in base R is to use stack, like this:
cbind(mydf[1], stack(mydf[-1]))
# Country values ind
# 1 Nigeria 1 2001
# 2 UK 2 2001
# 3 Nigeria 2 2002
# 4 UK NA 2002
# 5 Nigeria 3 2003
# 6 UK 1 2003
There are also new tools for reshaping data now available, like the "tidyr" package, which gives us gather. Of course, the tidyr:::gather_.data.frame method just calls reshape2::melt, so this part of my answer doesn't necessarily add much except introduce the newer syntax that you might be encountering in the Hadleyverse.
library(tidyr)
gather(mydf, year, value, `2001`:`2003`) ## Note the backticks
# Country year value
# 1 Nigeria 2001 1
# 2 UK 2001 2
# 3 Nigeria 2002 2
# 4 UK 2002 NA
# 5 Nigeria 2003 3
# 6 UK 2003 1
All three options here would need reordering of rows if you want the row order you showed in your question.
A fourth option would be to use merged.stack from my "splitstackshape" package. Like base R's reshape, you'll need to modify the column names to something that includes a "variable" and "time" indicator.
library(splitstackshape)
merged.stack(
setNames(mydf, c("Country", paste0("V.", 2001:2003))),
var.stubs = "V", sep = ".")
# Country .time_1 V
# 1: Nigeria 2001 1
# 2: Nigeria 2002 2
# 3: Nigeria 2003 3
# 4: UK 2001 2
# 5: UK 2002 NA
# 6: UK 2003 1
Sample data
mydf <- structure(list(Country = c("Nigeria", "UK"), `2001` = 1:2, `2002` = c(2L,
NA), `2003` = c(3L, 1L)), .Names = c("Country", "2001", "2002",
"2003"), row.names = 1:2, class = "data.frame")
You can use the melt command from the reshape package. See here: http://www.statmethods.net/management/reshape.html
Probably something like melt(myframe, id=c('Country'))

Resources