How to drop observations conditional on interval of numbers that varies by group - r

I have a dataframe of about 8,000 country-year observations. I want to model the correlates of an event that occurs in certain country-years. To do so properly, I need to drop observations after the event starts until it ends. The events can vary in length from less than one year to 30 years. In my df, I have a column that identifies the threshold_year and termination_year for each event. This column obviously contains many NAs for those countries and years that do not experience the event.
How do I drop observations that fall between the threshold and termination years for specific countries? I have tried to execute the following but it yields an empty dataset: filtering observations from time series conditionally by group.
See code I have attempted below. (BTW, this is my first question on SO).
df <- structure(list(country_id = c(475, 150, 475, 475, 475, 475, 475, 150, 475, 475, 475), year = c(1962, 1967, 1964, 1965, 1966, 1967, 1968, 1968, 1970, 1971, 1972), event = c(0L, 0L, 0L, 0L, 1L, 3L, 0L, 0L, 0L, 0L, 0L), threshold_year = c(NA, NA, NA, NA, 1966, 1967, NA, NA, NA, NA, NA), termination_year = c(NA, NA, NA, NA, 1966, 1970, NA, NA, NA, NA, NA)), .Names = c("country_id", "year", "event", "threshold_year", "termination_year"), row.names = 90:100, class = "data.frame")
df2 <- df %>%
group_by(country_id) %>%
filter(year<=threshold_year & year>termination_year)
I expect a smaller df, perhaps with about 7,000 observations. My attempts typically produce 0 observations.
EDIT
I discovered an inelegant and clumsy process for resolving this issue. I joined my complete dataframe with my threshold dataframe by country only, not year. This adds a column with threshold and termination years for every country that has an event. It also creates a lot of duplicates, but that doesn't matter. Since I no longer have NAs in my threshold and termination columns, I can easily code a dummy variable for each observation to determine whether it falls within the threshold and termination years. I can also concatenate country ids and country years. Once I subset subset my lengthy dataframe by whether the dummy = 1, I can then easily create a list of all country-years that need to be dropped. I then go back to my original data and threshold data set, left_join by both country and year, then subset this data by !(df$country-year %in% drops).
df_drops <- left_join(df, threshold_df, by=c("id"="id"))
df_drops$drops <- ifelse(df_drops$year>df_drops$threshold_year & df_drops$year<=df_drops$termination_year, 1,0)
df_drops$obs_to_drop <- ifelse(df_drops$drops==1, paste(df_drops$id,df_drops$year, sep="_"), NA)
drops <- unique(df_drops$obs_to_drop)
df2 <- left_join(df, threshold.df,by=c("id"="id","year"="threshold_year"))
df2$id_year <- paste(df2$id,df2$year,sep="_")
df3 <- subset(df2, !(df2$id_year %in% drops))

I am assuming that you have a list of thresholds that are specific to each group. If so, you can put the thresholds in a new data frame, then merge them with your original country-year data frame, and finally filter. My toy example below assumes that the end date is 2 years after the start date.
df <- data.frame(country=rep(letters[1:20],each=50),
year=sample(1999:2018,50,T))
threshold <- data.frame(country=letters[1:20],
start=as.numeric(sample(1999:2016,20,T))) %>%
mutate(end=start + 2)
df %>% left_join(.,threshold) %>%
filter(year>=start & year<=end)
country year start end
1 a 2016 2014 2016
2 a 2016 2014 2016
3 a 2014 2014 2016
4 a 2015 2014 2016
5 a 2015 2014 2016
6 a 2015 2014 2016
7 a 2014 2014 2016
8 b 2006 2004 2006
9 b 2004 2004 2006
10 b 2004 2004 2006
11 b 2005 2004 2006
12 b 2006 2004 2006
13 b 2004 2004 2006
14 b 2004 2004 2006
15 b 2006 2004 2006
16 b 2006 2004 2006
17 b 2006 2004 2006
18 b 2006 2004 2006
19 c 2010 2008 2010
20 c 2009 2008 2010
...

I think my guess was right, simply adding | is.na(threshold_year) is enough, at least for the sample data provided.
df %>% group_by(country_id) %>%
filter((year <= threshold_year & year > termination_year) | is.na(threshold_year))
# # A tibble: 9 x 5
# # Groups: country_id [2]
# country_id year event threshold_year termination_year
# <dbl> <dbl> <int> <dbl> <dbl>
# 1 475 1962 0 NA NA
# 2 150 1967 0 NA NA
# 3 475 1964 0 NA NA
# 4 475 1965 0 NA NA
# 5 475 1968 0 NA NA
# 6 150 1968 0 NA NA
# 7 475 1970 0 NA NA
# 8 475 1971 0 NA NA
# 9 475 1972 0 NA NA

Related

merging financial time series data

Given data frames the first column of which is the list of country names and is common in all data frames and the remainder columns are the years for which the value of the indicator is measured and these being the years is also common in data frames, what are the ways to merge the datasets by the first column? How to merge into a multidimensional array? dataset example:
country name
2005
....
2020
Aruba
23591
Angola
1902
country name
2005
....
2020
Aruba
-8.8
Angola
-3.5
Doing a full_join
library(dplyr)
full_join(DataSet1,DataSet2, by = 'country name')
changes the name of the columns and the data is not accessible.
1) Assuming the data frames in the Note at the end we can use bind_rows
library(dplyr)
bind_rows(DF1, DF2, .id = "id")
giving the following which takes all the rows from both data frames and identifies which data frame each row came from.
id countryName 2005 2006
1 1 Aruba 1 2
2 1 Angola 3 4
3 2 Aruba 11 12
4 2 Angola 13 14
2) Another possibility is to create a 3d array
library(abind)
a <- abind(DF1[-1], DF2[-1], along = 3, new.names = list(DF1$countryName,NULL,1:2))
a
giving this 3d array where the dimensions correspond to the country name, the year and the originating data.frame.
, , 1
2005 2006
Aruba 1 2
Angola 3 4
, , 2
2005 2006
Aruba 11 12
Angola 13 14
We can get various slices:
> a["Aruba",,]
1 2
2005 1 11
2006 2 12
> a[,"2005",]
1 2
Aruba 1 11
Angola 3 13
> a[,,2]
2005 2006
Aruba 11 12
Angola 13 14
Note
DF1 <- structure(list(countryName = c("Aruba", "Angola"), `2005` = c(1L,
3L), `2006` = c(2L, 4L)), class = "data.frame", row.names = c(NA, -2L))
DF2 <- structure(list(countryName = c("Aruba", "Angola"), `2005` = c(11L,
13L), `2006` = c(12L, 14L)), class = "data.frame", row.names = c(NA, -2L))
> DF1
countryName 2005 2006
1 Aruba 1 2
2 Angola 3 4
> DF2
countryName 2005 2006
1 Aruba 11 12
2 Angola 13 14

Nested for loop to create a column, using two data sets in R

I want to create a new variable(named "treatment") in a dataset using two different datasets. My original datasets are two big datasets with other variables however, for simplicity, let's say I have the following datasets:
#individual level data, birth years
a <- data.frame (country_code = c(2,2,2,10,10,10,10,8),
birth_year = c(1920,1930,1940,1970,1980,1990, 2000, 1910))
#country level reform info with affected cohorts
b <- data.frame(country_code = c(2,10,10,11),
lower_cutoff = c(1928, 1975, 1907, 1934),
upper_cutoff = c(1948, 1995, 1927, 1948),
cohort = c(1938, 1985, 1917, 1942))
Dataset a is an individual dataset with birth year informations and dataset b is country-level data with some country reform information. According to dataset b I want to create a treatment column in the dataset a. Treatment is 1 if the birth_year is between the cohort and upper_cutoff and 0 if between cohort and lower_cutoff. And anything else should be NA.
After creating an empty treatment column, I used the following code below:
for(i in 1:nrow(a)) {
for(j in 1:nrow(b)){
a$treatment[i] <- ifelse(a$country_code[i] == b$country_code[j] &
a$birth_year[i] >= b$cohort[j] &
a$birth_year[i]<= b$upper_cutoff[j], "1",
ifelse(a$ison[i] == b$ison[j] &
a$birth_year[i] < b$cohort[j] &
a$birth_year[i]>= b$lower_cutoff[j], "0", NA))
}
}
As well as:
for(i in 1:nrow(a)) {
for(j in 1:nrow(b)){
a[i, "treatment"] <- case_when(a[i,"country_code"] == b[j, "country_code"] &
a[i,"birth_year"] >= b[j,"cohort"] &
a[i,"birth_year"]<= b[j,"upper_cutoff"] ~ 1,
a[i,"country_code"] == b[j, "country_code"] &
a[i,"birth_year"] < b[j,"cohort"]&
a[i,"birth_year"]>= b[j,"lower_cutoff"] ~ 0)
}
}
Both codes run, but they only return NAs. The following is the result I want to get:
treatment <- c(NA, 0, 1, NA, 0, 1, NA, 0)
Any ideas about what is wrong? Or any other suggestions? Thanks in advance!
I believe you are switching your upper and lower cutoffs. Try this approach with dplyr:
library(dplyr)
left_join(a,b) %>%
mutate(treatment = case_when(
(birth_year>=cohort & birth_year<=lower_cutoff)~1,
(birth_year<cohort & birth_year>=upper_cutoff)~0
))
Output:
country_code birth_year upper_cutoff lower_cutoff cohort treatment
1 2 1920 1928 1948 1938 NA
2 2 1930 1928 1948 1938 0
3 2 1940 1928 1948 1938 1
4 10 1970 1975 1995 1985 NA
5 10 1970 1907 1927 1917 NA
6 10 1980 1975 1995 1985 0
7 10 1980 1907 1927 1917 NA
8 10 1990 1975 1995 1985 1
9 10 1990 1907 1927 1917 NA
10 10 2000 1975 1995 1985 NA
11 10 2000 1907 1927 1917 NA
12 8 1910 NA NA NA NA
Try this for loop
for(i in 1:nrow(a)){
x <- which(a$country_code[i] == b$country_code)
a$treatment[i] <- NA
for(j in x){
if(a$birth_year[i] %in% b$cohort[j]:b$upper_cutoff[j]){
a$treatment[i] <- 1
}
if(a$birth_year[i] %in% b$lower_cutoff[j]:b$cohort[j]){
a$treatment[i] <- 0
}
}
}
Output
country_code birth_year treatment
1 2 1920 NA
2 2 1930 0
3 2 1940 1
4 10 1970 NA
5 10 1980 0
6 10 1990 1
7 10 2000 NA
8 8 1910 NA
I found the mistake in my code. Apparently, I should have used a break to avoid overwriting the variable I'm creating. But, I'm still open to other answers.
for(i in 1:nrow(a)) {
for(j in 1:nrow(b)){
if(!is.na(a$treatment[i])){break} #to make it stop if I already assign a value
a$treatment[i] <- ifelse(a$country_code[i] == b$country_code[j] &
a$birth_year[i] >= b$cohort[j] &
a$birth_year[i]<= b$upper_cutoff[j], "1",
ifelse(a$ison[i] == b$ison[j] &
a$birth_year[i] < b$cohort[j] &
a$birth_year[i]>= b$lower_cutoff[j], "0", NA))
}
}

How could I split a data.frame?

I have 50 synoptic stations precipitation data from 1986 to 2015.
I need to sort the related information for the period of years from 2007 to 2015 for each station separately. I mean there are three variables:
the station's name
the specific year
the amount of precipitation
I need the result for each station separately.
Does anyone know how to use "split" for this purpose?
May you please write codes from the beginning "read.table"?
If your task is simply to split the dataframe by year you can use split:
split(df, f = df$year)
Illustrative data:
(set.seed(123)
df <- data.frame(
station = sample(LETTERS[1:3],10, replace = T),
year = paste0("201", sample(1:9, 10, replace = T)),
precipitation = sample(333:444, 10, replace = T)
)
Result:
$`2011`
station year precipitation
5 C 2011 406
8 C 2011 399
$`2013`
station year precipitation
7 B 2013 393
9 B 2013 365
$`2015`
station year precipitation
2 C 2015 410
$`2016`
station year precipitation
4 C 2016 444
$`2017`
station year precipitation
3 B 2017 404
$`2019`
station year precipitation
1 A 2019 432
6 A 2019 412
10 B 2019 349

Add stringA or stringB to values in column based on condition

I have a data table "dates" such as:
dates <- data.frame(date1=c("2015","1998","2000","1991"),
date2=c("98","00","18","92"))
dates <- mutate_if(dates,is.factor,as.character)
Where the values in "dates" are of class -char
I want to make "date2" a 4-digit number. For this I would like the following condition:
If "date2" starts with 9 add a 19 before the value
If "date2" starts with anything else add a 20
I have done a lot of research but I cannot find how to add a string to an already existing string by using a conditional
Afterthought: How can we deal with "NA" values so it does not assign a "19" or "20" to "NA´s"
A regex-free alternative:
d2int <- as.integer(dates$date2)
dates[["date2n"]] <- as.character(d2int + ifelse(d2int > 18, 1900, 2000))
dates
date1 date2 date2n
1 2015 98 1998
2 1998 00 2000
3 2000 18 2018
4 1991 92 1992
5 2015 89 1989
6 1998 18 2018
7 2000 19 1919
8 1991 NA <NA>
Where:
dates <- data.frame(
date1=c("2015","1998","2000","1991"),
date2=c("98","00","18","92", "89", "18", "19", "NA"),
stringsAsFactors = FALSE
)
you can use lubridate and try something like :
Input:
dates <- data.frame(date1=c("2015","1998","2000","1991", "1991", "1991"),
date2=c("98","00","18","92", "88", NA))
use:
dates %>%
mutate(date2 = as.integer(date2)) %>%
mutate(date3 = if_else(date2+2000 > year(today()), date2+1900, date2+2000))
which gives:
date1 date2 date3
1 2015 98 1998
2 1998 0 2000
3 2000 18 2018
4 1991 92 1992
5 1991 88 1988
6 1991 NA NA
p.s. added two rows to the input data to show how this handles NA values

How do I avoid a slow loop with large data set?

Consider this data set:
> DATA <- data.frame(Agreement_number = c(1,1,1,1,2,2,2,2),
+ country = c("Canada","Canada", "USA", "USA", "Canada","Canada", "USA", "USA"),
+ action = c("signature", "ratification","signature", "ratification", "signature", "ratification","signature", "ratification"),
+ signature_date = c(2000,NA,2000,NA, 2001, NA, 2002, NA),
+ ratification_date = c(NA, 2001, NA, 2002, NA, 2001, NA, 2002))
> DATA
Agreement_number country action signature_date ratification_date
1 Canada signature 2000 NA
1 Canada ratification NA 2001
1 USA signature 2000 NA
1 USA ratification NA 2002
2 Canada signature 2001 NA
2 Canada ratification NA 2001
2 USA signature 2002 NA
2 USA ratification NA 2002
As you can see, half of the rows have duplicate information. For a small data set like this it is really easy to remove duplicates. I could use the coalesce function (dplyr package), get rid of the "action" column and then erase all the irrelevant rows. Though, there many other ways. The final result should look like this:
> DATA <- data.frame( Agreement_number = c(1,1,2,2),
+ country = c("Canada", "USA", "Canada","USA"),
+ signature_date = c(2000,2000,2001,2002),
+ ratification_date = c(2001, 2002, 2001, 2002))
> DATA
Agreement_number country signature_date ratification_date
1 Canada 2000 2001
1 USA 2000 2002
2 Canada 2001 2001
2 USA 2002 2002
The problem, is that my real data set is MUCH bigger (102000 x 270) and there are many more variables. The real data is also more irregular and there are more absent values. The coalesce function seems very slow. The best loop I could make so far still takes up to 5-10 minutes to run.
Is there a simple way of doing this which would be faster? I have the feeling that there must be some function in R for that kind of operation, but I couldn't find any.
I think you need dcast. The version in the data.table library calls itself "fast", and in my experience, it is speedy on large datasets.
First, let's create one column which is either the signature_date or ratification_date, depending on the action
library(data.table)
setDT(DATA)[, date := ifelse(action == "ratification", ratification_date, signature_date)]
Now, let's cast it so that the action are the columns and the value is the date
wide <- dcast(DATA, Agreement_number + country ~ action, value.var = 'date')
So wide looks like this
Agreement_number country ratification signature
1 1 Canada 2001 2000
2 1 USA 2002 2000
3 2 Canada 2001 2001
4 2 USA 2002 2002
The OP has told that his production data has 100 k rows x 270 columns, and speed is a concern for him. Therefore, I suggest to use data.table.
I'm aware that Harland also has proposed to use data.table and dcast() but the solution below is a different approach. It brings the rows in the correct order and copies the ratification_date to the signature row. After some clean-up we get the desired result.
library(data.table)
# coerce to data.table,
# make sure that the actions are ordered properly, not alphabetically
setDT(DATA)[, action := ordered(action, levels = c("signature", "ratification"))]
# order the rows to make sure that signature row and ratification row are
# subsequent for each agreement and country
setorder(DATA, Agreement_number, country, action)
# copy the ratification date from the row below but only within each group
result <- DATA[, ratification_date := shift(ratification_date, type = "lead"),
by = c("Agreement_number", "country")][
# keep only signature rows, remove action column
action == "signature"][, action := NULL]
result
Agreement_number country signature_date ratification_date dummy1 dummy2
1: 1 Canada 2000 2001 2 D
2: 1 USA 2000 2002 3 A
3: 2 Canada 2001 2001 1 B
4: 2 USA 2002 2002 4 C
Data
The OP has mentioned that his production data has 270 columns. To simulate this I've added two dummy columns:
set.seed(123L)
DATA <- data.frame(Agreement_number = c(1,1,1,1,2,2,2,2),
country = c("Canada","Canada", "USA", "USA", "Canada","Canada", "USA", "USA"),
action = c("signature", "ratification","signature", "ratification", "signature", "ratification","signature", "ratification"),
signature_date = c(2000,NA,2000,NA, 2001, NA, 2002, NA),
ratification_date = c(NA, 2001, NA, 2002, NA, 2001, NA, 2002),
dummy1 = rep(sample(4), each = 2L),
dummy2 = rep(sample(LETTERS[1:4]), each = 2L))
Note that set.seed() is used for repeatable results when sampling.
Agreement_number country action signature_date ratification_date dummy1 dummy2
1 1 Canada signature 2000 NA 2 D
2 1 Canada ratification NA 2001 2 D
3 1 USA signature 2000 NA 3 A
4 1 USA ratification NA 2002 3 A
5 2 Canada signature 2001 NA 1 B
6 2 Canada ratification NA 2001 1 B
7 2 USA signature 2002 NA 4 C
8 2 USA ratification NA 2002 4 C
Addendum: dcast() with additional columns
Harland has suggested to use data.table and dcast(). Besides several other flaws in his answer, it doesn't handle the additional columns the OP has mentioned.
The dcast() approach below will return also the additional columns:
library(data.table)
# coerce to data table
setDT(DATA)[, action := ordered(action, levels = c("signature", "ratification"))]
# use already existing column to "coalesce" dates
DATA[action == "ratification", signature_date := ratification_date]
DATA[, ratification_date := NULL]
# dcast from long to wide form, note that ... refers to all other columns
result <- dcast(DATA, Agreement_number + country + ... ~ action,
value.var = "signature_date")
result
Agreement_number country dummy1 dummy2 signature ratification
1: 1 Canada 2 D 2000 2001
2: 1 USA 3 A 2000 2002
3: 2 Canada 1 B 2001 2001
4: 2 USA 4 C 2002 2002
Note that this approach will change the order of columns.
Here is another data.table solution using uwe-block's data.frame. It is similar to uwe-block's method, but uses max to collapse the data.
# covert data.frame to data.table and factor variables to character variables
library(data.table)
setDT(DATA)[, names(DATA) := lapply(.SD,
function(x) if(is.factor(x)) as.character(x) else x)]
# collapse data set, by agreement and country. Take max of remaining variables.
DATA[, lapply(.SD, max, na.rm=TRUE), by=.(Agreement_number, country)][,action := NULL][]
The lapply runs through variables not included in the by statement and calculates the maximum after removing NA values. The next link in the chain drops the unneeded action variable and the final (unnecessary) link prints the output.
This returns
Agreement_number country signature_date ratification_date dummy1 dummy2
1: 1 Canada 2000 2001 2 D
2: 1 USA 2000 2002 3 A
3: 2 Canada 2001 2001 1 B
4: 2 USA 2002 2002 4 C

Resources