Baffling error using dataprep function in R Synth package - r

I am trying to use the 'Synth' package in R to explore the effect that certain coups had on economic growth in the countries where they occurred, but I'm hung up on an error I can't understand. When I attempt to run dataprep(), I get the following:
Error in dataprep(foo = World, predictors = c("rgdpe.pc", "population.ln", :
unit.variable not found as numeric variable in foo.
That's puzzling because my data frame, World, does include a numeric id called "idno" as specified in the call to dataprep().
Here is the script I'm using. It ingests a .csv with the requisite data from GitHub. The final step --- the call to dataprep() --- is where the error arises. I would appreciate help in figuring out why this error arises and how to avoid it so I can get on to the synth() part to follow.
library(dplyr)
library(Synth)
# DATA INGESTION AND TRANSFORMATION
World <- read.csv("https://raw.githubusercontent.com/ulfelder/coups-and-growth/master/data.raw.csv", stringsAsFactors=FALSE)
World$rgdpe.pc = World$rgdpe/World$pop # create per capita version of GDP (PPP)
World$idno = as.numeric(as.factor(World$country)) # create numeric country id
World$population.ln = log(World$population/1000) # population size in 1000s, logged
World$trade.ln = log(World$trade) # trade as % of GDP, logged
World$civtot.ln = log1p(World$civtot) # civil conflict scale, +1 and logged
World$durable.ln = log1p(World$durable) # political stability, +1 and logged
World$polscore = with(World, ifelse(polity >= -10, polity, NA)) # create version of Polity score that's missing for -66, -77, and -88
World <- World %>% # create clocks counting years since last coup (attempt) or 1950, whichever is most recent
arrange(countrycode, year) %>%
mutate(cpt.succ.d = ifelse(cpt.succ.n > 0, 1, 0),
cpt.any.d = ifelse(cpt.succ.n > 0 | cpt.fail.n > 0, 1, 0)) %>%
group_by(countrycode, idx = cumsum(cpt.succ.d == 1L)) %>%
mutate(cpt.succ.clock = row_number()) %>%
ungroup() %>%
select(-idx) %>%
group_by(countrycode, idx = cumsum(cpt.any.d == 1L)) %>%
mutate(cpt.any.clock = row_number()) %>%
ungroup() %>%
select(-idx) %>%
mutate(cpt.succ.clock.ln = log1p(cpt.succ.clock), # include +1 log versions
cpt.any.clock.ln = log1p(cpt.any.clock))
# THAILAND 2006
THI.coup.year = 2006
THI.years = seq(THI.coup.year - 5, THI.coup.year + 5)
# Get names of countries that had no coup attempts during window analysis will cover. If you wanted to restrict the comparison to a
# specific region or in any other categorical way, this would be the place to do that as well.
THI.controls <- World %>%
filter(year >= min(THI.years) & year <= max(THI.years)) %>% # filter to desired years
group_by(idno) %>% # organize by country
summarise(coup.ever = sum(cpt.any.d)) %>% # get counts by country of years with coup attempts during that period
filter(coup.ever==0) %>% # keep only the ones with 0 counts
select(idno) # cut down to country names
THI.controls = unlist(THI.controls) # convert that data frame to a vector
names(THI.controls) = NULL # strip the vector of names
THI.synth.dat <- dataprep(
foo = World,
predictors = c("rgdpe.pc", "population.ln", "trade.ln", "fcf", "govfce", "energy.gni", "polscore", "durable.ln", "cpt.any.clock.ln", "civtot.ln"),
predictors.op = "mean",
time.predictors.prior = seq(from = min(THI.years), to = THI.coup.year - 1),
dependent = "rgdpe.pc",
unit.variable = "idno",
unit.names.variable = "country",
time.variable = "year",
treatment.identifier = unique(World$idno[World$country=="Thailand"]),
controls.identifier = THI.controls,
time.optimize.ssr = seq(from = THI.coup.year, to = max(THI.years)),
time.plot = THI.years
)

Too long for a comment.
Your dplyr statement:
World <- World %>% ...
converts World from a data.frame to a tbl_df object (read the docs on dplyr). Unfortunately, this causes mode(World[,"idno"]) to return list, not numeric and the test for numeric unit.variable fails.
You can fix this by using
`World <- as.data.frame(World)`
just before the call to dataprep(...).
Unfortunately (again) you now get a different error which may be due to the logic of your dplyr statement.

Related

Finding all nearest neighbour to all data points using sparklyr

I would like to use sparklyr find the nearest neighbour for each point in a dataset.
I've found sparklyr::ml_approx_nearest_neighbors() uses a key argument (a single feature vector) to find the nearest neighbour, so I guess I'd iterate over that for each point. Should I use this with lapply(), or is this inefficient?
Here's an example (I've modified from here) where I take the titanic dataset and attempt to find the nearest 2 neighbours from the same dataset using the first 700 data points. It returns the point itself, and the next closest as expected, but I suspect the entire pipeline reruns for each data point making this inefficient.
Is there a better way, please?
library(sparklyr)
library(titanic)
library(dplyr)
library(magrittr)
sc <- spark_connect(method = "databricks") # create a spark connection object
# clean dataset
df_titanic <- titanic::titanic_train %>%
dplyr::select(Survived, Pclass, Sex, Age, SibSp, Parch, Fare) %>%
dplyr::rename_all(tolower) %>% # make the col names lower case
dplyr::mutate(sex = ifelse(sex == 'male', 1, 0), id = 1:nrow(.)) %>% # turn sex to an integer
dplyr::filter_all(dplyr::all_vars(!is.na(.))) # remove NAs
sdf_titanic <- sparklyr::copy_to(sc, df_titanic, overwrite = T) # copy to spark
input_cols <- c('pclass', 'sex', 'age', 'sibsp', 'parch', 'fare') # features list
## append a vectorised list of the features we're interested in
sdf_titanic_va <- ft_vector_assembler(sdf_titanic,
input_cols = input_cols,
output_col = 'features')
brp_lsh <- sparklyr::ft_bucketed_random_projection_lsh(
sc,
input_col = 'features',
output_col = 'hash',
bucket_length = 2,
num_hash_tables = 3
)
brp_fit <- ml_fit(brp_lsh, sdf_titanic_va) ## fit the LSH to our data to get the hashes
id1_input <- sdf_titanic_va %>%
dplyr::filter(id %in% 1:700) %>%
dplyr::pull(features)
lapply(id1_input, function(x) ml_approx_nearest_neighbors(
brp_fit,
sdf_titanic_va,
key = x,
dist_col = 'dist_col',
num_nearest_neighbors = 2
))

Multiple file processing in R, looping over variable for data processing

I've written several functions for cleaning and processing 15 samples from the American Community Survey (ACS). This workflow is very laborious and repetitive: reading in each file, applying my functions, and moving on to the next survey year.
My current workflow is like this:
library(tidyverse)
library(ids)
wage_2005 <- haven::read_dta("~/Data/ACS/2005_ACS.dta") %>%
gen.wages(wage_2005) %>%
reg.variables() %>%
wage.adj(year = 2005) %>%
wage.sample(year = 2005)
And moving on to 2006, 2007, and so on until 2019. For instance,
wage_2006 <- haven::read_dta("~/Data/ACS/2006_ACS.dta") %>%
gen.wages(wage_2006) %>%
reg.variables() %>%
wage.adj(year = 2006) %>%
wage.sample(year = 2006)
What I would like to is process each sample using my cleaning functions iteratively looping through the files in succession and using the year variable in each file in order to apply the appropriate processing for each survey year, and then get and store the result for each survey year in a list.
As a first step, I have written some code, reading in the files using an sapply function:
files <- list.files(path = "~/Data/ACS" , full.names = TRUE)
data_files <- sapply(files, function(x) {
df <- haven::read_dta(file = paste0(x)),
USE.NAMES = TRUE,
simplify = FALSE
}
)
But this takes an enormous amount of storage space as the files come from the Census bureau and are quite large. I am stuck on the next steps to iteratively process each file, apply my functions, and store the result in a list.
Some pseudo code to give a clearer idea:
for year in years
read in data file
apply functions
store results
Say for example that I have three sets of data, something like
acs_2005 <-
data.frame(id = random_id(n = 1000, bytes = 16, use_openssl = TRUE),
wage = runif(1000, min = 0, max = 100),
year = 2005)
acs_2006 <-
data.frame(id = random_id(n = 1000, bytes = 16, use_openssl = TRUE),
wage = runif(1000, min = 0, max = 100),
year = 2006)
acs_2007 <-
data.frame(id = random_id(n = 1000, bytes = 16, use_openssl = TRUE),
wage = runif(1000, min = 0, max = 100),
year = 2007)
data <- list(acs_2005, acs_2006, acs_2007)
And let's say they are to be read in as csv files
lapply(1:length(data_list), function(i) write.csv(data_list[[i]],
file = paste0(names(data_list[i]), ".csv"),
row.names = FALSE))
My custom function is,
wage_summarize <-
function(df, year) {
mutate(df, wage = case_when(
year == 2005 ~ wage/0.7903,
year == 2006 ~ wage/0.8112,
year == 2007 ~ wage/0.8323)) %>%
group_by(year) %>%
summarize(wage = mean(wage, na.rm = TRUE))
}
How would I iterate through this list of data frames when the function depends on the year variable in order to perform the operation? In this case, hypothetically adjust for inflation?
Any help or guidance in this would be much appreciated, thank you!
This should serve as a guidance for you with the information that you provided,
library(tidyverse)
# Simulate multiple data
# that has been loaded by some
# read_data-function
data_list <- list(
mtcars,
diamonds,
iris
)
# Iterate through the list
# of data with some function
data_list <- data_list %>% map(
.f = function(x) {
x %>% mutate(
row_id = row_number()
)
}
)
Here we loaded the data and stored it in a list - it simulates that we read one data at the time. And we applied some function on using dplyr. It outputs a list of same length!
Please refer to Programming with Dplyr for more information on implementing custom functions to your data.
If you want to do this in parallel, this is also possible - but this is OS-specific. If you are on UNIX then mclapply() is your go-to-function.
You asked for some guidance, and this is what I could provide with the information you gave.
Why not just combine the list of dataframes into one -
library(dplyr)
bind_rows(data) %>%
mutate(wage = wage/case_when(
year == 2005 ~ 0.7903,
year == 2006 ~ 0.8112,
year == 2007 ~ 0.8323)) %>%
group_by(year) %>%
summarise(wage = mean(wage, na.rm = TRUE))
# year wage
# <dbl> <dbl>
#1 2005 63.0
#2 2006 61.9
#3 2007 59.8

Conditionally calculating average time between events by group in R

I am working with a call log data set from a telephone hotline service. There are three call outcomes: Answered, Abandoned & Engaged. I am trying to find out the average time taken by each caller to contact the hotline again if they abandoned the previous call. The time difference can be either seconds, minutes, hours or days but I would like to get all four if possible.
Here is some mock data with the variables I am working with:-
library(wakefield)#for generating the Status variable
library(dplyr)
library(stringi)
library(Pareto)
library(uuid)
n_users<-1300
n_rows <- 365000
set.seed(1)
#data<-data.frame()
Date<-seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by = "1 day")
Date<-sample(rep(Date,each=1000),replace = T)
u <- runif(length(Date), 0, 60*60*12) # "noise" to add or subtract from some timepoint
CallDateTime<-as.POSIXlt(u, origin = paste0(Date,"00:00:00"))
CallDateTime
CallOutcome<-r_sample_factor(x = c("Answered", "Abandoned", "Engaged"), n=length(Date))
CallOutcome
data<-data.frame(Date,CallDateTime,CallOutcome)
relative_probs <- rPareto(n = n_users, t = 1, alpha = 0.3, truncation = 500)
unique_ids <- UUIDgenerate(n = n_users)
data$CallerId <- sample(unique_ids, size = n_rows, prob = relative_probs, replace = TRUE)
data<-data%>%arrange(CallDateTime)
head(data)
So to reiterate, if a caller abandons their call (represented by "Abandoned" in the CallOutcome column), I would like to know the average time taken for the caller to make another call to the service, in the four time units I have mentioned. Any pointers on how I can achieve this would be great :)
Keep rows in the data where the current row is "Abandoned" and the next row is not "Abandoned" for each ID. Find difference in time between every 2 rows to get time required for the caller to make another call to service after it was abandoned, take average of each of the duration to get average time.
library(dplyr)
data %>%
#Test the answer on smaller subset
#slice(1:1000) %>%
arrange(CallerId, CallDateTime) %>%
group_by(CallerId) %>%
filter(CallOutcome == 'Abandoned' & dplyr::lead(CallOutcome) != 'Abandoned' |
CallOutcome != 'Abandoned' & dplyr::lag(CallOutcome) == 'Abandoned') %>%
mutate(group = rep(row_number(), each = 2, length.out = n())) %>%
group_by(group, .add = TRUE) %>%
summarise(avg_sec = difftime(CallDateTime[2], CallDateTime[1], units = 'secs')) %>%
mutate(avg_sec = as.numeric(mean(avg_sec)),
avg_min = avg_sec/60,
avg_hour = avg_min/60,
avg_day = avg_hour/24) -> result
result
First, I would create the lead variable (basically calculate what is the "next" value by group. Then it's just as easy as using whatever unit you want for difftime. A density plot can help you analyze these differences, as shown below.
data <-
data %>%
group_by(CallerId) %>%
mutate(CallDateTime_Next = lead(CallDateTime)) %>%
ungroup() %>%
mutate(
diff_days = difftime(CallDateTime_Next, CallDateTime, units = 'days'),
diff_hours = difftime(CallDateTime_Next, CallDateTime, units = 'hours'),
diff_mins = difftime(CallDateTime_Next, CallDateTime, units = 'mins'),
diff_secs = difftime(CallDateTime_Next, CallDateTime, units = 'secs')
)
data %>%
filter(CallOutcome == 'Abandoned') %>%
ggplot() +
geom_density(aes(x = diff_days))

Multiply a grouped data frame by a matrix dplyr

My problem:
I have two data frames, one for industries and one for occupations. They are nested by state, and show employment.
I also have a concordance matrix, which shows the weights of each of the occupations in each industry.
I would like to create a new employment number in the Occupation data frame, using the Industry employments and the concordance matrix.
I have made dummy version of my problem - which I think is clear:
Update
I have solved the issue, but I would like to know if there is a more elegant solution? In reality my dimensions are 7 States * 200 industries * 350 Occupations it becomes rather data hungry
# create industry data frame
set.seed(12345)
ind_df <- data.frame(State = c(rep("a", len =6),rep("b", len =6),rep("c", len =6)),
industry = rep(c("Ind1","Ind2","Ind3","Ind4","Ind5","Ind6"), len = 18),
emp = rnorm(18,20,2))
# create occupation data frame
Occ_df <- data.frame(State = c(rep("a", len = 5), rep("b", len = 5), rep("c", len =5)),
occupation = rep(c("Occ1","Occ2","Occ3","Occ4","Occ5"), len = 15),
emp = rnorm(15,10,1))
# create concordance matrix
Ind_Occ_Conc <- matrix(rnorm(6*5,1,0.5),6,5) %>% as.data.frame()
# name cols in the concordance matrix
colnames(Ind_Occ_Conc) <- unique(Occ_df$occupation)
rownames(Ind_Occ_Conc) <- unique(ind_df$industry)
# solution
Ind_combined <- cbind(Ind_Occ_Conc, ind_df)
Ind_combined <- Ind_combined %>%
group_by(State) %>%
mutate(Occ1 = emp*Occ1,
Occ2 = emp*Occ2,
Occ3 = emp*Occ3,
Occ4 = emp*Occ4,
Occ5 = emp*Occ5
)
Ind_combined <- Ind_combined %>%
gather(key = "occupation",
value = "emp2",
-State,
-industry,
-emp
)
Ind_combined <- Ind_combined %>%
group_by(State, occupation) %>%
summarise(emp2 = sum(emp2))
Occ_df <- left_join(Occ_df,Ind_combined)
My solution seems pretty inefficient, is there a better / faster way to do this?
Also - I am not quite sure how to get to this - but the expected outcome would be another column added to the Occ_df called emp2, this would be derived from Ind_df emp column and the Ind_Occ_Conc. I have tried to step this out for Occupation 1, essentially the Ind_Occ_Conc contains weights and the result is a weighted average.
I'm not sure about what you want to do with the sum(Ind$emp*Occ1_coeff) line but maybe that's what your looking for :
# Instead of doing the computation only for state a, get expected outcomes for all states (with dplyr):
Ind <- ind_df %>% group_by(State) %>%
summarize(rez = sum(emp))
# Then do some computations on Ind, which is a N element vector (one for each state)
# ...
# And finally, join Ind and Occ_df using merge
Occ_df <- merge(x = Occ_df, y = Ind, by = "State", all = TRUE)
Final output would then have Ind values in a new column: one value for all a, one value for b and one value for c.
Hope it will help ;)

In R, is it possible to include the same row in multiple groups, or is there other workaround?

I've measured N20 flux from soil at multiple timepoints in the day (not equally spaced). I'm trying to calculate the total N20 flux from soil for a subset of days by finding the area under the curve for the given day. I know how to do this when using only measures from the given day, however, I'd like to include the last measure of the previous day and the first measure of the following day to improve the estimation of the curve.
Here's an example to give a more concrete idea:
library(MESS)
library(lubridate)
library(dplyr)
Generate Reproducible Example
datetime <- seq(ymd_hm('2015-04-07 11:20'),ymd('2015-04-13'), by = 'hours')
dat <- data.frame(datetime, day = day(datetime), Flux = rnorm(n = length(datetime), mean = 400, sd = 20))
useDate <- data.frame(day = c(7:12), DateGood = c("No", "Yes", "Yes", "No", "Yes", "No"))
dat <- left_join(dat, useDate)
Some days are "bad" (too many missing measures) and some are "Good" (usable). The goal is to filter all measurements (rows) that occurred on a "Good" day as well as the last measurement from the day before and the first measurement on the next day.
out <- dat %>%
mutate(lagDateGood = lag(DateGood),
leadDateGood = lead(DateGood)) %>%
filter(lagDateGood != "No" | leadDateGood != "No")
Now I need to calculate the area under the curve - this is not correct
out2 <- out %>%
group_by(day) %>%
mutate(hourOfday = hour(datetime) + minute(datetime)/60) %>%
summarize(auc = auc(x = hourOfday, y = Flux, from = 0, to = 24, type = "spline"))
The trouble is that I don't include the measurements on end of previous day and start of following day when calculating AUC. Also, I get an estimate of flux for day 10, which is a "bad" day.
I think the crux of my question has to do with groups. Some measurements need to be in multiple groups (for example the last measurement on day 8 would be used in estimating AUC for day 8 and day 9). Do you have suggestions for how I could form new groups? Or might there be a completely different way to achieve the goal?
For what it's worth, this is what I did. The answer really lies in the question I linked to in the comments. Starting with the dataframe "out" from the question:
#Now I need to calculate the area under the curve for each day
n <- nrow(out)
extract <- function(ix) out[seq(max(1, min(ix)-1), min(n, max(ix) + 1)), ]
res <- lapply(split(1:n, out$day), extract)
calcTotalFlux <- function(df) {
if (nrow(df) < 10) { # make sure the day has at least 10 measures
NA
} else {
day_midnight <- floor_date(df$datetime[2], "day")
df %>%
mutate(time = datetime - day_midnight) %>%
summarize(TotalFlux = auc(x = time, y = Flux, from = 0, to = 1440, type = "spline"))}
}
do.call("rbind",lapply(res, calcTotalFlux))
TotalFlux
7 NA
8 585230.2
9 579017.3
10 NA
11 563689.7
12 NA
Here's another way. More in line with the suggestions of #Alex Brown.
# Another way
last <- out %>%
group_by(day) %>%
filter(datetime == max(datetime)) %>%
ungroup() %>%
mutate(day = day + 1)
first <- out %>%
group_by(day) %>%
filter(datetime == min(datetime)) %>%
ungroup() %>%
mutate(day = day - 1)
d <- rbind(out, last, first) %>%
group_by(day) %>%
arrange(datetime)
n_measures_per_day <- d %>%
summarize(n = n())
d <- left_join(d, n_measures_per_day) %>%
filter(n > 4)
TotalFluxDF <- d %>%
mutate(timeAtMidnight = floor_date(datetime[3], "day"),
time = datetime - timeAtMidnight) %>%
summarize(auc = auc(x = time, y = Flux, from = 0, to = 1440, type = "spline"))
TotalFluxDF
Source: local data frame [3 x 2]
day auc
(dbl) (dbl)
1 8 585230.2
2 9 579017.3
3 11 563689.7

Resources