Multiple file processing in R, looping over variable for data processing - r

I've written several functions for cleaning and processing 15 samples from the American Community Survey (ACS). This workflow is very laborious and repetitive: reading in each file, applying my functions, and moving on to the next survey year.
My current workflow is like this:
library(tidyverse)
library(ids)
wage_2005 <- haven::read_dta("~/Data/ACS/2005_ACS.dta") %>%
gen.wages(wage_2005) %>%
reg.variables() %>%
wage.adj(year = 2005) %>%
wage.sample(year = 2005)
And moving on to 2006, 2007, and so on until 2019. For instance,
wage_2006 <- haven::read_dta("~/Data/ACS/2006_ACS.dta") %>%
gen.wages(wage_2006) %>%
reg.variables() %>%
wage.adj(year = 2006) %>%
wage.sample(year = 2006)
What I would like to is process each sample using my cleaning functions iteratively looping through the files in succession and using the year variable in each file in order to apply the appropriate processing for each survey year, and then get and store the result for each survey year in a list.
As a first step, I have written some code, reading in the files using an sapply function:
files <- list.files(path = "~/Data/ACS" , full.names = TRUE)
data_files <- sapply(files, function(x) {
df <- haven::read_dta(file = paste0(x)),
USE.NAMES = TRUE,
simplify = FALSE
}
)
But this takes an enormous amount of storage space as the files come from the Census bureau and are quite large. I am stuck on the next steps to iteratively process each file, apply my functions, and store the result in a list.
Some pseudo code to give a clearer idea:
for year in years
read in data file
apply functions
store results
Say for example that I have three sets of data, something like
acs_2005 <-
data.frame(id = random_id(n = 1000, bytes = 16, use_openssl = TRUE),
wage = runif(1000, min = 0, max = 100),
year = 2005)
acs_2006 <-
data.frame(id = random_id(n = 1000, bytes = 16, use_openssl = TRUE),
wage = runif(1000, min = 0, max = 100),
year = 2006)
acs_2007 <-
data.frame(id = random_id(n = 1000, bytes = 16, use_openssl = TRUE),
wage = runif(1000, min = 0, max = 100),
year = 2007)
data <- list(acs_2005, acs_2006, acs_2007)
And let's say they are to be read in as csv files
lapply(1:length(data_list), function(i) write.csv(data_list[[i]],
file = paste0(names(data_list[i]), ".csv"),
row.names = FALSE))
My custom function is,
wage_summarize <-
function(df, year) {
mutate(df, wage = case_when(
year == 2005 ~ wage/0.7903,
year == 2006 ~ wage/0.8112,
year == 2007 ~ wage/0.8323)) %>%
group_by(year) %>%
summarize(wage = mean(wage, na.rm = TRUE))
}
How would I iterate through this list of data frames when the function depends on the year variable in order to perform the operation? In this case, hypothetically adjust for inflation?
Any help or guidance in this would be much appreciated, thank you!

This should serve as a guidance for you with the information that you provided,
library(tidyverse)
# Simulate multiple data
# that has been loaded by some
# read_data-function
data_list <- list(
mtcars,
diamonds,
iris
)
# Iterate through the list
# of data with some function
data_list <- data_list %>% map(
.f = function(x) {
x %>% mutate(
row_id = row_number()
)
}
)
Here we loaded the data and stored it in a list - it simulates that we read one data at the time. And we applied some function on using dplyr. It outputs a list of same length!
Please refer to Programming with Dplyr for more information on implementing custom functions to your data.
If you want to do this in parallel, this is also possible - but this is OS-specific. If you are on UNIX then mclapply() is your go-to-function.
You asked for some guidance, and this is what I could provide with the information you gave.

Why not just combine the list of dataframes into one -
library(dplyr)
bind_rows(data) %>%
mutate(wage = wage/case_when(
year == 2005 ~ 0.7903,
year == 2006 ~ 0.8112,
year == 2007 ~ 0.8323)) %>%
group_by(year) %>%
summarise(wage = mean(wage, na.rm = TRUE))
# year wage
# <dbl> <dbl>
#1 2005 63.0
#2 2006 61.9
#3 2007 59.8

Related

How to sample without replacement within groups in R

I have a data frame which contains a 'year' variable with values between 1 and 100000 repeating multiple times. I have another data frame with 1000 'loss amounts' with an associated probability for each loss. I'd like to merge loss amounts onto the year data frame by sampling from the loss amounts table. I want to sample without replacement within each level of the year variable e.g. within each level of the year variable the loss amounts should be unique.
Reproducible example below where I can only get it to sample without replacement across the full 'year' dataset and not just within the different levels of the year variable as required. Is there a way of doing this (ideally without using loops as I need the code to run quickly)
#mean frequency
freq <- 100
years <- 100000
#create data frame with number of losses in each year
num_losses <- rpois(years, freq)
year <- tibble(index=1:length(num_losses), num=num_losses)
year <- map2(year$index, year$num, function(x, y) rep(x, y)) %>% unlist() %>% tibble(year = .)
#lookup table with loss amounts
lookup <- tibble(prob = runif(1000, 0, 1), amount = rgamma(1000, shape = 1.688, scale = 700000)) %>%
mutate(total_prob = cumsum(prob)/sum(prob),
pdf = total_prob - lag(total_prob),
pdf = ifelse(is.na(pdf), total_prob, pdf))
#add on amounts to year table by sampling from lookup table
sample_from_lookup <- function(number){
amount <- sample(lookup$amount, number, replace = FALSE, prob = lookup$pdf)
}
amounts <- sample_from_lookup(nrow(year))
year <- tibble(year = year$year, amount = amounts)
According to your description, maybe you can try replicate within your sample_from_lookup, i.e.,
sample_from_lookup <- function(number){
amount <- replicate(number,
sample(lookup$amount,
1,
replace = FALSE,
prob = lookup$pdf))
}
In this case, you need to set size 1 to your sample function.
I ended up using split to break the 'year' data into groups within a list. Then running the(slightly amended) sample_from_lookup function on each element of the list using map. Amended code below.
#mean frequency
freq <- 5
years <- 100
#create data frame with number of losses in each year
num_losses <- rpois(years, freq)
year <- tibble(index=1:length(num_losses), num=num_losses)
year <- map2(year$index, year$num, function(x, y) rep(x, y)) %>% unlist() %>% tibble(year = .)
year_split = split(year, year$year)
#lookup table
lookup <- tibble(prob = runif(1000, 0, 1), amount = rgamma(1000, shape = 1.688, scale = 700000)) %>%
mutate(total_prob = cumsum(prob)/sum(prob),
pdf = total_prob - lag(total_prob),
pdf = ifelse(is.na(pdf), total_prob, pdf))
#add on amounts to year table by sampling from lookup table
sample_from_lookup <- function(x){
number = NROW(x)
amount <- sample(lookup$amount, number, replace = FALSE, prob = lookup$pdf)
}
amounts <- map(year_split, sample_from_lookup) %>% unlist() %>% tibble(amount = .)
year <- tibble(year = year$year, amount = amounts$amount)

How to calculate correlation by group

I am trying to run an iterative for loop to calculate correlations for levels of a factor variable. I have 16 rows of data for each of 32 teams in my data set. I want to correlate year with points for each of the teams individually. I can do this one by one but want to get better at looping.
correlate <- data %>%
select(Team, Year, Points_Game) %>%
filter(Team == "ARI") %>%
select(Year, Points_Game)
cor(correlate)
I made an object "teams" by:
teams <- levels(data$Team)
A little help in using [i] to iterate over all 32 teams to get each teams correlation of year and points would be greatly helpful!
require(dplyr)
# dummy data
data = data.frame(
Team = sapply(1:32, function(x) paste0("T", x)),
Year = rep(c(2000:2009), 32),
Points_Game = rnorm(320, 100, 10)
)
# find correlation of Year and Points_Game for each team
# r - correlation coefficient
correlate <- data %>%
group_by(Team) %>%
summarise(r = cor(Year, Points_Game))
The data.table way:
library(data.table)
# dummy data (same as #Aleksandr's)
dat <- data.table(
Team = sapply(1:32, function(x) paste0("T", x)),
Year = rep(c(2000:2009), 32),
Points_Game = rnorm(320, 100, 10)
)
# find correlation of Year and Points_Game for each Team
result <- dat[ , .(r = cor(Year, Points_Game)), by = Team]

Multiply a grouped data frame by a matrix dplyr

My problem:
I have two data frames, one for industries and one for occupations. They are nested by state, and show employment.
I also have a concordance matrix, which shows the weights of each of the occupations in each industry.
I would like to create a new employment number in the Occupation data frame, using the Industry employments and the concordance matrix.
I have made dummy version of my problem - which I think is clear:
Update
I have solved the issue, but I would like to know if there is a more elegant solution? In reality my dimensions are 7 States * 200 industries * 350 Occupations it becomes rather data hungry
# create industry data frame
set.seed(12345)
ind_df <- data.frame(State = c(rep("a", len =6),rep("b", len =6),rep("c", len =6)),
industry = rep(c("Ind1","Ind2","Ind3","Ind4","Ind5","Ind6"), len = 18),
emp = rnorm(18,20,2))
# create occupation data frame
Occ_df <- data.frame(State = c(rep("a", len = 5), rep("b", len = 5), rep("c", len =5)),
occupation = rep(c("Occ1","Occ2","Occ3","Occ4","Occ5"), len = 15),
emp = rnorm(15,10,1))
# create concordance matrix
Ind_Occ_Conc <- matrix(rnorm(6*5,1,0.5),6,5) %>% as.data.frame()
# name cols in the concordance matrix
colnames(Ind_Occ_Conc) <- unique(Occ_df$occupation)
rownames(Ind_Occ_Conc) <- unique(ind_df$industry)
# solution
Ind_combined <- cbind(Ind_Occ_Conc, ind_df)
Ind_combined <- Ind_combined %>%
group_by(State) %>%
mutate(Occ1 = emp*Occ1,
Occ2 = emp*Occ2,
Occ3 = emp*Occ3,
Occ4 = emp*Occ4,
Occ5 = emp*Occ5
)
Ind_combined <- Ind_combined %>%
gather(key = "occupation",
value = "emp2",
-State,
-industry,
-emp
)
Ind_combined <- Ind_combined %>%
group_by(State, occupation) %>%
summarise(emp2 = sum(emp2))
Occ_df <- left_join(Occ_df,Ind_combined)
My solution seems pretty inefficient, is there a better / faster way to do this?
Also - I am not quite sure how to get to this - but the expected outcome would be another column added to the Occ_df called emp2, this would be derived from Ind_df emp column and the Ind_Occ_Conc. I have tried to step this out for Occupation 1, essentially the Ind_Occ_Conc contains weights and the result is a weighted average.
I'm not sure about what you want to do with the sum(Ind$emp*Occ1_coeff) line but maybe that's what your looking for :
# Instead of doing the computation only for state a, get expected outcomes for all states (with dplyr):
Ind <- ind_df %>% group_by(State) %>%
summarize(rez = sum(emp))
# Then do some computations on Ind, which is a N element vector (one for each state)
# ...
# And finally, join Ind and Occ_df using merge
Occ_df <- merge(x = Occ_df, y = Ind, by = "State", all = TRUE)
Final output would then have Ind values in a new column: one value for all a, one value for b and one value for c.
Hope it will help ;)

Baffling error using dataprep function in R Synth package

I am trying to use the 'Synth' package in R to explore the effect that certain coups had on economic growth in the countries where they occurred, but I'm hung up on an error I can't understand. When I attempt to run dataprep(), I get the following:
Error in dataprep(foo = World, predictors = c("rgdpe.pc", "population.ln", :
unit.variable not found as numeric variable in foo.
That's puzzling because my data frame, World, does include a numeric id called "idno" as specified in the call to dataprep().
Here is the script I'm using. It ingests a .csv with the requisite data from GitHub. The final step --- the call to dataprep() --- is where the error arises. I would appreciate help in figuring out why this error arises and how to avoid it so I can get on to the synth() part to follow.
library(dplyr)
library(Synth)
# DATA INGESTION AND TRANSFORMATION
World <- read.csv("https://raw.githubusercontent.com/ulfelder/coups-and-growth/master/data.raw.csv", stringsAsFactors=FALSE)
World$rgdpe.pc = World$rgdpe/World$pop # create per capita version of GDP (PPP)
World$idno = as.numeric(as.factor(World$country)) # create numeric country id
World$population.ln = log(World$population/1000) # population size in 1000s, logged
World$trade.ln = log(World$trade) # trade as % of GDP, logged
World$civtot.ln = log1p(World$civtot) # civil conflict scale, +1 and logged
World$durable.ln = log1p(World$durable) # political stability, +1 and logged
World$polscore = with(World, ifelse(polity >= -10, polity, NA)) # create version of Polity score that's missing for -66, -77, and -88
World <- World %>% # create clocks counting years since last coup (attempt) or 1950, whichever is most recent
arrange(countrycode, year) %>%
mutate(cpt.succ.d = ifelse(cpt.succ.n > 0, 1, 0),
cpt.any.d = ifelse(cpt.succ.n > 0 | cpt.fail.n > 0, 1, 0)) %>%
group_by(countrycode, idx = cumsum(cpt.succ.d == 1L)) %>%
mutate(cpt.succ.clock = row_number()) %>%
ungroup() %>%
select(-idx) %>%
group_by(countrycode, idx = cumsum(cpt.any.d == 1L)) %>%
mutate(cpt.any.clock = row_number()) %>%
ungroup() %>%
select(-idx) %>%
mutate(cpt.succ.clock.ln = log1p(cpt.succ.clock), # include +1 log versions
cpt.any.clock.ln = log1p(cpt.any.clock))
# THAILAND 2006
THI.coup.year = 2006
THI.years = seq(THI.coup.year - 5, THI.coup.year + 5)
# Get names of countries that had no coup attempts during window analysis will cover. If you wanted to restrict the comparison to a
# specific region or in any other categorical way, this would be the place to do that as well.
THI.controls <- World %>%
filter(year >= min(THI.years) & year <= max(THI.years)) %>% # filter to desired years
group_by(idno) %>% # organize by country
summarise(coup.ever = sum(cpt.any.d)) %>% # get counts by country of years with coup attempts during that period
filter(coup.ever==0) %>% # keep only the ones with 0 counts
select(idno) # cut down to country names
THI.controls = unlist(THI.controls) # convert that data frame to a vector
names(THI.controls) = NULL # strip the vector of names
THI.synth.dat <- dataprep(
foo = World,
predictors = c("rgdpe.pc", "population.ln", "trade.ln", "fcf", "govfce", "energy.gni", "polscore", "durable.ln", "cpt.any.clock.ln", "civtot.ln"),
predictors.op = "mean",
time.predictors.prior = seq(from = min(THI.years), to = THI.coup.year - 1),
dependent = "rgdpe.pc",
unit.variable = "idno",
unit.names.variable = "country",
time.variable = "year",
treatment.identifier = unique(World$idno[World$country=="Thailand"]),
controls.identifier = THI.controls,
time.optimize.ssr = seq(from = THI.coup.year, to = max(THI.years)),
time.plot = THI.years
)
Too long for a comment.
Your dplyr statement:
World <- World %>% ...
converts World from a data.frame to a tbl_df object (read the docs on dplyr). Unfortunately, this causes mode(World[,"idno"]) to return list, not numeric and the test for numeric unit.variable fails.
You can fix this by using
`World <- as.data.frame(World)`
just before the call to dataprep(...).
Unfortunately (again) you now get a different error which may be due to the logic of your dplyr statement.

Updating a list of models with a list of data in data.table

I'm working with multiple time series models, and hoping to manage them with data.table:
pkg <- c("data.table", "magrittr", "forecast")
sapply(pkg, library, character.only = TRUE)
df <- data.frame(
group = rep(c("a", "b", "c"), each = 10),
val = sample(1:10, 30, replace = TRUE)
) %>% as.data.table
I can successfully generate the model (and a time series ready for the next step):
t1 <- df[, list(
tsAll = list(val %>% as.ts),
mod1 = list(val %>% as.ts %>% window(1, 7) %>% ets)
), by = group]
I'm now trying to update the ets model, which involves passing 'tsAll' into the ets function and providing the model to use (mod1).
This does not work:
t1[, lapply(tsAll, ets, model = mod1)]
I've also tried:
t1[, lapply(tsAll, ets, model = mod1[[1]])]
This runs, but it looks like the same model is returned to every row.
Possibly I have solved the question but run into this issue:
Why is using update on a lm inside a grouped data.table losing its model data?
Can anyone help with the next step?
Is this what you want:
t1[, list(lapply(tsAll, ets, model = mod1[[1]])), by = group]$V1
I put the result in a list, so that the data type is preserved, as opposed to being converted into a vector and did the operation by group (since each group has its own model).

Resources