Related
Imagine a dataframe:
df1 <- tibble::tribble( ~City, ~Year, ~Burger_cost, ~Cola_cost, ~Resident_AVGGrowth_cm,
"Abu Dhabi", 2020, 2, 3, 175,
"Abu Dhabi", 2019, 1, 3, 174,
"Abu Dhabi", 2018, 1, 2, 173,
"New York", 2020, 4, 5, 500,
"New York", 2019, 3, 5, 184,
"New York", 2018, 2, 3, 183,
"Abu Dhabi", 2020, 2, 3, 175,
"Abu Dhabi", 2019, 1, 3, 174,
"Abu Dhabi", 2018, 1, 2, 173,
"Abu Dhabi", 2017, 1, NA, 100,
"London", 2020, 5, 6, 186,
"London", 2019, 4, 6, 188,
"London", 2018, 3, 5, 185,
"New York", 2020, 4, 5, 185,
"New York", 2019, 3, 5, 184,
"New York", 2018, 3, 3, 183,
"London", 2020, 5, 6, 186,
"London", 2019, 4, 6, 188,
"London", 2018, 3, 5, 185)
The same dataset for visual representation:
There can be many inputs. For example, data for London is totally similar for all years, so we can delete it. The data for Abu Dhabi is similar for years 2018-2020 and has difference for 2017.
The data for New York contains discrepancy in year 2018 for the Burger cost (and growth of a city resident).
The data for the growth of a city resident is surely erroneus in row 4 for NY (too huge descrepancy), but it has a duplicate value in the row 16 (in this case we would prefer delete row 4 and keep row 16 based on criteria that no person can be shorter than 50 cm and longer than 4 meters [400 cm] [especially in the mean values :)]).
Rows 6 and 16 (for NY, 2018) contain different data for the burger cost which can only be resolved by human (say, variant with 3 USD is correct but R needs to show the fact of discrepancy for the end user).
The question: can you suppose nice and neat methods for performing these operations? What do you usually use to solve such issues.
I just started to think on this issue.
It is an easy task in C#. I am keeping in mind several strategies, but I would be extremely interested to know what ways do you use for solving such issues :) Any ideas are much appreciated :)
Perhaps this helps - Grouped by 'City', create a flag for the outliers with boxplot on the 'Resident_AVGGrowth_cm', then add 'Year' into the grouping, create another flag for unique values based on the columns Burger_cost to Resident_AVGGrowth_cm using n_distinct and looping over if_all (returns TRUE only if all the columns looped for the row are returning TRUE), then grouped by City, filter out those City having all duplicates e.g. London, as well as remove the rows with outlier_flag
library(dplyr)
df1 %>%
group_by(City) %>%
mutate(outlier_flag = Resident_AVGGrowth_cm %in%
boxplot(Resident_AVGGrowth_cm, plot = FALSE)$out) %>%
group_by(Year, .add = TRUE) %>%
mutate(flag_all_unq = if_all(Burger_cost:Resident_AVGGrowth_cm,
~ n_distinct(.x, na.rm = TRUE) == 1)) %>%
group_by(City) %>%
filter(!all(flag_all_unq)) %>%
ungroup %>%
filter((!outlier_flag)|flag_all_unq)
I have a panel data that looks like this:
x <- structure(list(id2 = c(1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3,
4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 8,
8, 8, 8, 8, 9, 9, 9, 9, 10, 10, 10, 10, 10, 11, 11, 11), private = c(1,
1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,
0, 0, 0, 0, 1, 1, 1), capex = c(-0.003423963, -0.028064674, -0.03058208,
-0.00186256, -0.010839419, 0.052905358, 0.058931317, 0.065547734,
0.007644231, -0.025942514, 0.00163772, -0.007530502, 0.010706151,
0.025040116, 0.035105374, 0.036772128, 0.03886272, 0.045399148,
0.042642809, 0.080788992, 0.080848917, 0.057645567, 0.057636742,
0.046084184, 0.041080192, 0.05690382, 0.057325598, 0.051791377,
0.070084445, 0.069627948, 0.077849329, 0.077247024, 0.081251919,
0.071702167, 0.078424804, 0.077482991, 0.078969546, 0.076208547,
0.059055354, 0.056043826, 0.029450929, 0.016044363, 0.048353843,
0.047607671, 0.046497576, 0.047454875, 0.050881654, 0.047155183,
0.055546004, 0.057564467), roa = c(-0.003078416, -0.035302367,
-0.01884984, 0.002839225, -0.001113289, 0.024291474, 0.040153231,
0.051482957, 0.026102768, 0.005372915, 0.004466314, -0.025178509,
-0.002043936, -0.069235161, 0.023604594, 0.010512878, 0.021912357,
0.016721437, -0.09472625, 0.04061316, 0.074661337, 0.0214584,
0.047743626, 0.013149141, -0.010418181, 0.025671346, 0.031785361,
0.084893651, 0.018490626, 0.024941774, 0.023567598, 0.031878859,
0.029931206, 0.043837443, 0.041305128, 0.041293647, 0.039307728,
0.046259467, 0.017479861, 0.029429797, 0.023826957, 0.00763736,
0.017485917, 0.017156925, 0.006504506, 0.021350464, 0.032917287,
0.036106978, 0.04545372, 0.049348988), year = c(2011, 2012, 2013,
2014, 2015, 2011, 2012, 2013, 2011, 2012, 2013, 2014, 2015, 2011,
2012, 2013, 2014, 2015, 2011, 2012, 2013, 2014, 2015, 2011, 2012,
2013, 2014, 2015, 2011, 2012, 2013, 2014, 2015, 2011, 2012, 2013,
2014, 2015, 2011, 2012, 2013, 2014, 2011, 2012, 2013, 2014, 2015,
2011, 2012, 2013)), row.names = c(NA, -50L), class = c("tbl_df",
"tbl", "data.frame"))
Where id2 is firm ID and private is an indicator for private/public status. My goal is to run a randomization test for r-squared as follows:
regress roa on capex for private firms (i.e. private==1) and public (private==0) separately and get the observed difference in R-squared
randomly assign firms to private-public status (note that the data is panel)
rerun the regression and get the difference in r-squared for the random sample
repeat this, say, 1000 times
measure the p-value as the number of times that the randomly generated difference in R2 is larger than the observed difference divided by the number of iterations (1,000)
My issue is that this code takes ages to run, it will be great if someone has an idea of a better way to do this.
you will need estimatr and tidyverse packages to run this code
library(estimatr)
library(tidyverse)
# run model 1
mod1 <-lm_robust(roa ~ capex,
cluster=id2,
se_type = "stata",
data=x,private==0)
# run model 2
mod2 <-lm_robust(roa ~ capex,
cluster=id2,
se_type = "stata",
data=x,private==1)
#obtain the observed difference in R2
R2.obs1 <- summary(mod1)$adj.r.squared
R2.obs2 <- summary(mod2)$adj.r.squared
diff_r2_obs <- R2.obs2 - R2.obs1
]
#create a list for the simulated differnce in r2
simulated_r2 <- list()
# prepare the loop
set.seed(8)
nreps = 1000
for(i in 1:nreps){
x1 <- x %>% # randamize the variable private taking into acount each id appears a number of times
distinct(id2, private) %>%
mutate(private1=sample(private), replace=T) %>%
left_join(x, by="id2")
model_m <-lm_robust(roa ~ capex,
cluster=id2,
se_type = "stata",
data = x1,
subset=private1==0)
R2.obs_m <- summary(model_m)$adj.r.squared
model_f <-lm_robust(roa ~ capex,
cluster=id2,
se_type = "stata",
data = x1,
subset=private1==1)
R2.obs_f <- summary(model_f)$adj.r.squared
r2_diff_sim <- R2.obs_f - R2.obs_m
simulated_r2[i] <- r2_diff_sim
}
simulated_r2 <- unlist(simulated_r2)
exceed_count <- length(simulated_r2[simulated_r2 >=
diff_r2_obs])
p_val <- exceed_count / nreps
p_val
I initialized simulated_r2 as a vector rather than a list. We benefit a lot when we predefine it's length=. Also I replaced the dplyr code with an transform approach which appears to be faster. Runs quite fast with 1000 reps on this data actually. (Note, that in your code there was a mistake when using sample, the replace= argument is outside the call but should be inside!)
nreps <- 1000
simulated_r2 <- numeric(length=nreps)
set.seed(8)
system.time(
for (i in seq_len(nreps)) {
x1 <- transform(subset(x, !duplicated(id2)),
private=sample(private, replace=T))
model_m <- estimatr::lm_robust(roa ~ capex, clusters=id2, se_type="stata",
data=x1, subset=private == 0)
R2.obs_m <- summary(model_m)$adj.r.squared
model_f <- estimatr::lm_robust(roa ~ capex, clusters=id2, se_type="stata",
data=x1, subset=private == 1)
R2.obs_f <- summary(model_f)$adj.r.squared
r2_diff_sim <- R2.obs_f - R2.obs_m
simulated_r2[i] <- r2_diff_sim
}
)
# user system elapsed
# 8.262 0.000 8.267
(p_val <- length(simulated_r2[simulated_r2 >= diff_r2_obs])/nreps)
# [1] 0.273
Note, that warnings occur if you set replace=TRUE, probably when the nobs of one of the subsets on 0/1 get to small. You should rethink your approach a little.
Parallelize
If this is too slow, you could parallelize the code.
library(parallel)
cl <- makeCluster(detectCores() - 1)
clusterExport(cl, 'x')
clusterSetRNGStream(cl, 8) ## set seed
nreps <- 10000
system.time(
simulated_r2 <- parSapplyLB(cl, seq_len(nreps), \(i) {
x1 <- transform(subset(x, !duplicated(id2)),
private=sample(private, replace=F))
model_m <- estimatr::lm_robust(roa ~ capex, clusters=id2, se_type="stata",
data=x1, subset=private == 0)
R2.obs_m <- summary(model_m)$adj.r.squared
model_f <- estimatr::lm_robust(roa ~ capex, clusters=id2, se_type="stata",
data=x1, subset=private == 1)
R2.obs_f <- summary(model_f)$adj.r.squared
R2.obs_f - R2.obs_m
})
)
# user system elapsed
# 0.067 0.058 17.820
(p_val <- length(simulated_r2[simulated_r2 >= diff_r2_obs])/nreps)
# [1] 0.257
stopCluster(cl)
This takes a different approach to jay.sf's nice solution. Here, I use data.table, create the random permutations of private/public status, use a small function to get difference(s) in r-sq:
library(data.table)
setDT(x)
set.seed(8)
nperms=1000
# get permutations of public/private status
perms = cbind(unique(x[,.(id2)]), unique(x[,.(id2,private)])[,lapply(1:nperms, function(p) sample(private, replace=F))])
# function to get R-sq differences
obsR <- function(r,c,id,p) {
r0 = lm_robust(r[p==0]~c[p==0],clusters=id[p==0],se_type="stata")$adj.r.squared
r1 = lm_robust(r[p==1]~c[p==1],clusters=id[p==1],se_type="stata")$adj.r.squared
r1-r0
}
# empirical r-sq diff
empirR = x[, obsR(roa,capex,id2,private)]
# simulated r-sq differences
simR = x[perms, on=.(id2)][, sapply(.SD, function(v) obsR(roa,capex,id2,v)), .SDcols = patterns("V")]
# pvalue
sum(simR>=empirR)/nperms
With the small provided dataset, this can do a thousand estimates and get the p-value in under 5 seconds on my machine.
user system elapsed
4.56 0.14 4.70
The objective of my code is to apply a percentile-based cutoff on a specific column defined by a group.
I found several threads on SO such as:
Efficient way to filter one data frame by ranges in another
Subsetting data frame with multiple date conditions for ranges in between
Efficient way to filter one data frame by ranges in another
How to filter cases in a data.table by multiple conditions defined in another data.table
Unfortunately, these threads either don't apply filter based on a group or don't use data.table or base-R
I am specifically looking for a method without join. Base R-based method would be fine, but I would really love data.table-based method because I have huge size of data. I was able to do what I want to do with join, but I am looking for even better method that possibly avoids join.
Here's my input data:
Input_File <- structure(list(Zone = c("East", "East", "East", "East", "East",
"East", "East", "West", "West", "West", "West", "West", "West",
"West"), Fiscal.Year = c(2016, 2016, 2016, 2016, 2016, 2016,
2017, 2016, 2016, 2016, 2017, 2017, 2018, 2018), Transaction.ID = c(132,
133, 134, 135, 136, 137, 171, 171, 172, 173, 175, 176, 177, 178
), L.Qty = c(3, 0, 0, 1, 0, 0, 1, 1, 1, 2, 2, 1, 2, 1), A.Qty = c(0,
0, 0, 2, 2, 3, 0, 0, 0, 0, 0, 3, 0, 0), I.Qty = c(2, 2, 2, 0,
1, 0, 3, 0, 0, 0, 1, 0, 1, 1)), .Names = c("Zone", "Fiscal.Year",
"Transaction.ID", "L.Qty", "A.Qty", "I.Qty"), row.names = c(NA,
-14L), class = "data.frame")
Here's my code (using join):
Input_File <- data.table::as.data.table(Input_File)
Q <- data.table::as.data.table(data.frame(Zone=c("East","West"), Ten_percentile=c(2017,2018)))
O <- Q[Input_File,on=c("Zone")] [Fiscal.Year>=Ten_percentile]
Brief explanation about my code: I am applying Ten_percentile cutoff on Fiscal.Year grouped by Zone.
Here's the cutoff table:
Q
Zone Ten_percentile
1: East 2017
2: West 2018
Here's the expected output:
O
Zone Ten_percentile Fiscal.Year Transaction.ID L.Qty A.Qty I.Qty
1: East 2017 2017 171 1 0 3
2: West 2018 2018 177 2 0 1
3: West 2018 2018 178 1 0 1
and here's the output in dput format
structure(list(Zone = structure(c(1L,2L,2L),
.Label = c("East","West"), class = "factor"),
Ten_percentile = c(2017,2018,2018),
Fiscal.Year = c(2017,2018,2018),
Transaction.ID = c(171,177,178), L.Qty = c(1,2,1),
A.Qty = c(0,0,0), I.Qty = c(3,1,1)),
.Names = c("Zone","Ten_percentile","Fiscal.Year","Transaction.ID",
"L.Qty","A.Qty","I.Qty"), class = "data.frame", row.names = c(NA,
-3L))
Thanks in advance for any help extended to me. I am a big fan of data.table. Hence, I want to learn different ways to solve the same problem and become well versed in data.table and base-R.
We can do a non-equi join
res <- as.data.table(Input_File)[Q, c(.SD, list(Ten_percentile = Ten_percentile)),
on = .(Zone, Fiscal.Year >= Ten_percentile)]
I have a dataset with x countries over y years.
I would like to do a certain analysis (see indicated below, but this code is not the problem)
The problem: I would like to do this analysis of the code I already have, a number of times: each time with a different dataset that has another combination of the x countries and y years. To be clear: I would like to do the analysis for EACH possible combination of the x countries and the y years.
The code that I would like to execute for each version of the dataset (explanation dataset see further)
library(stats)
##### the analysis for one dataset ####
d=data.frame(outcome_spring=rep(1,999),outcome_summer=rep(1,999),
outcome_autumn=rep(1,999),outcome_winter=rep(1,999))
o <- lapply(1:999, function(i) {
Alldata_Rainfed<-subset(Alldata, rainfed <= i)
outcome_spring=sum(Alldata$spring)
outcome_summer=sum(Alldata$summer)
outcome_autumn=sum(Alldata$autumn)
outcome_winter=sum(Alldata$winter)
d[i, ] = c(outcome_spring, outcome_summer, outcome_autumn, outcome_winter)
} )
combination<-as.data.frame(do.call(rbind, o)) #the output I want is another dataset for each unique dataset
#### the end of the analysis for one dataset ####
Desired output
That means that as an output I need to have the same amounts of datasets (named "combination" in the example) as the number of combinations possible between x countries and y years.
As an example, imagine having the following dataset (real dataset has over 500000 observations, 15 countries, 9 years)
> dput(Alldata)
structure(list(country = c("belgium", "belgium", "belgium", "belgium",
"germany", "germany", "germany", "germany"), year = c(2004, 2005,
2005, 2013, 2005, 2009, 2013, 2013), spring = c(23, 24, 45, 23,
1, 34, 5, 23), summer = c(25, 43, 654, 565, 23, 1, 23, 435),
autumn = c(23, 12, 4, 12, 24, 64, 23, 12), winter = c(34,
45, 64, 13, 346, 74, 54, 45), irrigation = c(10, 30, 40,
300, 288, 500, 996, 235), id = c(1, 2, 2, 3, 4, 5, 6, 6)), datalabel = "", time.stamp = "14 Nov 2016 20:09", .Names = c("country",
"year", "spring", "summer", "autumn", "winter", "irrigation",
"id"), formats = c("%9s", "%9.0g", "%9.0g", "%9.0g", "%9.0g",
"%9.0g", "%9.0g", "%9.0g"), types = c(7L, 254L, 254L, 254L, 254L,
254L, 254L, 254L), val.labels = c("", "", "", "", "", "", "",
""), var.labels = c("", "", "", "", "", "", "", "group(country year)"
), row.names = c("1", "2", "3", "4", "5", "6", "7", "8"), version = 12L, class = "data.frame")
In the example above, I already made an id for combining country and year. That means I want to make datasets with all observations that have combinations of the following ids:
dataset 1_2_3_4_5: ids 1, 2, 3, 4, 5 (so this dataset only misses the observations with id = 6)
dataset 1_2_3_4_6: ids 1, 2, 3, 4, 6 (but not 5)
dataset 1_2: ids 1, 2 (but not all the rest)
dataset 3_4_5: ids 3, 4, 5 (but not all the rest)
....
etc etc... Note that I gave the name of the dataset the name of the ids that are included. Otherwise it will be hard for me to distinguish all the different datasets from each other. Other names are fine too, as long as I can distinguish between the datasets!
Thanks for your help!
EDIT: it might be possible that certain datasets give no results (because in the second loop irrigation is used too loop and certain combinations might not have irrigation) but then the output should just be a dataset with missing values
Not sure if this is the most efficient way of doing this, but I think it should work:
# create a df to store the results of all combinations
result=data.frame()
The next loops are based on the combn() function, which creates all possible combinations of a vector (here ID), using m number of elements.
for(i in 2:max(o$id)){
combis=combn(unique(o$id),i)
for(j in 1:ncol(combis)){
sub=o[o$id %in% combis[,j],]
out=sub[1,] # use your function
out$label=paste(combis[,j],collapse ='') #provide an id so you know for which combination this result is
result=rbind(result,out) # paste it to previous output
}
}
This is a follow on question to a question I posted earlier (see Sum over rows with multiple changing conditions R data.table for more details). I want to calculate how many times the 3 subjects have experienced an event in the last 5 years. So have been summing over a rolling window using rollapply from the zoo package. This assumes that the experience 5 years ago is as important as the experience 1 year ago (same weighting), so now I want to include a time decay for the experience that enters the sum. This basically means that the experience 5 years ago does not enter into the sum with the same weighting as the experience 1 year ago.
I my case I want to include an age dependent decay (even though for other applications faster or slower decays such as square root or squares could be possible).
For example lets assume I have the following data (I build on the previous data for clarity):
mydf <- data.frame (Year = c(2000, 2001, 2002, 2004, 2005,
2007, 2000, 2001, 2002, 2003,
2003, 2004, 2005, 2006, 2006, 2007),
Name = c("Tom", "Tom", "Tom", "Fred", "Gill",
"Fred", "Gill", "Gill", "Tom", "Tom",
"Fred", "Fred", "Gill", "Fred", "Gill", "Gill"))
# Create an indicator for the experience
mydf$Ind <- 1
# Load require packages
library(data.table)
library(zoo)
# Set data.table
setDT(mydf)
setkey(mydf, Name,Year)
# Perform cartesian join to calculate experience. I2 is the new experience indicator
m <- mydf[CJ(unique(Name),seq(min(Year)-5, max(Year))),allow.cartesian=TRUE][,
list(Ind = unique(Ind), I2 = sum(Ind,na.rm=TRUE)),
keyby=list(Name,Year)]
# This is the approach I have been taking so far. Note that is a simple rolling sum of I2
m[,Exp := rollapply(I2, 5, function(x) sum(head(x,-1)),
align = 'right', fill=0),by=Name]
So question now is, how can I include a age dependent decay into this calculation. To model this I need to divide the experience by the age of the experience before it enters the sum.
I have been trying to get it to work using something along these lines:
m[,Exp_age := rollapply(I2, 5, function(x) sum(head(x,-1)/(tail((Year))-head(Year,-1))),
align = 'right', fill=0),by=Name]
But it does not work. I think my main problem is that I cannot get the age of the experience right so I can divide by the age in the sum. The result should look like the Exp_age column in the myres data.frame below
myres <- data.frame(Name = c("Fred", "Fred", "Fred", "Fred", "Fred",
"Gill", "Gill", "Gill", "Gill", "Gill", "Gill",
"Tom", "Tom", "Tom", "Tom", "Tom"),
Year = c(2003, 2004, 2004, 2006, 2007, 2000, 2001, 2005,
2005, 2006, 2007, 2000, 2001, 2002, 2002, 2003),
Ind = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1),
Exp = c(0, 1, 1, 3, 4, 0, 1, 1, 1, 2, 3, 0, 1, 2, 2, 4),
Exp_age = c(0, 1, 1, 1.333333333, 1.916666667, 0, 1, 0.45,
0.45, 2.2, 2, 0, 1, 1.5, 1.5, 2.833333333))
Any pointers would be greatly appreciated!
If I understand you correctly, you are trying to do a rollapply with width=5 and rather than do a simple sum, you want to do a weighted sum. The weights are the age of the experience relative to the 5 year window. I would do this: first set the key in your data.table so that it has proper increasing order by Name, then you know that the last item in your x variable is the youngest and the first item is the oldest (you do this in your code already). I can't quite tell which way you want the weights to go (youngest to have greatest weight or oldest) but you get the point:
setkey(m, Name, Year)
my_fun = function(x) { w = 1:length(x); sum(x*w)}
m[,Exp_age:=rollapply(I2, width=5, by=1, fill=NA, FUN=my_fun, by.column=FALSE, align="right") ,by=Name]