Resampling in nested groups in R - r

I have run across similar question, but have not been able to find an answer for my specific needs.
I have a data set with a nested group design and I need to randomly sample (with replacement) within each group and the number of resampling events must equal the number of samples (i.e., rows) per group. Additionally, the nested groups have multiple columns of data. See the example df below.
I have code using the dplyr package, but am moving away from dplyr as I have to continuously update my code as dplyr changes function names and operations...which is annoying to say the least. Yes...I know there are several ways to circumvent this issue, but have decided it is time to cast aside the dplyr crutches and learn how to execute data wrangling using R base package.
Working dplyr code:
Resample_function = function(Boot)
{group_by(data1, GROUP, YEAR) %>%
slice(sample(n(), replace = TRUE))%>%
ungroup()
}
I have tried to use various combinations of aggregate, ave, and the apply family of functions...but my ability to deal with nested group designs in base package is limited to say the least.
Below I have provided an example data set (df) and what the results should look like. Note that the resampling produce will produce different results, but the number of resamples per nested group should be the same.
One final request...I am open to all options (e.g., library(data.table), library(boot), etc) as it would be great if others find this post useful. Additionally, some of these packages can be more efficient than base package. However, I prefer solutions that do not require the installation and loading of additional packages.
Thanks in advance for you help.
Take care.
df <- read.table(text = "GROUP YEAR VAR1 VAR2
a 2018 1.0 1.0
a 2018 2.0 2.0
b 2018 10 10
b 2018 20 20
b 2018 30 30
b 2018 40 40
b 2019 50 50
b 2019 60 60
b 2019 70 70
b 2019 80 80
b 2019 90 90
b 2019 100 100
b 2019 110 110
b 2019 120 120
b 2019 130 130
b 2019 140 140
b 2019 150 150
b 2019 160 160
b 2019 170 170
b 2019 180 180
b 2020 190 190
b 2020 200 200
b 2020 210 210", header = TRUE)
result <- read.table(text = "GROUP YEAR VAR1 VAR2
a 2018 1 1
a 2018 1 1
b 2018 20 20
b 2018 30 30
b 2018 30 30
b 2018 20 20
b 2019 70 70
b 2019 170 170
b 2019 50 50
b 2019 150 150
b 2019 70 70
b 2019 150 150
b 2019 100 100
b 2019 120 120
b 2019 50 50
b 2019 160 160
b 2019 90 90
b 2019 150 150
b 2019 170 170
b 2019 180 180
b 2020 190 190
b 2020 190 190
b 2020 190 190", header = TRUE)

You can perform this kind of shuffling in base R using ave :
Resample_function <- function(data) {
new_data <- data[with(data, ave(seq(nrow(data)), GROUP, YEAR,
FUN = function(x) sample(x, replace = TRUE))), ]
rownames(new_data) <- NULL
return(new_data)
}
Resample_function(df)

Related

How to subset columns based on the value in another column in R

I'm looking to subset multiple columns based on the value (a year) that is issued elsewhere in the data. For example, I have a column reflecting various data, and another including a year. My data looks something like this:
Individual
Age 2010
weight 2010
Age 2011
Weight 2011
Age 2012
Weight 2012
Age 2013
Weight 2013
Year
A
53
50
85
100
82
102
56
90
2013
B
22
NA
23
75
NA
68
25
60
2013
C
33
65
34
64
35
70
NA
75
2010
D
NA
70
28
NA
29
78
30
55
2012
E
NA
NA
64
90
NA
NA
NA
NA
2011
I want to create a new column that reflects the data that the 'Year' columns highlights. For example, subsetting data for 'Individual' A from 2013, and 'Individual B' from 2012.
My end goal is to have a table that looks like:
Individual
Age
Weight
A
56
90
B
25
60
C
33
65
D
29
78
E
64
90
Is there any way to subset the years based on the years chosen in the final column?
I made a subset of your data and came up with the following (could be more elegant but this works):
Individual<-c("A","B","C","D","E")
Age2010<-c(53,22,33,NA,NA)
`weight 2010`<-c(50,NA,65,70,NA)
Age2011<-c(85,23,34,28,64)
Weight2011<-c(100,75,64,NA,90)
df<-as.data.frame(cbind(Individual,Age2010,`weight 2010`,Age2011,Weight2011))
colnames(df)<-str_replace_all(colnames(df)," ", "") # remove spaces
# create a dataframe for each year (prob could do this using `apply`)
df2010<-df %>% select(Individual, contains("2010")) %>% mutate(year=2010) %>% rename(weight=weight2010,age=Age2010)
df2011<-df %>% select(Individual, contains("2011")) %>% mutate(year=2011) %>% rename(weight=Weight2011,age=Age2011)
final<-bind_rows(df2010,df2011)
Of course, you can extend this for the remaining years in your dataset. You will then have a year variable to perform your analyses.

Efficient data.table method to generate additional rows given random numbers

I have a large data.table that I want to generate a random number (using two columns) and perform a calculation. Then I want to perform this step 1,000 times. I am looking for a way to do this efficiently with out a loop.
Example data:
> dt <- data.table(Group=c(rep("A",3),rep("B",3)),
Year=rep(2020:2022,2),
N=c(300,350,400,123,175,156),
Count=c(25,30,35,3,6,8),
Pop=c(1234,1543,1754,2500,2600,2400))
> dt
Group Year N Count Pop
1: A 2020 300 25 1234
2: A 2021 350 30 1543
3: A 2022 400 35 1754
4: B 2020 123 3 2500
5: B 2021 175 6 2600
6: B 2022 156 8 2400
> dt[, rate := rpois(.N, lambda=Count)/Pop*100000]
> dt[, value := N*(rate/100000)]
> dt
Group Year N Count Pop rate value
1: A 2020 300 25 1234 1944.8947 5.8346840
2: A 2021 350 30 1543 2009.0732 7.0317563
3: A 2022 400 35 1754 1938.4265 7.7537058
4: B 2020 123 3 2500 120.0000 0.1476000
5: B 2021 175 6 2600 115.3846 0.2019231
6: B 2022 156 8 2400 416.6667 0.6500000
I want to be able to do this calculation for value 1,000 times, and keep all instances (with an indicator column for 1-1,000 indicating which run) without using a loop. Any suggestions?
Maybe you can try replicate like below
n <- 1000
dt[, paste0(c("rate", "value"), rep(1:n, each = 2)) := replicate(n, list(u <- rpois(.N, lambda = Count) / Pop * 100000, N * (u / 100000)))]

How to add the value of a row to other rows based on some criteria in R?

I have a panel data for costs, sampled monthly for various product types. I also have "Generic" costs which doesn't belong to any product type. A super simple representative df looks like this:
type <- c("A","A","B","B","C","C","Generic","Generic")
year <- c(2020,2020,2020,2020,2020,2020,2020,2020)
month <- c(1,2,1,2,1,2,1,2)
cost <- c(1,2,3,4,5,6,600,630)
volume <- c(10,11,20,21,30,31,60,63)
df <- data.frame(type,year,month,cost,volume)
type year month cost volume
A 2020 1 1 10
A 2020 2 2 11
B 2020 1 3 20
B 2020 2 4 21
C 2020 1 5 30
C 2020 2 6 31
Generic 2020 1 600 60
Generic 2020 2 630 63
I need to distribute the "Generic" costs to product types according to their "Volume".
For example,
For 2020-1, the volume ratio of
product type A: 10 / (10 + 20 + 30) = 1/6
product type B: 20 / (10 + 20 + 30) = 2/6
product type C: 30 / (10 + 20 + 30) = 3/6
For 2020-2, the volume ratio of
product type A: 11 / (11 + 21 + 31) = 11/63
product type B: 21 / (11 + 21 + 31) = 21/63
product type C: 31 / (11 + 21 + 31) = 31/63
So, I would like to distribute "Generic" costs for 2020-1 to product types like this:
1/6 * 600 = 100 for product type A
2/6 * 600 = 200 for product type B
3/6 * 600 = 300 for product type C
Similarly for 2020-2, I would like to distribute "Generic" costs like:
11/63 * 630 = 110 for product type A
21/63 * 630 = 210 for product type B
31/63 * 630 = 310 for product type C
In the end, I would like to end up with the following data frame:
type year month new_cost volume
A 2020 1 101 10
A 2020 2 112 11
B 2020 1 203 20
B 2020 2 214 21
C 2020 1 305 30
C 2020 2 316 31
I already have the total volume in the original dataframe within the "Generic" type, so there is no need to calculate that seperately.
I was trying to do these calculations via dplyr package's group_by() and mutate() functions, but I couldn't figure out how.
Any help is appreciated.
We can do this using data.table, by first merging in the generic costs separately and spreading them according to the percentage of volume made up by each type in each month/year:
df <- setDT(df)
generic <- df[type == "Generic"]
setnames(generic, "cost", "generic_cost")
df <- df[type !="Generic"]
df[, volume_ratio:=volume/sum(volume), by = c("year", "month")]
df <- merge(df, generic[,c("year", "month", "generic_cost")], by = c("year", "month"))
df[,new_cost:=cost + (generic_cost*volume_ratio)]
Which gives us:
df
year month type cost volume volume_ratio generic_cost new_cost
1: 2020 1 A 1 10 0.1666667 600 101
2: 2020 1 B 3 20 0.3333333 600 203
3: 2020 1 C 5 30 0.5000000 600 305
4: 2020 2 A 2 11 0.1746032 630 112
5: 2020 2 B 4 21 0.3333333 630 214
6: 2020 2 C 6 31 0.4920635 630 316
This has a few extra columns, but new cost seems to be the most important column of interest.

tapply based on multiple indexes in R

I have a data frame, much like this one:
ref=rep(c("A","B"),each=240)
year=rep(rep(2014:2015,each=120),2)
month=rep(rep(1:12,each=10),4)
values=c(rep(NA,200),rnorm(100,2,1),rep(NA,50),rnorm(40,4,2),rep(NA,90))
DF=data.frame(ref,year,month,values)
I would like to compute the maximum number of consecutive NAs per reference, per year.
I have created a function, which works out the maximum number of consecutive NAs, but can only be based on one variable.
For example,
func <- function(x) {
max(rle(is.na(x))$lengths)
}
with(DF, tapply(values,ref, func))
# A B
# 200 90
with(DF, tapply(values,year, func))
# 2014 2015
# 120 90
So there are a maximum of 200 consecutive NAs in ref A in total, and maximum of 90 in ref B, which is correct. There are also 120 NAs in 2014, and 90 in 2015.
What I'd like is a result per ref and year, such as:
A 2015 80
A 2014 120
B 2015 90
B 2014 50
There are multiple ways of doing this, one is with the plyr library:
library(plyr)
ddply(DF,c('ref','year'),summarise,NAs=max(rle(is.na(values))$lengths))
ref year NAs
1 A 2014 120
2 A 2015 80
3 B 2014 60
4 B 2015 90
Using your function, you could also try:
with(DF, tapply(values,list(ref,year), func))
which gives a slightly different output
2014 2015
A 120 80
B 60 90
By using melt() you can however get to the same dataframe.
Very similar to the tapply solution above. I find aggregate give a better output than tapply though.
with(DF, aggregate(list(Value = values),list(Year = year,ref = ref), func))
Year ref Value
1 2014 A 120
2 2015 A 80
3 2014 B 60
4 2015 B 90
I like the recipe format
library(dplyr)
DF$values[is.na(DF$values)] <- 1
DF %>%
filter(values==1) %>%
group_by(ref,year) %>%
mutate(csum=cumsum(values)) %>%
group_by(ref,year) %>%
summarise(max(csum))
Source: local data frame [4 x 3]
Groups: ref [?]
ref year max(csum)
(fctr) (int) (dbl)
1 A 2014 120
2 A 2015 80
3 B 2014 50
4 B 2015 90

Reshaping a data frame --- changing rows to columns

Suppose that we have a data frame that looks like
set.seed(7302012)
county <- rep(letters[1:4], each=2)
state <- rep(LETTERS[1], times=8)
industry <- rep(c("construction", "manufacturing"), 4)
employment <- round(rnorm(8, 100, 50), 0)
establishments <- round(rnorm(8, 20, 5), 0)
data <- data.frame(state, county, industry, employment, establishments)
state county industry employment establishments
1 A a construction 146 19
2 A a manufacturing 110 20
3 A b construction 121 10
4 A b manufacturing 90 27
5 A c construction 197 18
6 A c manufacturing 73 29
7 A d construction 98 30
8 A d manufacturing 102 19
We'd like to reshape this so that each row represents a (state and) county, rather than a county-industry, with columns construction.employment, construction.establishments, and analogous versions for manufacturing. What is an efficient way to do this?
One way is to subset
construction <- data[data$industry == "construction", ]
names(construction)[4:5] <- c("construction.employment", "construction.establishments")
And similarly for manufacturing, then do a merge. This isn't so bad if there are only two industries, but imagine that there are 14; this process would become tedious (though made less so by using a for loop over the levels of industry).
Any other ideas?
This can be done in base R reshape, if I understand your question correctly:
reshape(data, direction="wide", idvar=c("state", "county"), timevar="industry")
# state county employment.construction establishments.construction
# 1 A a 146 19
# 3 A b 121 10
# 5 A c 197 18
# 7 A d 98 30
# employment.manufacturing establishments.manufacturing
# 1 110 20
# 3 90 27
# 5 73 29
# 7 102 19
Also using the reshape package:
library(reshape)
m <- reshape::melt(data)
cast(m, state + county~...)
Yielding:
> cast(m, state + county~...)
state county construction_employment construction_establishments manufacturing_employment manufacturing_establishments
1 A a 146 19 110 20
2 A b 121 10 90 27
3 A c 197 18 73 29
4 A d 98 30 102 19
I personally use the base reshape so I probably should have shown this using reshape2 (Wickham) but forgot there was a reshape2 package. Slightly different:
library(reshape2)
m <- reshape2::melt(data)
dcast(m, state + county~...)

Resources