Efficient data.table method to generate additional rows given random numbers - r

I have a large data.table that I want to generate a random number (using two columns) and perform a calculation. Then I want to perform this step 1,000 times. I am looking for a way to do this efficiently with out a loop.
Example data:
> dt <- data.table(Group=c(rep("A",3),rep("B",3)),
Year=rep(2020:2022,2),
N=c(300,350,400,123,175,156),
Count=c(25,30,35,3,6,8),
Pop=c(1234,1543,1754,2500,2600,2400))
> dt
Group Year N Count Pop
1: A 2020 300 25 1234
2: A 2021 350 30 1543
3: A 2022 400 35 1754
4: B 2020 123 3 2500
5: B 2021 175 6 2600
6: B 2022 156 8 2400
> dt[, rate := rpois(.N, lambda=Count)/Pop*100000]
> dt[, value := N*(rate/100000)]
> dt
Group Year N Count Pop rate value
1: A 2020 300 25 1234 1944.8947 5.8346840
2: A 2021 350 30 1543 2009.0732 7.0317563
3: A 2022 400 35 1754 1938.4265 7.7537058
4: B 2020 123 3 2500 120.0000 0.1476000
5: B 2021 175 6 2600 115.3846 0.2019231
6: B 2022 156 8 2400 416.6667 0.6500000
I want to be able to do this calculation for value 1,000 times, and keep all instances (with an indicator column for 1-1,000 indicating which run) without using a loop. Any suggestions?

Maybe you can try replicate like below
n <- 1000
dt[, paste0(c("rate", "value"), rep(1:n, each = 2)) := replicate(n, list(u <- rpois(.N, lambda = Count) / Pop * 100000, N * (u / 100000)))]

Related

How to add the value of a row to other rows based on some criteria in R?

I have a panel data for costs, sampled monthly for various product types. I also have "Generic" costs which doesn't belong to any product type. A super simple representative df looks like this:
type <- c("A","A","B","B","C","C","Generic","Generic")
year <- c(2020,2020,2020,2020,2020,2020,2020,2020)
month <- c(1,2,1,2,1,2,1,2)
cost <- c(1,2,3,4,5,6,600,630)
volume <- c(10,11,20,21,30,31,60,63)
df <- data.frame(type,year,month,cost,volume)
type year month cost volume
A 2020 1 1 10
A 2020 2 2 11
B 2020 1 3 20
B 2020 2 4 21
C 2020 1 5 30
C 2020 2 6 31
Generic 2020 1 600 60
Generic 2020 2 630 63
I need to distribute the "Generic" costs to product types according to their "Volume".
For example,
For 2020-1, the volume ratio of
product type A: 10 / (10 + 20 + 30) = 1/6
product type B: 20 / (10 + 20 + 30) = 2/6
product type C: 30 / (10 + 20 + 30) = 3/6
For 2020-2, the volume ratio of
product type A: 11 / (11 + 21 + 31) = 11/63
product type B: 21 / (11 + 21 + 31) = 21/63
product type C: 31 / (11 + 21 + 31) = 31/63
So, I would like to distribute "Generic" costs for 2020-1 to product types like this:
1/6 * 600 = 100 for product type A
2/6 * 600 = 200 for product type B
3/6 * 600 = 300 for product type C
Similarly for 2020-2, I would like to distribute "Generic" costs like:
11/63 * 630 = 110 for product type A
21/63 * 630 = 210 for product type B
31/63 * 630 = 310 for product type C
In the end, I would like to end up with the following data frame:
type year month new_cost volume
A 2020 1 101 10
A 2020 2 112 11
B 2020 1 203 20
B 2020 2 214 21
C 2020 1 305 30
C 2020 2 316 31
I already have the total volume in the original dataframe within the "Generic" type, so there is no need to calculate that seperately.
I was trying to do these calculations via dplyr package's group_by() and mutate() functions, but I couldn't figure out how.
Any help is appreciated.
We can do this using data.table, by first merging in the generic costs separately and spreading them according to the percentage of volume made up by each type in each month/year:
df <- setDT(df)
generic <- df[type == "Generic"]
setnames(generic, "cost", "generic_cost")
df <- df[type !="Generic"]
df[, volume_ratio:=volume/sum(volume), by = c("year", "month")]
df <- merge(df, generic[,c("year", "month", "generic_cost")], by = c("year", "month"))
df[,new_cost:=cost + (generic_cost*volume_ratio)]
Which gives us:
df
year month type cost volume volume_ratio generic_cost new_cost
1: 2020 1 A 1 10 0.1666667 600 101
2: 2020 1 B 3 20 0.3333333 600 203
3: 2020 1 C 5 30 0.5000000 600 305
4: 2020 2 A 2 11 0.1746032 630 112
5: 2020 2 B 4 21 0.3333333 630 214
6: 2020 2 C 6 31 0.4920635 630 316
This has a few extra columns, but new cost seems to be the most important column of interest.

Convert values using a conversion table R

I am currently running statistical models on ACT and SAT scores. To help clean my data, I want to convert the ACT scores into its SAT equivalent. I found the following table online:
ACT SAT
<dbl> <dbl>
1 36 1590
2 35 1540
3 34 1500
4 33 1460
5 32 1430
6 31 1400
7 30 1370
8 29 1340
9 28 1310
10 27 1280
I want to replace the column ACT_Composite with the number in the SAT column of the conversion table. For instance, if one row displays an ACT_Composite score of 35, I want to input 1540.
If anyone has ideas on how to accomplish this, I would greatly appreciate it.
In base you can you use merge directly:
#Reading score table
df <- read.table(header = TRUE, text ="ACT SAT
36 1590
35 1540
34 1500
33 1460
32 1430
31 1400
30 1370
29 1340
28 1310
27 1280")
#Setting seed to reproduce df1
set.seed(1234)
# Create a data.frame with 50 sample scores
df1 <- data.frame(ACT_Composite = sample(27:36, 50, replace = TRUE))
# left-join df1 with df with keys ACT_Composite and ACT
result <- merge(df1, df,
by.x = "ACT_Composite",
by.y = "ACT",
all.x = TRUE,
sort = FALSE)
#The first 6 values of result
head(result)
ACT_Composite SAT
1 31 1400
2 31 1400
3 31 1400
4 31 1400
5 31 1400
6 36 1590
In data.table you can you use merge
library(data.table)
#Setting seed to reproduce df1
set.seed(1234)
# Create a data.table with 50 sample scores
df1 <- data.table(ACT_Composite = sample(27:36, 50, replace = TRUE))
# left-join df1 with df with keys ACT_Composite and ACT
result <- merge(df1, df,
by.x = "ACT_Composite",
by.y = "ACT",
all.x = TRUE,
sort = FALSE)
#The first 6 values of result
head(result)
ACT_Composite SAT
1: 36 1590
2: 32 1430
3: 31 1400
4: 35 1540
5: 31 1400
6: 32 1430
Alternatively in data.table you can try also
df1 <- data.table(ACT_Composite = sample(27:36, 50, replace = TRUE))
setDT(df)# you need to convert your look-up table df into data.table
result <- df[df1, on = c(ACT = "ACT_Composite")]
head(result)
ACT_Composite SAT
1: 36 1590
2: 32 1430
3: 31 1400
4: 35 1540
5: 31 1400
6: 32 1430

Resampling in nested groups in R

I have run across similar question, but have not been able to find an answer for my specific needs.
I have a data set with a nested group design and I need to randomly sample (with replacement) within each group and the number of resampling events must equal the number of samples (i.e., rows) per group. Additionally, the nested groups have multiple columns of data. See the example df below.
I have code using the dplyr package, but am moving away from dplyr as I have to continuously update my code as dplyr changes function names and operations...which is annoying to say the least. Yes...I know there are several ways to circumvent this issue, but have decided it is time to cast aside the dplyr crutches and learn how to execute data wrangling using R base package.
Working dplyr code:
Resample_function = function(Boot)
{group_by(data1, GROUP, YEAR) %>%
slice(sample(n(), replace = TRUE))%>%
ungroup()
}
I have tried to use various combinations of aggregate, ave, and the apply family of functions...but my ability to deal with nested group designs in base package is limited to say the least.
Below I have provided an example data set (df) and what the results should look like. Note that the resampling produce will produce different results, but the number of resamples per nested group should be the same.
One final request...I am open to all options (e.g., library(data.table), library(boot), etc) as it would be great if others find this post useful. Additionally, some of these packages can be more efficient than base package. However, I prefer solutions that do not require the installation and loading of additional packages.
Thanks in advance for you help.
Take care.
df <- read.table(text = "GROUP YEAR VAR1 VAR2
a 2018 1.0 1.0
a 2018 2.0 2.0
b 2018 10 10
b 2018 20 20
b 2018 30 30
b 2018 40 40
b 2019 50 50
b 2019 60 60
b 2019 70 70
b 2019 80 80
b 2019 90 90
b 2019 100 100
b 2019 110 110
b 2019 120 120
b 2019 130 130
b 2019 140 140
b 2019 150 150
b 2019 160 160
b 2019 170 170
b 2019 180 180
b 2020 190 190
b 2020 200 200
b 2020 210 210", header = TRUE)
result <- read.table(text = "GROUP YEAR VAR1 VAR2
a 2018 1 1
a 2018 1 1
b 2018 20 20
b 2018 30 30
b 2018 30 30
b 2018 20 20
b 2019 70 70
b 2019 170 170
b 2019 50 50
b 2019 150 150
b 2019 70 70
b 2019 150 150
b 2019 100 100
b 2019 120 120
b 2019 50 50
b 2019 160 160
b 2019 90 90
b 2019 150 150
b 2019 170 170
b 2019 180 180
b 2020 190 190
b 2020 190 190
b 2020 190 190", header = TRUE)
You can perform this kind of shuffling in base R using ave :
Resample_function <- function(data) {
new_data <- data[with(data, ave(seq(nrow(data)), GROUP, YEAR,
FUN = function(x) sample(x, replace = TRUE))), ]
rownames(new_data) <- NULL
return(new_data)
}
Resample_function(df)

Accessing previous rows to make a calculation in a current row in R

I have data that I need to make calculations within sub-groups of rows. Specifically, I am calculating the nitrogen (N) recovery efficiency of crops between different field treatments. This requires finding the difference in N uptake of crops (totN) between a plot with a given amount of N applied (Nrate) and the control plot with 0 N applied, then dividing by the Nrate.
This is how the data looks for just one year:
year drain Nrate totN
2016 C 0 190
2016 C 100 220
2016 C 200 230
2016 N 0 130
2016 N 100 200
2016 N 200 220
I have gotten this far, thanks to Performing calculations between rows in R, but I am not sure how to reference the control row (Nrate = 0) within the sub-group each row is in.
This is where I am:
library(tidyverse)
df <- data.frame(year=c(2016,2016,2016,2016,2016,2016),
drain=c("C","C","C","N","N","N"),
Nrate=c(0,100,200,0,100,200),
totN=c(190,220,230,130,200,220))
df %>%
group_by(year,drain) %>%
mutate(id = row_number()) %>%
mutate(RE = ifelse(id != 1,
(totN - <the totN where Nrate=0 for same year and drain>) / Nrate,
NA))
This is what I expect to get:
year drain Nrate totN RE
2016 C 0 190 NA
2016 C 100 220 0.3 #(220-190)/100
2016 C 200 230 0.2 #(230-190)/200
2016 N 0 130 NA
2016 N 100 200 0.7 #(200-130)/100
2016 N 200 220 0.45 #(220-130)/200
We may subset the 'totN' by indexing or if it is already ordered, use first(totN)
library(dplyr)
df %>%
group_by(year, drain) %>%
mutate(RE = na_if((totN - totN[Nrate == 0])/Nrate, "NaN")) %>%
ungroup
-output
# A tibble: 6 x 5
# year drain Nrate totN RE
# <dbl> <chr> <dbl> <dbl> <dbl>
#1 2016 C 0 190 NA
#2 2016 C 100 220 0.3
#3 2016 C 200 230 0.2
#4 2016 N 0 130 NA
#5 2016 N 100 200 0.7
#6 2016 N 200 220 0.45

Rolling multi regression in R data table

Say I have an R data.table DT which has a list of returns:
Date Return
2016-01-01 -0.01
2016-01-02 0.022
2016-01-03 0.1111
2016-01-04 -0.006
...
I want to do a rolling multi regression of the previous N observations of Return predicting the next Return over some window K. E.g. Over the last K = 120 days do a regression of the last N = 14 observations to predict the next observation. Once I have this regression I want to use the predict function to get a prediction for each row based on the regression. In pseudocode it would be something like:
DT[, Prediction := predict(lm(Return[prev K - N -1] ~ Return[N observations prev for each observation]), Return[N observations previous for this observation])]
To be clear i want to do a multi regression so if N was 3 it would be:
lm(Return ~ Return[-1] + Return[-2] + Return[-3]) ## where the negatives are the prev rows
How do I write this (as efficiently as possible).
Thanks
If I understand correctly you want a quarterly auto-regression.
There's a related thread on time-series with data.table here.
You can setup a rolling date in data.table like this (see the link above for more context):
#Example for quarterly data
quarterly[, rollDate:=leftBound]
storeData[, rollDate:=date]
setkey(quarterly,"rollDate")
setkey(storeData,"rollDate")
Since you only provided a few rows of example data, I extended the series through 2019 and made up random return values.
First get your data setup:
require(forecast)
require(xts)
DT <- read.table(con<- file ( "clipboard"))
dput(DT) # the dput was too long to display here
DT[,1] <- as.POSIXct(strptime(DT[,1], "%m/%d/%Y"))
DT[,2] <- as.double(DT[,2])
dat <- xts(DT$V2,DT$V1, order.by = DT$V1)
x.ts <- to.quarterly(dat) # 120 days
dat.Open dat.High dat.Low dat.Close
2016 Q1 1292 1292 1 698
2016 Q2 138 1290 3 239
2016 Q3 451 1285 5 780
2016 Q4 355 1243 27 1193
2017 Q1 878 1279 4 687
2017 Q2 794 1283 12 411
2017 Q3 858 1256 9 1222
2017 Q4 219 1282 15 117
2018 Q1 554 1286 32 432
2018 Q2 630 1272 30 46
2018 Q3 310 1288 18 979
2019 Q1 143 1291 10 184
2019 Q2 250 1289 8 441
2019 Q3 110 1220 23 571
Then you can do a rolling ARIMA model with or without re-estimation like this:
fit <- auto.arima(x.ts)
order <- arimaorder(fit)
fcmat <- matrix(0, nrow=nrow(x), ncol=1)
n <- nrow(x)
for(i in 1:n)
{
x <- window(x.ts, end=2017.99 + (i-1)/4)
refit <- Arima(x, order=order[1:3], seasonal=order[4:6])
fcmat[i,] <- forecast(refit, h=h)$mean
}
Here's a good related resource with several examples of different ways you might construct this: http://robjhyndman.com/hyndsight/rolling-forecasts/
You have to have the lags in the columns anyway, so I if i understand you correctly you can do something like this, say for a lag of 3:
setkey(DT,date)
lag_max<-3
for(i in 1:lag_max){
set(DT,NULL,paste0("lag",i),shift(DT[["return"]],1L,type="lag"))
}
DT[, prediction := lm(return~lag1+lag2+lag3)[["fitted.values"]]]

Resources