Create iteration function to apply to a list of dataframes - r

I was wondering how to perform a recursive calculation over a list of dataframes? I have previously created a for loop that works but I would now like to convert this into a function that I can then apply to a list of dataframes. The equation that the for loop is based on is:
api(i) = k * api(i-1) + rain(i)
For loop example:
k <- 0.9 # assign k value
rain <- runif(n =50, min = 0, max = 4) # generate rainfall data
api <- numeric(length(rain)) # create api vector
api[1] <- k * rain[1] # set first api value
# for loop to calculate api
for (i in 2:length(rain)) {
api[i] <- k * api[i-1] + rain[i]
}
Example data:
library(lubridate)
date_time <-
seq(ymd_hm('2022-01-01 00:00'), ymd_hm('2022-02-01 23:45'), by = '15 mins')
precip <- runif(n = length(date_time),
min = 0,
max = 4)
df1 <- data.frame(date_time, precip)
date_time <-
seq(ymd_hm('2022-08-01 00:00'), ymd_hm('2022-09-01 23:45'), by = '15 mins')
precip <- runif(n = length(date_time),
min = 0,
max = 4)
df2 <- data.frame(date_time, precip)
list_Df <- list(df1, df2)
I would like to assign the api object to a new column in each dataframe (e.g. df$api)
Thanks for your help!

Related

Does preallocation of R list improve loop run time? how?

I am running a simulation in R, in which the outputs should be stored in numeric vectors in a variable of the type list. However, I am wondering why when I preallocated the list with numeric vectors, the computational time remains the same instead of reducing. My code is similar to the following hypothetical cases in which I have to use nested loops and store the results in the list.
Here is the code for the case without preallocation:
n_times <- 5000
my_list <- list()
Sys.time()
start_time <- Sys.time()
for( i in 1:n_times){
for (j in 1:10){
df <- data.frame(y = rnorm(n = 200, mean = sample.int(10,1), sd = 4),
x1 = rnorm(n = 200, mean = sample.int(10,1), sd = 1),
x2 = rnorm(n = 200, mean = sample.int(10,1), sd = 4))
model <- lm(y ~ x1 + x2, data = df)
my_list[[as.character(j)]][i] <- summary(model)$r.squared
}
}
end_time <- Sys.time()
end_time - start_time
and here is the code for the case with preallocation:
# number of times the simulation to be run
n_times <- 5000
# preallocating the list of length 10 with numeric vectors of length n_times
my_list <- replicate(10, vector("numeric", n_times), simplify = F)
names(my_list) <- as.character(1:10)
Sys.time()
start_time <- Sys.time()
for( i in 1:n_times){
for (j in 1:10){
df <- data.frame(y = rnorm(n = 200, mean = sample.int(10,1), sd = 4),
x1 = rnorm(n = 200, mean = sample.int(10,1), sd = 1),
x2 = rnorm(n = 200, mean = sample.int(10,1), sd = 4))
model <- lm(y ~ x1 + x2, data = df)
my_list[[as.character(j)]][i] <- summary(model)$r.squared
}
}
end_time <- Sys.time()
end_time - start_time
I think preallocating a list with just 5000 * 10 elements doesn't take much time , after profiling you code most time goes to lm and data.farme creations , see below

Randomizing a distribution of data in a list

I have a data frame df that I would like to separate into a training set and a test set. Instead of getting only a single training and test set, I would like to get a distribution of them (n = 100).
I try and do this with lapply, but the values for each element in the list end up being exactly the same. How do I randomize the values in the two list (i.e., train.data and test.data)?
The expected output would be a list for both train.data and test.data, each containing 100 elements with different subsets of df in both of them.
library(lubridate)
library(tidyverse)
library(caret)
date <- rep_len(seq(dmy("01-01-2013"), dmy("31-12-2013"), by = "days"), 300)
ID <- rep(c("A","B","C"), 50)
class <- rep(c("N","M"), 50)
df <- data.frame(value = runif(length(date), min = 0.5, max = 25),
ID,
class)
training.samples <- df$class %>%
createDataPartition(p = 0.6, list = FALSE)
n <- 100
train.data <- lapply(1:n, function(x){
df[training.samples, ]
})
test.data <- lapply(1:n, function(x){
df[-training.samples, ]
})
Try using replicate
f1 <- function(dat, colnm) {
s1 <- createDataPartition(dat[[colnm]], p = 0.6,
list = FALSE)
return(list(train.data = dat[s1,], test.data = dat[-s1,]))
}
n <- 100
out <- replicate(n, f1(df, "class"), simplify = FALSE)

Computing NTILE in R for RFM analysis

I'm trying to create a dataframe computing 10 percentiles based on the Recency, Frequency and Monetary. I have most of it set up, but I can't figure out why my code is returning three NTILES, when I'm asking for 10. I'm currently at a stand still. The next step will be calculating the percentage of customers in each ntile.
Here is my code:
rm(list = ls())
setwd("/Users/a76475/Documents/Customer_Analytics")
rfm<-read.csv("cdnow_students_transaction.csv")
#Set up test and calibration samples
rfm$DATE <- as.Date(rfm$DATE, format = "%m/%d/%y")
calib <- subset(rfm, rfm$DATE<"1997-09-29")
valid <- subset(rfm, rfm$DATE>"1997-09-30")
#Aggregate for frequency, monetary, and recency -- Calibration
recency<- aggregate(DATE ~ ID, data =calib, FUN = max)
colnames(recency) <- c("ID","Recency")
frequency <- aggregate(DOLLARS ~ ID, data =calib, FUN = length)
colnames(frequency) <- c("ID","Frequency")
monetary <- aggregate(DOLLARS ~ ID, data =calib, FUN = mean)
colnames(frequency) <- c("ID","Monetary")
calib <- merge(frequency, monetary, by = "ID")
calib <- merge(calib, recency, by = "ID")
#Aggregate for frequency, monetary, and recency -- Validation
recency<- aggregate(DATE ~ ID, data =valid, FUN = max)
colnames(recency) <- c("ID","Recency")
frequency <- aggregate(DOLLARS ~ ID, data =valid, FUN = length)
colnames(frequency) <- c("ID","Frequency")
monetary <- aggregate(DOLLARS ~ ID, data =valid, FUN = mean)
colnames(frequency) <- c("ID","Monetary")
valid <- merge(frequency, monetary, by = "ID")
valid <- merge(valid, recency, by = "ID")
colnames(valid) <- c("ID","FREQ","MONETARY","RECENCY")
colnames(calib) <- c("ID","FREQ","MONETARY","RECENCY")
calib$RECENCY <- NULL
#Create recency score
#For validation
for (i in 1:nrow(valid)) {
valid$RECENCY1[i] = as.numeric(max(valid$RECENCY) - valid$RECENCY[i])
}
valid$RECENCY <- valid$RECENCY1
valid$RECENCY1 <- NULL
#For calibration
for (i in 1:nrow(calib)) {
calib$RECENCY1[i] = as.numeric(max(valid$RECENCY) - calib$RECENCY[i])
}
calib$RECENCY <- calib$RECENCY1
calib$RECENCY1 <- NULL
#Merge datasets
rfm <- merge(calib,valid, by="ID", all.x = TRUE)
#Create Column for retention%
require(dplyr)
rfm$monetary.ntile <- ntile(rfm$MONETARY.y,10)
rfm$freq.ntile <- ntile(rfm$FREQ.y,10)
rfm$recency.ntile <- ntile(rfm$RECENCY,10)
For example, if you want 10 buckets for Recency, Frequency and Monetary Ratio:
set.seed(1)
n <- 100
df <- data.frame(
R = runif(n, 1, 365),
F = runif(n, 1, 5),
M = runif(n, 0, 100)
)
apply(df, 2, function(col) {
breaks <- quantile(col, probs=seq(0, 1, length.out = 10))
findInterval(col, breaks)
})

Passing data to forecast.lm using dplyr and do

I am having trouble passing data to forecast.lm in a dplyr do. I want to make several models based on a a factor - hour - and the forecaste these models using new data.
Building on previous excellent examples here is my data example:
require(dplyr)
require(forecast)
# Training set
df.h <- data.frame(
hour = factor(rep(1:24, each = 100)),
price = runif(2400, min = -10, max = 125),
wind = runif(2400, min = 0, max = 2500),
temp = runif(2400, min = - 10, max = 25)
)
# Forecasting set
df.f <- data.frame(
hour = factor(rep(1:24, each = 10)),
wind = runif(240, min = 0, max = 2500),
temp = runif(240, min = - 10, max = 25)
)
# Bind training & forecasting
df <- rbind(df.h, data.frame(df.f, price=NA))
# Do a training model and then forecast using the new data
df <- rbind(df.h, data.frame(df.f, price=NA))
res <- group_by(df, hour) %>% do({
hist <- .[!is.na(.$price), ]
fore <- .[is.na(.$price), c('hour', 'wind', 'temp')]
fit <- Arima(hist$price, xreg = hist[,3:4], order = c(1,1,0))
data.frame(fore[], price=forecast.Arima(fit, xreg = fore[ ,2:3])$mean)
})
res
This works excellently with a time series model, but using a lm I have problem passing the data into the forecasting part.
My corresponding lm example looks like this:
res <- group_by(df, hour) %>% do({
hist <- .[!is.na(.$price), ]
fore <- .[is.na(.$price), c('hour', 'wind', 'temp')]
fit <- lm(hist$price ~ wind + temp, data = hist)
data.frame(fore[], price = forecast.lm(fit, newdata = fore[, 2:3])$mean)
})
The problem is that I cant' get data into the newdata = function. If you add hist$ in the fit section, you can't reference the forecast data, and for some reason if you add data = fore it can't find it - but it can in the time series example.
The problem is that forecast.lm expects that fit has a data component. If you use glm or tslm, that is true. But lm objects don't generally have a data component. So you need to manually add fit$data <- hist for forecast.lm to work properly.
res <- group_by(df, hour) %>% do({
hist <- .[!is.na(.$price), ]
fore <- .[is.na(.$price), c('hour', 'wind', 'temp')]
fit <- lm(price ~ wind + temp, data = hist)
fit$data <- hist # have to add data manually
data.frame(fore[], price = forecast.lm(fit, newdata = fore[, 2:3])$mean)
})
This is actually a known issue.

randomly assign teachers to a school with dplyr or similar?

Suppose I have a data frame with 8 schools and its characteristics, and another with 48 teachers and its characteristics. I can generate some fake data with the following code:
library(dplyr)
library(geosphere)
set.seed(6232015)
n.schools <-8
n.teachers <- 48
makeRandomString <- function(pre, n=1, length=12) {
randomString <- c(1:n) # initialize vector
for (i in 1:n) {
randomString[i] <- paste0(pre,'.', paste(sample(c(0:9, letters, LETTERS),
length, replace=TRUE),
collapse=""))
}
return(randomString)
}
gen.teachers <- function(n.teachers){
Teacher.ID <- makeRandomString(pre= 'T', n = n.teachers, length = 20)
Teacher.exp <- runif(n = n.teachers, min = 1, max = 30)
Teacher.Other <- sample(c(0,1), replace = T, prob = c(0.5, 0.5), size = n.teachers)
Teachers <- data.frame(Teacher.ID, Teacher.exp, Teacher.Other)
return(Teachers)
}
gen.schools <- function(n.schools){
School.ID <- makeRandomString(pre= 'S', n = n.schools, length = 20)
School.lat <- runif(n = n.schools, min = -2, max = 2)
School.long <- runif(n = n.schools, min = -2, max = 2)
Schools <- data.frame(School.ID, School.lat, School.long) %>%
rowwise() %>% mutate (School.distance = distHaversine(p1 = c(School.long, School.lat),
p2 = c(0, 0), r = 3961))
return(Schools)
}
Teachers <- gen.teachers(n.teachers = n.teachers)
Schools <- gen.schools(n.schools = n.schools)
To each shool, I want to assign 6 teachers (every teacher get 1 and only 1 school). I could use:
Teachers %>% sample_n(6)
To get a list of 6 teachers assign those to a school, remove them from my original pool and keep going with a loop. My guess/hope is that there is a much easier way of doing this.
Thanks for the help!
In the context of your code
sample(rep(Schools$School.ID, each = 6))
gives a random sequence of schools where each school.id appears 6 times. Set Teachers$AssignedSchool to this sample and each teacher has an assigned school

Resources