This is my first question on Stackoverflow, so please bear with me if I make any mistakes or omit necessary information.
I have a dataset consisting of a time series where I need to find the 5-day rolling average of a binary variable for each specific hour of the day. An example of my data can be created using:
library(dplyr)
library(zoo)
set.seed(69)
df <- data.frame(Hour = rep(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24), times = 10),
Reg = rep(round(runif(24*10, 0, 1))),
HumidityLevel = rep(runif(24*10, 0, 100)))
df_ranges <- data.frame(LowerRange = rep(cbind(rollapply(df$HumidityLevel, 24, min, by = 24)), each = 24)
,UpperRange = rep(cbind(rollapply(df$HumidityLevel, 24, max, by = 24)), each = 24))
df <- cbind(df, df_ranges)
I have computed the simple rolling average using the following code:
df <- df %>%
group_by(Hour) %>%
mutate(AvgReg = lag(rollapplyr(Reg, 5, mean, na.rm = T, partial = T), n = 1))
What I need to do is compute the rolling average of Reg using previous rows where HumidityLevel lies within the range for that specific day. The lower and upper boundary of the range is determined by two columns (LowerRange, UpperRange). The boundary values are dependent on the lowest and highest HumidityLevel-values for the day.
For instance, a day may have levels between 20 and 54. The rolling average for hour 1 of that specific day should then be computed by using previous Hour 1 observations with a HumidityLevel value above or equal to 20 and below or equal to 54.
I hope that my question makes sense.
This is my desired output:
desired_output <- data.frame(RowNum = c(1:10),
Hour = rep(1, times = 10),
Reg = c(1,0,0,1,0,1,0,0,0,0),
HumidityLevel = c(28.36, 65.02, 1.12, 49.61, 24.50, 98.16, 77.33, 97.03, 47.03, 85.71),
LowerBoundary = c(5.67, 7.50, 1.12, 19.32, 0.01, 6.94, 7.48, 0.71, 2.85, 1.59),
UpperBoundary = c(93.60, 89.37, 97.25, 99.63, 91.92, 98.16, 98.48, 99.98, 99.70, 98.86),
AvgReg = c("NA", 1, 0.5, 0.5, 0.5, 0.5, 0.6, 0.4, 0.4, 0.2))
Using data.table you can use between for filter and shift + frollmean for calculation:
setDT(df)[
between(HumidityLevel, LowerRange, UpperRange),
new_col := shift(
frollmean(Reg, c(seq_len(min(5, .N)), rep(5, max(0, .N - 5))), adaptive = TRUE)
),
by = Hour
]
Related
I need, with the help of the map() function, apply the above for each element
How can I do so?
As dt is of class data.table, you can make a vector of columns of interest (i.e. your items; below I use grepl on the names), and then apply your weighting function to each of those columns using .SD and .SDcols, with by
qs = names(dt)[grepl("^q", names(dt))]
dt[, (paste0(qs,"wt")):=lapply(.SD, \(q) 1/(sum(!is.na(q))/.N)),
.(sex, education_code, age), .SDcols = qs]
As mentioned in the comments, you miss a dt <- in your dt[, .(ID, education_code, age, sex, item = q1_1)] which makes the column item unavailable in the following line dt[, no_respond := is.na(item)].
Your weighting scheme is not absolutely clear to me however, assuming you want to do what is done in your code here, I would go with dplyr solution to iterate over columns.
# your data without no_respond column and correcting missing value in q2_3
dt <- data.table::data.table(
ID = c(1,2,3,4, 5, 6, 7, 8, 9, 10),
education_code = c(20,50,20,60, 20, 10,5, 12, 12, 12),
age = c(87,67,56,52, 34, 56, 67, 78, 23, 34),
sex = c("F","M","M","M", "F","M","M","M", "M","M"),
q1_1 = c(NA,1,5,3, 1, NA, 3, 4, 5,1),
q1_2 = c(NA,1,5,3, 1, 2, NA, 4, 5,1),
q1_3 = c(NA,1,5,3, 1, 2, 3, 4, 5,1),
q1_text = c(NA,1,5,3, 1, 2, 3, 4, 5,1),
q2_1 = c(NA,1,5,3, 1, 2, 3, 4, 5,1),
q2_2 = c(NA,1,5,3, 1, 2, 3, 4, 5,1),
q2_3 = c(NA,1,5,3, 1, NA, NA, 4, 5,1),
q2_text = c(NA,1,5,3, 1, NA, 3, 4, 5,1))
dt %>%
group_by(sex, education_code, age) %>% #groups the df by sex, education_code, age
add_count() %>% #add a column with number of rows in each group
mutate(across(starts_with("q"), #for each column starting with "q"
~ 1/(sum(!is.na(.))/n), #create a new column following your weight calculation
.names = '{.col}_wgt')) %>% #naming the new column with suffix "_wgt" to original name
ungroup()
I am trying to make a table like this -
The table contains several scenarios and risk_type.
The scenarios are basically filters. For example
0 - loan_age > 18
1 - interest_rate > 8%
2 - interest_rate > 18% AND referee == "MALE" AND new_LTV > 50
risk_type are columns in the original dataset like
A - flood risk
B - wildfire risk
C - foundation risk
What I want to do is to create a summary table of all these different risks for all the filters.
This is how the data looks like -
Damage and new LTV is a function of risk score, and I want to filter for risk score > 4
Edit - The first 5 rows of the dummy dataframe.
structure(list(ID = c(1, 2, 3, 4, 5), LTV_value = c(43, 43, 32,
34, 35), loan_age = c(17, 65, 32, 33, 221), referee = c("MALE",
"FEMALE", "MALE", "MALE", "FEMALE"), interest_rate = c(0.02,
0.03, 0.05, 0.0633333333333333, 0.0783333333333333), value = c(70000,
80000, 90000, 1e+05, 45000), flood_risk_score = c(3, 4, 5, 0,
1), wildfire_risk_score = c(3, 4, 3, 3, 2), foundation_risk_score = c(5,
5, 2, 0, 1), flood_damage = c(21000, 32000, 45000, 0, 4500),
wildfire_damage = c(21000, 32000, 27000, 30000, 9000), foundation_damage = c(35000,
40000, 18000, 0, 4500), new_LTV_flood = c(40, 39, 27, 34,
34), new_LTV_wildfire = c(40, 39, 29, 31, 33), new_LTV_foundation = c(38,
38, 30, 34, 34)), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L))
Till now I have tried these methods.
risk_list = c("flood_risk_score"
, "wildfire_risk_score"
, "foundation_risk_score")
for (i in risk_list){
table <- df %>%
filter(df[i] > 3) %>%
summarise(Count = n()
, mean = mean(value, na.rm = TRUE)
, LTV = mean(LTV))
# Using rbind() to append the output of one iteration to the dataframe
table_append= rbind(table_append, table)
}
This helps me get the values for all the risk scores, however, I have two issues here.
I am unable to filter according to a filter list
For the filter list, I tried this code, but I am unable to add it in a loop -
filters_list = list(which(df$interest > 8)
, which(df$loan_age > 18))
For LTV update, all of them have different new LTV
All of them need to be filtered for high LTV using their new LTV scores
risk_type_list = c("flood"
, "wildfire"
, "foundation")
for (i in list(paste0(risk_type_list,"_risk_level"))){
table <- df %>%
filter(df[paste0(i,"_risk_level")] > 3) %>%
summarise(Count = n())
#Using rbind() to append the output of one iteration to the dataframe
table_append = rbind(table_append, table)
}
In the end, I want to have code that will generate data from the given data by putting in required filters for all different risk types and also use their new LTV values.
I'm currently trying to do a stratified split in R to create train and test datasets.
A problem posed to me is the following
split the data into a train and test sample such that 70% of the data
is in the train sample. To ensure a similar distribution of price
across the train and test samples, use createDataPartition from the
caret package. Set groups to 100 and use a seed of 1031. What is the
average house price in the train sample?
The dataset is a set of houses with prices (along with other data points)
For some reason, when I run the following code, the output I get is labeled as incorrect in the practice problem simulator. Can anyone spot an issue with my code? Any help is much appreciated since I'm trying to avoid learning this language incorrectly.
dput(head(houses))
library(ISLR); library(caret); library(caTools)
options(scipen=999)
set.seed(1031)
#STRATIFIED RANDOM SAMPLING with groups of 100, stratefied on price, 70% in train
split = createDataPartition(y = houses$price,p = 0.7,list = F, groups = 100)
train = houses[split,]
test = houses[-split,]
nrow(train)
nrow(test)
nrow(houses)
mean(train$price)
mean(test$price)
Output
> dput(head(houses))
structure(list(id = c(7129300520, 6414100192, 5631500400, 2487200875,
1954400510, 7237550310), price = c(221900, 538000, 180000, 604000,
510000, 1225000), bedrooms = c(3, 3, 2, 4, 3, 4), bathrooms = c(1,
2.25, 1, 3, 2, 4.5), sqft_living = c(1180, 2570, 770, 1960, 1680,
5420), sqft_lot = c(5650, 7242, 10000, 5000, 8080, 101930), floors = c(1,
2, 1, 1, 1, 1), waterfront = c(0, 0, 0, 0, 0, 0), view = c(0,
0, 0, 0, 0, 0), condition = c(3, 3, 3, 5, 3, 3), grade = c(7,
7, 6, 7, 8, 11), sqft_above = c(1180, 2170, 770, 1050, 1680,
3890), sqft_basement = c(0, 400, 0, 910, 0, 1530), yr_built = c(1955,
1951, 1933, 1965, 1987, 2001), yr_renovated = c(0, 1991, 0, 0,
0, 0), age = c(59, 63, 82, 49, 28, 13)), row.names = c(NA, -6L
), class = c("tbl_df", "tbl", "data.frame"))
>
> library(ISLR); library(caret); library(caTools)
> options(scipen=999)
>
> set.seed(1031)
> #STRATIFIED RANDOM SAMPLING with groups of 100, stratefied on price, 70% in train
> split = createDataPartition(y = houses$price,p = 0.7,list = F, groups = 100)
>
> train = houses[split,]
> test = houses[-split,]
>
> nrow(train)
[1] 15172
> nrow(test)
[1] 6441
> nrow(houses)
[1] 21613
>
> mean(train$price)
[1] 540674.2
> mean(test$price)
[1] 538707.6
I try to reproduce it manually using sample_frac form dplyr package and cut2 function from Hmisc package. The results are almost the same - still not same.
It looks like there might be a problem with pseudo numbers generator or with some rounding.
In my opinion your code looks to be a correct one.
Is it possible that in previous steps you should remove some outliers or pre-process dataset in any way.
library(caret)
options(scipen=999)
library(dplyr)
library(ggplot2) # to use diamonds dataset
library(Hmisc)
diamonds$index = 1:nrow(diamonds)
set.seed(1031)
# I use diamonds dataset from ggplot2 package
# g parameter (in cut2) - number of quantile groups
split = diamonds %>%
group_by(cut2(diamonds$price, g= 100)) %>%
sample_frac(0.7) %>%
pull(index)
train = diamonds[split,]
test = diamonds[-split,]
> mean(train$price)
[1] 3932.75
> mean(test$price)
[1] 3932.917
set.seed(1031)
#STRATIFIED RANDOM SAMPLING with groups of 100, stratefied on price, 70% in train
split = createDataPartition(y = diamonds$price,p = 0.7,list = T, groups = 100)
train = diamonds[split$Resample1,]
test = diamonds[-split$Resample1,]
> mean(train$price)
[1] 3932.897
> mean(test$price)
[1] 3932.572
This sampling procedure should result in mean that approximate to a population one.
I want to add a modification factor to an existing equation to fit data. The original equation is defined through a function because the variable N_l is a vector of numbers and the function is selecting the largest outcome of the equation by going through all possible values in the vector N_l. The original function is defined as:
library(utils)
R <- function(x){
N_b <- x[1]
N_l <- x[2]
A <- x[3]
x.sqr <- x[4]
S <- x[10]
e <- x[grepl("e_\\d",names(x))]
f <- sapply(seq(N_l),function(k) max(Multi.Presence$m[k] * ((k/N_b) +
(A * combn(e,k,sum) / x.sqr))))
c(val = max(f), pos = which.max(f))
}
DATA.GIRDER1 <- cbind(DATA.GIRDER1, vars = t(apply(DATA.GIRDER1, 1, R)))
colnames(DATA.GIRDER1)[12:13] <- c("Proposed.Girder1","Lanes")
The equation defines in the function is:
The first 5 rows of the dataframe DATA.GIRDER1 and dataframe Multi.Presence are provided:
> dput(DATA.GIRDER1[(1:5),]
structure(list(N_b = c(5, 5, 5, 5, 5), N_l = c(4, 4, 4, 4, 4),
A = c(-12, -12, -12, -12, -12), x.sqr = c(1440, 1440,
1440, 1440, 1440), e_1 = c(21.8, 21.8, 21.8, 21.8, 21.8),
e_2 = c(9.8, 9.8, 9.8, 9.8, 9.8), e_3 = c(-2.2, -2.2, -2.2,
-2.2, -2.2), e_4 = c(-14.2, -14.2, -14.2, -14.2, -14.2),
e_5 = c(0, 0, 0, 0, 0), S = c(12, 12, 12, 12, 12),
R = c(0.59189685884369, 0.583646426252063,
0.556293941275237, 0.576160481501275, 0.597435112708129)),
row.names = c(NA, 5L), class = "data.frame")
> dput(Multi.Presence)
structure(list(N_l = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10), m = c(1.2,
1, 0.85, 0.65, 0.65, 0.65, 0.65, 0.65, 0.65, 0.65)), row.names = c(NA,
-10L), class = "data.frame")
The theoretical data to fit the equation to is CSi.Girder1. At the moment, the way the function is set up, it calculates the maximum R for each row of dataframe DATA.GIRDER1.
I want to add a regression term based on variable S in dataframe DATA.GIRDER1to the second part of the equation to find parameters a and b to best fit the data in CSi.Girder1. The desired output would implement the equation below:
To use nlsLM I need to define a function for the equation such as:
library(minpack.lm)
Prposed.Girder1 <- function(N_b, N_l,A,x.sqr,e_1,e_2,e_3,e_4,e_5,S,a,b) {
R <- function(x){
N_b <- x[1]
N_l <- x[2]
A <- x[3]
x.sqr <- x[4]
e <- x[grepl("e_\\d",names(x))]
f <- sapply(seq(N_l),function(k) max(Multi.Presence$m[k] * ((k/N_b) +
(A * combn(e,k,sum) / x.sqr) * (b*S^a))))
c(val = max(f), pos = which.max(f))
}
DATA.GIRDER1 <- cbind(DATA.GIRDER1, vars = t(apply(DATA.GIRDER1, 1, R)))
colnames(DATA.GIRDER1)[12:13] <- c("Proposed.Girder1","Lanes")
return(R)
}
Girder1_nlsLM <- nlsLM(R ~ Prposed.Girder1(N_b, N_l,A,x.sqr,e_1,e_2,e_3,e_4,e_5,S,a,b),
data = DATA.GIRDER1,
start = c(a = 0.01, b = 0.01))
summary(Girder1_nlsLM)
But this is not successful and I get the error:
Error in model.frame.default(formula = ~R + N_b + N_l + A + x.sqr + e_1 + :
object is not a matrix
How can I add this modification factor in terms of variable S to solve for the parameters a and b.
My question builds on another one previously posted by someone: mapply for all arguments' combinations [R]
I want to apply a function to multiple arguments using mapply, and this works with my code below. But I want to add a condition such that NOT ALL tmin- and tmax- values will be combined, instead only the first tmin with the first tmax, the second tmin with the second tmax (if tmin == 0.01 & tmax == 0.99 or if tmin == 0.05 & tmax == 0.95, but e.g. tmin == 0.01 should not be combined with tmax == 0.95).
But the first elements of tmin & tmax should be combined with ALL variables, all second elements of tmin & tmax should be combined with ALL variables, etc (as below in the expand.grid() function).
In the end I should have a data frame as the one called "alltogether", but I should have 15 rows with the described condition and not 75 as it is the case now.
I could just filter rows with dplyr::filter afterwards, but is there a nice way to include this condition in the function?
Here an example data frame:
dataframe <- data.frame(personID = 1:10,
Var1 = c(4, 6, 3, 3, 7, 1, 20, NA, 12, 2),
Var2 = c(5, 4, 5, 6, 9, 14, 14, 1, 0, NA),
Var3 = c(NA, 15, 12, 0, NA, NA, 2, 7, 6, 7),
Var4 = c(0, 0, 0, 0, 1, 0, 1, 4, 2, 1),
Var5 = c(12, 15, 11, 10, 10, 15, NA, 10, 13, 11))
and here the code I have so far:
des <- function(var, tmin, tmax){
v <- var[var >= quantile(var, probs = tmin, na.rm = TRUE) &
var <= quantile(var, probs = tmax, na.rm = TRUE)]
d <- psych::describe(v)
df <- cbind(variable = deparse(substitute(var)), tmin = tmin, tmax = tmax, d)
print(df)
}
args = expand.grid(var = dataframe[, c("Var2", "Var4", "Var5")], tmin = c(0.01, 0.05, 0.1, 0.2, 0.25), tmax = c(0.99, 0.95, 0.9, 0.8, 0.75))
alltogether <- do.call("rbind", mapply(FUN = des, var = args$var, tmin = args$tmin, tmax = args$tmax, SIMPLIFY = FALSE))
Thank you for helping!
Edit:
The expected output is the one after filtering the "alltogether"-dataframe with the following code (15 obs. of 16 variables):
alltogether <- alltogether%>%
dplyr::filter((tmin == 0.01 & tmax == 0.99) |
(tmin == 0.05 & tmax == 0.95) |
(tmin == 0.1 & tmax == 0.9) |
(tmin == 0.2 & tmax == 0.8) |
(tmin == 0.25 & tmax == 0.75))
OK, here's a solution to both problems. Unfortunately, I couldn't get one using mapply so I had to rely on a good old for loop (but it's still faster, given that it doesn't have to do all the extra calculations). Also, I changed the function to give you the names of the variables as you wanted. The biggest difference is that I'm not using expand.grid but merge. Finally, it incorporates your comment from above.
des <- function(var, tmin, tmax, cor.var, cor.method = c("spearman", "pearson", "kendall")){
var[var < quantile(var, probs = tmin, na.rm = TRUE) |
var > quantile(var, probs = tmax, na.rm = TRUE)] <- NA
d <- psych::describe(var)
correlation<- cor(cor.var, var, use="pairwise.complete", match.arg(cor.method))
df <- cbind(variable = names(var), tmin = tmin, tmax = tmax, d, correlation)
names(df)[length(names(df))]<- paste0("correlation_with_", names(cor.var))
print(df)
}
minmax = data.frame(tmin = c(0.01, 0.05, 0.1, 0.2, 0.25), tmax = c(0.99, 0.95, 0.9, 0.8, 0.75))
args<- merge(c("Var2", "Var4", "Var5"), minmax)
args[,1]<- as.character(args[,1])
alltogether<- NULL
for (i in 1:nrow(args)){
alltogether<- rbind(alltogether, des(var = dataframe[args[i,1]],
tmin = args[i, 2], tmax=args[i, 3], cor.var = dataframe["Var1"]))
}