I want to first find a max of 0 or j where j is any variable and then sum these for k (k=1,2,...k) variables of a dataframe data. In stata, I did as follows:
gen sum=max(0,x)+max(0,y)+max(0,z)+...+max(0,k)
In R I used following approach:
data$sum<-ifelse(data$x<0,0,data$x*1)+ifelse(data$y<0,0,data$y*1)+ifelse(data$z<0,0,data$z*1)+...+ifelse(data$k<0,0,data$k*1)
I was wondering whether there is an alternative and efficient approach in R to do the same thing.
Try this:
mycols <- c("x", "y", "z", "k")
data$sum <- rowSums(data[mycols] * (data[mycols] > 0))
Check with some sample data:
data <- data.frame(x = runif(10) - 0.5,
y = runif(10) - 0.5,
z = runif(10) - 0.5,
k = runif(10) - 0.5)
identical(rowSums(data[mycols] * (data[mycols] > 0)), # mine
ifelse(data$x < 0, 0, data$x * 1) + # yours
ifelse(data$y < 0, 0, data$y * 1) +
ifelse(data$z < 0, 0, data$z * 1) +
ifelse(data$k < 0, 0, data$k * 1))
# [1] TRUE
Alternatives to flodel's excellent solution, noting the first looks quite a bit like your Stata code.
with( data, # terrible name for an R object, BTW
pmax(x, 0) + pmax(y, 0) + pmax(z, 0) +pmax(k,0) )
rowSums( apply(data[-5], 2, pmax, 0) )
The second one is probably slower, but it is in the running for this R-golf competition. Also a matrix math solution:
as.matrix( (data[,1:4] > 0 )* data[, 1:4]) %*% rep(1, 4 )
Not the question, but writing out every variable in Stata is likely to be tedious and error-prone. There is likely to be scope for a loop here:
gen sum = 0
quietly foreach v of var varlist {
replace sum = sum + `v' if inrange(`v', 0, .)
}
where you must work out what the varlist should be.
Related
I am running below code to evaluate a function at each value of r.
For each element of r, the function calculates the sum of elements of a matrix product. Before doing this, values of M are adjusted based on a kernel function.
# (1) set-up with toy data
r <- seq(0, 10, 1)
bw <- 25
M <- matrix(data = c(0, 1, 2,
1, 0, 1,
2, 1, 0), nrow = 3, ncol = 3)
X <- matrix(rep(1, 9), 3, 3)
#
# (2) computation
res <- c()
# loop, calculationg sum, Epanechnikov kernel
for(i in seq_along(r)) {
res[i] <- sum(
# Epanechnikov kernel
ifelse(-bw < (M - r[i]) & (M - r[i]) < bw,
3 * (1 - ((M - r[i])^2 / bw^2)) / (4*bw),
0) * X,
na.rm = TRUE
)
}
# result
res
I am looking for recommendations to speed this up using base R. Thank you!
Using outer:
Mr <- outer(c(M), r, "-")
colSums(3*(1 - Mr^2/bw^2)/4/bw*(abs(Mr) < bw)*c(X))
#> [1] 0.269424 0.269760 0.269232 0.267840 0.265584 0.262464 0.258480 0.253632 0.247920 0.241344 0.233904
I'll also note that the original for loop solution can be sped up by pre-allocating res (e.g., res <- numeric(length(r))) prior to the for loop.
How to write a distribution of piecewise functions in R? For example, if a random variable X is a N(0,1) if p=1 and X~N(0,2) when p=0. I try the following code:
if(p==1)(X=rnorm(1,0,2))?
You can use ifelse:
X <- function(size){
ifelse(sample(0:1,size,replace = TRUE),rnorm(size,0,1),rnorm(size,0,2))
}
50% of the time (on average), X will sample from a N(0,1) variable and the other 50% of the time it will sample from N(0,2).
How it works can be seen more clearly if you change the definition of X so that the means of the two variables sampled from are different:
X <- function(size){
ifelse(sample(0:1,size,replace = TRUE),rnorm(size,0,1),rnorm(size,4,1))
}
Then hist(X(10000)) yields:
library(tidyverse)
#define the function pieces
g =function(x) rnorm(1,0,2)
h =function(x) rnorm(1,0,1)
#define the input
p = c(1,0,1,1,0)
#longer input
#p = sample(c(0,1),2000,replace = T)
piecewise_function= function(p) {
case_when( p==1 ~ g() , # a condition a tilde and a function
p==0 ~ h() ,
T ~ NA) #what to do if neither condition is met.
}
piecewise_function(p)
Try any of these where n is the number of sample size:
rnorm(n, 0, 1 * (p == 1) + 2 * (p == 0))
rnorm(n, 0, ifelse(p == 1, 1, 2))
rnorm(n, 0, 1 + !p)
Having an lm object I need to create a function based on its variables represented as character vector. I have tried to use a combination of eval and expr to create an f function that would be further used in obj and nlm optimisation of the latter.
library(tidyverse)
df <- drop_na(airquality)
model <- lm(Ozone~. - Temp, data = df, x=TRUE, y=TRUE)
base_vars <- all.vars(formula(model)[-2])
k <- length(base_vars)
f <- function(base_df, x, y, parms) {
with(base_df, parms[1] +
eval(expr(paste(paste(paste0('parms[', 2:(k+1), ']'), base_vars, sep = '*'), collapse = '+'))) +
log(parms[k+2] * (x - parms[k+3] ^ 2)))
}
obj <- function(parms, y, x) mean((residuals(model) - f(df, x, y, parms))^2)
fit <- with(data, nlm(obj, c(0, 0, 0, 0, 0, 0, 0), y = e, x = x))
But calling f(model$x, df$Temp, model$y, c(0, 0, 0, 0, 0, 0, 0)) results in the following error:
Error in eval(substitute(expr), data, enclos = parent.frame()) :
numeric 'envir' arg not of length one
4.
eval(substitute(expr), data, enclos = parent.frame())
3.
with.default(base_df, parms[1] + eval(expr(paste(paste(paste0("parms[",
2:(k + 1), "]"), base_vars, sep = "*"), collapse = "+"))) +
log(parms[k + 2] * (x - parms[k + 3]^2)))
2.
with(base_df, parms[1] + eval(expr(paste(paste(paste0("parms[",
2:(k + 1), "]"), base_vars, sep = "*"), collapse = "+"))) +
log(parms[k + 2] * (x - parms[k + 3]^2)))
1.
f(model$x, df$Temp, model$y, c(0, 0, 0, 0, 0, 0, 0))
I believe there might be a conflict between eval environment and environment implied by with function, but can't figure out why. Any ideas how can I create custom function f for variable models?
Expected output for the f(model$x, df$Temp, model$y, c(0, 0, 0, 0, 0, 0, 0)) would be:
with(base_df, parms[1]+parms[2]*Solar.R+parms[3]*Wind+parms[4]*Temp+parms[5]*Month+
parms[6]*Day+log(parms[7] * (Temp - parms[8] ^ 2)))
but for a different model it could be something like:
with(base_df,
parms[1]+parms[2]*var1+parms[3]*var2+log(parms[4]*(var3-parms[5]^2)))
so the number of variables and parameters is different with every call.
R supports computing on the language, but it should not be your first option. If you do it, it should never involve text processing of code. You don't have a case here where you need to compute on the language. I have no idea how you thought your attempt would work but I don't know the expr function and I refuse to install package tidyverse and its ginormous dependency tree.
Also, you generally should avoid with outside of interactive use. But with is not the problem here.
Here is how I would do this:
df <- airquality[complete.cases(airquality),]
model <- lm(Ozone~. - Temp, data = df)
f <- function(base_df, x, parms) {
m <- model.matrix(model, data = base_df)
k <- ncol(m)
stopifnot(length(parms) == (k + 2L))
#I use exp(parms[k+1]) to ensure a positive value within the log
m %*% parms[seq_len(k)] + log(exp(parms[k + 1L]) * (x - parms[k + 2L] ^ 2))
}
obj <- function(parms, y, x, base_df) mean((residuals(model) - f(base_df, x, parms))^2)
#some x:
x <- rpois(nrow(df), 10)
fit <- nlm(obj, c(0, 0, 0, 0, 0, 0, 0), x = x, base_df = df)
#works
You don't seem to use y and thus I removed it from the code.
Note how I create the design matrix for the linear part (using model.matrix) and use matrix multiplication with the parameters. You also need to ensure that log doesn't return Inf/-Inf/NaN.
I think #Roland gave a good answer covering your actual problem. I am isolating what I think you were specifically asking based on the question Title, with no comment on whether it is a good idea or not. It probably isn't in this use case.
But what you were looking for more than likely is eval_tidy() from rlang. I left the :: function notation in just so its obvious what package is being used here.
Note I fixed a couple things that seemed to be errors in the code. I am also using all ones instead of zeros to test in parms due to the log.
library(rlang)
library(tidyr)
# dropped y since it was an unused argument
f <- function(base_df, x, parms) {
# set an expression to evaluate using parse_expr()
.f <- rlang::parse_expr(paste(paste(paste0('parms[', 2:(k+1), ']'),
base_vars, sep = '*'), collapse = '+'))
# use eval_tidy() with the data mask
y_part1 <- rlang::eval_tidy(.f, data = base_df)
y_part2 <- log(parms[k + 2] * (x - parms[k + 3] ^ 2))
parms[1] + y_part1 + y_part2
}
# using your code
df <- tidyr::drop_na(airquality)
model <- lm(Ozone~. - Temp, data = df, x=TRUE, y=TRUE)
base_vars <- all.vars(formula(model)[-2])
k <- length(base_vars)
# changed to all ones, I think this is what you wanted for length
parms <- rep(1, k + 3)
method_1 <- f(df, df$Temp, parms)
method_2 <- with(df, parms[1]+parms[2]*Solar.R+parms[3]*Wind+parms[4]*Temp+parms[5]*Month+
parms[6]*Day+log(parms[7] * (Temp - parms[8] ^ 2)))
all.equal(method_1, method_2)
# [1] TRUE
I have a data.frame were each row contains the parameter values for a function (i.e. one function per row). I would like to draw these functions. The functions are conditional, and should only be drawn for certain values, and have kinks (due to min/max levels). See example of what I am trying to archive:
I originally considered using curve() or stat_function (ggplot2-library). But I could not see how it would be possible to draw the curves only for certain values (see a, b, c), without generating a data.frame.
So I created a function that generates the plot data:
N = 10000;
PrisstrukturToPlotdata = function(s){
# Create empty data.frame:
A <- data.frame(Site=rep(s$Site, N), bid=1:N, Saelger=rep(NA, N), Koeber=rep(NA, N), stringsAsFactors=FALSE)
# Fill out the data.frame:
for (i in 1:N) {
# Don't draw below:
if(i > s$Mindste_bud*s$Kurs) {
# First parenthesis is condition to insure we are above min, second parenthesis is in-between min and max, and third parenthesis is condition for above max:
A[i, ]$Saelger = s$Saelger_Fast_salaer*s$Kurs + i*s$Saelger_Andet_pct +
(i*s$Saelger_Variable_salaer <= s$Saelger_Min_variable_salaer*s$Kurs) *
s$Saelger_Min_variable_salaer*s$Kurs +
(i*s$Saelger_Variable_salaer > s$Saelger_Min_variable_salaer | (s$Saelger_Max_variable_salaer != 0 & i*s$Saelger_Variable_salaer < s$Saelger_Max_variable_salaer*s$Kurs)) *
i*s$Saelger_Variable_salaer
(s$Saelger_Max_variable_salaer != 0 & i*s$Saelger_Variable_salaer >= s$Saelger_Max_variable_salaer*s$Kurs) *
s$Saelger_Max_variable_salaer*s$Kurs;
A[i, ]$Koeber = s$Koeber_Fast_salaer*s$Kurs + i*s$Koeber_Variable_salaer;
}
}
return(A)
}
library(plyr)
Plotdata = adply(Prisstruktur, 1, PrisstrukturToPlotdata, .expand = FALSE)
Conditionality explained: There is a minimum value, below which the curve should not be drawn at all; if(i > s$Mindste_bud*s$Kurs)).
Then there is a percentage i*s$Saelger_Variable_salaer with respectively a minimum and maximum level (to complicate things not all functions have a max, those without the max value is just 0). If the percentage is below the minimum, the minimum level should be used. If the percentage is above the max, then the maximum level should be used. In between the percentage should be used.
The script above works okay for N=100 or even N=1000, but when I go to N=10000 or above it takes ages to run. I am guessing this is due to all the conditional statements, but I am not sure how to do this in a more efficient manner?
Dummy data:
Site = c('A', 'B', 'C')
Mindste_bud = c(300, 0 , 0)
Saelger_Fast_salaer = c(0, 250, 2)
Saelger_Variable_salaer = c(0.12, 0.16, 0.10)
Saelger_Min_variable_salaer = c(250, 0, 0)
Saelger_Max_variable_salaer = c(0, 0, 250)
Saelger_Andet_pct = c(0, 0, 0)
Koeber_Fast_salaer = c(95, 0, 0)
Koeber_Variable_salaer = c(0.2, 0.25, 0)
Kurs = c(1, 1, 5.430)
Prisstruktur = cbind(Site, Mindste_bud, Saelger_Fast_salaer, Saelger_Variable_salaer, Saelger_Min_variable_salaer, Saelger_Max_variable_salaer, Saelger_Andet_pct, Koeber_Fast_salaer, Koeber_Variable_salaer, Kurs)
You don't need a loop in your function. I doubt that you need all N = 10000 data points to get a nice plot. I've added structure to your code by using more whitespace and some ifelse functions for clarity.
PrisstrukturToPlotdata <- function(s, N = 10000, Length = 101)
n <- seq(s$Mindste_bud * s$Kurs + 1, N, length = Length)
data.frame(
Bid = n,
Saelger =
s$Saelger_Fast_salaer * s$Kurs +
n * s$Saelger_Andet_pct +
ifelse(
n * s$Saelger_Variable_salaer <= s$Saelger_Min_variable_salaer * s$Kurs,
s$Saelger_Min_variable_salaer * s$Kurs,
0
) +
ifelse(
n * s$Saelger_Variable_salaer > s$Saelger_Min_variable_salaer |
(s$Saelger_Max_variable_salaer != 0 &
n * s$Saelger_Variable_salaer < s$Saelger_Max_variable_salaer * s$Kurs),
n * s$Saelger_Variable_salaer,
0
) +
ifelse(
s$Saelger_Max_variable_salaer != 0 & n * s$Saelger_Variable_salaer >= s$Saelger_Max_variable_salaer * s$Kurs,
s$Saelger_Max_variable_salaer * s$Kurs,
0
),
Koeber = s$Koeber_Fast_salaer * s$Kurs + n * s$Koeber_Variable_salaer
)
)
I'm pretty new to R, and am struggling a bit with it. I have the following code:
repeat {
if (t > 1000)
break
else {
y1 <- rpois(50, 15)
y2 <- rpois(50, 15)
y <- c(y1, y2)
p_0y <- matrix(nrow = max(y) - min(y), ncol = 1)
i = min(y)
while (i <= max(y)) {
p_0y[i - min(y), ] = (length(which(y1 == i))/50)
i <- i + 1
}
p_y <- matrix(nrow = max(y) - min(y), ncol = 1)
j = min(y)
while (j <= max(y)) {
p_y[j - min(y), ] = (length(which(y == j))/100)
j <- j + 1
}
p_0yx <- p_0y[rowSums(p_0y == 0) == 0]
p_yx <- p_y[rowSums(p_0y == 0) == 0]
g = 0
logvect <- matrix(nrow = (length(p_yx)), ncol = 1)
while (g <= (length(p_yx))) {
logvect[g, ] = (p_0yx[g])/(p_yx[g])
g <- g + 1
}
p_0yx %*% (log2(logvect))
print(p_0yx %*% (log2(logvect)))
t <- t + 1
}
}
i am happy with everything up to the last line, but instead of printing the value of p_0yx%*%(log2(logvect)) to the screen i would like to store this as another vector. any ideas? i have tried doing it a similar way as in the nested loop but doesnt seem to work.
Thanks
The brief answer is to first declare a variable. Put it before everything you've posted here. I'm going to call it temp. It will hold all of the values.
temp <- numeric(1000)
Then, instead of your print line use
temp[t] <- p_0yx %*% log2(logvect)
As an aside, your code is doing some weird things. Look at the first index of p_0y. It is effectively an index to item 0, in that matrix. R starts indexing at 1. When you create the number of rows in that matrix you use max(y) - min(y). If the max is 10 and the min is 1 then there's only 9 rows. I'm betting you really wanted to add one. Also, your code is very un R-like with all of the unnecessary while loops. For example, your whole last loop (and the initialization of logvect) can be replaced with:
logvect = (p_0yx)/(p_yx)
But back to the errors.. and some more Rness... could the following code...
p_0y <- matrix(nrow = max(y) - min(y), ncol = 1)
i = min(y)
while (i <= max(y)) {
p_0y[i - min(y), ] = (length(which(y1 == i))/50)
i <- i + 1
}
maybe be replaced more correctly with?
p_0y <- numeric(max(y) - min(y) + 1)
p_0y[sort(unique(y1)) - min(y1) + 1] = table(y1)/50
p_0y <- matrix(p_0y, ncol = 1)
(similar rethinking of the rest of your code could eliminate the rest of the loops as well)