R function not evaluating properly on h2o dataset - r

I'm trying to create a function to linearly spline a variable in an h2o dataset, but can't get h2o to evaluate the function properly.
Here's my initial attempt on intermediate spline:
df <- data.frame( AGE = sample(1:100, 1e6, replace = TRUE))
df_A.hex <- as.h2o( df, 'df_A.hex' )
simple_spline <- function( x, L, U ) min( max(x-L,0), U-L)
spline_vector <- Vectorize( simple_spline, vectorize.args = 'x', USE.NAMES = FALSE )
df_A.hex[, 'AGE_12_24'] <- spline_vector( df_A.hex[, 'AGE'], 12, 24)
And here is the result:
AGE AGE_12_24
1 9 12
2 7 12
3 33 12
4 84 12
5 86 12
6 25 12
I tried using pmin and pmax, on the assumption that maybe it wasn't vectorizing the columns, but I get the following error:
> simple_spline <- function( x, L, U ) pmin( pmax(x-L,0), U-L)
> df_A.hex[, 'AGE_12_24'] <- simple_spline( df_A.hex[, 'AGE'], 12, 24)
Error in each[change] : invalid subscript type 'environment'
I'm guessing it's because the pmin and pmax aren't implemented in h2o?
I also tried using apply, but also hit an error:
> simple_spline <- function( x, L, U ) min( max(x-L,0), U-L)
> df_A.hex[, 'AGE_12_24'] <- apply( df_A.hex[, 'AGE'], 1, simple_spline, 12, 24)
> [1] "Lookup failed to find min"
Error in .process.stmnt(stmnt, formalz, envs) :
Don't know what to do with statement: min
I could write a function that iteratively overwrites the spline column like so:
df_A.hex[, 'AGE_12_24'] <- df_A.hex[, 'AGE'] - 12
df_A.hex[, 'AGE_12_24'] <- h2o.ifelse( df_A.hex[, 'AGE_12_24'] < 0, 0, df_A.hex[, 'AGE_12_24'] )
df_A.hex[, 'AGE_12_24'] <- h2o.ifelse( df_A.hex[, 'AGE_12_24'] > 12, 12, df_A.hex[, 'AGE_12_24'] )
This gets me my expected result:
AGE AGE_12_24
1 9 0
2 7 0
3 33 12
4 84 12
5 86 12
6 25 12
But it's a fairly ugly way of getting there. I'd like to know what I'm doing wrong and how to have a function pass on the values to the h2o frame.

Unfortunately you can't pass additional parameters to the H2O R apply() method (I've reported the bug here).
and even if you hardcode the original parameters to get the apply method to evaluate it, it won't evaluate correctly:
library(h2o)
h2o.init()
df <- data.frame( AGE = c(9,7,33,84,86,25))
df_A.hex <- as.h2o( df, 'df_A.hex' )
L = 12
U = 24
simple_spline <- function(x) { min( max(x-L,0), U-L )}
apply(df_A.hex, 1, simple_spline)
C1
1 -3
2 -5
3 21
4 72
5 74
6 13
I think your best bet is to use your iterative method, or play around with the apply method (not passing additional parameters) until you can trust the results you see.

Related

Atomic value regarding sequence function reinforcement learning

Brief description:
I got a matrix based on action and states (1:25, nrow 5) and i want to be able to select the upcoming row (so whenever i am sitting on the first row no matter position i want to have an output of all the positions in the next row, example input function number 8, output = 4 9 14 19 24). Came up with a logical function but whenever i run it i get an error in environment_mat$cellnumb, $ operator is invalid for atomic vectors....
Can you maybe help a lad out here?
states <- seq(1,5, by = 1)
actions <- seq(1,5, by = 1)
state_sequence <- cbind(merge(states,states), state = seq(1, length(states)*length(actions)))
environment_mat <- matrix(state_sequence$state, nrow = length(states), ncol= length(actions))
rewards_mat <- matrix(data = c(-100,10,50,16,32,40,-100,80,41,7,50,1,-100,
85,2,16,98,4,-100,8,32,45,95,78,-100), nrow = 5)
environment_mat
nextCells <- function(curCell) {
nexSta <- seq(0, max(states)-1, by = 1)*max(states)+
environment[environment_mat$cellnum == curCell,]$y
return(nexSta)
}
nextCells(24)
As explained above i tried multiple things but i cannot come up with another logical function than this
You may be looking for the modulo operator %%.
nextCells <- function(curCell) {
environment_mat[(which(environment_mat == curCell) - 1L) %% nrow(environment_mat) + 2L,]
}
nextCells(24)
#> [1] 5 10 15 20 25
nextCells(8)
#> [1] 4 9 14 19 24
Or, more simply:
nextCells <- function(curCell) {
environment_mat[(curCell - 1L) %% nrow(environment_mat) + 2L,]
}
nextCells(24)
#> [1] 5 10 15 20 25
nextCells(8)
#> [1] 4 9 14 19 24

Centered moving average in R (without using packages)

I have been constructing a function for centered moving average in R (without using any packages), and have encountered a challenge as below:
As you know, the centered moving average includes the concept of incorporating the 'incomplete portions' (i.e. at the beginning and the end of the datapoint). For example, consider below vector p:
p <- c(10,20,30,40,50,60,70,80,90)
In this case, centered moving average that I am interested in looks like this:
x <- ((10+20)/2, (10+20+30)/3, (20+30+40)/3 ..... (70+80+90)/3, (80+90)/2)
To achieve above, I tried function with if function as below:
wd means window size
mov_avg <- function(p, wd) {
x <- c(0, cumsum(p))
if ((p > p[1])&(p < p[length(p)])) {
neut <- 1:(length(p)-(wd-1))
upper <- neut+(wd-1)
x <- (x[upper]-x[neut])/(upper-neut)
} else if (p==p[1]) {
neut <- 0
upper <- neut+3
x <- (x[upper]-x[neut])/(upper-1-neut)
} else if (p==p[length(p)]) {
upper <-(length(p)+1)
neut <- (length(p)-(wd-2))
x <- (x[upper]-x[neut])/(upper-neut)
}
return(x)
}
Then I entered below line to execute:
mov_avg(p, 3)
I encountered errors as below:
numeric(0)
Warning messages:
1: In if ((p > p[1]) & (p < p[length(p)])) { :
the condition has length > 1 and only the first element will be used
2: In if (p == p[1]) { :
the condition has length > 1 and only the first element will be used
Could someone help me out in making this a working function?
Thank you!
How about something like this in base R:
window <- 3
p <- c(10,20,30,40,50,60,70,80,90)
x <- c(NA, p, NA)
sapply(seq_along(x[-(1:(window - 1))]), function(i)
mean(x[seq(i, i + window - 1)], na.rm = T))
#[1] 15 20 30 40 50 60 70 80 85
The trick is to add flanking NAs and then use mean with na.rm = T.
I know you said "without using packages", but the same is even shorter using zoo::rollapply
library(zoo)
rollapply(c(NA, p, NA), 3, mean, na.rm = T)
#[1] 15 20 30 40 50 60 70 80 85
We could also use rowMeans
rowMeans(embed(c(NA, p, NA), 3)[, 3:1], na.rm = TRUE)
#[1] 15 20 30 40 50 60 70 80 85
Another method is to create a function where we can adjust with variable windows
mov_avg <- function(p, window) {
mean_number = numeric()
index = 1
while(index < length(p)) {
if (index == 1 | index == length(p) - 1)
mean_number = c(mean_number, mean(p[index:(index + window - 2)]))
else
mean_number = c(mean_number, mean(p[index:(index + window - 1)]))
index = index + 1
}
mean_number
}
mov_avg(p, 3)
#[1] 15 30 40 50 60 70 80 85
mov_avg(p, 2)
#[1] 10 25 35 45 55 65 75 80
Take the mean by rows in a matrix with columns that are x, and the head and tail appended with the means respectively of the first two and last two elements.
apply( matrix( c(x,
c( x[1]+x[2])/2, head(x,-1) ),
c( tail(x,-1), sum( tail(x,2))/2) ),
ncol = 3),
1, mean)

Trouble applying function to data frame

Toy example:
> myfn = function(a,x){sum(a*x)}
> myfn(a=2, x=c(1,2,3))
[1] 12
Good so far. Now:
> df = data.frame(a=c(4,5))
> df$ans = myfn(a=df$a, x=c(1,2,3))
Warning message:
In a * x : longer object length is not a multiple of shorter object length
> df
a ans
1 4 26
2 5 26
What I want to happen is that for the first row, it is as if I called myfn(a=4, x=c(1,2,3), giving an answer of 24, and for the second row, it is as if I called myfn(a=5, x=c(1,2,3) giving an answer of 30. How do I do this? Thank you.
EDIT: slightly more complex version. Now suppose that the function is
myfn = function(a,b, x){sum((a+b)*x)}
and that I have the data frame
df = data.frame(a=c(4,5), b=c(6,7), c=c(9,9))
I want to create df$ans such that, for the first row it is as if I called myfn(a=4, b=6, x=c(1,2,3) and for the second for it is as if I called myfn(a=5, b=7, x=c(1,2,3), that is, use df$x for a, df$y for b, and ignore df$z.
Something like this would work:
myfn = function(a,x){
return(sum(a*x))
}
df <- data.frame(a=c(4,5))
df$ans <- apply(df, 1, myfn, x = c(1,2,3))
df$ans
a ans
1 4 24
2 5 30
** Edited Based On User Edit **
df = data.frame(a=c(4,5), b=c(6,7), c=c(9,9))
df$ans <- apply(df[, c("a", "b")], 1, function(y) sum((y['a']+y['b'])*c(1,2,3)))
a b c ans
1 4 6 9 60
2 5 7 9 72
There are several ways this can be done, each with it's own charms. If you don't want to modify the function I would just do
mapply(myfn, df$x, df$y, MoreArgs = list(x = 1:3))
Alternatively, you can bake the iteration right into the function, e.g,
myfn = function(a,b, x){
sapply(a+b, function(ab) {
sum(ab*x)
})
}
myfn(df$x, df$y, 1:3)
That's probably the way I would do it.

Replacing the value in between vectors when there is a defined difference

I have question on replacing the value in between the vectors.
The algorithm should find that replacement number when the certain condition is met. In this case finding the number which makes the difference -20 with the previous number. So I prefer to use diff function.
Here is what I mean
x <- c(20,20,0,20,0,5)
> diff(x)
[1] 0 -20 20 -20 5
So in this case 0 makes the difference -20 and I want to change those 0s to 20.
. I know the easiest solution is the directly assigning x[3] <- 20 or x[5] <- 20
However, the 0 location is always different so I need an automated process that can do that. Thanks!
**EDIT
if we need to do this in a grouped data.frame
> df
x gr
1 20 1
2 20 1
3 0 1
4 20 1
5 0 1
6 5 1
7 33 2
8 0 2
9 20 2
10 0 2
11 20 2
12 0 2
How can we implement this ?
modify <- function(x){
value_search = c(0, 33)
value_replacement = c(20, 44)
for (k in 1:length(value_search)) {
index_position = which(x %in% value_search[k])
replacement = value_replacement[k]
for (i in index_position) {
x[i] = replacement
}
}
}
df%>%
group_by(gr)%>%
mutate(modif_x=modify(x))
Error in mutate_impl(.data, dots) :
Evaluation error: 'match' requires vector arguments.
You can do it using which to get the position, i.e.
x[which(diff(x) == -20)+1] <- 20
x
#[1] 20 20 20 20 20 5
if you want a generic way to replace values of a vector based on particular values, i would approach it this way.
x = c(20,20,0,20,0,5)
value_search = 0
value_replacement = 20
index_position = which(x %in% value_search)
for (i in index_position) {
x[i] = value_replacement
}
but this works for single values. if you want to look for multiple values, you can use a nested loop as below:
x = c(20,20,0,20,0,5,33)
value_search = c(0, 33)
value_replacement = c(20, 44)
for (k in 1:length(value_search)) {
index_position = which(x %in% value_search[k])
replacement = value_replacement[k]
for (i in index_position) {
x[i] = replacement
}
}
in response to OP's edits:
any number of ways to do this:
x = c(20,20,0,20,0,5,33)
gr = c(1,1,1,1,2,2,2)
df = data.frame(x, gr)
func_replace <- function(source, value_search, value_replacement) {
for (k in 1:length(source)) {
index_position = which(x %in% value_search[k])
replacement = value_replacement[k]
for (i in index_position) {
source[i] = replacement
} # for i loop
} # for k loop
return(source)
} # func_replace
value_search = c(0, 33)
value_replacement = c(20, 44)
gr_value = 1
df$replacement = with(df, ifelse(gr == gr_value, sapply(df, FUN = function(x) func_replace(x, value_search, value_replacement)), NA))

Take the first unique value form a function

This is my function:
g <- function(x,y){
x <- (x-y):x
y <- 1:30 # ------> (y is always fixed 1:30)
z<- outer(x,y,fv) # ---->(fv is a previous function)
s <- colSums(z)
which(s==max(s),arr.ind=T)
}
It tells me the position of the max value in s. I basically have a problem in choosing y because given a small y, the max(s) appears more than once in s. For example:
#given x=53
> g(53,1)
[1] 13 16 20 22 25 26 27
> g(53,2)
[1] 20 25 26
> g(53,3)
[1] 20 25 26
> g(53,4)
[1] 20 25 26
> g(53,5)
[1] 20 25
> g(53,6)
[1] 25 -----> This is the only result i would like from my function (right y=6)
Another example:
# given x=71
> g(71,1)
[1] 7 9 14
> g(71,2)
[1] 7 14
> g(71,3)
[1] 14 -----> my desired result (right y=3)
Therefore, i would like a function resulting in the first unique solution given y as small as possible ( ex: g(53)=25 , g(71)=14, ...). Any help? Thanks
This is a simplify example. I hope to be more clear in questioning:
#The idea is the same:
n <- 1:9
e <- rep(nn,500)
p<- sample(e) # --->(Need to sample in order to have more max later (mixed matrix)
mat <- matrix(p,90)
g <- function(x,y){
x <- (x-y):x
k <- rowSums(mat[,x])
which(k==max(k), arr.ind=T)
}
#In my sample matrix :
k <- rowSums(mat[,44:45])
which(k==max(k), arr.ind=T)
[1] 44 71 90
#In fact
g(45,1)
[1] 44 71 90 # ---> more than one solution
g(45,2)
[1] 90 # ----> I would like to pick up this value wich is the first unique solution given x=45
Therefore, i would like a function resulting in the first unique solution for y as small as possible given x ( in this new ex: g(45)=90... ).
I got it. It is a bit long but i think right.
Taking into consideration the second simplify example:
g <- function(x,y){
x <- (x-y):x
k <- rowSums(mat[,x])
q <- which(k==max(k), arr.ind=T)
length(q)
}
gv <- Vectorize(g)
l <- function(x){
y<- 1:30 # <- (until 30 to be sure)
z<- outer(x,y,gv)
y <- which.min(z) # <- (min is surely length=1 and which.min takes the first)
x <- (x-y):x
k <- rowSums(mat[,x])
q <- which(k==max(k), arr.ind=T)
q
}
l(45)
[1] 90
It seems like you could just do this with a recursive function. Consider the following:
set.seed(42)
n = 1:9
e = rep(n, 500)
p = sample(e)
mat = matrix(p, 90)
g <- function(x, y=1) {
xv <- (x-y):x
k <- rowSums(mat[, xv])
i <- which(k == max(k), arr.ind=T)
n <- length(i)
if (n == 1) {
return(y) # want to know the min y that solves the problem, right?
} else {
y <- y + 1 # increase y by 1
g(x,y) # run our function again with a new value of y
}
}
You should now be able to run g(45) and get 1 as the result, since that is the value of y that solves the problem, and g(33) to get 2.

Resources