Problem with row-wise operation in base R

Problem with row-wise operation in base R - r

I have a problem with performing row-wise operations using 'apply' function in R. I want to calculate the distance between two points:
d <- function(x,y){
length <- norm(x-y,type="2")
as.numeric(length)
}
The coordinates are given by two dataframes:
start <- data.frame(
a = c(7, 5, 17, 1),
b = c(5, 17, 1, 2))
stop <- data.frame(
b = c(5, 17, 1, 2),
c = c(17, 1, 2, 1))
My point is to calculate successive distances given by start and stop coordiantes. I wish it worked like:
d(start[1,], stop[1,])
d(start[2,], stop[2,])
d(start[3,], stop[3,])
etc...
I have tried:
apply(X = start, MARGIN = 1, FUN = d, y = stop)
which brought some strange results. Can you please help me with finding the proper solution? I know how to perform the operation using dplyr rowwise() function, however my wish is to use base only.
Can you also explain me why did I receive such a strange results with apply()?

Loop over the sequence of rows and apply the d
sapply(seq_len(nrow(start)), function(i) d(start[i,], stop[i,]))
[1] 12.165525 20.000000 16.031220 1.414214
Or if we want to use apply, create a single data by cbinding the two data and then subset by indexing
apply(cbind(start, stop), 1, FUN = function(x) d(x[1:2], x[3:4]))
[1] 12.165525 20.000000 16.031220 1.414214
Or may use dapply for efficiency
library(collapse)
dapply(cbind(start, stop), MARGIN = 1, parallel = TRUE,
FUN = function(x) d(x[1:2], x[3:4]))
[1] 12.165525 20.000000 16.031220 1.414214

Related

How to create list of functions with multiple parameters from dataframes in R?

Long time reader, first time poster. I have not found any previous questions about my current problem. I would like to create multiple linear functions, which I can later apply to variables. I have a data frame of slopes: df_slopes and a data frame of constants: df_constants.
Dummy data:
df_slope <- data.frame(var1 = c(1, 2, 3,4,5), var2 = c(2,3,4,5,6), var3 = c(-1, 1, 0, -10, 1))
df_constant<- data.frame(var1 = c(3, 4, 6,7,9), var2 = c(2,3,4,5,6), var3 = c(-1, 7, 8, 0, -1))
I would like to construct functions such as
myfunc <- function(slope, constant, trvalue){
result <- trvalue*slope+constant
return(result)}
where the slope and constant values are
slope<- df_slope[i,j]
constant<- df_constant[i,j]
I have tried many ways, for example like this, creating a dataframe of functions with for loop
myfunc_all<-data.frame()
for(i in 1:5){
for(j in 1:3){
myfunc_all[i,j]<-function (x){ x*df_slope[i,j]+df_constant[i,j] }
full_func[[i]][j]<- func_full
}
}
without success. The slope-constant values are paired up, such as df_slope[i,j] is paired with df_constant[i,j]. The desired end result would be some kind of data frame, from where I can call a function by giving it the coordinates, for example like this:
myfunc_all[i,j}
but any form would be great. For example
myfunc_all[2,1]
in our case would be
function (x){ x*2+4]
which I can apply to different x values. I hope my problem is clear.

So you have a slight problem with lazy evaluation and variable scopes when you are using a for loop to build functions (see here for more info). It's a bit safer to use something like mapply which will create closures for you. Try
myfunc_all <- with(expand.grid(1:5, 1:3), mapply(function(i, j) {
function(x) {
x*df_slope[i,j]+df_constant[i,j]
}
},Var1, Var2))
dim(myfunc_all) <- c(5,3)
This will create an array like object. The only difference is that you need to use double brackets to extract the function. For example
myfunc_all[[2,1]](0)
# [1] 4
myfunc_all[[5,3]](0)
# [1] -1
Alternative you can choose to write a function that returns a function. That would look like
myfunc_all <- (function(slopes, constants) {
function(i, j)
function(x) x*slopes[i,j]+constants[i,j]
})(df_slope, df_constant)
then rather than using brackets, you call the function with parenthesis.
myfunc_all(2,1)(0)
# [1] 4
myfunc_all(5,3)(0)
# [1] -1

df_slope <- data.frame(var1 = c(1, 2, 3,4,5), var2 = c(2,3,4,5,6), var3 = c(-1, 1, 0, -10, 1))
df_constant<- data.frame(var1 = c(3, 4, 6,7,9), var2 = c(2,3,4,5,6), var3 = c(-1, 7, 8, 0, -1))
functions = vector(mode = "list", length = nrow(df_slope))
for (i in 1:nrow(df_slope)) {
functions[[i]] = function(i,x) { df_slope[i]*x + df_constant[i]}
}
f = function(i, x) {
functions[[i]](i, x)
}
f(1, 1:10)
f(3, 5:10)

Inconsistent dist() foreach results

I have data that is roughly in the following format but is very large but is broken up by groups using the class and uniqueId variable. Where each location is a pair row wise (x, y).
df <-
data.frame(
x = c(1, 2, 3, 4, 5, 6, 8, 9, 10),
y = c(1, 2, 3, 4, 5, 6, 8, 9, 10),
class = c(0, 0, 0, 0, 0, 1, 0, 1, 0),
uniqueId = c("1-2-3", "1-2-3", "1-2-3", "1-2-4", "1-2-4", "1-2-4", "1-3-2", "1-3-2", "1-3-2"),
partialId = c("1.2", "1.2", "1.2", "1.2", "1.2", "1.2", "1.3", "1.3", 1.3")
)
The function I am using should go through the dataframe and calculate the smallest distance to another object within the same uniqueId but different class as the current row. To do this I've broken my data into chunks the following way.
indexes <-
df %>%
select(partialId) %>%
unique()
j <- 1
library(doParallel)
class_separation <- c()
cl <- makePSOCKcluster(24)
registerDoParallel(cl)
while(j <= nrow(indexes)) {
test <- df %>% filter(partialId == indexes$partialId[j])
n <- nrow(test)
vec <- numeric(n)
vec <- foreach(k = 1:n, .combine = 'c', .multicombine = F) %dopar% {
c(
min(
apply(
test[test$uniqueId == test$uniqueId[k] & test$class != test$class[k], c("x","y")],
1,
function(x) dist(rbind(c(test$x[k],test$y[k]), c(x[1], x[2])))
)
)
)
}
class_separation <- c(class_separation, vec)
j <- j + 1
}
endtime <- Sys.time()
stopwatch <- endtime - starttime
closeAllConnections()
registerDoSEQ()
gc()
df <- cbind(df, class_separation)
When handling single plays or small batches, this code seems to operate just fine. However, when handling the full dataset I am getting results that are obviously incorrect. I know there must be a flaw in how I am calculating these distances since there is very little chance the dist() function itself or %dopar% is at fault. I have changed to %do% and my results do not change.
As an example of the discrepancy, the following image shows the class_separation column from when the full data run is conducted vs when I feed it a small example. As you can see the results are wildly different, but I'm not sure why.

After a day of thinking about this, the problem is in how I was sending my df to dist().
For example, if we intended to pass
dist(rbind(c(1, 1), c(6, 6)))
dist(rbind(c(1, 1), c(9, 9)))
What we actually pass is dist(rbind(c(1, 1), c(6, 6, 9, 9)))
This is obviously not what I want. I needed both distances and then to select the minimum between them or add in other conditionals. The way to do this I found was using the rdist package.
foreach(i = 1:nrow(df), .combine = 'c', .multicombine = F, .packages = c('tidyverse',
'rdist')) %dopar% {
min(
cdist(
df[df$class != df$class[i] & df$uniqueId == df$uniqueId[i], ] %>% select(x, y),
df %>% select(x, y) %>% slice(i)
)
)
}
For our test data this returns the vector
Inf Inf Inf 2.828427 1.414214 1.414214 1.414214 1.414214 1.414214
Which is exactly what I needed. The first three entries having no class == 1 options for their uniqueId should return Inf, row 4 is twice as far from row 6 as row 5 while all having the same uniqueId, while row 9 is equally distance to rows 8 and 10. Whether this solution will be sufficiently fast I will test out.

How to refer to unnamed object in R

I want to perform a simple task in R. I want to call a method on an object which has not been assigned to any variable yet.
Like this:
a <- c(5, 2, 11, 3)
b <- order(a, decreasing = TRUE)[1:floor(0.1 * length(.))]
So I guess, I would like to to find, what to pass to length function here. I know that I can perform it like this:
a <- c(5, 2, 11, 3)
b <- order(a, decreasing = TRUE)
b <- b[1:floor(0.1 * length(b))]
But I wanted to make it like I wrote above.

There is as far as i know, no implemented way that will achieve higher efficiency than the base code
a <- c(5, 2, 11, 3)
b <- order(a, decreasing = TRUE)
b[1:floor(0.1 * length(b))]
However one can achieve something similar to what you are asking, using either the magrittr, the dplyr or similar packages, which allow for piping calls. This would look similar to
a <- c(5, 2, 11, 3)
c <- a %>% order(., decreasing = TRUE) %>% .[1:floor(0.1 * length(.))]
identical(b[1:floor(0.1 * length(b))],c)
[1] TRUE

R function for creating, naming and lagging variables

I have some data like so:
a <- c(1, 2, 9, 18, 6, 45)
b <- c(12, 3, 34, 89, 108, 44)
c <- c(0.5, 3.3, 2.4, 5, 13,2)
df <- data.frame(a, b,c)
I need to create a function to lag a lot of variables at once for a very large time series analysis with dozens of variables. So i need to lag a lot of variables without typing it all out. In short, I would like to create variables a.lag1, b.lag1 and c.lag1 and be able to add them to the original df specified above. I figure the best way to do so is by creating a custom function, something along the lines of:
lag.fn <- function(x) {
assign(paste(x, "lag1", sep = "."), lag(x, n = 1L)
return (assign(paste(x, "lag1", sep = ".")
}
The desired output is:
a.lag1 <- c(NA, 1, 2, 9, 18, 6, 45)
b.lag1 <- c(NA, 12, 3, 34, 89, 108, 44)
c.lag1 <- c(NA, 0.5, 3.3, 2.4, 5, 13, 2)
However, I don't get what I am looking for. Should I change the environment to the global environment? I would like to be able to use cbind to add to orignal df. Thanks.

Easy using dplyr. Don't call data frames df, may cause confusion with the function of the same name. I'm using df1.
library(dplyr)
df1 <- df1 %>%
mutate(a.lag1 = lag(a),
b.lag1 = lag(b),
c.lag1 = lag(c))

The data frame statement in the question is invalid since a, b and c are not the same length. What you can do is create a zoo series. Note that the lag specified in lag.zoo can be a vector of lags as in the second example below.
library(zoo)
z <- merge(a = zoo(a), b = zoo(b), c = zoo(c))
lag(z, -1) # lag all columns
lag(z, 0:-1) # each column and its lag

We can use mutate_all
library(dplyr)
df %>%
mutate_all(funs(lag = lag(.)))

If everything else fails, you can use a simple base R function:
my_lag <- function(x, steps = 1) {
c(rep(NA, steps), x[1:(length(x) - steps)])
}

Check if elements in a vector are drawn exclusively from another vector

I have an R list with numeric vectors of different lengths. Something like this.
l = list(a = c(0, 1, 2), b = c(0, 1), c = c(0, 1, NA), d = c(0, 1, 5))
I want to identify the vectors that have values of 0, 1, or NA and, therefore, can be converted to logical vectors. In the above example, I would identify vectors b and c.
To do this, I am going to attempting something like this.
is.logical.vector = lapply(l, FUNCTION_NAME)
But I'm not sure what function to use in place of FUNCTION_NAME (that's just a placeholder for illustrative purposes). I need something that can take a vector like allowed = c(0, 1, NA) and ensure that only the values in allowed are represented in the elements of a vector (like those in list l).
Do you know if such a function exists? Alternatively, do you know how I could construct such a function without an explicit for loop? Thank you in advance!

By the sounds of it, you are looking for a combination of all and %in%:
vapply(l, function(z) all(z %in% c(0, 1, NA)), logical(1L))
# a b c d
# FALSE TRUE TRUE FALSE
Alternatively, you can use lapply:
lapply(l, function(z) all(z %in% c(0, 1, NA)))
FYI, as.logical(5) or even as.logical(-5) also evaluate to TRUE, so your condition "therefore, can be converted to logical vectors" doesn't quite match what you actually seem to be asking for :-)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Problem with row-wise operation in base R - r

Related

How to create list of functions with multiple parameters from dataframes in R?

Inconsistent dist() foreach results

How to refer to unnamed object in R

R function for creating, naming and lagging variables

Check if elements in a vector are drawn exclusively from another vector

Categories

Resources