R - SparkR - referencing variable within `spark.lapply` - r

My goal is to convert some work using the usual lapply in R to using SparkR::spark.lapply. I'm not sure what the proper way to pass through extra variables into the function. lapply has the ... parameter to pass in extra arguments but spark.lapply does not.
Here's a toy example, where the list we map over isn't actually used, but it gets the point across. My particular case needs to use a sparse matrix so I'll use that in my example.
library(Matrix)
x <- Matrix(c(1, 0, 0, 2, 3, 4), nrow = 2, sparse = TRUE)
iter <- list(1, 1)
# using lapply
> lapply(iter, function(ii) nrow(x))
[[1]]
[1] 2
[[2]]
[1] 2
# sparkr
> library(SparkR)
> spark.lapply(iter, function(ii) nrow(x))
object 'x' not found
I am running this in Databricks, with R 4.2.2. So, how would I pass in other variables?
Thank you!

Related

Use an arrow assignment function as R purrr map2

I have a list of rasters:
filelist <- as.list(c("rasterA", "rasterB") # name them
rasterlist <- lapply(filelist, function(x) raster::raster(x)) # read them in
And a list of CRSes corresponding to those rasters:
crslist <- as.list(c("crsA.Rds", "crsB.Rds")) # name them
crslist %<>% lapply(function(x) readRDS(x)) # read them in
To fill the #crs slot in a raster you use (e.g.):
raster::crs(rasterA) <- crsA
But I can't work out how to use this arrow assignment in a purrr::map2 call. The example in the R cheat sheet is map2(x, y, sum), but sum works on vectors, is directionless, and contains all of its terms within its brackets.
Reading the map2 help, I tried the formula format:
purrr::map2(rasterlist, crslist, ~ raster::crs(.x) <- .y)
But no dice:
Error in as_mapper(.f, ...) : object '.y' not found
Does anyone know how I'd do this? Currently I'm using a loop which is simple, but I'm trying to force myself to learn mapping and keep hitting these kinds of walls.
for (i in length(rasterlist)) {
crs(rasterlist[[i]]) <- crslist[[i]]
}
Thanks!
Starting condition:
After for loop:
After raster_assigner function instead:
Hopefully this is salvagable because I've got other use cases like this. Cheers!
We may need
purrr::map2(rasterlist, crslist, ~ {crs(.x) <- .y;.x})
It is not related to the raster - as the simple example below shows the same error
> map2(list(1, 2, 3), list(3, 4, 5), ~ names(.x) <- .y)
Error in as_mapper(.f, ...) : object '.y' not found
> map2(list(1, 2, 3), list(3, 4, 5), ~ {names(.x) <- .y; .x})
[[1]]
3
1
[[2]]
4
2
[[3]]
5
3
However this won't return an error in Map (but it is not correct because the output returned is without the names, unless we wrap it with {} and return the x
> Map(function(x, y) names(x) <- y, list(1, 2, 3), list(3, 4, 5))
[[1]]
[1] 3
[[2]]
[1] 4
[[3]]
[1] 5
Could do a dedicated function:
raster_assigner <- function(raster, crs_input){
raster::crs(raster) <- crs_input
raster
}
purrr::map2(rasterlist, crslist, raster_assigner)

how to use apply (or sapply) with columns of matrix or dataframe as function args

I know this is a bonehead newbie question, but I've been trying to figure it out for quite awhile and need some input. Basically, I'm trying to learn how to use the apply family to omit for loops, specifically how to set up the call so that columns of a matrix serve as arguments to the function. I'll use a simple call to the rbinom function as an example.
Example: this for loop works fine. The data are a set of integers and a set of probabilities
success <- rep(-1, times=10) # initialize result var
num <- sample.int(20, 10) # get 10 random integers
p <- runif(10) # get 10 random probabilities
for (i in 1:10) {
success[i]= rbinom(n=1, size=num[i],prob=p[i]) # number successes in 1 trial
}
But how to do the same thing with the apply family? I first put the data into 2 columns of a matrix, thinking that was the right start. However, the following does NOT work, obviously due to my
poor understanding of how to set up a call to apply.
myData <- matrix(nrow=10, ncol=2)
myData[,1] <- num
myData[,2] <- p
success <- apply(myData, rbinom, n=1, size=myData[,1], prob=myData[,2])
Any tips are greatly appreciated! I'm coming to R from Fortran, and trying to port over a lot of code that is loaded with DO loops, so I really need to get my head around this.
lapply, sapply, apply only deal with one vector/list at a time. That is, apply will only call its function for one column at a time. What you need is mapply or Map.
myData <- matrix(nrow=10, ncol=2)
myData[,1] <- num
myData[,2] <- p
mapply(rbinom, n = 1, myData[,1], myData[,2])
# [1] 5 4 11 8 3 3 17 8 0 11
Just like lapply returns a list, so does Map; similarly, just like sapply, mapply will return a vector or array if all return values are compatible, otherwise it returns a list as well.
These calls are equivalent:
sapply(1:3, function(z) z + 1)
mapply(function(z) z + 1, 1:3)
but mapply and Map allow arbitrary number of lists/vectors, so for instance
func <- function(X,Y,Z) X^2+2*Y-Z
Map(func, 1:9, 11:19, 21:29)
## effectively the same as
list(
func(1, 11, 21),
func(2, 12, 22),
func(3, 13, 33),
...,
func(9, 19, 29)
)
The equivalent call of that with sapply for your data would be
sapply(seq_len(nrow(myData)), function(ind) {
rbinom(n = 1, size = myData[ind,1], prob = myData[ind,2])
})
though I personally feel that mapply is easier to read.

Faster for loop in R

I want to extract only the rows matched by a particular index from the matrix.
Is there a way to speed up the for loop?
for(x in 1:dim(gene)[1]){
for(y in 1:dim(geno)[1]){
if(grepl(gene[x,2], geno[y], fixed = TRUE)){
geno_gene <- rbind(geno_gene, geno[y,2:dim(geno)[2]])
next
}
}
}
Share a bit of the gene and geno datasets. What is the actual problem? Is it slow? Compared to what?
At the very least your loop is subject to R's copy-on-modify behaviour. Using the address function from pryr we can see that after we call rbind that your object m now is a copy of it's former self:
library(pryr)
m <- matrix(1:6, nrow = 3, ncol = 2)
address(m)
new_row <- c(8, 9)
m <- rbind(m, new_row)
address(m)
Matrix m changes address after rbind:
> m <- matrix(1:6, nrow = 3, ncol = 2)
> address(m)
[1] "0x18a0d4737f0"
> new_row <- c(8, 9)
> m <- rbind(m, new_row)
> address(m)
[1] "0x18a0fcb7450"
Which means that R has created a copy of m, bound the new row to it, assigned the name m to the copy and leave it to the garbage collector to discard of the original version of m. And it does this for each iteration of your loop. This is a well-known mechanism why R loops can be slow. Read from Hadley Wickham here for more information.
One potential way out is to assess if the function your loop implements can be replaced by vectorized function in R (and there are many vectorized functions).

How to loop over a sliding window of data with lapply?

How can I use lapply() to "loop" over a multi-column dataset and apply a function? Normally, I would use rollapply(), but for reasons that aren't worth going into the analytics in this case only works with lapply(). I know how to run a function over an expanding window. But how can lapply() be used with a sliding window? For example, here's a toy example for manually changing the range works with a function I'll call my_fun for a multi-column dataset (dat1):
set.seed(78)
dat1 <- as.data.frame(matrix(rnorm(1000), ncol = 20, nrow = 50))
my_fun <-function(x) {
a <-apply(x,1,mean)
}
test.1 <-my_fun(dat1[1:10])
test.2 <-my_fun(dat1[2:11])
test.3 <-my_fun(dat1[3:12])
Using lapply() for an expanding window works too, i.e., for ranges 1:10, 1:11, 1:12:
test.a <-lapply(seq(10, 12), function(x) my_fun(dat1[1:x]))
My question: is there any way to use lapply to replicate the sliding window analysis via the 3 manual examples above? I've tried several possibilities, using rep() and replicate(), for example, but so far no success. Any insight would be greatly appreciated.
test.a <-lapply(seq(1, 3), function(x) my_fun(dat1[x:(x+9)]))
In fact, it can be done with rollapply like this:
library(zoo)
res <- t(rollapply(t(dat1), 10, function(x) my_fun(t(x)), by.column = FALSE))
# verify that res[, i] equals test.i for i = 1,2,3
all.equal(res[, 1], test.1)
## [1] TRUE
all.equal(res[, 2], test.2)
## [1] TRUE
all.equal(res[, 3], test.3)
## [1] TRUE

How can I use a frame in apply?

I really like using the frame syntax in R. However, if I try to do this with apply, it gives me an error that the input is a vector, not a frame (which is correct). Is there a similar function to mapply which will let me keep using the frame syntax?
df = data.frame(x = 1:5, y = 1:5)
# This works, but is hard to read because you have to remember what's
# in column 1
apply(df, 1, function(row) row[1])
# I'd rather do this, but it gives me an error
apply(df, 1, function(row) row$x)
Youcab't use the $ on an atomic vector, But I guess you want use it for readability. But you can use [ subsetter.
Here an example. Please provide a reproducible example next time. Question in R specially have no sense without data.
set.seed(1234)
gidd <- data.frame(region=sample(letters[1:6],100,rep=T),
wbregion=sample(letters[1:6],100,rep=T),
foodshare=rnorm(100,0,1),
consincPPP05 = runif(100,0,5),
stringsAsFactors=F)
apply(gidd, ## I am applying it in all the grid here!
1,
function(row) {
similarRows = gidd[gidd$wbregion == row['region'] &
gidd$consincPPP05 > .8 * as.numeric(row['consincPPP05']),
]
return(mean(similarRows$foodshare))
})
Note that with apply I need to convert to a numeric.
You can also use plyr or data.table for a clean syntax , for example:
apply(df,1,function(row)row[1]*2)
is equivalent to
ddply(df, 1, summarise, z = x*2)

Resources