Alternative to mapply to select sample - r

I created a mapply function to select samples from a dataset but is there any faster ways to do it by avoiding mapply because it is slow and I have a larger dataset? My goal is to use more matrix / vector operations and less in terms of lists.
#A list of a set of data to be selected
bl <- list(list(c(1, 2),c(2, 3), c(3, 4), c(4, 5), c(5, 6), c(6, 7), c(7, 8), c(8, 9)),
list(c(1, 2, 3), c(2, 3, 4), c(3, 4, 5), c(4, 5, 6), c(5, 6, 7), c(6, 7, 8)),
list(c(1, 2, 3, 4, 5), c(2, 3, 4, 5, 6), c(3, 4, 5, 6, 7), c(4, 5, 6, 7, 8), c(5, 6, 7, 8, 9)))
#Number of elements to be selected
kn <- c(5, 4, 3)
#Total number of elements in each set
nb <- c(8, 6, 5)
#This output a list but preferably I would like a matrix
bl_func <- function() mapply(function(x, y, z) {
x[sample.int(y, z, replace = TRUE)]
}, bl, nb, kn, SIMPLIFY = FALSE)
EDIT
As suggested by #LMc, parallel::mcmapply indeed is faster:
mc.cores=parallel::detectCores()-1
bl_func <- function() parallel::mcmapply(function(x, y, z) {
x[sample.int(y, z, replace = TRUE)]
}, bl, nb, kn, SIMPLIFY = FALSE)
bl_func.0 <- function() mapply(function(x, y, z) {
x[sample.int(y, z, replace = TRUE)]
}, bl, nb, kn, SIMPLIFY = FALSE)
library(microbenchmark)
microbenchmark(
para = bl_func(),
nopara = bl_func.0(),
times = 100
)
Unit: microseconds
expr min lq mean median uq max neval
para 11601.12 18176.46 19901 20402.4 21872 26457 100
nopara 37.34 90.86 1275 246.5 1311 9159 100
I am still curious, though, of other ways to speed things up without the aid of parallel process. Any ideas will be appreciated!

Use a tool designed for speed and large datasets,e.g. data.table .
To do this you would need to reshape your data from lists to a data.table which is in any ways a good idea.
Here is an attempt:
require(data.table)
x = lapply(bl, function(x) data.table( t(data.frame(x) ) ) )
x = lapply(x, melt)
for( i in 1:length(x) ) x[[i]][, group := i]
x = rbindlist(x)
Now the original list of lists is structured in a data.table with 3 columns: the value containing the actual data, the variable defining the vectors within each list and the group defining the list ID.
> head(x)
variable value group
1: V1 1 1
2: V1 2 1
3: V1 3 1
4: V1 4 1
5: V1 5 1
6: V1 6 1
data.table has a by argument which means we can sample rows (.SD ) by one or several columns in the data.table like this:
x[,.SD[ sample( .N, sample(nb,1) , replace = TRUE ) ],by = group ]
group variable value
1: 1 V2 6
2: 1 V2 5
3: 1 V1 6
4: 1 V1 7
5: 1 V1 3

Related

How to export the frequency tables (categorical data) and descreptive analysis (continuous variables) from a big dataframe from R into excel?

I have this dataframe:
df = data.frame(x = c(1,0,0,0,1,1,1), y = c(2,2,2,2,3,3,2),
z = 1:7, m = c(1,2,3,1,2,3,1) )
df$x = factor(df$x)
df$y = factor(df$y)
df$m = factor(df$m)
I want to extract all the descriptive analysis information of each of these variables into excel in a simple way so I can present the results of my work.
There is little information here and it unclear how you want to present this, but if you would like to get summary statistics (tables [categorical] and median/IQR [continuous]), you can use lapply across all columns. This will output as a list from which you can manipulate to export as you see fit.
Data
df <- data.frame(x = c(1, 0, 0, 0, 1, 1, 1),
y = c(2, 2, 2, 2, 3, 3, 2),
z = 1:7,
m = c(1, 2, 3, 1, 2, 3, 1))
cats <- c("x", "y", "m")
df[cats] <- lapply(df[cats], as.factor)
Run lapply
results_list <- lapply(df[1:ncol(df)], summary)
Output:
> results_list
$x
0 1
3 4
$y
2 3
5 2
$z
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0 2.5 4.0 4.0 5.5 7.0
$m
1 2 3
3 2 2

Extract a vector from nested lists in R

I am trying to extract a vector from a nested list based on the value of another variable\element within the same nested list. Hopefully my example will explain what I'm trying to do.
To begin, I have a list of lists like so:
## Create the inner lists
# Inner list 1
listInner1 <- list(
value = c(0.25),
index = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
left = c(),
right = c()
)
listInner1$left$index <- c(1, 2, 3, 4, 5)
listInner1$left$good <- TRUE
listInner1$right$index <- c(6, 7, 8, 8, 10)
listInner1$right$good <- TRUE
# Inner list 2
listInner2 <- list(
value = c(1.5),
index = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
left = c(),
right = c()
)
listInner2$left$index <- c(1, 2, 3)
listInner2$left$good <- TRUE
listInner2$right$index <- c(4, 5, 6, 7, 8, 9, 10)
listInner2$right$good <- TRUE
# Inner list 3
listInner3 <- list(
value = c(0.5),
index = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
left = c(),
right = c()
)
listInner3$left$index <- c(1, 2, 3, 4, 5)
listInner3$right$index <- c( 6, 7, 8, 9, 10)
listInner3$left$left$index <- c(2, 4, 6, 8, 10)
listInner3$left$right$index <- c(1, 3, 5 ,7, 9)
listInner3$left$left$good <- TRUE
listInner3$left$right$good <- TRUE
# put all inner lists into single list object
listMiddle <- list(listInner1, listInner2, listInner3)
# one more list for fun
listMaster <- list(listMiddle)
As you can see, some of the left and right elements of the nested lists contain the element good = TRUE and some don't.
What I'm trying to do is if a particular nested list contains the element good = TRUE then to extract the element index from that same nested list.
For example, manually creating my desired output for the above example would look something like this:
ans <- list(
index.1 = c(1, 2, 3, 4, 5),
index.2 = c(6, 7, 8, 8, 10),
index.3 = c(1, 2, 3),
index.4 = c(4, 5, 6, 7, 8, 9, 10),
index.5 = c(2, 4, 6, 8, 10),
index.6 = c(1, 3, 5 ,7, 9)
)
The object ans contains all the index vectors that are contained within a nested list that also contains good = TRUE.
Any suggestions as to how I could do this?
Here is an option where we bind the nested elements to a more easily approachable format with rrapply, then, we get the index of 'good' columns, extract the corresponding 'index' elements from that position index by looping over in map2 (based on the the TRUE values), transpose the list , keep only the elements having greater than 0 length, flatten the list and set the names (if needed)
library(purrr)
library(rrapply)
library(stringr)
library(dplyr)
out <- rrapply(listMaster, how = 'bind')
i1 <- grep('good', names(out))
map2(out[i1-1], out[i1], `[`) %>%
transpose %>%
map( ~ keep(.x, lengths(.x) > 0)) %>%
flatten %>%
setNames(str_c('index.', seq_along(.)))
-output
$index.1
[1] 1 2 3 4 5
$index.2
[1] 6 7 8 8 10
$index.3
[1] 1 2 3
$index.4
[1] 4 5 6 7 8 9 10
$index.5
[1] 2 4 6 8 10
$index.6
[1] 1 3 5 7 9

R - removing data table rows based on two values

I have a large data frame (tbl_df) with approximately the following information:
data <- data.frame(Energy = sample(1:200, 100, replace = T), strip1 = sample(1:12, 100, replace = T), strip2 = sample(1:12, 100, replace = T))
It has 3 columns. The first is energy, the second and third are strip numbers (where energy was deposited).
Each strip has a different threshold and these are stored in two numeric arrays, each position in the array is for the corresponding strip number:
threshold_strip1 <- c(4, 6, 3, 7, 7, 1, 2, 5, 8, 10, 2, 2)
threshold_strip2 <- c(5, 3, 5, 7, 6, 2, 7, 7, 10, 2, 2, 2)
These tell me the minimum amount of energy the strip can receive. What I want to be able to do is remove the rows from the data frame where BOTH strips do not have over the required threshold.
As an example, if I have the row:
Energy = 4, strip1 = 2, strip2 = 2
Then I would remove this row as although strip2 has a lower threshold than 4, strip1 has a threshold of 6 and so there isn't enough energy here.
Apologies if this question is worded poorly, I couldn't seem to find anything like it in old questions.
filter1 <- data$strip1 >= threshold_strip1[data$strip1]
filter2 <- data$strip2 >= threshold_strip1[data$strip2]
data <- subset(data, filter1 & filter2)
I'd maybe do...
library(data.table)
setDT(data)
# structure lower-bound rules
threshes = list(threshold_strip1, threshold_strip2)
lbDT = data.table(
strip_loc = rep(seq_along(threshes), lengths(threshes)),
strip_num = unlist(lapply(threshes, seq_along)),
thresh = unlist(threshes)
)
# loop over strip locations (strip1, strip2, etc)
# marking where threshold is not met
data[, keep := TRUE]
lbDT[, {
onexpr = c(sprintf("strip%s==s", strip_loc), "Energy<th")
data[.(s = strip_num, th = thresh), on=onexpr, keep := FALSE]
NULL
}, by=strip_loc]
What about this? Using dplyr:
require(dplyr)
data2 <- data %>%
mutate(
strip1_value = threshold_strip1[strip1],
strip2_value = threshold_strip2[strip2],
to_keep = Energy > strip1_value & Energy > strip2_value
) %>%
filter(to_keep == TRUE)

For each row, return the column name of the largest value whilst assigning ties to new groups

I have three variables; and I want to create a new varible showing which column that had the highest number. Data:
x= c(5, 1, 4, 5, 5, 1, 1)
y= c(1, 2, 4, 5, 1, 4, 1)
z= c(1, 1, 5, 3, 5, 4, 1)
data <-data.frame(x, y, z)
Importantly if there are a tie I want this to be indicated too, so that.
1= x is highest
2= y is highest
3= z is highest
4= x and y is highest as tie
5= x and z is highest as a tie
6 = y and z is highest as a tie
7 = x, y and z is all equally high.
I've tried below, but it doesn't handle the ties correctly.
data$Highest <- apply(data, 1, which.max)
data
PS. The correct new variable that I would like to get from the data above should be:
correct= c(1, 2, 3, 4, 5, 6, 7)
fun <- function(v) {
stopifnot(length(v) == 3L)
if (anyNA(v)) stop("NA values in input")
if (length(unique(v)) == 1L) return(7L)
rk <- rank(v)
if (max(rk) %% 1 == 0L) return(which.max(rk))
test <- rk %% 1 != 0L
if (sum(test) == 2L) return(sum(which(test)) + 1L)
stop("undefined case")
}
apply(data, 1, fun)
#[1] 1 2 3 4 5 6 7
You can do:
library(plyr)
combn2 <- function(x, y) combn(y, x, paste, collapse="")
x = unlist(sapply(1:ncol(data), combn2, names(data)))
vec = alply(data, 1, function(u) which(paste(names(data)[max(u)==u], collapse='')==x))
#unlist(vec)
#1 2 3 4 5 6 7

extract information from a data frame parametrically (via a menu selection)

I would like to extract information from a data frame parametrically.
That is:
A <- c(3, 10, 20, 30, 40)
B <- c(30, 100, 200, 300, 400)
DF <- data.frame(A, B)
DF[A%in%c(1, 2, 3, 4, 5), ] # it works
# But what if this is the case,
# which comes for example out of a user-menu selection:
m <- "A%in%"
k <- c(1, 2, 3, 4, 5)
# How can we make something like that work:
DF[eval(parse(text=c(m, k))), ]
This works:
DF[eval(parse(text = paste0(m, deparse(k)))), ]
# A B
#1 3 30
However, eval(parse()) should be avoided. Maybe this would be an alternative for you?
x <- "A"
fun <- "%in%"
k <- c(1, 2, 3, 4, 5)
DF[getFunction(fun)(get(x), k), ]
# A B
#1 3 30
Also,
DF[eval(parse(text=paste(m, substitute(k)))),]
or
DF[eval(parse(text=paste(m, quote(k)))),]
or
DF[eval(parse(text=paste(m, "k"))),]

Resources