extract information from a data frame parametrically (via a menu selection) - r

I would like to extract information from a data frame parametrically.
That is:
A <- c(3, 10, 20, 30, 40)
B <- c(30, 100, 200, 300, 400)
DF <- data.frame(A, B)
DF[A%in%c(1, 2, 3, 4, 5), ] # it works
# But what if this is the case,
# which comes for example out of a user-menu selection:
m <- "A%in%"
k <- c(1, 2, 3, 4, 5)
# How can we make something like that work:
DF[eval(parse(text=c(m, k))), ]

This works:
DF[eval(parse(text = paste0(m, deparse(k)))), ]
# A B
#1 3 30
However, eval(parse()) should be avoided. Maybe this would be an alternative for you?
x <- "A"
fun <- "%in%"
k <- c(1, 2, 3, 4, 5)
DF[getFunction(fun)(get(x), k), ]
# A B
#1 3 30

Also,
DF[eval(parse(text=paste(m, substitute(k)))),]
or
DF[eval(parse(text=paste(m, quote(k)))),]
or
DF[eval(parse(text=paste(m, "k"))),]

Related

Return dataframe modified by function in R

I made a function which is searching for outliners in each row of dataframe. What i'd like to get at the end is modified dataframe with new column x$outliers_numb as return not as just print. I added return() function at the end but it doesn't work at all. Any ideas?
outliers <- function(x, s, e){
# x = dataframe
# s = index of first col to take
# e = index of last column to take
p <- x
for(i in s:e){
Q1 <- quantile(p[,i], 0.25, names = FALSE)
Q3 <- quantile(p[,i], 0.75, names = FALSE)
iqr <- IQR(p[,i])
low <- Q1 - iqr*1.5
up <- Q3 + iqr*1.5
p[,i] <- ((p[,i] < low) | (p[,i] > up))
}
p <- p %>% mutate(outliers_numb = rowSums(p[,s:e]))
x$outliers_numb <- p$outliers_numb
return(x)
}
#example
w <- data.frame(col1 = c(1, 2, 3, 4, 5, 90, 6),
col2 = c(13, 60, 13, 18, 13, 12, 0),
col3 = c(1, 899, 5, 4, 3, 8, 6))
outliers(w, 1, 3)
Just assign it to a new variable
dataframe_to_reus <- outliers(w, 1, 3)

Alternative to mapply to select sample

I created a mapply function to select samples from a dataset but is there any faster ways to do it by avoiding mapply because it is slow and I have a larger dataset? My goal is to use more matrix / vector operations and less in terms of lists.
#A list of a set of data to be selected
bl <- list(list(c(1, 2),c(2, 3), c(3, 4), c(4, 5), c(5, 6), c(6, 7), c(7, 8), c(8, 9)),
list(c(1, 2, 3), c(2, 3, 4), c(3, 4, 5), c(4, 5, 6), c(5, 6, 7), c(6, 7, 8)),
list(c(1, 2, 3, 4, 5), c(2, 3, 4, 5, 6), c(3, 4, 5, 6, 7), c(4, 5, 6, 7, 8), c(5, 6, 7, 8, 9)))
#Number of elements to be selected
kn <- c(5, 4, 3)
#Total number of elements in each set
nb <- c(8, 6, 5)
#This output a list but preferably I would like a matrix
bl_func <- function() mapply(function(x, y, z) {
x[sample.int(y, z, replace = TRUE)]
}, bl, nb, kn, SIMPLIFY = FALSE)
EDIT
As suggested by #LMc, parallel::mcmapply indeed is faster:
mc.cores=parallel::detectCores()-1
bl_func <- function() parallel::mcmapply(function(x, y, z) {
x[sample.int(y, z, replace = TRUE)]
}, bl, nb, kn, SIMPLIFY = FALSE)
bl_func.0 <- function() mapply(function(x, y, z) {
x[sample.int(y, z, replace = TRUE)]
}, bl, nb, kn, SIMPLIFY = FALSE)
library(microbenchmark)
microbenchmark(
para = bl_func(),
nopara = bl_func.0(),
times = 100
)
Unit: microseconds
expr min lq mean median uq max neval
para 11601.12 18176.46 19901 20402.4 21872 26457 100
nopara 37.34 90.86 1275 246.5 1311 9159 100
I am still curious, though, of other ways to speed things up without the aid of parallel process. Any ideas will be appreciated!
Use a tool designed for speed and large datasets,e.g. data.table .
To do this you would need to reshape your data from lists to a data.table which is in any ways a good idea.
Here is an attempt:
require(data.table)
x = lapply(bl, function(x) data.table( t(data.frame(x) ) ) )
x = lapply(x, melt)
for( i in 1:length(x) ) x[[i]][, group := i]
x = rbindlist(x)
Now the original list of lists is structured in a data.table with 3 columns: the value containing the actual data, the variable defining the vectors within each list and the group defining the list ID.
> head(x)
variable value group
1: V1 1 1
2: V1 2 1
3: V1 3 1
4: V1 4 1
5: V1 5 1
6: V1 6 1
data.table has a by argument which means we can sample rows (.SD ) by one or several columns in the data.table like this:
x[,.SD[ sample( .N, sample(nb,1) , replace = TRUE ) ],by = group ]
group variable value
1: 1 V2 6
2: 1 V2 5
3: 1 V1 6
4: 1 V1 7
5: 1 V1 3

Unpack a list by duplicating elements longer than 1

I have the following list that I wish to unpack (aka expand) using only base R.
For example, I want to turn this:
b <- list(a = c(1, 2), b = 1, d = c(5, 7))
into the equivalent of:
list(a = 1, a = 2, b = 1, d = 5, d = 7)
I have this function that works if only one named element has length > 1 but not if there are multiple elements:
expand_list <- function(listx){
long_elements <- as.numeric(which(lapply(listx, length) > 1))
short_elements <- as.numeric(which(lapply(listx, length) == 1))
res <- lapply(long_elements, function(x){
as.list(setNames(listx[[x]], rep(names(listx)[x], length(listx[[x]]))))
})
expanded_elements <- res[[1]]
c(listx[short_elements], expanded_elements)
}
expand_list(b)
You can use stack followed by setNames to achieve that
y <- list(a = c(1, 2), b = 1, c = 2, d = c(5, 7))
x <- stack(y)
as.list(setNames(x$values, x$ind))

Creating a data frame from nrow results of different vectors/data frames

I have some vectors in different data.frames. I want to count the number of observations of each vectors and make a list out of it. The first column should be the data frame names and the second columns should be the number of observations in each data frame. A minimal example could be,
x <- c(1, 3, 4, 5, 6)
x1 <- data.frame(x)
y <- c(3, 9)
y1 <- data.frame(y)
z <- c(23, 43, 23, 12, 1, 3, 7,8,9)
z1 <- data.frame(z)
a <- nrow(x1)
b <- nrow(y1)
c <- nrow(z1)
d <- c(a, b, c)
e <- data.frame(d)
e
The output e looks like this,
> e
d
1 5
2 2
3 9
However, I want that in this way,
> e
df.name nobs
1 x1 5
2 y1 2
3 z1 9
Any help would be greately appreciated.
Is this what you want?
x <- c(1, 3, 4, 5, 6)
x1 <- data.frame(x)
y <- c(3, 9)
y1 <- data.frame(y)
z <- c(23, 43, 23, 12, 1, 3, 7,8,9)
z1 <- data.frame(z)
library(purrr)
targlist <- list(x1,y1,z1)
data.frame(
names=unlist(map(targlist,names)),
nobs=unlist(map(targlist,nrow))
)
If there's more than one col names is going to misbehave. Maybe you want names=paste0("x",1:length(targlist)) instead. But this was fun for your example.
We can do this with base R
stack(lapply(mget(c("x1", "y1", "z1")), nrow))[2:1]

Divide vector with grouping vector

I have two vectors, which I would like to combine in one dataframe. One of the vectors values needs to be divided into two columns. The second vector nc informs about the number of values for each observation. If nc is 1, only one value is given in values (which goes into val1) and 999 is to be written in the second column (val2).
What is an r-ish way to divide vector value and populate the two columns of df? I suspect I miss something very obvious, but can't proceed at the moment...Many thanks!
set.seed(123)
nc <- sample(1:2, 10, replace = TRUE)
value <- sample(1:6, sum(nc), replace = TRUE)
# result by hand
df <- data.frame(nc = nc,
val1 = c(6, 3, 4, 1, 2, 2, 6, 5, 6, 5),
val2 = c(999, 5, 999, 6, 1, 999, 6, 4, 4, 999))
Here's an approach based on this answer:
set.seed(123)
nc <- sample(1:2, 10, replace = TRUE)
value <- sample(1:6, sum(nc), replace = TRUE)
splitUsing <- function(x, pos) {
unname(split(x, cumsum(seq_along(x) %in% cumsum(replace(pos, 1, pos[1] + 1)))))
}
combineValues <- function(vals, nums) {
mydf <- data.frame(cbind(nums, do.call(rbind, splitUsing(vals, nums))))
mydf$V3[mydf$nums == 1] <- 999
return(mydf)
}
df <- combineValues(value, nc)
I think this is what you are looking for. I'm not sure it is the fastest way, but it should do the trick.
count <- 0
for (i in 1:length(nc)) {
count <- count + nc[i]
if(nc[i]==1) {
df$val1[i] <- value[count]
df$val2[i] <- 999
} else {
df$val1[i] <- value[count-1]
df$val2[i] <- value[count]
}
}

Resources