dplyr::mutate comparing each value to vector, collapsing with any/all - r

I have a dataset of true values (location) that I'm attempting to compare to a vector of estimated values using dplyr. My code below results in an error message. How do I compare each value of data$location to every value of est.locations and collapse the resulting vector to true if all comparisons are greater than 20?
library(dplyr)
data <- data.frame("num" = 1:10, "location" = runif(10, 0, 1500) %>% sort)
est.locations <- runif(12, 0, 1500) %>% sort
data %>%
mutate(false.neg = (all(abs(location - est.locations) > 20)))
num location false.neg
1 1 453.4281 FALSE
2 2 454.4260 FALSE
3 3 718.0420 FALSE
4 4 801.2217 FALSE
5 5 802.7981 FALSE
6 6 854.2148 FALSE
7 7 873.6085 FALSE
8 8 901.0217 FALSE
9 9 1032.8321 FALSE
10 10 1240.3547 FALSE
Warning message:
In c(...) :
longer object length is not a multiple of shorter object length
The context of the question is dplyr, but I'm open to other suggestions that may be faster. This is a piece of a larger calculation I'm doing on birth-death mcmc chains for 3000 iterations * 200 datasets. (i.e. repeated many times and the number of locations will be different among datasets and for each iteration.)
UPDATE (10/13/15):
I'm going to mark akrun's solution as the answer. A linear algebra approach is a natural fit for this problem and with a little tweaking this will work for calculating both FNR and FPR (FNR should need an (l)apply by iteration, FPR should be one large vector/matrix operation).
JohannesNE's solution points out the issue with my initial approach -- the use of any() reduces the number of rows to a single value, when instead I intended to do this operation row-wise. Which also leads me to think there is likely a dplyr solution using rowwise() and do().
I attempted to limit the scope of the question in my initial post. But for added context, the full problem is on a Bayesian mixture model with an unknown number of components, where the components are defined by a 1D point process. Estimation results in a 'random effects' chain similar in structure to the version of est.locations below. The length mismatch is a result of having to estimate the number of components.
## Clarification of problem
options("max.print" = 100)
set.seed(1)
# True values (number of items and their location)
true.locations <-
data.frame("num" = 1:10,
"location" = runif(10, 0, 1500) %>% sort)
# Mcmc chain of item-specific values ('random effects')
iteration <<- 0
est.locations <-
lapply(sample(10:14, 3000, replace=T), function(x) {
iteration <<- iteration + 1
total.items <- rep(x, x)
num <- 1:x
location <- runif(x, 0, 1500) %>% sort
data.frame(iteration, total.items, num, location)
}) %>% do.call(rbind, .)
print(est.locations)
iteration total.items num location
1 1 11 1 53.92243818
2 1 11 2 122.43662006
3 1 11 3 203.87297671
4 1 11 4 641.70211495
5 1 11 5 688.19477968
6 1 11 6 1055.40283048
7 1 11 7 1096.11595818
8 1 11 8 1210.26744065
9 1 11 9 1220.61185888
10 1 11 10 1362.16553219
11 1 11 11 1399.02227302
12 2 10 1 160.55916378
13 2 10 2 169.66834129
14 2 10 3 212.44257723
15 2 10 4 228.42561489
16 2 10 5 429.22830291
17 2 10 6 540.42659572
18 2 10 7 594.58339156
19 2 10 8 610.53964624
20 2 10 9 741.62600969
21 2 10 10 871.51458277
22 3 13 1 10.88957267
23 3 13 2 42.66629869
24 3 13 3 421.77297967
25 3 13 4 429.95036650
[ reached getOption("max.print") -- omitted 35847 rows ]

You can use sapply (here inside mutate, but not really taking advantage of its functions).
library(dplyr)
data <- data.frame("num" = 1:10, "location" = runif(10, 0, 1500) %>% sort)
est.locations <- runif(12, 0, 1500) %>% sort
data %>%
mutate(false.neg = sapply(location, function(x) {
all(abs(x - est.locations) > 20)
}))
num location false.neg
1 1 92.67941 TRUE
2 2 302.52290 FALSE
3 3 398.26299 TRUE
4 4 558.18585 FALSE
5 5 859.28005 TRUE
6 6 943.67107 TRUE
7 7 991.19669 TRUE
8 8 1347.58453 TRUE
9 9 1362.31168 TRUE
10 10 1417.01290 FALSE

We can use outer for this kind of comparison. We get all the combination of difference between 'location' and 'est.locations', take the abs, compare with 20, negate (!), do the rowSums and negate again so that if all the elements in the rows are greater than 20, it will be TRUE.
data$false.neg <- !rowSums(!abs(outer(data$location, est.locations, FUN='-'))>20)

Related

Generate list of datasets with randomly selected features in R

I have a dataset with 20 features. I wish to create a list of datasets with random subsets of features from the original data set.
For example - [dataset[, c(1,3,4)], dataset[, c(2,3,5,11,20)]].
I am trying the following code for the same
selectors = array(runif(2000), dim=c(100, 20, 1))
list_datasets = vector("list", 100)
i = 1
while(i < 100)
list_datasets[[i]] = dataset[, selectors[i,,1] > 0.5]
i = i + 1
Here, my dataset has 20 features and I need 100 datasets with random features out of these 20 features. So I have created selectors array, by initially creating a vector of size 2000 with random values and then making it 100*20. In the while loop, I have then tried to add a feature to list_datasets[[i]] only if random value generated for it was greater than 0.5 . Hopefully, I was able to explain myself
But this is very slow. I am new to R and want to know what is the best way to achieve what I am trying to do.
I'm not totally sure if I understand your setup, so correct me if I missed something. My understanding is that you have a dataset (I created a fake dataset of size 100 rows x 20 features) and want to create 100 new datasets using a random subset of features. You generate a random subset of features by generating random uniform values and checking if each is > 0.5.
I have two options here, one using lapply and one using a for loop.
The apply functions are generally faster than loops (and I think you want to use a for loop, not a while loop here anyway).
Other changes:
1) Use a boolean mask as #Krash suggested, since you can check if each value is greater than 0.5 outside the loop, as whether or not it is does not depend on i.
2) selectors can be 2d
set.seed(123)
# Original dataset: assume it's 100 x 20 features
dataset <- array(rnorm(2000), dim = c(100, 20))
## Original (Option 0: while loop)
system.time({
# Select features: 100 x 20 x 1 (one row per dataset)
selectors = array(runif(2000), dim = c(100, 20, 1));
# Initialize list
list_datasets = vector("list", 100);
# Fill in list
i = 1;
while(i < 100) {
list_datasets[[i]] = dataset[, selectors[i,,1] > 0.5];
i = i + 1 # This causes an off-by-one error, as list_datasets[[100]] is never filled in
}
})
## user system elapsed
## 0.006 0.000 0.006
# Option 1: for loop
system.time({
# Select: boolean mask: 100 x 20 (need one row to create each dataset)
selectors = array(runif(2000), dim = c(100, 20));
selectors = selectors < 0.5
# Initialize list
list_datasets = vector("list", 100);
# Fill in list
for (i in 1:100) {
list_datasets[[i]] = dataset[ , selectors[i, ]]
}
})
## user system elapsed
## 0.004 0.000 0.005
# Option 2: lapply
system.time({
# Select: boolean mask: 100 x 20 (need one row to create each dataset)
selectors = array(runif(2000), dim = c(100, 20));
selectors = selectors < 0.5
# Fill in list
list_datasets <- lapply(1:100, FUN = function(x) dataset[ , selectors[x, ]])
})
## user system elapsed
## 0.003 0.000 0.003
Obviously the amount of time statements take to run vary each time you run the statements, but hopefully some of these suggested changes will improve the speed.
Just as a check to make sure the code did what I wanted it to do:
# Check number of cols per dataset
list_datasets %>%
purrr::map_int(~ncol(.))
## [1] 8 7 9 12 11 13 11 10 10 14 14 7 8 10 10 9 14 10 6 11 13 8 7 8 10 12 9 11 9 9 13
## [32] 12 8 14 11 11 8 10 11 8 10 13 12 10 6 10 10 12 9 9 10 11 7 8 11 9 11 9 7 9 9 11
## [63] 14 9 9 9 9 13 13 14 12 9 10 9 12 8 11 14 9 7 12 7 6 11 11 7 9 8 12 10 12 9 11
## [94] 13 12 16 9 8 11 10
Other thoughts: Rather than creating a selectors array via random uniforms with each row corresponding to a new dataset, you could add a line like this inside your loop (or lapply FUN).
include_feature <- sample(0:1, size = 20, replace = TRUE)
include_feature
## [1] 0 0 1 0 0 0 1 1 1 0 1 1 1 0 1 0 0 0 0 1

calculating column sum for certain row

I am trying to calculate column sum of per 5 rows for each row, in R using the following code:
df <- data.frame(count=1:10)
for (loop in (1:nrow(df)))
{df[loop,"acc_sum"] <- sum(df[max(1,loop-5):loop,"count"])}
But I don't like the explicit loop here, how can I modify it? Thanks.
According to your question, your desired result is:
df
# count acc_sum
# 1 1 1
# 2 2 3
# 3 3 6
# 4 4 10
# 5 5 15
# 6 6 21
# 7 7 27
# 8 8 33
# 9 9 39
# 10 10 45
This can be done like this:
df <- data.frame(count=1:10)
library(zoo)
df$acc_sum <- rev(rollapply(rev(df$count), 6, sum, partial = TRUE, align = "left"))
To obtain this result, we are reversing the order of df$count, we sum the elements (using partial = TRUE and align = "left" is important here), and we reverse the result to have the vector needed.
rev(rollapply(rev(df$count), 6, sum, partial = TRUE, align = "left"))
# [1] 1 3 6 10 15 21 27 33 39 45
Note that this sums 6 elements, not 5. According to the code in your question, this gives the same output. If you just want to sum 5 rows, just replace the 6 with a 5.

Subset columns using logical vector

I have a dataframe that I want to drop those columns with NA's rate > 70% or there is dominant value taking over 99% of rows. How can I do that in R?
I find it easier to select rows with logic vector in subset function, but how can I do the similar for columns? For example, if I write:
isNARateLt70 <- function(column) {//some code}
apply(dataframe, 2, isNARateLt70)
Then how can I continue to use this vector to subset dataframe?
If you have a data.frame like
dd <- data.frame(matrix(rpois(7*4,10),ncol=7, dimnames=list(NULL,letters[1:7])))
# a b c d e f g
# 1 11 2 5 9 7 6 10
# 2 10 5 11 13 11 11 8
# 3 14 8 6 16 9 11 9
# 4 11 8 12 8 11 6 10
You can subset with a logical vector using one of
mycols<-c(T,F,F,T,F,F,T)
dd[mycols]
dd[, mycols]
There's really no need to write a function when we have colMeans (thanks #MrFlick for the advice to change from colSums()/nrow(), and shown at the bottom of this answer).
Here's how I would approach your function if you want to use sapply on it later.
> d <- data.frame(x = rep(NA, 5), y = c(1, NA, NA, 1, 1),
z = c(rep(NA, 3), 1, 2))
> isNARateLt70 <- function(x) mean(is.na(x)) <= 0.7
> sapply(d, isNARateLt70)
# x y z
# FALSE TRUE TRUE
Then, to subset with the above line your data using the above line of code, it's
> d[sapply(d, isNARateLt70)]
But as mentioned, colMeans works just the same,
> d[colMeans(is.na(d)) <= 0.7]
# y z
# 1 1 NA
# 2 NA NA
# 3 NA NA
# 4 1 1
# 5 1 2
Maybe this will help too. The 2 parameter in apply() means apply this function column wise on the data.frame cars.
> columns <- apply(cars, 2, function(x) {mean(x) > 10})
> columns
speed dist
TRUE TRUE
> cars[1:10, columns]
speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10
7 10 18
8 10 26
9 10 34
10 11 17

Excel OFFSET function in r

I am trying to simulate the OFFSET function from Excel. I understand that this can be done for a single value but I would like to return a range. I'd like to return a group of values with an offset of 1 and a group size of 2. For example, on row 4, I would like to have a group with values of column a, rows 3 & 2. Sorry but I am stumped.
Is it possible to add this result to the data frame as another column using cbind or similar? Alternatively, could I use this in a vectorized function so I could sum or mean the result?
Mockup Example:
> df <- data.frame(a=1:10)
> df
a
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
> #PROCESS
> df
a b
1 1 NA
2 2 (1)
3 3 (1,2)
4 4 (2,3)
5 5 (3,4)
6 6 (4,5)
7 7 (5,6)
8 8 (6,7)
9 9 (7,8)
10 10 (8,9)
This should do the trick:
df$b1 <- c(rep(NA, 1), head(df$a, -1))
df$b2 <- c(rep(NA, 2), head(df$a, -2))
Note that the result will have to live in two columns, as columns in data frames only support simple data types. (Unless you want to resort to complex numbers.) head with a negative argument cuts the negated value of the argument from the tail, try head(1:10, -2). rep is repetition, c is concatenation. The <- assignment adds a new column if it's not there yet.
What Excel calls OFFSET is sometimes also referred to as lag.
EDIT: Following Greg Snow's comment, here's a version that's more elegant, but also more difficult to understand:
df <- cbind(df, as.data.frame((embed(c(NA, NA, df$a), 3))[,c(3,2)]))
Try it component by component to see how it works.
Do you want something like this?
> df <- data.frame(a=1:10)
> b=t(sapply(1:10, function(i) c(df$a[(i+2)%%10+1], df$a[(i+4)%%10+1])))
> s = sapply(1:10, function(i) sum(b[i,]))
> df = data.frame(df, b, s)
> df
a X1 X2 s
1 1 4 6 10
2 2 5 7 12
3 3 6 8 14
4 4 7 9 16
5 5 8 10 18
6 6 9 1 10
7 7 10 2 12
8 8 1 3 4
9 9 2 4 6
10 10 3 5 8

Is there any way to bind data to data.frame by some index?

#For say, I got a situation like this
user_id = c(1:5,1:5)
time = c(1:10)
visit_log = data.frame(user_id, time)
#And I've wrote a method to calculate interval
interval <- function(data) {
interval = c(Inf)
for (i in seq(1, length(data$time))) {
intv = data$time[i]-data$time[i-1]
interval = append(interval, intv)
}
data$interval = interval
return (data)
}
#But when I want to get intervals by user_id and bind them to the data.frame,
#I can't find a proper way
#Is there any method to get something like
new_data = merge(by(visit_log, INDICE=visit_log$user_id, FUN=interval))
#And the result should be
user_id time interval
1 1 1 Inf
2 2 2 Inf
3 3 3 Inf
4 4 4 Inf
5 5 5 Inf
6 1 6 5
7 2 7 5
8 3 8 5
9 4 9 5
10 5 10 5
We can replace your loop with the diff() function which computes the differences between adjacent indices in a vector, for example:
> diff(c(1,3,6,10))
[1] 2 3 4
To that we can prepend Inf to the differences via c(Inf, diff(x)).
The next thing we need is to apply the above to each user_id individually. For that there are many options, but here I use aggregate(). Confusingly, this function returns a data frame with a time component that is itself a matrix. We need to convert that matrix to a vector, relying upon the fact that in R, columns of matrices are filled first. Finally, we add and interval column to the input data as per your original version of the function.
interval <- function(x) {
diffs <- aggregate(time ~ user_id, data = x, function(y) c(Inf, diff(y)))
diffs <- as.numeric(diffs$time)
x <- within(x, interval <- diffs)
x
}
Here is a slightly expanded example, with 3 time points per user, to illustrate the above function:
> visit_log = data.frame(user_id = rep(1:5, 3), time = 1:15)
> interval(visit_log)
user_id time interval
1 1 1 Inf
2 2 2 Inf
3 3 3 Inf
4 4 4 Inf
5 5 5 Inf
6 1 6 5
7 2 7 5
8 3 8 5
9 4 9 5
10 5 10 5
11 1 11 5
12 2 12 5
13 3 13 5
14 4 14 5
15 5 15 5

Resources