Generate list of datasets with randomly selected features in R - r

I have a dataset with 20 features. I wish to create a list of datasets with random subsets of features from the original data set.
For example - [dataset[, c(1,3,4)], dataset[, c(2,3,5,11,20)]].
I am trying the following code for the same
selectors = array(runif(2000), dim=c(100, 20, 1))
list_datasets = vector("list", 100)
i = 1
while(i < 100)
list_datasets[[i]] = dataset[, selectors[i,,1] > 0.5]
i = i + 1
Here, my dataset has 20 features and I need 100 datasets with random features out of these 20 features. So I have created selectors array, by initially creating a vector of size 2000 with random values and then making it 100*20. In the while loop, I have then tried to add a feature to list_datasets[[i]] only if random value generated for it was greater than 0.5 . Hopefully, I was able to explain myself
But this is very slow. I am new to R and want to know what is the best way to achieve what I am trying to do.

I'm not totally sure if I understand your setup, so correct me if I missed something. My understanding is that you have a dataset (I created a fake dataset of size 100 rows x 20 features) and want to create 100 new datasets using a random subset of features. You generate a random subset of features by generating random uniform values and checking if each is > 0.5.
I have two options here, one using lapply and one using a for loop.
The apply functions are generally faster than loops (and I think you want to use a for loop, not a while loop here anyway).
Other changes:
1) Use a boolean mask as #Krash suggested, since you can check if each value is greater than 0.5 outside the loop, as whether or not it is does not depend on i.
2) selectors can be 2d
set.seed(123)
# Original dataset: assume it's 100 x 20 features
dataset <- array(rnorm(2000), dim = c(100, 20))
## Original (Option 0: while loop)
system.time({
# Select features: 100 x 20 x 1 (one row per dataset)
selectors = array(runif(2000), dim = c(100, 20, 1));
# Initialize list
list_datasets = vector("list", 100);
# Fill in list
i = 1;
while(i < 100) {
list_datasets[[i]] = dataset[, selectors[i,,1] > 0.5];
i = i + 1 # This causes an off-by-one error, as list_datasets[[100]] is never filled in
}
})
## user system elapsed
## 0.006 0.000 0.006
# Option 1: for loop
system.time({
# Select: boolean mask: 100 x 20 (need one row to create each dataset)
selectors = array(runif(2000), dim = c(100, 20));
selectors = selectors < 0.5
# Initialize list
list_datasets = vector("list", 100);
# Fill in list
for (i in 1:100) {
list_datasets[[i]] = dataset[ , selectors[i, ]]
}
})
## user system elapsed
## 0.004 0.000 0.005
# Option 2: lapply
system.time({
# Select: boolean mask: 100 x 20 (need one row to create each dataset)
selectors = array(runif(2000), dim = c(100, 20));
selectors = selectors < 0.5
# Fill in list
list_datasets <- lapply(1:100, FUN = function(x) dataset[ , selectors[x, ]])
})
## user system elapsed
## 0.003 0.000 0.003
Obviously the amount of time statements take to run vary each time you run the statements, but hopefully some of these suggested changes will improve the speed.
Just as a check to make sure the code did what I wanted it to do:
# Check number of cols per dataset
list_datasets %>%
purrr::map_int(~ncol(.))
## [1] 8 7 9 12 11 13 11 10 10 14 14 7 8 10 10 9 14 10 6 11 13 8 7 8 10 12 9 11 9 9 13
## [32] 12 8 14 11 11 8 10 11 8 10 13 12 10 6 10 10 12 9 9 10 11 7 8 11 9 11 9 7 9 9 11
## [63] 14 9 9 9 9 13 13 14 12 9 10 9 12 8 11 14 9 7 12 7 6 11 11 7 9 8 12 10 12 9 11
## [94] 13 12 16 9 8 11 10
Other thoughts: Rather than creating a selectors array via random uniforms with each row corresponding to a new dataset, you could add a line like this inside your loop (or lapply FUN).
include_feature <- sample(0:1, size = 20, replace = TRUE)
include_feature
## [1] 0 0 1 0 0 0 1 1 1 0 1 1 1 0 1 0 0 0 0 1

Related

R Select N evenly spaced out elements in vector, including first and last

I'm looking for a way to extract evenly spaced elements in a vector. I'd like a general way to do this because I am trying to specify the values that I want in a plotly chart. I tried using pretty but that only seems to work with ggplot2.
I'm pretty much looking for an R version of this question that was answered for python.
Here's a sample set. This sample is a vector of 23 elements, a prime that cannot be factored.
x <- 1:23
Ideally, there would be a function that takes a number for the spacing (n) and that splits x into a subset of n evenly spaced values that also includes the first and last element. For example:
split_func(x, n = 4)
[1] 1 4 8 12 16 20 23
The output elements centered between the first and last elements and are spaced by 4, with the exception of the first/second and second-to-last/last elements.
A couple other examples:
split_func(x, n = 5)
[1] 1 5 10 15 20 23 # either this
[1] 1 4 9 14 19 23 # or this would work
split_func(1:10, n = 3)
[1] 1 3 6 9 10 # either this
[1] 1 2 5 8 10 # or this would work
split_func(1:27, n = 6)
[1] 1 5 11 17 23 27
Is there a function that does this already?
Try this:
split_func <- function(x, by) {
r <- diff(range(x))
out <- seq(0, r - by - 1, by = by)
c(round(min(x) + c(0, out - 0.51 + (max(x) - max(out)) / 2), 0), max(x))
}
split_func(1:23, 4)
# [1] 1 4 8 12 16 20 23
split_func(1:23, 5)
# [1] 1 4 9 14 19 23
split_func(1:10, 3)
# [1] 1 4 7 10
split_func(1:27, 6)
# [1] 1 5 11 17 23 27

Referring to elements of a matrices list

I've got a matrices list created as following:
#create the database
vect_date <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14)
vect <- c(48,40,32,36,37,37,20,15,15,24,24,10,10,10)
vect <- as.data.frame(cbind(vect_date, vect))
vect <- vect[order(vect$vect_date),]
#create levels depending on vect$vect value
vect$level <- 1
for(i in 2:length(vect$vect)){vect$level[i] <- ifelse(vect$vect[i]==vect$vect[i-1], vect$level[i- 1],vect$level[i-1]+1)}
#create the list
monotone <- split(vect, f=vect$level)
Now, I would like to change vect$vect value of each of these levels/matrices depending on the vect$vect value of the subsequent matrix. I guess the issue consists of indexing elements and using for loops, but I don't know how to do that.
As an example, I would like to change the value of vect$vect depending on the fact that the subsequent is 10. In that case, the vect$vect value of that level should be multiplied by 100, obtaining:
vect <- c(48,40,37,36,37,37,20,15,15,2400,2400,10,10,10)
Any help would be great!
I think you can use factor in R first to get your levels:
vect_date <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14)
vect <- c(48,40,32,36,37,37,20,15,15,24,24,10,10,10)
vect <- as.data.frame(cbind(vect_date, vect))
vect <- vect[order(vect$vect_date),]
vect$level = factor(vect$vect,levels=unique(vect$vect))
vect$level = as.numeric(vect$level)
So if we want to change the level that comes before vect that have values 10, we can do:
level_tochange = vect$level[vect$vect==10] - 1
level_tochange
[1] 8 8 8
This tells us we need to change rows with level == 8. Note I use %in% because in events where you have more than 2 levels with vect==10, this will still work:
rows_tochange = which(vect$level %in% level_tochange)
vect$vect[rows_tochange] = vect$vect[rows_tochange]*100
vect
vect_date vect level
1 1 48 1
2 2 40 2
3 3 32 3
4 4 36 4
5 5 37 5
6 6 37 5
7 7 20 6
8 8 15 7
9 9 15 7
10 10 2400 8
11 11 2400 8
12 12 10 9
13 13 10 9
14 14 10 9

For Loop Adding Extra Rows to The Data Frame

Hello I am very new to the programming world and data science as well, and I am trying to work my way through it.
I am trying to assign values to the column in a data frame and using for loop such that the data frame is divided into ten groups and every row in every group is assigned a rank, such that row 1 to 10 is assigned as rank 1 and row 11 to 20 is assigned as rank 2 and so on. The original dimension of subset data set is 100 * 6
My data frame looks like
Data Frame
The codes I have written are:
x <- round(nrow(subset) / 10)
a=1
for(j in 1:10){
for(i in a:x){
subset[i, "rank"] = j
}
j = j + 1
a = x + 1
x = x * j
}
However, the loop runs infinitely and keeps on adding additional rows to the data frame. I had to manually stop the loop and the resulting dimension of the subset data frame was 17926 * 6.
Please help me understand where am I going wrong in writing the loop.
P.S. subset is a data frame name and not the subset function in R
Thanks in Advance !!
It might be better for you to start working with vectorized calculations instead of loops. This will help you in the future.
For example:
df <- data.frame(x = 1:100)
df$rank <- (df$x-1)%/%10 + 1
df
results in:
x rank
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 1
11 11 2
12 12 2
13 13 2
14 14 2
15 15 2
16 16 2
17 17 2
18 18 2
19 19 2
20 20 2
21 21 3
22 22 3
23 23 3
24 24 3
25 25 3
How about something like this:
subset$Rank <- ceiling(as.numeric(rownames(subset))/10)
The as.numeric converts the rowname into a number, dividing it by 10 and rounding up should give you what you need? Let me know if I've misunderstood.

dplyr::mutate comparing each value to vector, collapsing with any/all

I have a dataset of true values (location) that I'm attempting to compare to a vector of estimated values using dplyr. My code below results in an error message. How do I compare each value of data$location to every value of est.locations and collapse the resulting vector to true if all comparisons are greater than 20?
library(dplyr)
data <- data.frame("num" = 1:10, "location" = runif(10, 0, 1500) %>% sort)
est.locations <- runif(12, 0, 1500) %>% sort
data %>%
mutate(false.neg = (all(abs(location - est.locations) > 20)))
num location false.neg
1 1 453.4281 FALSE
2 2 454.4260 FALSE
3 3 718.0420 FALSE
4 4 801.2217 FALSE
5 5 802.7981 FALSE
6 6 854.2148 FALSE
7 7 873.6085 FALSE
8 8 901.0217 FALSE
9 9 1032.8321 FALSE
10 10 1240.3547 FALSE
Warning message:
In c(...) :
longer object length is not a multiple of shorter object length
The context of the question is dplyr, but I'm open to other suggestions that may be faster. This is a piece of a larger calculation I'm doing on birth-death mcmc chains for 3000 iterations * 200 datasets. (i.e. repeated many times and the number of locations will be different among datasets and for each iteration.)
UPDATE (10/13/15):
I'm going to mark akrun's solution as the answer. A linear algebra approach is a natural fit for this problem and with a little tweaking this will work for calculating both FNR and FPR (FNR should need an (l)apply by iteration, FPR should be one large vector/matrix operation).
JohannesNE's solution points out the issue with my initial approach -- the use of any() reduces the number of rows to a single value, when instead I intended to do this operation row-wise. Which also leads me to think there is likely a dplyr solution using rowwise() and do().
I attempted to limit the scope of the question in my initial post. But for added context, the full problem is on a Bayesian mixture model with an unknown number of components, where the components are defined by a 1D point process. Estimation results in a 'random effects' chain similar in structure to the version of est.locations below. The length mismatch is a result of having to estimate the number of components.
## Clarification of problem
options("max.print" = 100)
set.seed(1)
# True values (number of items and their location)
true.locations <-
data.frame("num" = 1:10,
"location" = runif(10, 0, 1500) %>% sort)
# Mcmc chain of item-specific values ('random effects')
iteration <<- 0
est.locations <-
lapply(sample(10:14, 3000, replace=T), function(x) {
iteration <<- iteration + 1
total.items <- rep(x, x)
num <- 1:x
location <- runif(x, 0, 1500) %>% sort
data.frame(iteration, total.items, num, location)
}) %>% do.call(rbind, .)
print(est.locations)
iteration total.items num location
1 1 11 1 53.92243818
2 1 11 2 122.43662006
3 1 11 3 203.87297671
4 1 11 4 641.70211495
5 1 11 5 688.19477968
6 1 11 6 1055.40283048
7 1 11 7 1096.11595818
8 1 11 8 1210.26744065
9 1 11 9 1220.61185888
10 1 11 10 1362.16553219
11 1 11 11 1399.02227302
12 2 10 1 160.55916378
13 2 10 2 169.66834129
14 2 10 3 212.44257723
15 2 10 4 228.42561489
16 2 10 5 429.22830291
17 2 10 6 540.42659572
18 2 10 7 594.58339156
19 2 10 8 610.53964624
20 2 10 9 741.62600969
21 2 10 10 871.51458277
22 3 13 1 10.88957267
23 3 13 2 42.66629869
24 3 13 3 421.77297967
25 3 13 4 429.95036650
[ reached getOption("max.print") -- omitted 35847 rows ]
You can use sapply (here inside mutate, but not really taking advantage of its functions).
library(dplyr)
data <- data.frame("num" = 1:10, "location" = runif(10, 0, 1500) %>% sort)
est.locations <- runif(12, 0, 1500) %>% sort
data %>%
mutate(false.neg = sapply(location, function(x) {
all(abs(x - est.locations) > 20)
}))
num location false.neg
1 1 92.67941 TRUE
2 2 302.52290 FALSE
3 3 398.26299 TRUE
4 4 558.18585 FALSE
5 5 859.28005 TRUE
6 6 943.67107 TRUE
7 7 991.19669 TRUE
8 8 1347.58453 TRUE
9 9 1362.31168 TRUE
10 10 1417.01290 FALSE
We can use outer for this kind of comparison. We get all the combination of difference between 'location' and 'est.locations', take the abs, compare with 20, negate (!), do the rowSums and negate again so that if all the elements in the rows are greater than 20, it will be TRUE.
data$false.neg <- !rowSums(!abs(outer(data$location, est.locations, FUN='-'))>20)

Is there any way to bind data to data.frame by some index?

#For say, I got a situation like this
user_id = c(1:5,1:5)
time = c(1:10)
visit_log = data.frame(user_id, time)
#And I've wrote a method to calculate interval
interval <- function(data) {
interval = c(Inf)
for (i in seq(1, length(data$time))) {
intv = data$time[i]-data$time[i-1]
interval = append(interval, intv)
}
data$interval = interval
return (data)
}
#But when I want to get intervals by user_id and bind them to the data.frame,
#I can't find a proper way
#Is there any method to get something like
new_data = merge(by(visit_log, INDICE=visit_log$user_id, FUN=interval))
#And the result should be
user_id time interval
1 1 1 Inf
2 2 2 Inf
3 3 3 Inf
4 4 4 Inf
5 5 5 Inf
6 1 6 5
7 2 7 5
8 3 8 5
9 4 9 5
10 5 10 5
We can replace your loop with the diff() function which computes the differences between adjacent indices in a vector, for example:
> diff(c(1,3,6,10))
[1] 2 3 4
To that we can prepend Inf to the differences via c(Inf, diff(x)).
The next thing we need is to apply the above to each user_id individually. For that there are many options, but here I use aggregate(). Confusingly, this function returns a data frame with a time component that is itself a matrix. We need to convert that matrix to a vector, relying upon the fact that in R, columns of matrices are filled first. Finally, we add and interval column to the input data as per your original version of the function.
interval <- function(x) {
diffs <- aggregate(time ~ user_id, data = x, function(y) c(Inf, diff(y)))
diffs <- as.numeric(diffs$time)
x <- within(x, interval <- diffs)
x
}
Here is a slightly expanded example, with 3 time points per user, to illustrate the above function:
> visit_log = data.frame(user_id = rep(1:5, 3), time = 1:15)
> interval(visit_log)
user_id time interval
1 1 1 Inf
2 2 2 Inf
3 3 3 Inf
4 4 4 Inf
5 5 5 Inf
6 1 6 5
7 2 7 5
8 3 8 5
9 4 9 5
10 5 10 5
11 1 11 5
12 2 12 5
13 3 13 5
14 4 14 5
15 5 15 5

Resources