using sample() or an equivalent on 2 variables of a dataframe - r

I have a large dataset, a subset of which looks like this:
Var1 Var2
9 29_13x
14 41y
9 51_13x
4 101_13x
14 105y
14 109y
9 113_13x
9 114_13x
14 116y
14 123y
4 124_13x
14 124y
14 126y
4 134_13x
4 135_13x
4 137_13x
9 138_13x
4 139_13x
14 140y
9 142_13x
4 143_13x
My code sits inside a loop and I would like to be able to sample without replacement, a certain number of Var2 (defined by the loop iteration) from each of the different Var1 categories. So for i=4 I'd like to get something like this:
29_13x
51_13x
113_13x
138_13x
which are all from Var1=9
41y
109y
126y
140y
from Var1=14, and
101_13x
134_13x
137_13x
139_13x
all from Var1=4.
I can't get sample() to work across more than one variable and can't find any other way to do this. Any suggestions would be greatly appreciated.

Here are two options.
Using sample with by or tapply:
by(mydf$Var2, mydf$Var1, FUN=function(x) sample(x, 4))
tapply(mydf$Var2, mydf$Var1, FUN=function(x) sample(x, 4))
Here's some example output with tapply:
out <- tapply(mydf$Var2, mydf$Var1, FUN=function(x) sample(x, 4))
out
# $`4`
# [1] "101_13x" "143_13x" "124_13x" "134_13x"
#
# $`9`
# [1] "114_13x" "113_13x" "142_13x" "29_13x"
#
# $`14`
# [1] "116y" "109y" "140y" "105y"
You can also extract individual vectors by index position or by name:
out[[3]]
# [1] "116y" "126y" "124y" "105y"
out[["14"]]
# [1] "116y" "126y" "124y" "105y"
Subsetting based on a random variable ranked by a grouping variable:
x <- rnorm(nrow(mydf))
mydf[ave(x, mydf$Var1, FUN = rank) %in% 1:4, ]

Related

Replicate rows of dataframe

df =data.frame ("x"=c(5,4,10,7) , "y"=c (rep (1,2),rep (2,2))
I'm trying to replicate each x y times and then save it to a variable so the result will be like this:
a=c (5,4)
then
a=c (10,10,7,7)
Probably it is an easy one, but I'm new to programming..thanks in advance
You can use the split function that creates a list of length distincts values of y :
split(rep(df$x, df$y), rep(df$y, df$y))
$`1`
[1] 5 4
$`2`
[1] 10 10 7 7

How to find the index of the value sampled?

In R, I would like to know how I can find the index/indices of the value(s) sampled, for examaple using function sample.
In Matlab, it appears this is quite easily done by requesting output argument idx in function datasample. Explictly, taken from Matlab's documentation page for function datasample:
[y,idx] = datasample(data,k,...) returns an index vector indicating
which values datasample sampled from data.
I would like to know if such a thing can be accomplished in R, and how.
Example:
set.seed(12)
sample(c(0.3,78,45,0.8,0.3,0.8,77), size=1, replace=TRUE)
0.3
How can I know which of the two 0.3's was that one?
We can created a named vector and then sample
v1 <- c(LETTERS[1:10], LETTERS[1])
names(v1) <- seq_along(v1)
v2 <- sample(v1, 20, replace=TRUE)
as.integer(names(v2))
#[1] 10 11 4 2 1 4 6 9 1 1 2 9 2 2 2 3 4 7 3 6
Using the OP's data
set.seed(12)
v1 <- c(0.3,78,45,0.8,0.3,0.8,77)
names(v1) <- seq_along(v1)
set.seed(12)
sample(v1, size=1, replace=TRUE)
# 1
#0.3

Named arrays, dataframes and matrices

If I split my data matrix into rows according to class labels in another vector y like this, the result is something with 'names' like this:
> X <- matrix(c(1,2,3,4,5,6,7,8),nrow=4,ncol=2)
> y <- c(1,3,1,3)
> X_split <- split(as.data.frame(X),y)
$`1`
V1 V2
1 1 5
3 3 7
$`3`
V1 V2
2 2 6
4 4 8
I want to loop through the results and do some operations on each matrix, for example sum the elements or sum the columns. How do I access each matrix in a loop so I can that?
labels = names(X_split)
for (k in labels) {
# How do I get X_split[k] as a matrix?
sum_class = sum(X_split[k]) # Doesn't work
}
In fact, I don't really want to deal with dataframes and named arrays at all. Is there a way I can call split without as.data.frame and get a list of matrices or something similar?
To split without converting to a data frame
X_split <- list(X[c(1, 3), ], X[c(2, 4), ])
More generally, to write it in terms of a vector y of length nrow(X), indicating the group to which each row belongs, you can write this as
X_split <- lapply(unique(y), function(i) X[y == i, ])
To sum the results
X_sum <- lapply(X_split, sum)
# [[1]]
# [1] 16
# [[2]]
# [1] 20
(or use sapply if you want the result as a vector)
Another option is not to split in the first place and just sum per y. Here's a possible data.table approach
library(data.table)
as.data.table(X)[, sum(sapply(.SD, sum)), by = y]
# y V1
# 1: 1 16
# 2: 3 20
Pretty sure operating directly on the matrix is most efficient:
tapply(rowSums(X),y,sum)
# 1 3
# 16 20

Creating groups of equal sum in R

I am trying to group a column of my data.frame/data.table into three groups, all with equal sums.
The data is first ordered from smallest to largest, such that group one would be made up of a large number of rows with small values, and group three would have a small number of rows with large values. This is accomplished in spirit with:
test <- data.frame(x = as.numeric(1:100000))
store <- 0
total <- sum(test$x)
for(i in 1:100000){
store <- store + test$x[i]
if(store < total/3){
test$y[i] <- 1
} else {
if(store < 2*total/3){
test$y[i] <- 2
} else {
test$y[i] <- 3
}
}
}
While successful, I feel like there must be a better way (and maybe a very obvious solution that I am missing).
I never like resorting to loops, especially with nested ifs, when a vectorized approach is available - with even 100,000+ records this code becomes quite slow
This method would become impossibly complex to code to a larger number of groups (not necessarily the looping, but the ifs)
Requires pre-ordering of the column. Might not be able to get around this one.
As a nuance (not that it makes a difference) but the data to be summed would not always (or ever) be consecutive integers.
Maybe with cumsum:
test$z <- cumsum(test$x) %/% (ceiling(sum(test$x) / 3)) + 1
This is more or less a bin-packing problem.
Use the binPack function from the BBmisc package:
library(BBmisc)
test$bins <- binPack(test$x, sum(test$x)/3+1)
The sums of the 3 bins are nearly identical:
tapply(test$x, test$bins, sum)
1 2 3
1666683334 1666683334 1666683332
I thought that the cumsum/modulo division approach was very elegant, but it does retrun a somewhat irregular allocation:
> tapply(test$x, test$z, sum)
1 2 3
1666636245 1666684180 1666729575
> sum(test)/3
[1] 1666683333
So I though I would first create a random permutation and offer something similar:
test$x <- sample(test$x)
test$z2 <- cumsum(test$x)[ findInterval(cumsum(test$x),
c(0, 1666683333*(1:2), sum(test$x)+1))]
> tapply(test$x, test$z2, sum)
91099 116379 129539
1666676164 1666686837 1666686999
This also achieves a more even distribution of counts:
> table(test$z2)
91099 116379 129539
33245 33235 33520
> table(test$z)
1 2 3
57734 23915 18351
I must admit to puzzlement regarding the naming of the entries in z2.
Or you can just cut on the cumsum
test$z <- cut(cumsum(test$x), breaks = 3, labels = 1:3)
or use ggplot2::cut_interval instead of cut:
test$z <- cut_interval(cumsum(test$x), n = 3, labels = 1:3)
You can use fold() from groupdata2 and get an almost equal number of elements per group:
# Create data frame
test <- data.frame(x = as.numeric(1:100000))
# Use fold() to create 3 numerically balanced groups
test <- groupdata2::fold(k = 3, num_col = "x")
# Watch first 10 rows
head(test, 10)
## # A tibble: 10 x 2
## # Groups: .folds [3]
## x .folds
## <dbl> <fct>
## 1 1 1
## 2 2 3
## 3 3 2
## 4 4 1
## 5 5 2
## 6 6 2
## 7 7 1
## 8 8 3
## 9 9 2
## 10 10 3
# Check the sum and number of elements per group
test %>%
dplyr::group_by(.folds) %>%
dplyr::summarize(sum_ = sum(x),
n_members = dplyr::n())
## # A tibble: 3 x 3
## .folds sum_ n_members
## <fct> <dbl> <int>
## 1 1 1666690952 33333
## 2 2 1666716667 33334
## 3 3 1666642381 33333

Random number selection from a data-frame

I have created a dataframe of "errors" following the steps outlined by Bernaard & Sijtsma's (2000) two-way method for missing data imputation. In order to complete my calculation for missing data, I need to make a random selection of a SINGLE NUMBER from this error dataframe and add it to my already calculated missing data values.
I am familiar with the sample() function, but I am not looking for a random sample of a row or a column, but rather one individual cell from a data-frame. Is there a simple way to do this, such as a single "select random number()" command? Is there an alternative method I have yet to explore?
Any help is greatly appreciated.
It's easier if you can convert to a matrix instead of a dataframe , but on the assumption that you need to keep different data types or some such limitation,
foo<-as.data.frame(matrix(runif(20),nrow=4,ncol=5))
foo[sample(1:nrow(foo)),sample(1:ncol(foo))]
will pick a random element.
Similar to what #CarlWitthoft answered, you can convert your data frame back to matrix to make sure you sample a random cell
> set.seed(10)
> M <- data.frame(matrix(runif(20), nrow = 4, ncol = 5))
> M
# X1 X2 X3 X4 X5
# 1 0.5074782 0.08513597 0.6158293 0.1135090 0.05190332
# 2 0.3067685 0.22543662 0.4296715 0.5959253 0.26417767
# 3 0.4269077 0.27453052 0.6516557 0.3580500 0.39879073
# 4 0.6931021 0.27230507 0.5677378 0.4288094 0.83613414
> sample(as.matrix(M), 1)
# [1] 0.2641777 ## came from row 2, column 5
> sample(as.matrix(M), 1)
# [1] 0.113509 ## came from row 1, column 4
> sample(as.matrix(M), 1)
# [1] 0.4288094 ## came from row 4, column 4
> sample(as.matrix(M), 1)
# [1] 0.2723051 ## came from row 4, column 2
seq(as.matrix(M)) will show you all the cell numbers (top to bottom, left to right). You could also sample from that.
> seq(as.matrix(M))
# [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
> sample(seq(as.matrix(M)), 1)
# [1] 15

Resources