** edited because I'm a doofus - with replacement, not without **
I have a large-ish (>500k rows) dataset with 421 groups, defined by two grouping variables. Sample data as follows:
df<-data.frame(group_one=rep((0:9),26), group_two=rep((letters),10))
head(df)
group_one group_two
1 0 a
2 1 b
3 2 c
4 3 d
5 4 e
6 5 f
...and so on.
What I want is some number (k = 12 at the moment, but that number may vary) of stratified samples, by membership in (group_one x group_two). Membership in each group should be indicated by a new column, sample_membership, which has a value of 1 through k (again, 12 at the moment). I should be able to subset by sample_membership and get up to 12 distinct samples, each of which is representative when considering group_one and group_two.
Final data set would thus look something like this:
group_one group_two sample_membership
1 0 a 1
2 0 a 12
3 0 a 5
4 1 a 5
5 1 a 7
6 1 a 9
Thoughts? Thanks very much in advance!
Maybe something like this?:
library(dplyr)
df %>%
group_by(group_one, group_two) %>%
mutate(sample_membership = sample(1:12, n(), replace = FALSE))
Here's a one-line data.table approach, which you should definitely consider if you have a long data.frame.
library(data.table)
setDT(df)
df[, sample_membership := sample.int(12, .N, replace=TRUE), keyby = .(group_one, group_two)]
df
# group_one group_two sample_membership
# 1: 0 a 9
# 2: 0 a 8
# 3: 0 c 10
# 4: 0 c 4
# 5: 0 e 9
# ---
# 256: 9 v 4
# 257: 9 x 7
# 258: 9 x 11
# 259: 9 z 3
# 260: 9 z 8
For sampling without replacement, use replace=FALSE, but as noted elsewhere, make sure you have fewer than k members per group. OR:
If you want to use "sampling without unnecessary replacement" (making this up -- not sure what the right terminology is here) because you have more than k members per group but still want to keep the groups as evenly sized as possible, you could do something like:
# example with bigger groups
k <- 12L
big_df <- data.frame(group_one=rep((0:9),260), group_two=rep((letters),100))
setDT(big_df)
big_df[, sample_round := rep(1:.N, each=k, length.out=.N), keyby = .(group_one, group_two)]
big_df[, sample_membership := sample.int(k, .N, replace=FALSE), keyby = .(group_one, group_two, sample_round)]
head(big_df, 15) # you can see first repeat does not occur until row k+1
Within each "sampling round" (first k observations in the group, second k observations in the group, etc.) there is sampling without replacement. Then, if necessary, the next sampling round makes all k assignments available again.
This approach would really evenly stratify the sample (but perfectly even is only possible if you have a multiple of k members in each group).
Here is a base R method, that assumes that your data.frame is sorted by groups:
# get number of observations for each group
groupCnt <- with(df, aggregate(group_one, list(group_one, group_two), FUN=length))$x
# for reproducibility, set the seed
set.seed(1234)
# get sample by group
df$sample <- c(sapply(groupCnt, function(i) sample(12, i, replace=TRUE)))
Untested example using dplyr, if it doesn't work it might point you in the right direction.
library( dplyr )
set.seed(123)
df <- data.frame(
group_one = as.integer( runif( 1000, 1, 6) ),
group_two = sample( LETTERS[1:6], 1000, TRUE)
) %>%
group_by( group_one, group_two ) %>%
mutate(
sample_membership = sample( seq(1, length(group_one) ), length(group_one), FALSE)
)
Good luck!
Related
Let's say I want to order the iris dataset (as a data.table) by Species, keeping observations grouped by species and randomly ordering across species.
How do I do that?
I am not talking about generating a random order within groups (species).
My intuition was to write the code bellow. But it actually creates the within species random variable. Well at least it makes the question reproducible
d <- iris %>% data.table
set.seed('12345')
d[,g:=runif(.N),Species]
You may do a binary search in i. A smaller example:
d <- data.table(Species = rep(letters[1:4], each = 2), ri = 1:8)
set.seed(1)
d[.(sample(unique(Species))), on = "Species"]
# Species ri
# 1: b 3
# 2: b 4
# 3: d 7
# 4: d 8
# 5: c 5
# 6: c 6
# 7: a 1
# 8: a 2
Alternatively you could do:
e <- d[, .N, Species]
e[, g2 := runif(.N)]
d <- e[, .(Species, g2)][d, on = 'Species']
We can randomly sample from a series 1...N where N is the # of levels of the factor (Species) in question.
We then map the new order to a column and sort by it. Broken apart into steps for illustration it looks like this:
tmp <- sample_n(as.data.frame(seq(1,length(unique(d$Species)))),3)[,1]
d$index <- tmp[as.numeric(d$Species)]
d <- d[order(d$index),]
You could compact this into 1 line/step:
d <- d[order(sample_n(as.data.frame(seq(1,length(unique(d$Species)))),3)[,1][as.numeric(d$Species)]),]
I want to replace the nth consecutive occurrence of a particular code in my data frame. This should be a relatively easy task but I can't think of a solution.
Given a data frame
df <- data.frame(Values = c(1,4,5,6,3,3,2),
Code = c(1,1,2,2,2,1,1))
I want a result
df_result <- data.frame(Values = c(1,4,5,6,3,3,2),
Code = c(1,0,2,2,2,1,0))
The data frame is time-ordered so I need to keep the same order after replacing the values. I guess that nth() or duplicate() functions could be useful here but I'm not sure how to use them. What I'm missing is a function that would count the number of consecutive occurrences of a given value. Once I have it, I could then use it to replace the nth occurrence.
This question had some ideas that I explored but still didn't solve my problem.
EDIT:
After an answer by #Gregor I wrote the following function which solves the problem
library(data.table)
library(dplyr)
replace_nth <- function(x, nth, code) {
y <- data.table(x)
y <- y[, code_rleid := rleid(y$Code)]
y <- y[, seq := seq_along(Code), by = code_rleid]
y <- y[seq == nth & Code == code, Code := 0]
drop.cols <- c("code_rleid", "seq")
y %>% select(-one_of(drop.cols)) %>% data.frame() %>% return()
}
To get the solution, simply run replace_nth(df, 2, 1)
Using data.table:
library(data.table)
setDT(df)
df[, code_rleid := rleid(df$Code)]
df[, seq := seq_along(Code), by = code_rleid]
df[seq == 2 & Code == 1, Code := 0]
df
# Values Code code_rleid seq
# 1: 1 1 1 1
# 2: 4 0 1 2
# 3: 5 2 2 1
# 4: 6 2 2 2
# 5: 3 2 2 3
# 6: 3 1 3 1
# 7: 2 0 3 2
You could combine some of these (and drop the extra columns after). I'll leave it clear and let you make modifications as you like.
So I currently face a problem in R that I exactly know how to deal with in Stata, but have wasted over two hours to accomplish in R.
Using the data.frame below, the result I want is to obtain exactly the first observation per group, while groups are formed by multiple variables and have to be sorted by another variable, i.e. the data.frame mydata obtained by:
id <- c(1,1,1,1,2,2,3,3,4,4,4)
day <- c(1,1,2,3,1,2,2,3,1,2,3)
value <- c(12,10,15,20,40,30,22,24,11,11,12)
mydata <- data.frame(id, day, value)
Should be transformed to:
id day value
1 1 10
1 2 15
1 3 20
2 1 40
2 2 30
3 2 22
3 3 24
4 1 11
4 2 11
4 3 12
By keeping only one of the rows with one or multiple duplicate group-identificators (here that is only row[1]: (id,day)=(1,1)), sorting for value first (so that the row with the lowest value is kept).
In Stata, this would simply be:
bys id day (value): keep if _n == 1
I found a piece of code on the web, which properly does that if I first produce a single group identifier :
mydata$id1 <- paste(mydata$id,"000",mydata$day, sep="") ### the single group identifier
myid.uni <- unique(mydata$id1)
a<-length(myid.uni)
last <- c()
for (i in 1:a) {
temp<-subset(mydata, id1==myid.uni[i])
if (dim(temp)[1] > 1) {
last.temp<-temp[dim(temp)[1],]
}
else {
last.temp<-temp
}
last<-rbind(last, last.temp)
}
last
However, there are a few problems with this approach:
1. A single identifier needs to be created (which is quickly done).
2. It seems like a cumbersome piece of code compared to the single line of code in Stata.
3. On a medium-sized dataset (below 100,000 observations grouped in lots of about 6), this approach would take about 1.5 hours.
Is there any efficient equivalent to Stata's bys var1 var2: keep if _n == 1 ?
The package dplyr makes this kind of things easier.
library(dplyr)
mydata %>% group_by(id, day) %>% filter(row_number(value) == 1)
Note that this command requires more memory in R than in Stata: in R, a new copy of the dataset is created while in Stata, rows are deleted in place.
I would order the data.frame at which point you can look into using by:
mydata <- mydata[with(mydata, do.call(order, list(id, day, value))), ]
do.call(rbind, by(mydata, list(mydata$id, mydata$day),
FUN=function(x) head(x, 1)))
Alternatively, look into the "data.table" package. Continuing with the ordered data.frame from above:
library(data.table)
DT <- data.table(mydata, key = "id,day")
DT[, head(.SD, 1), by = key(DT)]
# id day value
# 1: 1 1 10
# 2: 1 2 15
# 3: 1 3 20
# 4: 2 1 40
# 5: 2 2 30
# 6: 3 2 22
# 7: 3 3 24
# 8: 4 1 11
# 9: 4 2 11
# 10: 4 3 12
Or, starting from scratch, you can use data.table in the following way:
DT <- data.table(id, day, value, key = "id,day")
DT[, n := rank(value, ties.method="first"), by = key(DT)][n == 1]
And, by extension, in base R:
Ranks <- with(mydata, ave(value, id, day, FUN = function(x)
rank(x, ties.method="first")))
mydata[Ranks == 1, ]
Using data.table, assuming the mydata object has already been sorted in the way you require, another approach would be:
library(data.table)
mydata <- data.table(my.data)
mydata <- mydata[, .SD[1], by = .(id, day)]
Using dplyr with magrittr pipes:
library(dplyr)
mydata <- mydata %>%
group_by(id, day) %>%
slice(1) %>%
ungroup()
If you don't add ungroup() to the end dplyr's grouping structure will still be present and might mess up some of your subsequent functions.
I am attempting to create new variables using a function and lapply rather than working right in the data with loops. I used to use Stata and would have solved this problem with a method similar to that discussed here.
Since naming variables programmatically is so difficult or at least awkward in R (and it seems you can't use indexing with assign), I have left the naming process until after the lapply. I am then using a for loop to do the renaming prior to merging and again for the merging. Are there more efficient ways of doing this? How would I replace the loops? Should I be doing some sort of reshaping?
#Reproducible data
data <- data.frame("custID" = c(1:10, 1:20),
"v1" = rep(c("A", "B"), c(10,20)),
"v2" = c(30:21, 20:19, 1:3, 20:6), stringsAsFactors = TRUE)
#Function to analyze customer distribution for each category (v1)
pf <- function(cat, df) {
df <- df[df$v1 == cat,]
df <- df[order(-df$v2),]
#Divide the customers into top percents
nr <- nrow(df)
p10 <- round(nr * .10, 0)
cat("Number of people in the Top 10% :", p10, "\n")
p20 <- round(nr * .20, 0)
p11_20 <- p20-p10
cat("Number of people in the 11-20% :", p11_20, "\n")
#Keep only those customers in the top groups
df <- df[1:p20,]
#Create a variable to identify the percent group the customer is in
top_pct <- integer(length = p10 + p11_20)
#Identify those in each group
top_pct[1:p10] <- 10
top_pct[(p10+1):p20] <- 20
#Add this variable to the data frame
df$top_pct <- top_pct
#Keep only custID and the new variable
df <- subset(df, select = c(custID, top_pct))
return(df)
}
##Run the customer distribution function
v1Levels <- levels(data$v1)
res <- lapply(v1Levels, pf, df = data)
#Explore the results
summary(res)
# Length Class Mode
# [1,] 2 data.frame list
# [2,] 2 data.frame list
print(res)
# [[1]]
# custID top_pct
# 1 1 10
# 2 2 20
#
# [[2]]
# custID top_pct
# 11 1 10
# 16 6 10
# 12 2 20
# 17 7 20
##Merge the two data frames but with top_pct as a different variable for each category
#Change the new variable name
for(i in 1:length(res)) {
names(res[[i]])[2] <- paste0(v1Levels[i], "_top_pct")
}
#Merge the results
res_m <- res[[1]]
for(i in 2:length(res)) {
res_m <- merge(res_m, res[[i]], by = "custID", all = TRUE)
}
print(res_m)
# custID A_top_pct B_top_pct
# 1 1 10 10
# 2 2 20 20
# 3 6 NA 10
# 4 7 NA 20
Stick to your Stata instincts and use a single data set:
require(data.table)
DT <- data.table(data)
DT[,r:=rank(v2)/.N,by=v1]
You can see the result by typing DT.
From here, you can group the within-v1 rank, r, if you want to. Following Stata idioms...
DT[,g:={
x = rep(0,.N)
x[r>.8] = 20
x[r>.9] = 10
x
}]
This is like gen and then two replace ... if statements. Again, you can see the result with DT.
Finally, you can subset with
DT[g>0]
which gives
custID v1 v2 r g
1: 1 A 30 1.000 10
2: 2 A 29 0.900 20
3: 1 B 20 0.975 10
4: 2 B 19 0.875 20
5: 6 B 20 0.975 10
6: 7 B 19 0.875 20
These steps can also be chained together:
DT[,r:=rank(v2)/.N,by=v1][,g:={x = rep(0,.N);x[r>.8] = 20;x[r>.9] = 10;x}][g>0]
(Thanks to #ExperimenteR:)
To rearrange for the desired output in the OP, with values of v1 in columns, use dcast:
dcast(
DT[,r:=rank(v2)/.N,by=v1][,g:={x = rep(0,.N);x[r>.8] = 20;x[r>.9] = 10;x}][g>0],
custID~v1)
Currently, dcast requires the latest version of data.table, available (I think) from Github.
You don't need the function pf to achieve what you want. Try dplyr/tidyr combo
library(dplyr)
library(tidyr)
data %>%
group_by(v1) %>%
arrange(desc(v2))%>%
mutate(n=n()) %>%
filter(row_number() <= round(n * .2)) %>%
mutate(top_pct= ifelse(row_number()<=round(n* .1), 10, 20)) %>%
select(custID, top_pct) %>%
spread(v1, top_pct)
# custID A B
#1 1 10 10
#2 2 20 20
#3 6 NA 10
#4 7 NA 20
The idiomatic way to do this kind of thing in R would be to use a combination of split and lapply. You're halfway there with your use of lapply; you just need to use split as well.
lapply(split(data, data$v1), function(df) {
cutoff <- quantile(df$v2, c(0.8, 0.9))
top_pct <- ifelse(df$v2 > cutoff[2], 10, ifelse(df$v2 > cutoff[1], 20, NA))
na.omit(data.frame(id=df$custID, top_pct))
})
Finding quantiles is done with quantile.
I am trying to group a column of my data.frame/data.table into three groups, all with equal sums.
The data is first ordered from smallest to largest, such that group one would be made up of a large number of rows with small values, and group three would have a small number of rows with large values. This is accomplished in spirit with:
test <- data.frame(x = as.numeric(1:100000))
store <- 0
total <- sum(test$x)
for(i in 1:100000){
store <- store + test$x[i]
if(store < total/3){
test$y[i] <- 1
} else {
if(store < 2*total/3){
test$y[i] <- 2
} else {
test$y[i] <- 3
}
}
}
While successful, I feel like there must be a better way (and maybe a very obvious solution that I am missing).
I never like resorting to loops, especially with nested ifs, when a vectorized approach is available - with even 100,000+ records this code becomes quite slow
This method would become impossibly complex to code to a larger number of groups (not necessarily the looping, but the ifs)
Requires pre-ordering of the column. Might not be able to get around this one.
As a nuance (not that it makes a difference) but the data to be summed would not always (or ever) be consecutive integers.
Maybe with cumsum:
test$z <- cumsum(test$x) %/% (ceiling(sum(test$x) / 3)) + 1
This is more or less a bin-packing problem.
Use the binPack function from the BBmisc package:
library(BBmisc)
test$bins <- binPack(test$x, sum(test$x)/3+1)
The sums of the 3 bins are nearly identical:
tapply(test$x, test$bins, sum)
1 2 3
1666683334 1666683334 1666683332
I thought that the cumsum/modulo division approach was very elegant, but it does retrun a somewhat irregular allocation:
> tapply(test$x, test$z, sum)
1 2 3
1666636245 1666684180 1666729575
> sum(test)/3
[1] 1666683333
So I though I would first create a random permutation and offer something similar:
test$x <- sample(test$x)
test$z2 <- cumsum(test$x)[ findInterval(cumsum(test$x),
c(0, 1666683333*(1:2), sum(test$x)+1))]
> tapply(test$x, test$z2, sum)
91099 116379 129539
1666676164 1666686837 1666686999
This also achieves a more even distribution of counts:
> table(test$z2)
91099 116379 129539
33245 33235 33520
> table(test$z)
1 2 3
57734 23915 18351
I must admit to puzzlement regarding the naming of the entries in z2.
Or you can just cut on the cumsum
test$z <- cut(cumsum(test$x), breaks = 3, labels = 1:3)
or use ggplot2::cut_interval instead of cut:
test$z <- cut_interval(cumsum(test$x), n = 3, labels = 1:3)
You can use fold() from groupdata2 and get an almost equal number of elements per group:
# Create data frame
test <- data.frame(x = as.numeric(1:100000))
# Use fold() to create 3 numerically balanced groups
test <- groupdata2::fold(k = 3, num_col = "x")
# Watch first 10 rows
head(test, 10)
## # A tibble: 10 x 2
## # Groups: .folds [3]
## x .folds
## <dbl> <fct>
## 1 1 1
## 2 2 3
## 3 3 2
## 4 4 1
## 5 5 2
## 6 6 2
## 7 7 1
## 8 8 3
## 9 9 2
## 10 10 3
# Check the sum and number of elements per group
test %>%
dplyr::group_by(.folds) %>%
dplyr::summarize(sum_ = sum(x),
n_members = dplyr::n())
## # A tibble: 3 x 3
## .folds sum_ n_members
## <fct> <dbl> <int>
## 1 1 1666690952 33333
## 2 2 1666716667 33334
## 3 3 1666642381 33333