Random Sample From a Dataframe With Specific Count - r

This question is probably best illustrated with an example.
Suppose I have a dataframe df with a binary variable b (values of b are 0 or 1). How can I take a random sample of size 10 from this dataframe so that I have 2 instances where b=0 in the random sample, and 8 instances where b=1 in the dataframe?
Right now, I know that I can do df[sample(nrow(df),10,] to get part of the answer, but that would give me a random amount of 0 and 1 instances. How can I specify a specific amount of 0 and 1 instances while still taking a random sample?

Here's an example of how I'd do this... take two samples and combine them. I've written a simple function so you can "just take one sample."
With a vector:
pop <- sample(c(0,1), 100, replace = TRUE)
yoursample <- function(pop, n_zero, n_one){
c(sample(pop[pop == 0], n_zero),
sample(pop[pop == 1], n_one))
}
yoursample(pop, n_zero = 2, n_one = 8)
[1] 0 0 1 1 1 1 1 1 1 1
Or, if you are working with a dataframe with some unique index called id:
# Where d1 is your data you are summarizing with mean and sd
dat <- data.frame(
id = 1:100,
val = sample(c(0,1), 100, replace = TRUE),
d1 = runif(100))
yoursample <- function(dat, n_zero, n_one){
c(sample(dat[dat$val == 0,"id"], n_zero),
sample(dat[dat$val == 1,"id"], n_one))
}
sample_ids <- yoursample(dat, n_zero = 2, n_one = 8)
sample_ids
mean(dat[dat$id %in% sample_ids,"d1"])
sd(dat[dat$id %in% sample_ids,"d1"])

Here is a suggestion:
First create a sample of 0 and 1 with id column.
Then sample 2:8 df's with condition and bind them together:
library(tidyverse)
set.seed(123)
df <- as_tibble(sample(0:1,size=50,replace=TRUE)) %>%
mutate(id = row_number())
df1 <- df[ sample(which (df$value ==0) ,2), ]
df2 <- df[ sample(which (df$value ==1), 8), ]
df_final <- bind_rows(df1, df2)
value id
<int> <int>
1 0 14
2 0 36
3 1 21
4 1 24
5 1 2
6 1 50
7 1 49
8 1 41
9 1 28
10 1 33

library(tidyverse)
set.seed(123)
df <- data.frame(a = letters,
b = sample(c(0,1),26,T))
bind_rows(
df %>%
filter(b == 0) %>%
sample_n(2),
df %>%
filter(b == 1) %>%
sample_n(8)
) %>%
arrange(a)
a b
1 d 1
2 g 1
3 h 1
4 l 1
5 m 1
6 o 1
7 p 0
8 q 1
9 s 0
10 v 1

Related

Update specific values in a dataframe based on array index position

Let's say I have a dataframe
> colA <- c(1, 14, 8)
> colB <- c(4, 8, 9)
> colC <- c(1, 2, 14)
> df <- data.frame(c(colA, colB, colC))
> df
colA colB colC
1 1 4 1
2 14 8 2
3 8 9 14
What I want to do is create a second data frame which has the same structure as df, but has 1 whenever a specific number is found, and 0 otherwise, e.g., if the number were 14, df2 would look like this
> df2
colA colB colC
1 0 0 0
2 1 0 0
3 0 0 1
I thought I could create a 3x3 data frame of 0s (df2), use which() to get the index for the number in df, and then use that index to change what shows up in df2
> number <- 14
> index <- which(df == number)
> index
[1] 2 9
or perhaps more helpfully
> index <- which(df == number, arr.ind = T)
> index
row col
[1,] 2 1
[2,] 3 3
However I am unsure how to use this index to specifiy which values in the df of NAs should be TRUE and which FALSE (i.e. how to reverse the which)?
NB - I will actually be testing this for multiple numbers, so I figured I would do it inside a for loop. So I want the final DF to show ones for every location which has any of the numbers (i.e. gradually switching the 0's "on" to 1's
> numbers <- c(14, 9, 1
> for(i in numbers){
> index <- which(df == numbers, arr.ind = T)
> #then do whatever needs to be done to change the index locations in df2
P.S., in general, I work in the tidyverse, so tidyverse specific solutions would be grand, but base r would also be brilliant.
Ohh, and yes, this is for day 4 of Advent of Code - it's a useful challenge to help this non-expert coder learn.
Thanks
Here's a full example how it could be done.
Data
df <- structure(list(colA = c(1, 14, 8), colB = c(4, 8, 9), colC = c(1,
2, 14)), class = "data.frame", row.names = c(NA, -3L))
base R
data.frame( sapply( df, function(x) as.numeric( x == 14 | x == 8 ) ))
colA colB colC
1 0 0 0
2 1 1 0
3 1 0 1
for any number in a loop
setNames( data.frame( matrix( rowSums( sapply( c(14,8,1), function(x)
df==x ) ), dim(df) ) ), colnames( df ) )
colA colB colC
1 1 0 1
2 1 1 0
3 1 0 1
dplyr
library(dplyr)
df %>% summarise_all( ~ as.numeric( .x == 14 | .x == 8 ) )
colA colB colC
1 0 0 0
2 1 1 0
3 1 0 1
# or
df %>% summarise( across( everything(), ~ as.numeric( .x == 14 | .x == 8 ) ) )
colA colB colC
1 0 0 0
2 1 1 0
3 1 0 1

Detect sequences of ordered strings and group them using R

I have a string vector with about 500K elements in it and I want to assign a value to each of the element to show the group number of each element.
The grouping criteria goes like this:
a group number is assigned consecutively from the top of the list
Each element should be assigned different groups unless if a minimum of 3 consecutive elements are in ascending alphabetical order, in which these consecutive elements will be in one group.
How do I do this in R?
For example and expected output:
> my_strings <- c("xx1", "1xxx", "abc.xyz", "a", "ad022", "ghj1", "kf1", "991r",
+ "jdd", "12vd", "r34o", "z", "034mh")
> expected_output <- c(1, 2, 3, 4, 4, 4, 4, 5, 6, 7, 7, 7, 8)
> (df <- data.frame(input = my_strings, output = expected_output))
input output
1 xx1 1
2 1xxx 2
3 abc.xyz 3
4 a 4
5 ad022 4
6 ghj1 4
7 kf1 4
8 991r 5
9 jdd 6
10 12vd 7
11 r34o 7
12 z 7
13 034mh 8
So far, I attempt to use dplyr::lead and assign order based on two consecutive elements. I don't know how to proceed from here though.
res <- as_tibble(my_strings) %>%
mutate(after = lead(my_strings))
res$pre_group = apply(res, 1, function(x) order(c(x[1], x[2]))[2])
(Dang, this was a tough one :-)
tidyverse
library(dplyr)
df %>%
mutate(r1 = cumsum(c(TRUE, diff(rank(input)) < 0)) + 0) %>%
group_by(r1) %>%
mutate(r2 = r1 + seq(0, 0.9*(n() < 3), len = n()) / n()) %>%
ungroup() %>%
mutate(r1 = with(list(rl = rle(r2)$lengths), rep(seq_along(rl), times = rl))) %>%
select(-r2)
# # A tibble: 13 x 3
# input output r1
# <chr> <dbl> <int>
# 1 xx1 1 1
# 2 1xxx 2 2
# 3 abc.xyz 3 3
# 4 a 4 4
# 5 ad022 4 4
# 6 ghj1 4 4
# 7 kf1 4 4
# 8 991r 5 5
# 9 jdd 6 6
# 10 12vd 7 7
# 11 r34o 7 7
# 12 z 7 7
# 13 034mh 8 8
(The lengthy with(...) in the mutate is just an inline version of data.table::rleid.)
data.table
library(data.table)
as.data.table(df)[
, r1 := cumsum(c(TRUE, diff(rank(input)) < 0)) + 0 ][
, r1 := r1 + seq(0, 0.9*(.N < 3), len = .N), by = .(r1) ][
, r1 := rleid(r1) ]
If you want to blur the lines of R-dialects a little, then
library(data.table)
library(magrittr)
as.data.table(df) %>%
.[, r1 := cumsum(c(TRUE, diff(rank(input)) < 0)) + 0 ] %>%
.[, r1 := r1 + seq(0, 0.9*(.N < 3), len = .N), by = .(r1) ] %>%
.[, r1 := rleid(r1) ]
Notes:
... + 0 is short-hand for as.numeric(...). This is because data.table enforces the column's original class when updating a column; since the first definition of r1 (without +0) would be integer, the next reassignment of r1 returns numeric. However, since data.table persists the original class, the numbers will be coerced (truncated) to integer and my efforts halted.
seq(0, 0.9*(...)) reduces to seq(0,0) when there are three or more in a group, which results in a no-op on that group. (This uses dplyr's n() and data.table's .N for group-size.)
the implementations differ slightly because dplyr prohibits modifying the grouping variable(s); data.table has no issue with this. (I'm not certain which direction is correct or better ...)
Not nearly as good as r2evans', but also seems to give the result.
x <- my_strings
n <- length(x)
c(FALSE,x[-1L] > x[-n]) &
c(FALSE,FALSE,x[-1L][-1L] > x[-n][-(n-1)]) &
c(FALSE,FALSE,FALSE,x[-1L][-1L][-1L] > x[-n][-(n-1)][-(n-2)])
(lead(x, 1) > x & lead(x,2) > lead(x,1)) |
(lag(x, 1) < x & lead(x,1) > x) |
(lag(x, 1) < x & lag(x,2) < lag(x,1)) -> condition
condition[is.na(condition)] <- FALSE # remove NAs
#to visualize
tibble(lag(x,2), lag(x,1), x, lead(x,1), lead(x,2), condition)
# There may be a better way than a loop
cur_class <- 0
classes <- integer(n)
for(i in 1:(n)){
if(!condition[i]){ #not in a sequence
cur_class <- cur_class + 1
classes[i] <- cur_class
} else if(!condition[i-1]){ #first of a sequence
cur_class <- cur_class + 1
classes[i] <- cur_class
} else{ #mid-sequence
classes[i] <- cur_class
}
}
tibble(x, classes, condition*1L)
# A tibble: 13 x 3
# x classes `condition * 1L`
# <chr> <dbl> <int>
# 1 xx1 1 0
# 2 1xxx 2 0
# 3 abc.xyz 3 0
# 4 a 4 1
# 5 ad022 4 1
# 6 ghj1 4 1
# 7 kf1 4 1
# 8 991r 5 0
# 9 jdd 6 0
# 10 12vd 7 1
# 11 r34o 7 1
# 12 z 7 1
# 13 034mh 8 0

How to get sum by each factor level?

I have filtered data and one of the columns has 5 factor levels and I want to get sum for each of the factor level.
I am using the below code
levels(df_Temp$ATYPE)
[1] "a" "b" "c" "d" "Unknown"
I am using the below code
cast(df_Temp,ATYPE~AFTER_ADM, sum, value = "CHRGES")
but the output I am getting is as below
ATYPE 0 1
1 a 0 2368968.39
2 b 0 3206567.47
3 c 0 19551.19
4 e 0 2528688.12
I want to all the factor levels and sum as "0" for those missing data of factors level.
So the desired output is
ATYPE 0 1
1 a 0 2368968.39
2 b 0 3206567.47
3 c 0 19551.19
4 d 0 0
5 e 0 2528688.12
Using xtabs from base R
xtabs(CHRGES ~ ATYPE + AFTER_ADM, subset(df_Temp, ATYPE != "e"))
# AFTER_ADM
#ATYPE 0 1
# a 0.00000000 -5.92270971
# b -1.68910431 0.05222349
# c -0.26869311 0.16922669
# d 1.44764443 -1.59011411
# e 0.00000000 0.00000000
data
set.seed(24)
df_Temp <- data.frame(ATYPE = sample(letters[1:5], 20, replace = TRUE),
AFTER_ADM = sample(0:1, 20, replace = TRUE), CHRGES = rnorm(20))
If I understand your question correctly, you can use dplyr. First I created an example dataset:
set.seed(123)
x <- sample(letters[1:5], 1e3, replace = T)
x[x == "e"] <- "Unknown"
y <- sample(1:100, 1e3, replace = T)
df1 <- data.frame(ATYPE = factor(x), AFTER_ADM = y)
df1$AFTER_ADM[df1$ATYPE == "Unknown"] <- NA
head(df1, 10)
ATYPE AFTER_ADM
1 b 28
2 d 60
3 c 17
4 Unknown NA
5 Unknown NA
6 a 48
7 c 78
8 Unknown NA
9 c 7
10 c 45
And then use group_by and summarise to get the sum and the counts. I was not sure if you would want the counts for the factor levels but it is easy to take out if you are not interested:
library(dplyr)
df1 %>%
group_by(ATYPE) %>%
summarise(sum_AFTER_ADM = sum(AFTER_ADM, na.rm = T),
n_ATYPE = n())
# A tibble: 5 x 3
ATYPE sum_AFTER_ADM n_ATYPE
<fct> <int> <int>
1 a 10363 198
2 b 11226 206
3 c 9611 203
4 d 9483 195
5 Unknown 0 198
Another possible solution using dplyr and tidyr. Using count and complete from the two packages will help solve your problem.
library(dplyr)
library(tidyr)
#using iris as toy data
iris2 <- iris %>%
filter(Species != "setosa")
#count data and then fill n with 0
ir3 <- count(iris2, Species) %>%
complete(Species, fill = list(n =0))

Group data by factor level, then transform to data frame with colname being levels?

There is my problem that I can't solve it:
Data:
df <- data.frame(f1=c("a", "a", "b", "b", "c", "c", "c"),
v1=c(10, 11, 4, 5, 0, 1, 2))
data.frame:f1 is factor
f1 v1
a 10
a 11
b 4
b 5
c 0
c 1
c 2
# What I want is:(for example, fetch data with the number of element of some level == 2, then to data.frame)
a b
10 4
11 5
Thanks in advance!
I might be missing something simple here , but the below approach using dplyr works.
library(dplyr)
nlevels = 2
df1 <- df %>%
add_count(f1) %>%
filter(n == nlevels) %>%
select(-n) %>%
mutate(rn = row_number()) %>%
spread(f1, v1) %>%
select(-rn)
This gives
# a b
# <int> <int>
#1 10 NA
#2 11 NA
#3 NA 4
#4 NA 5
Now, if you want to remove NA's we can do
do.call("cbind.data.frame", lapply(df1, function(x) x[!is.na(x)]))
# a b
#1 10 4
#2 11 5
As we have filtered the dataframe which has only nlevels observations, we would have same number of rows for each column in the final dataframe.
split might be useful here to split df$v1 into parts corresponding to df$f1. Since you are always extracting equal length chunks, it can then simply be combined back to a data.frame:
spl <- split(df$v1, df$f1)
data.frame(spl[lengths(spl)==2])
# a b
#1 10 4
#2 11 5
Or do it all in one call by combining this with Filter:
data.frame(Filter(function(x) length(x)==2, split(df$v1, df$f1)))
# a b
#1 10 4
#2 11 5
Here is a solution using unstack :
unstack(
droplevels(df[ave(df$v1, df$f1, FUN = function(x) length(x) == 2)==1,]),
v1 ~ f1)
# a b
# 1 10 4
# 2 11 5
A variant, similar to #thelatemail's solution :
data.frame(Filter(function(x) length(x) == 2, unstack(df,v1 ~ f1)))
My tidyverse solution would be:
library(tidyverse)
df %>%
group_by(f1) %>%
filter(n() == 2) %>%
mutate(i = row_number()) %>%
spread(f1, v1) %>%
select(-i)
# # A tibble: 2 x 2
# a b
# * <dbl> <dbl>
# 1 10 4
# 2 11 5
or mixing approaches :
as_tibble(keep(unstack(df,v1 ~ f1), ~length(.x) == 2))
Using all base functions (but you should use tidyverse)
# Add count of instances
x$len <- ave(x$v1, x$f1, FUN = length)
# Filter, drop the count
x <- x[x$len==2, c('f1','v1')]
# Hacky pivot
result <- data.frame(
lapply(unique(x$f1), FUN = function(y) x$v1[x$f1==y])
)
colnames(result) <- unique(x$f1)
> result
a b
1 10 4
2 11 5
I'd like code this, may it helps for you
library(reshape2)
library(dplyr)
aa = data.frame(v1=c('a','a','b','b','c','c','c'),f1=c(10,11,4,5,0,1,2))
cc = aa %>% group_by(v1) %>% summarise(id = length((v1)))
dd= merge(aa,cc) #get the level
ee = dd[dd$aa==2,] #select number of level equal to 2
ee$id = rep(c(1,2),nrow(ee)/2) # reset index like (1,2,1,2)
dcast(ee, id~v1,value.var = 'f1')
all done!

convert data frame of counts to proportions by conditions in R

I would need to expand on this question: convert data frame of counts to proportions in R
I need to calculate proportion by one condition and retain the information of the dataset.
Reproducible example:
ID <- rep(c(1,2,3), each=3)
trial <- rep("a", 9)
variable1 <- sample(1:10, 9)
variable2 <- sample(1:10, 9)
variable3 <- sample(1:10, 9)
condition <- rep(c("i","j","k"), 3)
dat <- data.frame(cbind(ID, trial,variable1,variable2,variable3,condition))
For each variable I would like to have the proportion by the ID (i.e. 3 times)
Ideally the new variables would be stored in the same database as dat$variable1_p
I know how to do the trick by a series of for loops but I would like to learn how to use the apply function. Also to be able to expand it to more conditions if necessary.
We can use adply from the plyr package:
library(plyr)
adply(dat, 1, function(x)
c('variable1_p' = x$variable1 / sum(dat[x$ID == dat$ID,]$variable1)))
# ID trial variable1 variable2 variable3 condition variable1_p
# 1 1 a 3 5 4 i 0.20000000
# 2 1 a 8 9 9 j 0.53333333
# 3 1 a 4 4 8 k 0.26666667
# 4 2 a 7 10 5 i 0.50000000
# 5 2 a 6 8 10 j 0.42857143
# 6 2 a 1 1 7 k 0.07142857
# 7 3 a 10 6 3 i 0.47619048
# 8 3 a 9 7 6 j 0.42857143
# 9 3 a 2 3 2 k 0.09523810
Another option is to use dplyr, which would handle cases where there is more than one row per condition per ID:
library(dplyr)
dat %>%
group_by(ID, condition) %>%
mutate(sum_v1_cond = sum(variable1)) %>%
ungroup() %>%
group_by(ID) %>%
mutate(variable1_p = sum_v1_cond / sum(variable1)) %>%
select(-sum_v1_cond)
Edit - here's a full solution for variable1, variable2, and variable3:
adply(dat, 1, function(x)
c('variable1_p' = x$variable1 / sum(dat[x$ID == dat$ID,]$variable1),
'variable2_p' = x$variable2 / sum(dat[x$ID == dat$ID,]$variable2),
'variable3_p' = x$variable3 / sum(dat[x$ID == dat$ID,]$variable3)))
Data:
set.seed(123)
ID <- rep(c(1,2,3), each=3)
trial <- rep("a", 9)
variable1 <- sample(1:10, 9)
variable2 <- sample(1:10, 9)
variable3 <- sample(1:10, 9)
condition <- rep(c("i","j","k"), 3)
dat <- data.frame(ID, trial,variable1,variable2,variable3,condition,
stringsAsFactors = FALSE)

Resources