simple random sampling from groups with specified sample size - r

So I have a dataframe (my.df) which I have grouped by the variable "strat". Each row consists of numerous variables. Example of what it looks like is below - I've simplified my.df for this example since it is quite large. What I want to do next is draw a simple random sample from each group. If I wanted to draw 5 observations from each group I would use this code:
new_df <- my.df %>% group_by(strat) %>% sample_n(5)
However, I have a different specified sample size that I want to sample for each group. I have these sample sizes in a vector nj.
nj <- c(3, 4, 2)
So ideally, I would want 3 observations from my first strata, 4 observations from my second strata and 2 observations from my last srata. I'm not sure if I can sample by group using each unique sample size (without having to write out "sample" however many times I need to)? Thanks in advance!
my.df looks like:
var1 var2 strat
15 3 1
13 5 3
8 6 2
12 70 3
11 10 1
14 4 2

You can use stratified from my "splitstackshape" package.
Here's some sample data:
set.seed(1)
my.df <- data.frame(var1 = sample(100, 20, TRUE),
var2 = runif(20),
strat = sample(3, 20, TRUE))
table(my.df$strat)
#
# 1 2 3
# 5 9 6
Here's how you can use stratified:
library(splitstackshape)
# nj needs to be a named vector
nj <- c("1" = 3, "2" = 4, "3" = 2)
stratified(my.df, "strat", nj)
# var1 var2 strat
# 1: 72 0.7942399 1
# 2: 39 0.1862176 1
# 3: 50 0.6684667 1
# 4: 21 0.2672207 2
# 5: 69 0.4935413 2
# 6: 91 0.1255551 2
# 7: 78 0.4112744 2
# 8: 7 0.3403490 3
# 9: 27 0.9347052 3
table(.Last.value$strat)
#
# 1 2 3
# 3 4 2

Since your data is inadequate for sampling, let us consider this example on iris dataset
library(tidyverse)
nj <- c(3, 5, 6)
set.seed(1)
iris %>% group_split(Species) %>% map2_df(nj, ~sample_n(.x, size = .y))
# A tibble: 14 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 4.6 3.1 1.5 0.2 setosa
2 4.4 3 1.3 0.2 setosa
3 5.1 3.5 1.4 0.2 setosa
4 6 2.7 5.1 1.6 versicolor
5 6.3 2.5 4.9 1.5 versicolor
6 5.8 2.6 4 1.2 versicolor
7 6.1 2.9 4.7 1.4 versicolor
8 5.8 2.7 4.1 1 versicolor
9 6.4 2.8 5.6 2.2 virginica
10 6.9 3.2 5.7 2.3 virginica
11 6.2 3.4 5.4 2.3 virginica
12 6.9 3.1 5.1 2.3 virginica
13 6.7 3 5.2 2.3 virginica
14 7.2 3.6 6.1 2.5 virginica

You can bring nj values to sample in the dataframe and then use sample_n by group.
library(dplyr)
df %>%
mutate(nj = nj[strat]) %>%
group_by(strat) %>%
sample_n(size = min(first(nj), n()))
Note that the above works because strat has value 1, 2, 3. For a general solution when the group does not have such values you could use :
df %>%
mutate(nj = nj[match(strat, unique(strat))]) %>%
group_by(strat) %>%
sample_n(size = min(first(nj), n()))

Related

Setting values to NA in one column based on conditions in another column

Here's a simplified mock dataframe:
df1 <- data.frame(amb = c(2.5,3.6,2.1,2.8,3.4,3.2,1.3,2.5,3.2),
warm = c(3.6,5.3,2.1,6.3,2.5,2.1,2.4,6.2,1.5),
sensor = c(1,1,1,2,2,2,3,3,3))
I'd like to set all values in the "amb" column to NA if they're in sensor 1, but retain the values in the "warm" column for sensor 1. Here's what I'd like the final output to look like:
amb warm sensor
NA 3.6 1
NA 5.3 1
NA 2.1 1
2.8 6.3 2
3.4 2.5 2
3.2 2.1 2
1.3 2.4 3
2.5 6.2 3
3.2 1.5 3
Using R version 4.0.2, Mac OS X 10.13.6
A possible solution, based on dplyr:
library(dplyr)
df1 %>%
mutate(amb = ifelse(sensor == 1, NA, amb))
#> amb warm sensor
#> 1 NA 3.6 1
#> 2 NA 5.3 1
#> 3 NA 2.1 1
#> 4 2.8 6.3 2
#> 5 3.4 2.5 2
#> 6 3.2 2.1 2
#> 7 1.3 2.4 3
#> 8 2.5 6.2 3
#> 9 3.2 1.5 3
Seems to be best handled with the vectorized function is.na<-
is.na(df1$amb) <- df1$sensor %in% c(1) # that c() isn't needed
But to be most general and support tests of proper test for equality among floating point numbers the answer might be:
is.na(df1$amb) <- df1$sensor-1 < 1e-16

R: Split all groups in half (dplyr)

My data is grouped, but I would like to split each group in two, as illustrated in the example below. It doesn't really matter what the content of group_half will be, it can be anything like 1a/1b or 1.1/1.2. Any recommendations on how to do this using dplyr? Thanks!
col_1 <- c(23,31,98,76,47,65,23,76,3,47,54,56)
group <- c(1,1,1,1,2,2,2,2,3,3,3,3)
group_half <- c(1.1, 1.1, 1.2, 1.2, 2.1, 2.1, 2.2, 2.2, 3.1, 3.1, 3.2, 3.2)
df1 <- data.frame(col_1, group, group_half)
# col_1 group group_half
# 23 1 1.1
# 31 1 1.1
# 98 1 1.2
# 76 1 1.2
# 47 2 2.1
# 65 2 2.1
# 23 2 2.2
# 76 2 2.2
# 3 3 3.1
# 47 3 3.1
# 54 3 3.2
# 56 3 3.2
Here are two options :
If you always have even number of rows in each group.
library(dplyr)
df1 %>%
group_by(group) %>%
mutate(group_half = paste(group, rep(1:2, each = n()/2), sep = '.')) %>%
ungroup
# col_1 group group_half
# <dbl> <dbl> <chr>
# 1 23 1 1.1
# 2 31 1 1.1
# 3 98 1 1.2
# 4 76 1 1.2
# 5 47 2 2.1
# 6 65 2 2.1
# 7 23 2 2.2
# 8 76 2 2.2
# 9 3 3 3.1
#10 47 3 3.1
#11 54 3 3.2
#12 56 3 3.2
This will work irrespective of number of rows in the group.
df1 %>%
group_by(group) %>%
mutate(group_half = paste(group,as.integer(row_number() > n()/2) + 1, sep = '.')) %>%
ungroup

Select varying number of top_n for different groups using dplyr

I have the following dataframe. I want to prefer dplyr to solve this problem.
For each zone I want at minimum two values. Value > 4.0 is preferred.
Therefore, for zone 10 all values (being > 4.0) are kept. For zone 20, top two values are picked. Similarly for zone 30.
zone <- c(rep(10,4), rep(20, 4), rep(30, 4))
set.seed(1)
value <- c(4.5,4.3,4.6, 5,5, rep(3,7)) + round(rnorm(12, sd = 0.1),1)
df <- data.frame(zone, value)
> df
zone value
1 10 4.4
2 10 4.3
3 10 4.5
4 10 5.2
5 20 5.0
6 20 2.9
7 20 3.0
8 20 3.1
9 30 3.1
10 30 3.0
11 30 3.2
12 30 3.0
The desired output is as follows
> df
zone value
1 10 4.4
2 10 4.3
3 10 4.5
4 10 5.2
5 20 5.0
6 20 3.1
7 30 3.1
8 30 3.2
I thought of using top_n but it picks the same number for each zone.
You could dynamically calculate n in top_n
library(dplyr)
df %>% group_by(zone) %>% top_n(max(sum(value > 4), 2), value)
# zone value
# <dbl> <dbl>
#1 10 4.4
#2 10 4.3
#3 10 4.5
#4 10 5.2
#5 20 5
#6 20 3.1
#7 30 3.1
#8 30 3.2
can do so
library(tidyverse)
df %>%
group_by(zone) %>%
filter(row_number(-value) <=2 | head(value > 4))

Creating new column names from existing column names using paste function

Assume I have a data frame df with variables A, B and C in it. I would like to create 3 more corresponding columns with names A_ranked, B_ranked and C_ranked. It doesn't matter how I will fill them for the sake of this question, so let's assume that I will set them all to 5. I tried the following code:
for (i in 1:length(df)){
df%>%mutate(
paste(colnames(df)[i],"ranked", sep="_")) = 5
}
I also tried:
for (i in 1:length(df)){
df%>%mutate(
as.vector(paste(colnames(df)[i],"ranked", sep="_")) = 5
}
And:
for (i in 1:length(df)){
df$paste(colnames(df)[i],"ranked", sep="_")) = 5
}
No one them seems to work. Can somebody please tell me what is the correct way to do this?
Here is a data.table option using the iris data set (here we create 4 more columns based on colnames of existing columns).
# data
df <- iris[, 1:4]
str(df)
# new columns
library(data.table)
setDT(df)[, paste(colnames(df), "ranked", "_") := 5][]
# output
Sepal.Length Sepal.Width Petal.Length Petal.Width Sepal.Length ranked _
1: 5.1 3.5 1.4 0.2 5
2: 4.9 3.0 1.4 0.2 5
3: 4.7 3.2 1.3 0.2 5
4: 4.6 3.1 1.5 0.2 5
5: 5.0 3.6 1.4 0.2 5
---
146: 6.7 3.0 5.2 2.3 5
147: 6.3 2.5 5.0 1.9 5
148: 6.5 3.0 5.2 2.0 5
149: 6.2 3.4 5.4 2.3 5
150: 5.9 3.0 5.1 1.8 5
Sepal.Width ranked _ Petal.Length ranked _ Petal.Width ranked _
1: 5 5 5
2: 5 5 5
3: 5 5 5
4: 5 5 5
5: 5 5 5
---
146: 5 5 5
147: 5 5 5
148: 5 5 5
149: 5 5 5
150: 5 5 5
# If you want to fill new columns with different values you can try something like
setDT(df)[, paste(colnames(df), "ranked", "_") := list(Sepal.Length/2,
Sepal.Width/2,
Petal.Length/2,
Petal.Width/2)][]
This should work:
df[paste(names(df), "ranked", sep = "_")] <- 5
df
# A B C A_ranked B_ranked C_ranked
# 1 1 2 3 5 5 5
Data:
df <- data.frame(A = 1, B = 2, C = 3)
Does this help?
dat <- data.frame(A=5,B=5,C=5)
dat %>%
mutate_each(funs(ranked=sum)) %>%
head()

cbind multiple, individual columns in a single data frame using column numbers

I have a single data frame of 100 columns and 25 rows. I would like to cbind different groupings of columns (sometimes as many as 30 columns) in several new data frames without having to type out each column name every time.
Some columns that i want fall individually e.g. 6 and 72 and some do lie next to each other e.g. columns 23, 24, 25, 26 (23:26).
Usually i would use:
z <- cbind(visco$fish, visco$bird)
for example, but i have too many columns and need to create too many new data frames to be typing the name of every column that i need every time. Generally i do not attach my data.
I would like to use column numbers, something like:
z <- cbind(6 , 72 , 23:26, data=visco)
and also retain the original column names, not the automatically generated V1, V2. I have tried adding deparse.level=2 but my column names then become "visco$fish" rather than the original "fish"
I feel there should be a simple answer to this, but so far i have failed to find anything that works as i would like.
df <- data.frame(AA = 11:15, BB = 2:6, CC = 12:16, DD = 3:7, EE = 23:27)
df
# AA BB CC DD EE
# 1 11 2 12 3 23
# 2 12 3 13 4 24
# 3 13 4 14 5 25
# 4 14 5 15 6 26
# 5 15 6 16 7 27
df1 <- data.frame(cbind(df,df,df,df))
df1
# AA BB CC DD EE AA.1 BB.1 CC.1 DD.1 EE.1 AA.2 BB.2 CC.2 DD.2 EE.2 AA.3 BB.3
# 1 11 2 12 3 23 11 2 12 3 23 11 2 12 3 23 11 2
# 2 12 3 13 4 24 12 3 13 4 24 12 3 13 4 24 12 3
# 3 13 4 14 5 25 13 4 14 5 25 13 4 14 5 25 13 4
# 4 14 5 15 6 26 14 5 15 6 26 14 5 15 6 26 14 5
# 5 15 6 16 7 27 15 6 16 7 27 15 6 16 7 27 15 6
# CC.3 DD.3 EE.3
# 1 12 3 23
# 2 13 4 24
# 3 14 5 25
# 4 15 6 26
# 5 16 7 27
Result <- data.frame(cbind(df1[,c(1:5,14:17,20)]))
Result
# AA BB CC DD EE DD.2 EE.2 AA.3 BB.3 EE.3
# 1 11 2 12 3 23 3 23 11 2 23
# 2 12 3 13 4 24 4 24 12 3 24
# 3 13 4 14 5 25 5 25 13 4 25
# 4 14 5 15 6 26 6 26 14 5 26
# 5 15 6 16 7 27 7 27 15 6 27
Note: The columns with same name are adjusted in their next appearance as .1 or .2 by R itself.
Here's an example of how to do this using the select function from dplyr - which should be your go to package for this type of data wrangling
> library(dplyr)
> df <- head(iris)
> df
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
>
>## select by variable name
>newdf <- df %>% select(Sepal.Length, Sepal.Width,Species)
> newdf
Sepal.Length Sepal.Width Species
1 5.1 3.5 setosa
2 4.9 3.0 setosa
3 4.7 3.2 setosa
4 4.6 3.1 setosa
5 5.0 3.6 setosa
6 5.4 3.9 setosa
>## select by variable indices
> newdf <- df %>% select(1:2,5)
> newdf
Sepal.Length Sepal.Width Species
1 5.1 3.5 setosa
2 4.9 3.0 setosa
3 4.7 3.2 setosa
4 4.6 3.1 setosa
5 5.0 3.6 setosa
6 5.4 3.9 setosa
However, I'm not sure why you would need to do this? Can you not run your analyses on the original dataframe?
I understand your question as , subsetting a large dataframe into smaller ones. Which could be achieved in different ways. One way is, data.table package helps you to retain the column names, and yet subset it by indexing the columns.
if you have your data as dataframe, you can just do
DT<- data.table(df)
# You still have to define your subsets of columns you need to create
sub_1<-c(2,3)
sub_2<-c(2:5,9)
sub_3<-c(1:2,5:6,10)
DT[ ,sub_2, with = FALSE]
Output
bird cat dog rat car
1: 0.2682538 0.1386834 0.01633384 0.5336649 0.43432878
2: 0.2418727 0.7530654 0.26999873 0.2679446 0.00859734
3: 0.1211858 0.2563736 0.92637523 0.8572615 0.63165705
4: 0.4556401 0.2343427 0.09324584 0.8731174 0.50098461
5: 0.1646126 0.9258622 0.86957980 0.3636781 0.89608415
Data
require("data.table")
DT <- data.table(matrix(runif(10*10),5,10))
colnames(DT) <- c("fish","bird","cat","dog","rat","tiger","insect","boat","car", "cycle")
Try this
z <- visco[c(6,72,23:26)]
In R we have vectors and matrices. You can create your own vectors with the function c.
c(1,5,3,4)
They are also the output of many functions such as
rnorm(10)
You can turn vectors into matrices using functions such as rbind, cbind or matrix.
Create the matrix from the vector 1:1000 like this:
X = matrix(1:1000,100,10)
What is the entry in row 25, column 3 ?

Resources