sampling by group in R

sampling by group in R - r

I have a large sample data of healthcare data called oct
Providers ID date ICD
Billy 4504 9/11 f.11
Billy 5090 9/10 r.05
Max 4430 9/01 k.11
Mindy 0812 9/30 f.11
etc.
I want a random sample of ID numbers for each provider. I have tried.
review <- oct %>% group_by(Providers) %>% do (sample(oct$ID, size = 5, replace= FALSE, prob = NULL))

Example using dplyr::sample_n
library(dplyr)
set.seed(1)
mtcars %>% group_by(cyl) %>% sample_n(3)
# A tibble: 9 x 11
# Groups: cyl [3]
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
2 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
3 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
4 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
5 21 6 160 110 3.9 2.88 17.0 0 1 4 4
6 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
7 15 8 301 335 3.54 3.57 14.6 0 1 5 8
8 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
9 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
If you'd like to just select a specific variable (ID in your question):
set.seed(1)
mtcars %>%
group_by(cyl) %>%
sample_n(3) %>%
pull(mpg)
[1] 22.8 32.4 33.9 19.7 21.0 19.2 15.0 15.5 14.7

Related

Create list-column of dataframes with cur_data() in R. Problems with accessing data

I do the following:
mtcars %>%
group_by(cyl) %>%
summarise(
model = list(lm(disp ~ mpg, data = cur_data())),
data = list(dat = cur_data())
) -> df
However, when I want to access the list-column data, it gives me this error:
> df$data
$dat
Error: Can't subset elements that don't exist.
x Locations 2, 3, 4, 5, 6, etc. don't exist.
ℹ There are only 1 element.
While the actual glimpse looks like this:
> glimpse(df)
Rows: 3
Columns: 3
$ cyl <dbl> 4, 6, 8
$ model <list> [<233.067448, -4.797961, -15.673940, 30.702798, 17.126060, 1.086485, -11.509437, 0.68342…
$ data <named list> [<tbl_df[11 x 11]>, <tbl_df[7 x 11]>, <tbl_df[14 x 11]>]
Not really sure what is going wrong here...

Change the order of operation.
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(
data = list(dat = cur_data()),
model = list(lm(disp ~ mpg, data = cur_data())),
) -> df
df$data
#$dat
# A tibble: 11 x 10
# mpg disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 22.8 108 93 3.85 2.32 18.6 1 1 4 1
# 2 24.4 147. 62 3.69 3.19 20 1 0 4 2
# 3 22.8 141. 95 3.92 3.15 22.9 1 0 4 2
# 4 32.4 78.7 66 4.08 2.2 19.5 1 1 4 1
# 5 30.4 75.7 52 4.93 1.62 18.5 1 1 4 2
#...
#...
#$dat
# A tibble: 7 x 10
# mpg disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 21 160 110 3.9 2.62 16.5 0 1 4 4
#2 21 160 110 3.9 2.88 17.0 0 1 4 4
#3 21.4 258 110 3.08 3.22 19.4 1 0 3 1
#4 18.1 225 105 2.76 3.46 20.2 1 0 3 1
#5 19.2 168. 123 3.92 3.44 18.3 1 0 4 4
#6 17.8 168. 123 3.92 3.44 18.9 1 0 4 4
#7 19.7 145 175 3.62 2.77 15.5 0 1 5 6
#$dat
# A tibble: 14 x 10
# mpg disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 18.7 360 175 3.15 3.44 17.0 0 0 3 2
# 2 14.3 360 245 3.21 3.57 15.8 0 0 3 4
# 3 16.4 276. 180 3.07 4.07 17.4 0 0 3 3
# 4 17.3 276. 180 3.07 3.73 17.6 0 0 3 3
# 5 15.2 276. 180 3.07 3.78 18 0 0 3 3
#...
I don't know the exact reason for this error but my guess is that after doing model = list(lm(disp ~ mpg, data = cur_data())) cur_data() now consists of current grouped dataframe as well as the model which is causing issues in storing the data.

Change column names in nested df using map function

I am working on a project where I created a function to edit column names of a given df:
fix_names <- function(a, b, c) {
if (is.data.frame(a) == TRUE & is.character(b) == TRUE & is.character(c) == TRUE) {
str_replace_all(colnames(a), pattern = b, replacement = c)
} else {
return("invalid inputs")
}
}
And then I have a column, data, that contains four data frames. I am trying to rename the columns of all the data frames in data using my function above inside of a map function. It's successful in fixing the names, but I cannot figure out how to apply it to the df since the output is a list and the data frames are nested. Here's what I have:
map(.x = df$data, ~fix_names(., "OldName", "NewName"))
Thank you!
Edit: adding example df using mtcars
data(mtcars)
mtcars %>%
group_by(cyl) %>%
nest() -> nestMtcars
map(.x = nestMtcars$data, ~fix_names(., "mpg", "MPG"))

You could transpose the nested list to run the map function, and transpose it back to its original form :
library(stringr)
library(purrr)
fix_names <- function(a, b, c) {
if (is.data.frame(a) == TRUE & is.character(b) == TRUE & is.character(c) == TRUE) {
colnames(a) <- str_replace_all(colnames(a), pattern = b, replacement = c)
a
} else {
return("invalid inputs")
}
}
nestMtcars %>% transpose %>%
map(~{.x$data <- fix_names(.x$data,"mpg","MPG"); .x}) %>%
transpose
$cyl
$cyl[[1]]
[1] 6
$cyl[[2]]
[1] 4
$cyl[[3]]
[1] 8
$data
$data[[1]]
# A tibble: 7 x 10
MPG disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 160 110 3.9 2.62 16.5 0 1 4 4
2 21 160 110 3.9 2.88 17.0 0 1 4 4
3 21.4 258 110 3.08 3.22 19.4 1 0 3 1
4 18.1 225 105 2.76 3.46 20.2 1 0 3 1
5 19.2 168. 123 3.92 3.44 18.3 1 0 4 4
6 17.8 168. 123 3.92 3.44 18.9 1 0 4 4
7 19.7 145 175 3.62 2.77 15.5 0 1 5 6
$data[[2]]
# A tibble: 11 x 10
MPG disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 22.8 108 93 3.85 2.32 18.6 1 1 4 1
2 24.4 147. 62 3.69 3.19 20 1 0 4 2
3 22.8 141. 95 3.92 3.15 22.9 1 0 4 2
4 32.4 78.7 66 4.08 2.2 19.5 1 1 4 1
5 30.4 75.7 52 4.93 1.62 18.5 1 1 4 2
6 33.9 71.1 65 4.22 1.84 19.9 1 1 4 1
7 21.5 120. 97 3.7 2.46 20.0 1 0 3 1
8 27.3 79 66 4.08 1.94 18.9 1 1 4 1
9 26 120. 91 4.43 2.14 16.7 0 1 5 2
10 30.4 95.1 113 3.77 1.51 16.9 1 1 5 2
11 21.4 121 109 4.11 2.78 18.6 1 1 4 2
$data[[3]]
# A tibble: 14 x 10
MPG disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 18.7 360 175 3.15 3.44 17.0 0 0 3 2
2 14.3 360 245 3.21 3.57 15.8 0 0 3 4
3 16.4 276. 180 3.07 4.07 17.4 0 0 3 3
4 17.3 276. 180 3.07 3.73 17.6 0 0 3 3
5 15.2 276. 180 3.07 3.78 18 0 0 3 3
6 10.4 472 205 2.93 5.25 18.0 0 0 3 4
7 10.4 460 215 3 5.42 17.8 0 0 3 4
8 14.7 440 230 3.23 5.34 17.4 0 0 3 4
9 15.5 318 150 2.76 3.52 16.9 0 0 3 2
10 15.2 304 150 3.15 3.44 17.3 0 0 3 2
11 13.3 350 245 3.73 3.84 15.4 0 0 3 4
12 19.2 400 175 3.08 3.84 17.0 0 0 3 2
13 15.8 351 264 4.22 3.17 14.5 0 1 5 4
14 15 301 335 3.54 3.57 14.6 0 1 5 8

summarise from dplyr data return

I am using summarise to perform a calculation for each row of my data frame. Although the results are ok as shown in my console, I cant seem to be apple to insert them in the same data frame, or even create a new one. Any help? I am grouping them first based on a column (postcode) and then performing the calculation for all rows with the same postcode. Thank you in advance
my_data %>%
group_by(as.character(Postcode))%>%
summarise(wgt_inw_PC = sum(E_W_GEM))

We just need to assign it back to the object
library(dplyr)
my_data <- my_data %>%
group_by(Postcode = Postcode)%>%
summarise(wgt_inw_PC = sum(E_W_GEM, na.rm = TRUE))
Or another option is the compound assignment operator from magrittr (%<>%)
library(magrittr)
my_data %<>%
group_by(Postcode = Postcode)%<>%
summarise(wgt_inw_PC = sum(E_W_GEM, na.rm = TRUE))

I am not sure if this is what you're looking for (I recommend using reproducible examples when asking questions on SO), but here's a code for conducting a calculation in groups and then appending this to your data frame:
library(dplyr)
#>
#> Attaching package: 'dplyr'
mtcars %>%
group_by(cyl)%>%
mutate(wgt = sum(wt))
#> # A tibble: 32 x 12
#> # Groups: cyl [3]
#> mpg cyl disp hp drat wt qsec vs am gear carb wgt
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 21.8
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 21.8
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 25.1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 21.8
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 56.0
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 21.8
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 56.0
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 25.1
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 25.1
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 21.8
#> # ... with 22 more rows

custom grouped dplyr function (sample_n)

I am trying to apply a sampling function in a grouped fashion to a data frame, where it should sample n samples from each group, or all group members if the group size is smaller than n.
Using dplyr, I first tried
library(dplyr)
mtcars %>% group_by(cyl) %>% sample_n(2)
This works when n is smaller than all the group sizes but does not take the full group when I choose n larger than the group size (note that there are 7 cars in one of the cyl groups):
mtcars %>% group_by(cyl) %>% sample_n(8)
Error: `size` must be less or equal than 7 (size of data),
set `replace` = TRUE to use sampling with replacement
I tried to solve this by creating an adapted group_n function like so:
sample_n_or_all <- function(tbl, n) {
if (nrow(tbl) < n)return(tbl)
sample_n(tbl, n)
}
but using my custom function (mtcars %>% group_by(cyl) %>% sample_n_or_all(8)) generates the same error.
Any suggestions how I can adapt my function so I can apply it to each of the groups? Or another solution to the problem?

We could check the number of rows in the group and pass the value to sample_n accordingly.
library(dplyr)
n <- 8
temp <- mtcars %>% group_by(cyl) %>% sample_n(if(n() < n) n() else n)
temp
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
# 2 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
# 3 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
# 4 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
# 5 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
# 6 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
# 7 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
# 8 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
# 9 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#10 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
# … with 13 more rows
We can check number of rows in each group after that.
table(temp$cyl)
#4 6 8
#8 7 8
table(mtcars$cyl)
# 4 6 8
#11 7 14

We can do this without using a logical condition with pmin
library(dplyr)
tmp <- mtcars %>%
group_by(cyl) %>%
sample_n(pmin(n(), n))
# A tibble: 23 x 11
# Groups: cyl [3]
# mpg cyl disp hp drat wt qsec vs am gear carb
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
# 2 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
# 3 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
# 4 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
# 5 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
# 6 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
# 7 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
# 8 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
# 9 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
#10 21 6 160 110 3.9 2.62 16.5 0 1 4 4
# … with 13 more rows
-checking
table(tmp$cyl)
# 4 6 8
# 8 7 8

use assign() inside purrr:walk()

I have a number of dataframes and a series of changes I want to make to each of them. For this example, let to the desired change be simply making each data frame a tibble using as_tibble. I know there are various ways of doing this, but I'd like to do this using purrr:walk.
For data frames df1 and df2,
df1 <- mtcars
df2 <- mtcars
I'd like to do the equivalent of
df1 %<>% as_tibble
df2 %<>% as_tibble
using walk. My attempt:
library(tidyverse)
walk(c(df1, df2), ~ assign(deparse(substitute(.)), as_tibble(.)))
This runs but does not make the desired change:
is_tibble(df1)
#> [1] FALSE

Here is how you can combine assign with walk (see the comments the code for more explanation)-
library(tidyverse)
# data
df1 <- mtcars
df2 <- mtcars
# creating tibbles
# this creates a list of objects with names ("df1", "df2")
tibble::lst(df1, df2) %>%
purrr::walk2(
.x = names(.), # names to assign
.y = ., # object to be assigned
.f = ~ assign(x = .x,
value = tibble::as.tibble(.y),
envir = .GlobalEnv)
)
# checking the newly created tibbles
df1
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ... with 22 more rows
df2
#> # A tibble: 32 x 11
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
#> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
#> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
#> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
#> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
#> 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
#> 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
#> 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
#> 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
#> # ... with 22 more rows
Created on 2018-11-13 by the reprex package (v0.2.1)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

sampling by group in R - r

Related

Create list-column of dataframes with cur_data() in R. Problems with accessing data

Change column names in nested df using map function

summarise from dplyr data return

custom grouped dplyr function (sample_n)

use assign() inside purrr:walk()

Categories

Resources