Okay so it's one of those days where a previously working piece of code suddenly breaks. Here's a reprex of the code in question:
test = data.frame(factor1 = sample(1:5, 10, replace=T),
factor2 = sample(letters[1:5], 10, replace=T),
variable = sample(100:200, 10))
group_vars = c('factor1','factor2') %>% paste(., collapse = ',')
> test %>% dplyr::group_by_(group_vars)
Error in parse(text = x) : <text>:1:8: unexpected ','
1: factor1,
^
Now I sweaaaar this worked until today. Of course dplyr is trying to do away with the 'x_' functions anyway, but I've tried to plug everything I can think of into group_by()- using combinations of !!, !!!, sym(), quo(), enquo(), etc and can't figure it out. I've tried not pasting the column names together and AT BEST it simply takes the first one and ignores everything else. Most commonly I get the following error message:
Error: Column <chr> must be length 10 (the number of rows) or one, not 2
I've also read over Hadley's dplyr programming guide (https://dplyr.tidyverse.org/articles/programming.html), WHICH SEEMS to cover the issue, except that I'm generating the column names internally and not accepting them as arguments to the function. Has anyone come across this or understand quoting well enough to know a solution to this?
Also, to be clear, this works when only using a single grouping variable. The problem is with multiple groups.
Thanks!
Instead of pasteing and using group_by_ (deprecated - but it would not work because it is expecting NSE), we can directly use the vector in group_by_at
library(dplyr)
group_vars <- c('factor1','factor2')
test %>%
group_by_at(group_vars)
# A tibble: 10 x 3
# Groups: factor1, factor2 [10]
# factor1 factor2 variable
# <int> <fct> <int>
# 1 1 d 145
# 2 5 e 119
# 3 4 a 181
# 4 3 e 155
# 5 3 d 164
# 6 3 b 135
# 7 4 e 137
# 8 4 d 197
# 9 2 d 142
#10 2 c 110
Or another option is to convert to symbols (syms from rlang) and evaluate (!!!) within group_by
test %>%
group_by(!!! rlang::syms(group_vars))
If we go by the route of paste, then one option is parse_expr (from rlang)
group_vars = c('factor1','factor2') %>% paste(., collapse = ';')
test %>%
group_by(!!! rlang::parse_exprs(group_vars))
# A tibble: 10 x 3
# Groups: factor1, factor2 [10]
# factor1 factor2 variable
# <int> <fct> <int>
# 1 1 d 145
# 2 5 e 119
# 3 4 a 181
# 4 3 e 155
# 5 3 d 164
# 6 3 b 135
# 7 4 e 137
# 8 4 d 197
# 9 2 d 142
#10 2 c 110
Related
I'm looking for a more eloquent way to write R code for a kind of case that I've encountered more than once. Here is an example of the data and some code that accomplishes the result I want:
library(tidyverse)
df <- tibble(id = 1:5, primary_county = 101:105, secondary_county = 201:205)
specific_counties <- c(101, 103, 202, 205)
df |>
mutate(target_area =
primary_county %in% specific_counties | secondary_county %in% specific_counties)
The result is:
# A tibble: 5 × 4
id primary_county secondary_county target_area
<int> <int> <int> <lgl>
1 1 101 201 TRUE
2 2 102 202 TRUE
3 3 103 203 TRUE
4 4 104 204 FALSE
5 5 105 205 TRUE
I want to know if there is a way to get the same result using code that would be more succinct and eloquent if I were dealing with more columns of the "..._county" variety. Specifically, in my code above, the expression %in% specific_counties must be repeated with an | for each extra column I want to handle. Is there a way to not have to repeat this kind of phrase multiple times?
These logical rowwise operations are superbly well handled by dplyr::if_any() or dplyr::if_all():
library(dplyr)
df %>%
mutate(target_area = if_any(ends_with('county'), ~. %in% specific_counties))
# A tibble: 5 × 4
id primary_county secondary_county target_area
<int> <int> <int> <lgl>
1 1 101 201 TRUE
2 2 102 202 TRUE
3 3 103 203 TRUE
4 4 104 204 FALSE
5 5 105 205 TRUE
We can also use:
purrr::reduce with |,
rowSums with as.logical
purrr::pmap_lgl with any(c(...) %in% x)
library(purrr)
library(dplyr)
df %>%
mutate(target_area = reduce(across(ends_with('county'), ~.x %in% specific_counties),
`|`))
## OR ##
df %>%
mutate(target_area = rowSums(across(ends_with('county'), ~.x %in% specific_counties)) %>%
as.logical)
## OR ##
df %>%
mutate(target_area = pmap_lgl(across(ends_with('county')),
~any(c(...) %in% specific_counties)))
For reference, this other answer of mine shows similar usages for if_any, and reduce(|) in a filter() operation:
R - Remove rows from dataframe that contain only zeros in numeric columns, base R and pipe-friendly methods?
Additional related questions/answers:
Logical function across multiple columns using "any" function
How to create a new column based on if any of a subset of columns are NA with the dplyr
This allows a little over what you have, not sure how "eloquent" I'd call it:
df %>%
mutate(
target_area = rowSums(
sapply(select(cur_data(), matches("_county")),
`%in%`, specific_counties)) > 0
)
# # A tibble: 5 x 4
# id primary_county secondary_county target_area
# <int> <int> <int> <lgl>
# 1 1 101 201 TRUE
# 2 2 102 202 TRUE
# 3 3 103 203 TRUE
# 4 4 104 204 FALSE
# 5 5 105 205 TRUE
Or you can list the columns explicitly, replacing the select(.., matches(..)) with list(primary_county, secondary_county).
Add as many columns to the list(..) as you want.
I have two data frames:
dat <- data.frame(Digits_Lower = 1:5,
Digits_Upper = 6:10,
random = 20:24)
dat
#> Digits_Lower Digits_Upper random
#> 1 1 6 20
#> 2 2 7 21
#> 3 3 8 22
#> 4 4 9 23
#> 5 5 10 24
cb <- data.frame(Digits = c("Digits_Lower", "Digits_Upper"),
x = 1:2,
y = 3:4)
cb
#> Digits x y
#> 1 Digits_Lower 1 3
#> 2 Digits_Upper 2 4
I am trying to perform some operation on multiple columns in dat similar to these examples: In data.table: iterating over the rows of another data.table and R multiply columns by values in second dataframe. However, I
am hoping to operate on these columns with an extended expression for every value in its corresponding row in cb. The solution should be applicable
for a large dataset. I have created this for-loop so far.
dat.loop <- dat
for(i in seq_len(nrow(cb)))
{
#create new columns from the Digits column of `cb`
dat.loop[paste0("disp", sep = '.', cb$Digits[i])] <-
#some operation using every value in a column in `dat` with its corresponding row in `cb`
(dat.loop[, cb$Digits[i]]- cb$y[i]) * cb$x[i]
}
dat.loop
#> Digits_Lower Digits_Upper random disp.Digits_Lower disp.Digits_Upper
#> 1 1 6 20 -2 4
#> 2 2 7 21 -1 6
#> 3 3 8 22 0 8
#> 4 4 9 23 1 10
#> 5 5 10 24 2 12
I will then perform operations on the data that I appended to dat in dat.loop applying a similar
for-loop, and then perform yet another operation on those values. My dataset is very large, and I imagine
my use of for-loops will become cumbersome. I am wondering:
Would another method improve efficiency such as using data.table or tidyverse?
How would I go about using another method, or improving my for-loop? My main confusion is how to write concise code
to perform operations on columns in dat with corresponding rows in cb. Ideally, I would split my for-loop into
multiple functions that would for example, avoid indexing into cb for the same values over and over again or appending unnecessary data to my dataframe, but I'm not really sure how to
do this.
Any help is appreciated!
EDIT:
I've modified the code #Desmond provided allowing for more generic code since dat and cb will be from user-inputted files,
and dat can have a varying number of columns/ column names that I will be operating on (columns in dat will always start with
"Digits_" and will be specified in the "Digits" column of cb.
library(tidytable)
results <- dat %>%
crossing.(cb) %>%
mutate_rowwise.(disp = (get(`Digits`)-y) *x ) %>%
pivot_wider.(names_from = Digits,
values_from = disp,
names_prefix = "disp_")
results2 <- results %>%
fill.(starts_with("disp"), .direction = c("downup"), .by = 'random') %>%
select.(-c(x,y)) %>%
distinct.()
results2
#> Digits_Lower Digits_Upper random disp_Digits_Lower disp_Digits_Upper
#> 1 1 6 20 -2 4
#> 2 2 7 21 -1 6
#> 3 3 8 22 0 8
#> 4 4 9 23 1 10
#> 5 5 10 24 2 12
Here's a tidyverse solution:
crossing generates combinations from both datasets
case_when to apply your logic
pivot_wider, filter and bind_cols to clean up the output
To scale this to a large dataset, I suggest using the tidytable package. After loading it, simply replace crossing() with crossing.(), pivot_wider() with pivot_wider.(), etc
library(tidyverse)
#> Warning: package 'tidyverse' was built under R version 4.2.1
#> Warning: package 'tibble' was built under R version 4.2.1
dat <- data.frame(
Digits_Lower = 1:5,
Digits_Upper = 6:10,
random = 20:24
)
cb <- data.frame(
Digits = c("Digits_Lower", "Digits_Upper"),
x = 1:2,
y = 3:4
)
results <- dat |>
crossing(cb) |>
mutate(disp = case_when(
Digits == "Digits_Lower" ~ (Digits_Lower - y) * x,
Digits == "Digits_Upper" ~ (Digits_Upper - y) * x
)) |>
pivot_wider(names_from = Digits,
values_from = disp,
names_prefix = "disp_")
results |>
filter(!is.na(disp_Digits_Lower)) |>
select(-c(x, y, disp_Digits_Upper)) |>
bind_cols(results |>
filter(!is.na(disp_Digits_Upper)) |>
select(disp_Digits_Upper))
#> # A tibble: 5 × 5
#> Digits_Lower Digits_Upper random disp_Digits_Lower disp_Digits_Upper
#> <int> <int> <int> <int> <int>
#> 1 1 6 20 -2 4
#> 2 2 7 21 -1 6
#> 3 3 8 22 0 8
#> 4 4 9 23 1 10
#> 5 5 10 24 2 12
Created on 2022-08-20 by the reprex package (v2.0.1)
My function is defined as the following, where i subset a dataframe to a specific name and return the first 5 elements.
Bestideas <- function(x) {
topideas <- subset(Masterall, Masterall$NAME == x) %>%
slice(1:5)
return(topideas)
I would then like to apply the function, to an entire df (with one column of Names), so that the function is applied to each name on the list and binds it into a new df, containing the first five ideas from all unique names. Through research - I have arrived at the following:
bestideas_collection = lapply(UNIQUE_NAMES_DF, Bestideas) %>% bind_rows()
However, it doesn't work. It returns a dataframe with only five ideas in total, and from 5 different names. As there is 30 Unique names in my list, I expected 30*5 = 150 ideas in the "bestideas_collection" variable. I get this error message:
"longer object length is not a multiple of shorter object lengthlonger object length is not a multiple of shorter object length"
Further, if I do it manually for each name, it works just as intended - which makes me think that the function works fine, and that the issue is with the lapply function.
holder <- Bestideas("NAME 1")
bestideas_collection <- bind_rows(bestideas_collection,holder)
holder <- Bestideas("NAME 2")
bestideas_collection <- bind_rows(bestideas_collection,holder)
holder <- Bestideas("NAME 3")
bestideas_collection <- bind_rows(bestideas_collection,holder)
...
Can anyone help me if I am using the function wrong, or do you have alternative methods of doing it? I have already tried with a for-loop - but it gives me the same error as with the lapply function.
I don't have your data, so I tried to reproduce your problem on a fabricated set. I was unable to do so. With a very simple case, your function works as expected.
library(dplyr)
set.seed(123)
Masterall <- data.frame(NAME = rep(LETTERS, 10), value = rnorm(260)) %>%
group_by(NAME) %>% arrange(desc(value))
UNIQUE_NAMES_DF <- LETTERS
lapply(UNIQUE_NAMES_DF, Bestideas) %>% bind_rows()
# A tibble: 130 x 2
# Groups: NAME [26]
NAME value
<chr> <dbl>
1 A 1.65
2 A 1.44
3 A 0.838
4 A 0.563
5 A 0.181
6 B 1.37
7 B 0.452
8 B 0.153
9 B -0.0450
10 B -0.0540
# ... with 120 more rows
Is your UNIQUE_NAMES_DF a data.frame? If so, that is the trouble. The lapply function expects a vector as its first input. It can handle a data.frame, but clearly unexpected results occur. Here is an example:
UNIQUE_NAMES_DF <- data.frame(NAME = LETTERS, other = sample(letters))
lapply(UNIQUE_NAMES_DF, Bestideas) %>% bind_rows()
# A tibble: 12 x 2
# Groups: NAME [11]
NAME value
<chr> <dbl>
1 C -0.785
2 D 0.385
3 E -0.371
4 F 1.13
5 I 1.10
6 N -0.641
7 P -1.02
8 Q -0.0341
9 U -1.07
10 X -0.0834
11 Z 1.26
12 Z -0.739
I do not know the structure of your UNIQUE_NAMES_DF, but if you just feed the column with the names into your lapply, it should work:
lapply(UNIQUE_NAMES_DF$NAME, Bestideas) %>% bind_rows()
# A tibble: 130 x 2
# Groups: NAME [26]
NAME value
<chr> <dbl>
1 A 1.65
2 A 1.44
3 A 0.838
4 A 0.563
5 A 0.181
6 B 1.37
7 B 0.452
8 B 0.153
9 B -0.0450
10 B -0.0540
# ... with 120 more rows
I am having a problem with the following script. When converting the min and max columns of the data.frame base to character, using dplyr, it "converts" back to character. Where the result that should be 582, ends up becoming 513.
base%>%
mutate(ocor=str_count(pass,letter))%>%
filter(ocor%>%between(min,max))%>%
count()
To correct the problem, I tried to convert the variables into the mechanics of dplyr. However, he seems to convert back.
base%>%
mutate(ocor=str_count(pass,letter))%>%
mutate(across(.cols = c('min', 'max'), .fns = ~ as.numeric(.)))%>%
filter(ocor%>%between(min,max))%>%
count()
class(base$max)
class(base$min)
n
1 513
> class(base$max)
[1] "character"
> class(base$min)
[1] "character"
Not using dplyr I got the correct result, an example:
a<-base%>%
mutate(ocor=str_count(pass,letter))%>%
select(ocor)
class(base$max)
class(base$min)
base$max<-as.integer(base$max)
base$min<-as.integer(base$min)
sum(a >= base$min & a <= base$max)
[1] 582
I can't understand what's going on. An example of the database for clarification:
head(base)
min max letter pass ocor
1 2 6 c fcpwjqhcgtffzlbj 2
2 6 9 x xxxtwlxxx 6
3 7 10 q nfbrgwqlvljgq 2
4 2 3 g gjggg 4
5 2 6 s sjsssss 6
6 4 13 b mdbctbzgcpdjbhsdctrd 3
The Original Basewithout changes:
> head(base)
V1 V2 V3
1 2-6 c: fcpwjqhcgtffzlbj
2 6-9 x: xxxtwlxxx
3 5-6 w: wwwwlwwwh
4 7-10 q: nfbrgwqlvljgq
5 2-3 g: gjggg
6 9-11 q: qqqqqqnqgqq
The changes:
base<-read.table('base.txt')
library(tidyverse)
base<-base%>%
separate(V1,c('min','max'),'-')%>%
rename(letter=V2,pass=V3)%>%
mutate(letter = str_replace(letter,':',''))
That's because you are not altering base.
%>% does not assign the result to a variable. I.e.
base %>% mutate(foo=bar(x))
does not alter base. It will just show the result on the console (and none if you are running the script or calling it from a function).
You might be confusing the pipe-operator with %<>% (found in the package magrittr) which uses the left-hand variable as input for the pipe, and overwrites the variable with the modified result.
Try
base <- base%>%
mutate(ocor=str_count(pass,letter))%>%
mutate(across(.cols = c('min', 'max'), .fns = ~ as.numeric(.)))%>%
filter(ocor%>%between(min,max))%>%
count()
Re. the issue with min and max being converted back to characters, I cannot reproduce.
Re. the issue with filtering not working as expected, it that between doesn't seem to care for vectors for inputs left and right. A fairly new thing is the use of rowwise:
Without rowwise:
base%>%
mutate(ocor=str_count(pass,letter))%>%
mutate(across(.cols = c('min', 'max'), .fns = ~ as.numeric(.)))%>%
mutate(between(ocor, min,max))
min max letter pass ocor between(ocor, min, max)
1 2 6 c fcpwjqhcgtffzlbj 2 TRUE
2 6 9 x xxxtwlxxx 6 TRUE
3 7 10 q nfbrgwqlvljgq 2 TRUE
4 2 3 g gjggg 4 TRUE
5 2 6 s sjsssss 6 TRUE
6 4 13 b mdbctbzgcpdjbhsdctrd 3 TRUE
With rowwise:
base%>%
mutate(ocor=str_count(pass,letter))%>%
mutate(across(.cols = c('min', 'max'), .fns = ~ as.numeric(.)))%>%
rowwise %>% mutate(between(ocor, min,max))
# A tibble: 6 x 6
# Rowwise:
min max letter pass ocor `between(ocor, min, max)`
<dbl> <dbl> <chr> <chr> <int> <lgl>
1 2 6 c fcpwjqhcgtffzlbj 2 TRUE
2 6 9 x xxxtwlxxx 6 TRUE
3 7 10 q nfbrgwqlvljgq 2 FALSE
4 2 3 g gjggg 4 FALSE
5 2 6 s sjsssss 6 TRUE
6 4 13 b mdbctbzgcpdjbhsdctrd 3 FALSE
Please see attached image for the best way I can describe my question.
I promise I did attempt to research this first, and I saw a few answers that fit close, but many of them required listing off each variable (in this image, this would be each encounter #), and my data has approximately 15 million lines of code, with about 10,000 different encounter #'s.
I would appreciate any assistance!
As an alternative, you can also use the data.table package. Especially on large datasets, data.table will give you an enormous performance boost. Applied to the data as used by #r2evans:
library(data.table)
setDT(df)[, .(n_uniq_enc = uniqueN(encounter)), by = patient]
this will lead to the following result:
patient n_uniq_enc
1: 123 5
2: 456 5
Lacking a reproducible example, here's some sample data:
set.seed(42)
df <- data.frame(patient = sample(c(123,456), size=30, replace=TRUE), encounter=sample(c(12,34,56,78,90), size=30, replace=TRUE))
head(df)
# patient encounter
# 1 456 78
# 2 456 90
# 3 123 34
# 4 456 78
# 5 456 12
# 6 456 90
Base R:
aggregate(x = df$encounter, by = list(patient = df$patient),
FUN = function(a) length(unique(a)))
# patient x
# 1 123 5
# 2 456 5
or (by #20100721's suggestion):
aggregate(encounter~.,FUN = function(t) length(unique(t)),data = df)
Using dplyr:
library(dplyr)
group_by(df, patient) %>%
summarize(numencounters = length(unique(encounter)))
# # A tibble: 2 x 2
# patient numencounters
# <dbl> <int>
# 1 123 5
# 2 456 5
Update: #2100721 informed me of n_distinct, effectively same as length(unique(...)):
group_by(df, patient) %>%
summarize(numencounters = n_distinct(encounter))