I'm looking for a more eloquent way to write R code for a kind of case that I've encountered more than once. Here is an example of the data and some code that accomplishes the result I want:
library(tidyverse)
df <- tibble(id = 1:5, primary_county = 101:105, secondary_county = 201:205)
specific_counties <- c(101, 103, 202, 205)
df |>
mutate(target_area =
primary_county %in% specific_counties | secondary_county %in% specific_counties)
The result is:
# A tibble: 5 × 4
id primary_county secondary_county target_area
<int> <int> <int> <lgl>
1 1 101 201 TRUE
2 2 102 202 TRUE
3 3 103 203 TRUE
4 4 104 204 FALSE
5 5 105 205 TRUE
I want to know if there is a way to get the same result using code that would be more succinct and eloquent if I were dealing with more columns of the "..._county" variety. Specifically, in my code above, the expression %in% specific_counties must be repeated with an | for each extra column I want to handle. Is there a way to not have to repeat this kind of phrase multiple times?
These logical rowwise operations are superbly well handled by dplyr::if_any() or dplyr::if_all():
library(dplyr)
df %>%
mutate(target_area = if_any(ends_with('county'), ~. %in% specific_counties))
# A tibble: 5 × 4
id primary_county secondary_county target_area
<int> <int> <int> <lgl>
1 1 101 201 TRUE
2 2 102 202 TRUE
3 3 103 203 TRUE
4 4 104 204 FALSE
5 5 105 205 TRUE
We can also use:
purrr::reduce with |,
rowSums with as.logical
purrr::pmap_lgl with any(c(...) %in% x)
library(purrr)
library(dplyr)
df %>%
mutate(target_area = reduce(across(ends_with('county'), ~.x %in% specific_counties),
`|`))
## OR ##
df %>%
mutate(target_area = rowSums(across(ends_with('county'), ~.x %in% specific_counties)) %>%
as.logical)
## OR ##
df %>%
mutate(target_area = pmap_lgl(across(ends_with('county')),
~any(c(...) %in% specific_counties)))
For reference, this other answer of mine shows similar usages for if_any, and reduce(|) in a filter() operation:
R - Remove rows from dataframe that contain only zeros in numeric columns, base R and pipe-friendly methods?
Additional related questions/answers:
Logical function across multiple columns using "any" function
How to create a new column based on if any of a subset of columns are NA with the dplyr
This allows a little over what you have, not sure how "eloquent" I'd call it:
df %>%
mutate(
target_area = rowSums(
sapply(select(cur_data(), matches("_county")),
`%in%`, specific_counties)) > 0
)
# # A tibble: 5 x 4
# id primary_county secondary_county target_area
# <int> <int> <int> <lgl>
# 1 1 101 201 TRUE
# 2 2 102 202 TRUE
# 3 3 103 203 TRUE
# 4 4 104 204 FALSE
# 5 5 105 205 TRUE
Or you can list the columns explicitly, replacing the select(.., matches(..)) with list(primary_county, secondary_county).
Add as many columns to the list(..) as you want.
Related
I'd like to create a new variable called POPULATION that takes up the sum of the values of the variable P1 grouped by the variable CODASC. It seemed easy to me at the beginning, but I'm eventually struggling. Since I have to do this for a lot of variables and for several datasets, I really need a quick way of doing it! If anyone can help me, I would really appreciate it!
Many thanks,
Ilaria
My data frame looks like that:
PROCOM SEZ2011 SEZ CODASC P1 P47 P62 P131 E1 E3 ST15 A46
<int> <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 48017 480000000000 60001 4 251 25 9 20 70 40 19 20
2 48017 480000000000 60002 3 15 1 0 1 4 4 0 3
3 48017 480000000000 60003 2 20 7 2 1 1 1 1 1
4 48017 480000000000 60004 3 253 21 4 10 63 40 49 22
5 48017 480000000000 60005 5 3 0 1 0 1 1 0 2
6 48017 480000000000 60006 1 161 19 7 5 27 17 26 13
>
And my code looks like that:
df <- df %>%
group_by(CODASC) %>%
mutate(POPULATION = sum(P1 , na.rm= T))
To apply sum within a group across multiple variables you could do, as an example:
library(dplyr)
df %>%
group_by(CODASC) %>%
mutate(across(P1:last_col(), sum, .names = "{.col}_sum")) %>%
ungroup()
To apply this across multiple data frames (if you're grouping by the same variable and summing the same columns) you can iterate through them easily if they're in a list and with the purrr library:
library(purrr)
library(dplyr)
l <- list(df, df, df)
map(l, ~ .x %>%
group_by(CODASC) %>%
mutate(across(P1:last_col(), sum, .names = "{.col}_sum")) %>%
ungroup())
Your code looks like it does what you want, but you are just looking for a way to streamline it to multiple columns?
It looks like your first 4 columns are some identifiers. If you want to summarise all remaining columns you can do something like:
df <- df %>%
group_by(PROCOM, SEZ2011, SEZ, CODASC) %>%
summarise_all(sum) ## or whatever function you want here
see https://dplyr.tidyverse.org/reference/summarise_all.html for more details on summarise_all() or summarise_at().
If you want to create a function to apply to many datasets, perhaps check out making functions: https://swcarpentry.github.io/r-novice-inflammation/02-func-R/ and apply functions
I'm fairly new in R and struggling to get this. The type of problem I'm trying to address involves one data frame containing books and the start and end page of a particular chapter.
book <- c("Dune", "LOTR", "LOTR", "OriginOfSpecies", "OldManSea")
chapt.start <- c(300, 8, 94, 150, 600)
chapt.end <- c(310, 19, 110, 158, 630)
df1 <- data.frame(books, chapt.start, chapt.end)
df1
books chapt.start chapt.end
1 Dune 300 310
2 LOTR 8 19
3 LOTR 94 110
4 OriginOfSpecies 150 158
5 OldManSea 600 630
My second dataframe contains a list of book titles and a single page.
title <- c("LOTR", "LOTR", "LOTR", "OriginOfSpecies", "OldManSea", "OldManSea")
page <- c(4, 12, 30, 200, 620, 650)
df2 <- data.frame(title, page)
df2
title page
1 LOTR 4
2 LOTR 12
3 LOTR 30
4 OriginOfSpecies 200
5 OldManSea 620
6 OldManSea 650
What I'm trying to ask is for each row in df1 is whether df2 contains any rows with the corresponding book title and the page is within the chapter, i.e. df2$title==df1$book and df2$page>df1$chapt.start and df2$page < df1$chapt.end
The desired output for these data would be FALSE, TRUE, FALSE, FALSE, TRUE
Is this best approached as some kind of for, ifelse loop, sapply, or something different? Thanks for your help people!
This is a range-based join. There are three good ways to do this in R. All of these are returning the page number itself instead of true/false, it should be straight-forward to convert to logical with something like !is.na(page).
sqldf
library(sqldf)
sqldf(
"select df1.*, df2.page
from df1
left join df2 on df1.book=df2.title
and df2.page between df1.[chapt.start] and df1.[chapt.end]")
# book chapt.start chapt.end page
# 1 Dune 300 310 NA
# 2 LOTR 8 19 12
# 3 LOTR 94 110 NA
# 4 OriginOfSpecies 150 158 NA
# 5 OldManSea 600 630 620
fuzzyjoin
(Edited out, see #IanCampbell's answer.)
data.table
library(data.table)
DT1 <- as.data.table(df1)
DT2 <- as.data.table(df2)
DT2[, p2 := page][DT1, on = .(title == book, p2 >= chapt.start, p2 <= chapt.end)]
# title page p2 p2.1
# <char> <num> <num> <num>
# 1: Dune NA 300 310
# 2: LOTR 12 8 19
# 3: LOTR NA 94 110
# 4: OriginOfSpecies NA 150 158
# 5: OldManSea 620 600 630
The reason I add p2 as a copy of page is that data.table on range-joins replaces the left's (inequality) column with those from the right (or something like that), so we'd lose that bit of info.
You're looking for a non-equi join. This can be accomplished in many ways, but I prefer the fuzzyjoin package:
library(fuzzyjoin)
fuzzy_left_join(df1, df2,
by = c( "books" = "title" , "chapt.start" = "page", "chapt.end" = "page"),
match_fun = c(`==`, `<=`, `>=`))
books chapt.start chapt.end title page
1 Dune 300 310 <NA> NA
2 LOTR 8 19 LOTR 12
3 LOTR 94 110 <NA> NA
4 OriginOfSpecies 150 158 <NA> NA
5 OldManSea 600 630 OldManSea 620
From here it's easy to get to the desired output:
library(dplyr)
fuzzy_left_join(df1, df2,
by = c( "books" = "title" , "chapt.start" = "page", "chapt.end" = "page"),
match_fun = c(`==`, `<=`, `>=`)) %>%
mutate(result = !is.na(page)) %>%
select(-c(title,page))
books chapt.start chapt.end result
1 Dune 300 310 FALSE
2 LOTR 8 19 TRUE
3 LOTR 94 110 FALSE
4 OriginOfSpecies 150 158 FALSE
5 OldManSea 600 630 TRUE
using dplyr only i.e. without purrr or fuzzyjoin
df2 %>% right_join(df1 %>% mutate(id = row_number()), by = c("title" = "book")) %>%
group_by(id, title) %>%
summarise(desired = ifelse(is.na(as.logical(sum(chapt.start <= page & page <= chapt.end))),
F,
as.logical(sum(chapt.start <= page & page <= chapt.end))))
# A tibble: 5 x 3
# Groups: id [5]
id title desired
<int> <chr> <lgl>
1 1 Dune FALSE
2 2 LOTR TRUE
3 3 LOTR FALSE
4 4 OriginOfSpecies FALSE
5 5 OldManSea TRUE
Another approach using purrr without joining data
Create a logical check variable for df1
library(dplyr)
library(purrr)
# This function design to take ... which is a row of data from pmap
# And then look up if there is any record match condition define in df2
look_up_check_df1 <- function(..., page_df) {
book_record <- tibble(...)
any_record <- page_df %>%
filter(title == book_record[["book"]],
page >= book_record[["chapt.start"]],
page <= book_record[["chapt.end"]])
nrow(any_record) > 0
}
df1$check <- pmap_lgl(df1, look_up_check_df1, page_df = df2)
df1
#> book chapt.start chapt.end check
#> 1 Dune 300 310 FALSE
#> 2 LOTR 8 19 TRUE
#> 3 LOTR 94 110 FALSE
#> 4 OriginOfSpecies 150 158 FALSE
#> 5 OldManSea 600 630 TRUE
Same logics just did it for df2
# If the check is for df2 then just need to revised it a bit
look_up_check <- function(..., book_chapters_df) {
page_record <- tibble(...)
any_record <- book_chapters_df %>%
filter(book == page_record[["title"]],
chapt.start <= page_record[["page"]],
chapt.end >= page_record[["page"]])
nrow(any_record) > 0
}
# Run a pmap_lgl which passing each row of df2 into function look_up_check
# and return a vector of logical TRUE/FALSE
df2$check <- pmap_lgl(df2, look_up_check, book_chapters_df = df1)
df2
#> title page check
#> 1 LOTR 4 FALSE
#> 2 LOTR 12 TRUE
#> 3 LOTR 30 FALSE
#> 4 OriginOfSpecies 200 FALSE
#> 5 OldManSea 620 TRUE
#> 6 OldManSea 650 FALSE
Created on 2021-04-12 by the reprex package (v1.0.0)
I'm working with the data frame below, which is just part of the full data, and I need to condense the duplicate numbers in the id column into one row. I want to preserve the row that has the highest sbp number, unless it's 300 or over, in which case I want to discard that too.
So for example, for the first three rows that have id as 13480, I want to keep the row that has 124 and discard the other two.
id,sex,visits,sbp
13480,M,2,124
13480,M,3,306
13480,M,4,116
13520,M,2,124
13520,M,3,116
13520,M,4,120
13580,M,2,NA
13580,M,3,124
This is the farthest I got, been trying to tweak this but not sure I'm on the right track:
maxsbp <- split(sbp, sbp$sbp)
r <- data.frame()
for (i in 1:length(maxsbp)){
one <- maxsbp[[i]]
index <- which(one$sbp == max(one$sbp))
select <- one[index,]
r <- rbind(r, select)
}
r1 <- r[!(sbp$sbp>=300),]
r1
I think a tidy solution to this would work quite well. I would first filter all values above 300, if you do not want to keep any value above that threshold. Then group_by id, order, and keep the first.
my.df <- data.frame("id" = c(13480,13480,13480,13520,13520,13520,13580,13580),
"sex" = c("M","M","M","M","M","M","M","M"),
"sbp"= c(124,306,116,124,116,120,NA,124))
my.df %>% filter(sbp < 300) # filter to retain only values below 300
%>% group_by(id) # group by id
%>% arrange(-sbp) # arrange by id in descending order
%>% top_n(1, sbp) # retain first value i.e. the largest
# A tibble: 3 x 3
# Groups: id [3]
# id sex sbp
# <dbl> <chr> <dbl>
#1 13480 M 124
#2 13520 M 124
#3 13580 M 124
In R, very rarely you'll require explicit for loops to do tasks.
There are functions available which will help you perform such grouped operations.
For example, in base R you can use subset and ave :
subset(df,sbp == ave(sbp,id,FUN = function(x) max(sbp[sbp <= 300],na.rm = TRUE)))
# id sex visits sbp
#1 13480 M 2 124
#4 13520 M 2 124
#8 13580 M 3 124
The same can be done using dplyr whose syntax is a little bit easier to understand.
library(dplyr)
df %>%
group_by(id) %>%
filter(sbp == max(sbp[sbp <= 300], na.rm = TRUE))
slice_head can also be used
my.df <- data.frame("id" = c(13480,13480,13480,13520,13520,13520,13580,13580),
"sex" = c("M","M","M","M","M","M","M","M"),
"sbp"= c(124,306,116,124,116,120,NA,124))
> my.df
id sex sbp
1 13480 M 124
2 13480 M 306
3 13480 M 116
4 13520 M 124
5 13520 M 116
6 13520 M 120
7 13580 M NA
8 13580 M 124
Proceed simply like this
my.df %>% group_by(id, sex) %>%
arrange(desc(sbp)) %>%
slice_head() %>%
filter(sbp <300)
# A tibble: 2 x 3
# Groups: id, sex [2]
id sex sbp
<dbl> <chr> <dbl>
1 13520 M 124
2 13580 M 124
Okay so it's one of those days where a previously working piece of code suddenly breaks. Here's a reprex of the code in question:
test = data.frame(factor1 = sample(1:5, 10, replace=T),
factor2 = sample(letters[1:5], 10, replace=T),
variable = sample(100:200, 10))
group_vars = c('factor1','factor2') %>% paste(., collapse = ',')
> test %>% dplyr::group_by_(group_vars)
Error in parse(text = x) : <text>:1:8: unexpected ','
1: factor1,
^
Now I sweaaaar this worked until today. Of course dplyr is trying to do away with the 'x_' functions anyway, but I've tried to plug everything I can think of into group_by()- using combinations of !!, !!!, sym(), quo(), enquo(), etc and can't figure it out. I've tried not pasting the column names together and AT BEST it simply takes the first one and ignores everything else. Most commonly I get the following error message:
Error: Column <chr> must be length 10 (the number of rows) or one, not 2
I've also read over Hadley's dplyr programming guide (https://dplyr.tidyverse.org/articles/programming.html), WHICH SEEMS to cover the issue, except that I'm generating the column names internally and not accepting them as arguments to the function. Has anyone come across this or understand quoting well enough to know a solution to this?
Also, to be clear, this works when only using a single grouping variable. The problem is with multiple groups.
Thanks!
Instead of pasteing and using group_by_ (deprecated - but it would not work because it is expecting NSE), we can directly use the vector in group_by_at
library(dplyr)
group_vars <- c('factor1','factor2')
test %>%
group_by_at(group_vars)
# A tibble: 10 x 3
# Groups: factor1, factor2 [10]
# factor1 factor2 variable
# <int> <fct> <int>
# 1 1 d 145
# 2 5 e 119
# 3 4 a 181
# 4 3 e 155
# 5 3 d 164
# 6 3 b 135
# 7 4 e 137
# 8 4 d 197
# 9 2 d 142
#10 2 c 110
Or another option is to convert to symbols (syms from rlang) and evaluate (!!!) within group_by
test %>%
group_by(!!! rlang::syms(group_vars))
If we go by the route of paste, then one option is parse_expr (from rlang)
group_vars = c('factor1','factor2') %>% paste(., collapse = ';')
test %>%
group_by(!!! rlang::parse_exprs(group_vars))
# A tibble: 10 x 3
# Groups: factor1, factor2 [10]
# factor1 factor2 variable
# <int> <fct> <int>
# 1 1 d 145
# 2 5 e 119
# 3 4 a 181
# 4 3 e 155
# 5 3 d 164
# 6 3 b 135
# 7 4 e 137
# 8 4 d 197
# 9 2 d 142
#10 2 c 110
Here's the dummy data:
cases <- rep(1:5,times=2)
var1 <- as.numeric(c(450,100,250,999,200,500,980,10,700,1000))
var2 <- as.numeric(c(111,222,333,444,424,634,915,12,105,152))
maindata1 <- data.frame(cases,var1,var2)
df1 <- maindata1 %>%
filter(var1 >950) %>%
distinct(cases) %>%
select(cases)
table1 <- maindata1 %>%
filter(cases == 2 | cases == 4 | cases == 5) %>%
arrange(cases)
> table1
cases var1 var2
1 2 100 222
2 2 980 915
3 4 999 444
4 4 700 105
5 5 200 424
6 5 1000 152
I'm trying to formulate a dataframe which contains all the data related to cases where var1 >950 so it would show every value of var1 for those cases (also those values which are <950) and all values of var2 and would drop all cases where var1 won't reach >950. Table1 produces the desired dataframe but I had to enter filtering conditions manually. Is there a way to use that df1$cases as a filtering condition for extracting the same dataframe as a result?
I'm new to R and trying to learn data manipulation mainly with dplyr because it's syntax is almost understandable for layman.. so if someone can offer a solution based on dplyr that would be fantastic, of course I'm willing to hear solutions based on other packages as well.
Filter by max(var1) in each group defined by cases:
maindata1 %>%
group_by(cases) %>%
filter(max(var1) > 950) %>%
arrange(cases)
# cases var1 var2
# 1 2 100 222
# 2 2 980 915
# 3 4 999 444
# 4 4 700 105
# 5 5 200 424
# 6 5 1000 152