I would like to optimize my code.
I’m working with str_detect to make a lot of selections, as I would like to optimize my code for the future I would like to select, have a filter pattern defined, based on an externally defined object. I can do that, but I have to strip my way to the object using as.character(). Is it possible to do it in a tidy way?
Working example demonstrating the issue. This is the classical way, it works
> tbl %>% mutate(twentys = case_when(
+ str_detect(fruit, "20") ~ T) )
# A tibble: 4 x 3
x fruit twentys
<int> <chr> <lgl>
1 1 apple 20 TRUE
2 2 banana 20 TRUE
3 3 pear 10 NA
4 4 pineapple 10 NA
This is how I imaged I could do, but it doesn’t way
> twenty <- 20
> tbl %>% mutate(twentys = case_when(
+ str_detect(fruit, twenty) ~ T) )
Error: Problem with `mutate()` input `twentys`.
x no applicable method for 'type' applied to an object of class "c('double', 'numeric')"
i Input `twentys` is `case_when(str_detect(fruit, twenty) ~ T)`.
Run `rlang::last_error()` to see where the error occurred.
This is the cumbersome way, using as.character(), that I would like to optimize.
> tbl %>% mutate(twentys = case_when(
+ str_detect(fruit, as.character(twenty)) ~ T) )
# A tibble: 4 x 3
x fruit twentys
<int> <chr> <lgl>
1 1 apple 20 TRUE
2 2 banana 20 TRUE
3 3 pear 10 NA
4 4 pineapple 10 NA
You can use grepl if you don't want to convert twenty to character.
library(dplyr)
tbl %>% mutate(twentys = case_when(grepl(twenty, fruit) ~ TRUE))
# x fruit twentys
#1 1 apple 20 TRUE
#2 2 banana 20 TRUE
#3 3 pear 10 NA
#4 4 pineapple 10 NA
data
tbl <- structure(list(x = 1:4, fruit = c("apple 20", "banana 20", "pear 10",
"pineapple 10")), class = "data.frame", row.names = c(NA, -4L))
twenty <- 20
We can use str_detect
library(dplyr)
library(stringr)
tbl %>%
mutate(twenty = case_when(str_detect(fruit, str_c(twenty)) ~ TRUE))
Or wrap with paste
tbl %>%
mutate(twenty = case_when(str_detect(fruit, paste(twenty)) ~ TRUE))
data
tbl <- structure(list(x = 1:4, fruit = c("apple 20", "banana 20", "pear 10",
"pineapple 10")), class = "data.frame", row.names = c(NA, -4L))
twenty <- 20
Related
I have a data frame in R that looks like this:
structure(list(items = c("Apple", "Apple, Pear", "Apple, Pear, Banana"
)), row.names = c(NA, -3L), class = "data.frame")
I would like to create new columns for each item in the "items" column and count the frequency of each item. For example, I want to create an "Apple" column that contains the frequency of "Apple" in the "items" column, a "Pear" column that contains the frequency of "Pear" in the "items" column, and so on.
The final data frame should look like this:
structure(list(items = c("Apple", "Apple, Pear", "Apple, Pear, Banana"
), Apple = c(3, 3, 3), Pear = c(2, 2, 2), Banana = c(1, 1, 1)), row.names = c(NA,
-3L), class = "data.frame")
I have tried using the mutate() and str_count() functions from the dplyr and stringr packages, but I'm not sure how to get the final data frame that I want.
Here is the code that I have tried so far:
items %>%
mutate(Apple = str_count(items, "Apple"),
Pear = str_count(items, "Pear"),
Banana = str_count(items, "Banana"))
This gets me part way there, but I'm not sure how to create a new column for each item and count the frequency of each item. Can someone help me figure out how to do this in R?
You can wrap str_count with sum:
items %>%
mutate(Apple = sum(str_count(items, "Apple")),
Pear = sum(str_count(items, "Pear")),
Banana = sum(str_count(items, "Banana")))
items Apple Pear Banana
1 Apple 3 2 1
2 Apple, Pear 3 2 1
3 Apple, Pear, Banana 3 2 1
Especially in situation where you have multiple rows and values ->
Here is a solution using separate the rows count and combining with cbind and finally pivoting with filling the NAs:
library(dplyr)
library(tidyr)
df %>%
separate_rows(items, sep='\\,') %>%
count(items1 = trimws(items)) %>%
cbind(df) %>%
pivot_wider(names_from = items1, values_from = n) %>%
fill(-items, .direction = "downup")
items Apple Banana Pear
<chr> <int> <int> <int>
1 Apple 3 1 2
2 Apple, Pear 3 1 2
3 Apple, Pear, Banana 3 1 2
Using map - loop over the words of interest, and transmute to return a single column with the count of the word in the items column and bind the output to the original data
library(purrr)
library(dplyr)
map_dfc(c("Apple", "Pear", "Banana"), ~ df1 %>%
transmute(!! .x := sum(str_count(items, .x)))) %>%
bind_cols(df1, .)
-output
items Apple Pear Banana
1 Apple 3 2 1
2 Apple, Pear 3 2 1
3 Apple, Pear, Banana 3 2 1
Or another option is to split the column 'items', use mtabulate and cbind the columns after getting the colSums
library(qdapTools)
cbind(df1, as.list(colSums(mtabulate(strsplit(df1$items, ",\\s*")))))
items Apple Banana Pear
1 Apple 3 1 2
2 Apple, Pear 3 1 2
3 Apple, Pear, Banana 3 1 2
You can try the following,
library(tidyverse)
df <- structure(list(items = c(
"Apple", "Apple, Pear", "Apple, Pear, Banana"
)),
row.names = c(NA,-3L),
class = "data.frame")
total_count <- function(x, word) {
paste0(x, collapse = ", ") %>%
stringr::str_count(word)
}
df %>%
mutate(Apple = total_count(items, "Apple"),
Pear = total_count(items, "Pear"),
Banana = total_count(items, "Banana"))
#> items Apple Pear Banana
#> 1 Apple 3 2 1
#> 2 Apple, Pear 3 2 1
#> 3 Apple, Pear, Banana 3 2 1
Created on 2023-01-04 with reprex v2.0.2
I have a tidy data.frame in this format:
library(tidyverse)
df = data.frame(name = c("Clarence","Clarence","Clarence","Shelby","Shelby", "Patricia","Patricia"), fruit = c("Apple", "Banana", "Grapes", "Apple", "Apricot", "Banana", "Grapes"))
df
# name fruit
#1 Clarence Apple
#2 Clarence Banana
#3 Clarence Grapes
#4 Shelby Apple
#5 Shelby Apricot
#6 Patricia Banana
#7 Patricia Grapes
I want to compare the overlaps between groups in a pairwise manner (i.e. if both people have an apple that counts as an overlap of 1) so that I end up with a dataframe that looks like this:
df2 = data.frame(names = c("Clarence-Shelby", "Clarence-Patricia", "Shelby-Patricia"), n_overlap = c(1, 2, 0))
df2
# names n_overlap
#1 Clarence-Shelby 1
#2 Clarence-Patricia 2
#3 Shelby-Patricia 0
Is there an elegant way to do this in the tidyverse framework? My real dataset is much larger than this and will be grouped on multiple columns.
If the 0 overlap is not important, a solution is:
> df %>% inner_join(df,by="fruit") %>% filter(name.x<name.y) %>% count(name.x,name.y)
name.x name.y n
1 Clarence Patricia 2
2 Clarence Shelby 1
If you really need non-overlapping pairs:
> a = df %>% inner_join(df,by="fruit") %>% filter(name.x<name.y) %>% count(name.x,name.y)
> b = as.data.frame(t(combn(sort(unique(df$name,2)),2)))
> colnames(b)=colnames(a)[1:2]
> a %>% full_join(b) %>% replace_na(list(n=0))
Joining, by = c("name.x", "name.y")
name.x name.y n
1 Clarence Patricia 2
2 Clarence Shelby 1
3 Patricia Shelby 0
Try this,
combinations <- apply(combn(unique(df$name), 2), 2, function(z) paste(sort(z), collapse = "-"))
combinations
# [1] "Clarence-Shelby" "Clarence-Patricia" "Patricia-Shelby"
library(dplyr)
df %>%
group_by(fruit) %>%
summarize(names = paste(sort(unique(name)), collapse = "-")) %>%
right_join(tibble(names = combinations), by = "names") %>%
group_by(names) %>%
summarize(n_overlap = sum(!is.na(fruit)))
# # A tibble: 3 x 2
# names n_overlap
# <chr> <int>
# 1 Clarence-Patricia 2
# 2 Clarence-Shelby 1
# 3 Patricia-Shelby 0
I have a data frame combining single and multi-values obs.
dataset <- c("Apple;Banana;Kiwi", "orange", "Apple;Banana", "orange" )
dataset <- as.data.frame(dataset)
My output :
dataset
1 Apple;Banana;Kiwi
2 orange
3 Apple;Banana
4 orange
What I want : separate by pairs all the combinaisons of values into 2 columns and count to make a graph
from |to |weight
Apple |Banana|2
Apple | Kiwi | 1
Banana| Kiwi | 1
orange|NA |2
What I tried :
dataset2 <- dataset %>%
separate_rows(dataset, sep = ";")
We may use combn on each row and get the frequency
stack(table(unlist(lapply(strsplit(dataset$dataset, ";"),
function(x) if(length(x) > 1) combn(x, 2, FUN = toString) else x))))[2:1]
-output
ind values
1 Apple, Banana 2
2 Apple, Kiwi 1
3 Banana, Kiwi 1
4 orange 2
You could do:
library(dplyr)
result <-
do.call(rbind, lapply(strsplit(dataset$dataset, ';'), function(x) {
if(length(x) == 1) return(c(x, NA_character_))
do.call(rbind, lapply(1:(length(x) - 1), function(i) c(x[i], x[i+1])))
}))
as.data.frame(table(paste(result[,1], result[,2]))) %>%
tidyr::separate(Var1, into = c('from', 'to'), sep = ' ') %>%
mutate(to = ifelse(to == 'NA', NA, to),
weight = Freq) %>%
select(-Freq)
#> from to weight
#> 1 Apple Banana 2
#> 2 Banana Kiwi 1
#> 3 orange <NA> 2
Another possible solution:
library(tidyverse)
pmap(dataset, ~ if (str_detect(.x, ";"))
{combn(.x %>% str_split(";") %>% unlist, 2, str_c, collapse=";")} else {.x}) %>%
map_dfr(data.frame) %>%
separate(1, ";", into = c("from", "to"), fill = "right") %>%
count(from, to, name = "weight")
#> from to weight
#> 1 Apple Banana 2
#> 2 Apple Kiwi 1
#> 3 Banana Kiwi 1
#> 4 orange <NA> 2
Or without purrr:
library(tidyverse)
dataset %>%
rowwise %>%
mutate(from = ifelse(str_detect(dataset, ";"), combn(dataset %>%
str_split(";") %>% unlist, 2, str_c, collapse=";") %>% list,
list(dataset))) %>%
unnest_longer(from) %>%
separate(from, ";", into = c("from", "to"), fill = "right") %>%
count(from, to, name = "weight")
#> # A tibble: 4 × 3
#> from to weight
#> <chr> <chr> <int>
#> 1 Apple Banana 2
#> 2 Apple Kiwi 1
#> 3 Banana Kiwi 1
#> 4 orange <NA> 2
This fails:
library(tidyverse)
myFn <- function(nmbr){
case_when(
nmbr > 3 ~ letters[1:3],
TRUE ~ letters[1:2]
)
}
myFn(4)
# Error: `TRUE ~ letters[1:2]` must be length 3 or one, not 2
# Run `rlang::last_error()` to see where the error occurred.
Why does it fail? Why is case_when built in such a way that its branches can't return different-length vectors? I'd like myFn to work so that I can do things like:
tibble(fruit = c("apple", "grape"),
count = 3:4) %>%
mutate(bowl = myFn(count)) %>%
unnest(col = "bowl")
and get
# A tibble: 5 x 3
fruit count bowl
<chr> <int> <int>
1 apple 3 a
2 apple 3 b
3 grape 4 a
4 grape 4 b
5 grape 4 c
I can get it to work - by writing a non-vectorized myFn using if/else, then wrapping it in map, but why should I have to?
Per my comments, your function needs to return one element for each row of input. However, each of those elements can be a list of length 0 or more (and arbitrary complexity). Try this:
myFn <- function(nmbr){
case_when(
nmbr > 3 ~ list(letters[1:3]),
TRUE ~ list(letters[1:2])
)
}
tibble(fruit = c("apple", "grape"),
count = 3:4) %>%
mutate(bowl = myFn(count))
# # A tibble: 2 x 3
# fruit count bowl
# <chr> <int> <list>
# 1 apple 3 <chr [2]>
# 2 grape 4 <chr [3]>
tibble(fruit = c("apple", "grape"),
count = 3:4) %>%
mutate(bowl = myFn(count)) %>%
unnest(col = "bowl")
# # A tibble: 5 x 3
# fruit count bowl
# <chr> <int> <chr>
# 1 apple 3 a
# 2 apple 3 b
# 3 grape 4 a
# 4 grape 4 b
# 5 grape 4 c
Extending r2Evans answer, you can unlist prior to returning. Here is my code that I am using as part of my nesting algorithm.
test_fn <- function(StockCode) {
L <- dplyr::case_when(stringr::str_starts(StockCode, "A") ~ list(7500),
stringr::str_starts(StockCode, "FL") ~ list(6000),
stringr::str_starts(StockCode, "PIPE") ~ list(6500),
stringr::str_starts(StockCode, "RHS") ~
list(c(8000, 12000)),
stringr::str_starts(StockCode, "RND") ~ list(6000),
stringr::str_starts(StockCode, "SQ") ~ list(6000),
TRUE ~
list(c(9000, 10500, 12000, 13500, 15000, 16500, 18000))) %>%
unlist()
return(L)
}
test_fn("RHS")
test_fn("CH")
The output will be the vector assigned to the list.
I have data, a simplified version of which looks like this:
df_current <- data.frame(
start = c('yes', rep('no', 5), 'yes', rep('no', 3)),
season = c('banana', rep('to update', 5), 'apple', rep('to update', 3)),
stringsAsFactors = F
)
Let's say that the "start" variable indicates when a new season starts, and I can use that in combination with a date variable (not included) to indicate where apple and banana season start. Once this is done, I want to update the rest of the rows in the "season" column. All of the rows which currently have the value "to update" should be updated to have the value of the type of fruit whose season has most recently started (the rows are arranged by date). In other words, I want the data to look like this:
df_desired <- data.frame(
start = c('yes', rep('no', 5), 'yes', rep('no', 3)),
season = c(rep('banana', 6), rep('apple', 4)),
stringsAsFactors = F
)
I had assumed that something like the following would work:
updated <- df_current %>%
rowwise() %>%
mutate(season = case_when(
season != 'to update' ~ season,
season == 'to update' ~ lag(season)
))
However, that generates NAs at all the 'to update' values.
An easy way would be to replace "to update" with NA and then use fill.
library(dplyr)
library(tidyr)
df_current %>%
mutate(season = replace(season, season == "to update", NA)) %>%
fill(season)
# start season
#1 yes banana
#2 no banana
#3 no banana
#4 no banana
#5 no banana
#6 no banana
#7 yes apple
#8 no apple
#9 no apple
#10 no apple
Using the same logic you can also use zoo::na.locf to fill missing values with latest non-missing values.
The reason you generate a bunch of NAs is due to season containing only a single value in each case_when evaluation, and thus lag(season) always producing NA. Here is another base R solution that uses rle:
x <- rle(df_current$season)
x
#> Run Length Encoding
#> lengths: int [1:4] 1 5 1 3
#> values : chr [1:4] "banana" "to update" "apple" "to update"
x$values[x$values == "to update"] <- x$values[which(x$values == "to update") - 1]
x
#> Run Length Encoding
#> lengths: int [1:4] 1 5 1 3
#> values : chr [1:4] "banana" "banana" "apple" "apple"
df_current$season <- inverse.rle(x)
df_current
#> start season
#> 1 yes banana
#> 2 no banana
#> 3 no banana
#> 4 no banana
#> 5 no banana
#> 6 no banana
#> 7 yes apple
#> 8 no apple
#> 9 no apple
#> 10 no apple
We can use na_if
library(dplyr)
library(tidyr)
df_current %>%
mutate(season = na_if(season, "to update")) %>%
fill(season)
# start season
#1 yes banana
#2 no banana
#3 no banana
#4 no banana
#5 no banana
#6 no banana
#7 yes apple
#8 no apple
#9 no apple
#10 no apple