I would like to run a function to test if a value exists in a dataset or not. I've looked for answers and since found a workaround but it's not as neat and I'm curious why my initial attempt failed.
Here's a simplified dataset
df <- structure(list(country = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("IRE","USA"),
class = "factor"), year = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("1990", "1995", "2000"),
class = "factor")), class = "data.frame", row.names = c(NA, -6L))
> df
country year
1 IRE 1990
2 IRE 1995
3 IRE 2000
4 USA 1990
5 USA 1995
6 USA 2000
I would like my function to return a 1 if a particular country code and year are present or a 0 otherwise. This is the working code:
myFunc <- function(x,y){
p.ans <- df %>% filter(year == y)
ifelse(x %in% p.ans$country, 1, 0)
}
> myFunc("USA", 1995)
[1] 1
> myFunc("USA", 1997)
[1] 0
But why doesn't this alternative code work? Is there a variation of it that would?
myFunc <- function(x,y){
df %>% filter(year == y) %>% ifelse(x %in% country, 1, 0)
}
> myFunc("USA", 1997)
Error in ifelse(., x %in% country, 1, 0) : unused argument (0)
Thanks!
But why doesn't this alternative code work?
Because x %>% f(y) is the same as f(x, y). Thus the code you wrote is equal to
ifelse(filter(df, year == y), x %in% country, 1, 0)
… which is not how the ifelse function works.
Instead, you could write
df %>% filter(year == y) %>% pull(country) %>% {ifelse(x %in% ., 1, 0)}
Here you need to surround the ifelse function call with {…} to prevent the pipe from inserting the right-hand side as the first argument into the function call (we want to use x %in% . as the first argument rather than just .).
… or, if you’ve loaded the ‘magrittr’ package, you can use %$% instead of %>% pull(…) %>% {…}:
df %>% filter(year == y) %$% ifelse(x %in% country, 1, 0)
Related
I am trying to group my dataframe and set the last N values of a column in each group to NA. I can do it for N = 1 like so:
df %>% group_by(ID) %>% mutate(target = c(target[-n()], NA))
But am struggling to get it to work for any N
This is my current attempt:
df %>% group_by(ID) %>% mutate(target = c(target[1:(abs(n()-1))], NA))
But this seems to fail for groups of size 1
I also tried:
df %>% group_by(ID) %>% mutate(target = ifelse(n()==1, target, c(target[1:(abs(n()-1))], NA)))
But the else clause never takes effect.
Any advice would be appreciated, thanks.
We could use
library(dplyr)
df %>%
group_by(ID) %>%
mutate(target = replace(target, tail(row_number(), N), NA))
You can use case_when() for a vectorized solution in dplyr.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(target = case_when(row_number() <= n() - N ~ target))
My thanks to #akrun for pointing out that case_when() defaults to an NA of the proper type, so case_when() automatically fills the last N with NAs.
Update
The solution by #akrun is more performant: when benchmarked at times = 50 repetitions apiece
library(microbenchmark)
big_df <- tibble(ID = rep(letters, 100000)) %>%
mutate(target = row_number()) %>%
group_by(ID)
microbenchmark(
times = 50,
Greg = {
big_df %>%
mutate(target = case_when(row_number() <= n() - N ~ target))
},
akrun = {
big_df %>%
mutate(target = replace(target, tail(row_number(), N), NA))
}
)
it is about 35% faster than mine at scale (2600000 rows):
Unit: milliseconds
expr min lq mean median uq max neval
Greg 82.6337 90.9669 128.93278 96.35760 213.3593 258.8570 50
akrun 52.4519 55.8314 63.40997 61.43945 64.1082 196.4069 50
Here is an alternative suggestion:
We could define the top N after using arrange in descending order (with -x), apply our ifelse statement and rearrange:
library(dplyr)
N = 2
df %>%
group_by(id) %>%
arrange(-x, .by_group = TRUE) %>%
mutate(x = ifelse(row_number()== 1:N, NA, x)) %>%
arrange(x, .by_group = TRUE)
df <- structure(list(x = c(2, 4, 6, 1, 2, 5, 6, 7, 3, 4, 5, 6), id = c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -12L))
does anyone know how I can loop pipe operator code through multiple dataframe?
I've quite a few dataframe named over the years (df_1990, df_1991 ... df_2020). However, not all years are included, (i.e. df_1993, df_2012 and 3 more years are not available). To account for this, I manually created a list to store all the data frame for the looping (do enlighten me if there's a faster way for this).
df_list = list(df_1990, df_1991, ..., df_2020)
for (i in df_list) {
...
}
The dataframes are pretty simple with just 2 columns (Item (character field) & Cost (numeric field).
Item
Cost
Book_A
3.00
Book_B
5.00
...
...
a sample code for the dataframe
df = structure(list(Item = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 2L,
3L, 1L, 2L, 1L, 2L, 1L, 3L, 1L, 2L, 2L, 1L, 3L, 1L), .Label = c("Book A",
"Book B", "Book C"), class = "factor"), Cost = c(5, 3.5, 12,
6, 8, 3, 6, 3.5, 3.8, 13, 5.1, 7, 11.5, 3.8, 5.5, 6.5, 13.5,
5.5, 3.5, 1.2)), class = "data.frame", row.names = c(NA, -20L
))
Does anyone know how I can add in the following code into the ... portion of the for loop code above? Thank you!
df %>%
group_by(Item) %>%
summarise(outlier = mean(Cost),
offset = outlier * 0.6,
higher_value = outlier + offset,
lower_value = outlier - offset) %>%
left_join(df, by = 'Item') %>%
transmute(Item, Cost, Outlier = ifelse(Cost < lower_value | Cost > higher_value, 'Y', 'N'))
The code basically detect the outlier (for e.g. if the cost is 60% higher or lower than majority average of the particular item) and output a column of "Y" and "N" for each row respectively. (Credits for the code goes to Ronak Shah)
Ideally the new column created should appear in the list created to allow exporting to excel format
Thank you!
Personally I would move the data wrangling code in a function and would then use lapply to loop over your list of data frames.
library(dplyr)
df_list <- list(df, df, df)
prep_data <- function(x) {
x %>%
group_by(Item) %>%
summarise(
outlier = mean(Cost),
offset = outlier * 0.6,
higher_value = outlier + offset,
lower_value = outlier - offset
) %>%
left_join(x, by = "Item") %>%
transmute(Item, Cost, Outlier = ifelse(Cost < lower_value | Cost > higher_value, "Y", "N"))
}
df_prep <- lapply(df_list, prep_data)
lapply(df_prep, head, 2)
#> [[1]]
#> # A tibble: 2 × 3
#> Item Cost Outlier
#> <fct> <dbl> <chr>
#> 1 Book A 5 N
#> 2 Book A 3.5 N
#>
#> [[2]]
#> # A tibble: 2 × 3
#> Item Cost Outlier
#> <fct> <dbl> <chr>
#> 1 Book A 5 N
#> 2 Book A 3.5 N
#>
#> [[3]]
#> # A tibble: 2 × 3
#> Item Cost Outlier
#> <fct> <dbl> <chr>
#> 1 Book A 5 N
#> 2 Book A 3.5 N
If you want to do it via a for loop then you could achieve the same result like so:
df_prep <- list()
for (i in seq_along(df_list)) {
df_prep[[i]] <- prep_data(df_list[[i]])
}
Why don't you put all your data into one dataframe:
df_list = list(df_1990 = df_1990, df_1991 = df_1991, ..., df_2020 = df_2020)
df2 = dplyr::bind_rows(df_list, .id = 'Year')
then you only have to add the variable Year into the group_by statement:
group_by(Year, Item)
If you need to, you can always convert it back to a list of dataframes:
df2 %>%
tidyr::nest(data = Item:Cost) %>%
pull(data, name = Year)
Btw, you can also improve the code for the outlier detection, by omitting the join:
df2 %>%
group_by(Year, Item) %>%
mutate(outlier = mean(Cost),
offset = outlier * 0.6,
higher_value = outlier + offset,
lower_value = outlier - offset) %>%
transmute(Item, Cost, Outlier = if_else(Cost < lower_value | Cost > higher_value, 'Y', 'N'))
using mutate instead of summarise copies the result of mean(Cost) to every row of the group.
Assume I have a dataset with two columns, Location and Product, that shows how many of each product is sold at each location. I create a contingency table for the number of each product sold at each location:
data%>%
group_by(Location,Product)%>%
summarize(n=n()) %>%
pivot_wider(names_from = product, values_from = n)
Now, imagine that instead of a single Product column, I have US_Product, Japan_Product,..., Germany_Product. How can I create my contingency tables in a for loop?
NOTE: when I create a vector of products like p<-c("Product1", "Product2",..., "Product3") and loop through these products, I get an error message because these are strings and not variable names.
Here is a minimal example:
Location <- c("AB","ON","MN","AB","ON")
Product1<-c("Type1","Type2","Type1","Type3","Type1")
Product2<-c("Type3","Type2","Type3","Type3","Type2")
Product3<-c("Type1","Type2","Type1","Type1","Type1")
data <- tibble(Location,Product1,Product2,Product3)
data%>%
group_by(Location,Product1)%>%
summarize(n=n()) %>%
pivot_wider(names_from = Product1, values_from = n) #this works as expected
#now I want to do the same thing in a loop
prodV <- c("Product1","Product2","Product3")
for (i in c(1:3)){
var <- prodV[i]
data%>%
group_by(Location,var)%>%
summarize(n=n()) %>%
pivot_wider(names_from = var, values_from = n)
}
If we need to use it in a loop, then one option is map
library(dplyr)
library(purrr)
library(tidyr)
map(p, ~
data%>%
group_by_at(vars("Location", .x)) %>%
summarize(n=n()) %>%
pivot_wider(names_from = .x, values_from = n))
Using a reproducible example
data(mtcars)
p <- c("cyl", "vs", "am")
map(p, ~
mtcars %>%
group_by_at(vars('gear', .x)) %>%
summarise(n = n()) %>%
pivot_wider(names_from = .x, values_from = n) )
Or if we use a for loop, then create an empty list to store the output from each iteration ('out'), loop over the 'p' values, and change only the .x part from map while assigning the output to each element of 'out' list
out <- vector('list', length(p))
names(out) <- p
for(p1 in p) {
out[[p1]] <- data %>%
group_by_at(vars("Location", p1)) %>%
summarize(n = n()) %>%
pivot_wider(names_from = p1, values_from = n)
}
Not sure if the following is the thing you are after. Below is a base R solution to make contingency tables:
p <- c("US_Product","Japan_product","Germany_Product")
res <- Map(function(x) table(df[c("Location",x)]),p)
such that
> res
$US_Product
US_Product
Location a b c
XX 2 0 1
YY 1 1 2
$Japan_product
Japan_product
Location d e f
XX 0 2 1
YY 3 0 1
$Germany_Product
Germany_Product
Location g i j
XX 0 3 0
YY 1 1 2
Dummy DATA
df <- > dput(df)
structure(list(Location = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
2L), .Label = c("XX", "YY"), class = "factor"), US_Product = structure(c(1L,
3L, 1L, 2L, 1L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"),
Japan_product = structure(c(2L, 2L, 3L, 3L, 1L, 1L, 1L), .Label = c("d",
"e", "f"), class = "factor"), Germany_Product = structure(c(2L,
2L, 2L, 2L, 3L, 1L, 3L), .Label = c("g", "i", "j"), class = "factor")), class = "data.frame", row.names = c(NA,
-7L))
I was able to handle the problem using group_by_at as opposed to group_by. According to dplyr: whats the difference between group_by and group_by_ functions?
if one needs to have inputs with quotation marks, SE versions of functions should be used, instead of NSE versions---please see the link for a detailed explanation.
prodV <- c("Product1","Product2","Product3")
for (i in c(1:3)){
var <- prodV[i]
a<-data%>%
group_by_at(vars("Location",var))%>%
summarize(n=n()) %>%
pivot_wider(names_from = var, values_from = n)
print(a)
}
This is an extension of the question I asked here where I was looking for a way to automate my labeling of subjects into groups based on if their data matched my filter.
Prior to attempting to the automating labeling, this is what I had.
library(tidyverse)
df <- structure(list(Subj_ID = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L),
Location = c(1, 2, 3, 1, 4, 2, 1, 2, 5)), class = "data.frame",
row.names = c(NA, -9L))
df2 <- df %>%
mutate(group=
if_else(Subj_ID ==1,
"Treatment",
if_else(Subj_ID == 2,
"Control","Withdrawn")))
complete.df <- df2 %>% filter(complete.cases(.))
In my actual data, there are some rows that have NA's and I need to be able to filter for both complete and incomplete cases so I can review the sub-data sets separately if needed.
My new code looks like this which assigns a subject to a group based on if they have a location data point 4 or 5:
df2 <- df %>%
mutate(group=
if_else(Subj_ID ==1,
"Treatment",
if_else(Subj_ID == 2,
"Control","Withdrawn")))
df3 <- df2 %>% ##this chunk breaks filter(complete.cases(.))
group_by(Subj_ID) %>%
mutate(group2 = case_when(any(Location == 4) | any(Location == 5) ~ "YES", TRUE ~ "NO"))
complete.df <- df3 %>% filter(complete.cases(.))
Once I generate df3 by mutating df2, my filter(complete.cases(.)) subsequently fails.
Yet, if I were to generate df3 by manual recoding, it works! As so:
df2 <- df %>%
mutate(group=
if_else(Subj_ID ==1,
"Treatment",
if_else(Subj_ID == 2,
"Control","Withdrawn")))
df3 <- df2 %>%
mutate(group2=
if_else(Subj_ID ==2 |
Subj_ID ==3,
"TRUE", "FALSE"))
complete.df <- df3 %>% filter(complete.cases(.))
Thoughts?
It would be the group_by attribute which causes the issue and can be solved by ungrouping and then apply the filter. In the OP's last code block (manual coding), it is not creating a grouping attribute and thus it works
library(dplyr)
df3 %>%
ungroup %>%
filter(complete.cases(.))
Or instead of complete.cases in filter, we can use !is.na with filter_all without removing the grouping attribute
df3 %>%
filter_all(any_vars(!is.na(.)))
OP mentioned about the last code block is working, but it doesn't have any group attribute. If we create one, then it fails too
df3 %>%
group_by(group) %>%
filter(complete.cases(.))
Error: Result must have length 3, not 9
I have this data.frame:
df <- structure(list(att_number = structure(1:3, .Label = c("0", "1",
"2"), class = "factor"), `1` = structure(c(2L, 3L, 1L), .Label = c("1026891",
"412419", "424869"), class = "factor"), `10` = structure(c(2L,
1L, 3L), .Label = c("235067", "546686", "92324"), class = "factor"),
`2` = structure(c(3L, 1L, 2L), .Label = c("12729", "7569",
"9149"), class = "factor")), .Names = c("att_number", "1",
"10", "2"), row.names = c(NA, -3L), class = "data.frame")
It looks like this having numbers as the column names.
att_number 1 10 2
0 412419 546686 9149
1 424869 235067 12729
2 1026891 92324 7569
Within a dplyr chain, I would like to order the columns in ascending order, like this:
att_number 1 2 10
0 412419 9149 546686
1 424869 12729 235067
2 1026891 7569 7569
I've tried using select_, but it doesn't want to work according to plan. Any idea on how I can do this? Here's my feeble attempt:
names_order <- names(df)[-1] %>%
as.numeric %>%
.[order(.)] %>%
as.character %>%
c('att_number', .)
df %>%
select_(.dots = names_order)
Error: Position must be between 0 and n
Update:
For newer versions of dplyr (>= 0.7.0):
library(tidyverse)
sort_names <- function(data) {
name <- names(data)
chars <- keep(name, grepl, pattern = "[^0-9]") %>% sort()
nums <- discard(name, grepl, pattern = "[^0-9]") %>%
as.numeric() %>%
sort() %>%
sprintf("%s", .)
select(data, !!!c(chars, nums))
}
sort_names(df)
Original:
You need back ticks around the numeric column names to stop select from trying to interpret them as column positions:
library(tidyverse)
sort_names <- function(data) {
name <- names(data)
chars <- keep(name, grepl, pattern = "[^0-9]") %>% sort()
nums <- discard(name, grepl, pattern = "[^0-9]") %>%
as.numeric() %>%
sort() %>%
sprintf("`%s`", .)
select_(data, .dots = c(chars, nums))
}
sort_names(df)