I ran into an annoying issue earlier today where I had a dataframe with hundreds of columns that I had been given. I was then attempting to select rows from this dataframe using a list I had created using a different process. When I attempted to filter using the list, I got a blank dataframe in return. After struggling with this for awhile I realized that the massive dataframe I was selecting from also had a column with the same name as my list, and that my filter action was usinig this as priority.
My question is, is there a better way I should be filtering from dataframes rather than the way I am currently? I do not like that it is ambiguous if a column or a list is used. Here is a minimum example to show this:
Consider a dataframe which has two columns, a and b:
library(tidyverse)
df = tibble(a = c("first", "second", "third"),
b = c("2", "3", "4"))
# A tibble: 3 × 2
# a b
# <chr> <chr>
# 1 first 2
# 2 second 3
# 3 third 4
I would then like to select rows from this dataframe using a list of values I created using a different process. Notice that the first list is named b, which is also the name of one of the columns in the df.
b = c("first")
d = c("first")
These two commands are almost the same, except that the first filters based on the column (and therefore returns nothing) and the second filters based on the list(and therefore returns the first row):
# Returns Nothing:
df %>%
filter(a %in% b)
# # A tibble: 0 × 2
# … with 2 variables: a <chr>, b <chr>
# ℹ Use `colnames()` to see all variable names
# Returns Desired Row(s)
df %>%
filter(a %in% d)
# A tibble: 1 × 2
# a b
# <chr> <chr>
# 1 first 2
Is there a better way to filter which is less ambiguous? I guess I would like an error or something like that. I realize this is kind of an edge case.
You can use .data$ and.env$ from rlang to distinguish between the variable in the data set and the object in the environment.
df %>%
filter(a %in% .env$b)
A tibble: 1 × 2
a b
<chr> <chr>
1 first 2
You can use !! to evaluate the vector b, rather than use the variable b from the dataset. It also works with vectors that are not also variable names in the data, like d. So, if you imagined this happening a lot, you could always prefix the vector of values you're filtering on with !! and you won't run unto this problem.
library(tidyverse)
df = tibble(a = c("first", "second", "third"),
b = c("2", "3", "4"))
b = c("first")
d = c("first")
df %>%
filter(a %in% !!b)
#> # A tibble: 1 × 2
#> a b
#> <chr> <chr>
#> 1 first 2
df %>%
filter(a %in% !!d)
#> # A tibble: 1 × 2
#> a b
#> <chr> <chr>
#> 1 first 2
Created on 2023-02-10 by the reprex package (v2.0.1)
Related
This question already has answers here:
Count categorical variable ("yes") for each column
(2 answers)
Closed 1 year ago.
I´ve downloaded a dataset from OpenML https://www.openml.org/search?type=data
Intentionally I picked a dataset with many features and "0 missing values". Now I found that some features have the value '?'. Therefore I would like to count for every feature how often the value '?' appears (in that column of my data.frame).
My question seems so easy but I´m sorry, I did not find an answer so far. Everything I tried so far seems to be a bit "overkill" and is not working:
frequency
I tried out frequency. I think somewhere I picked up that it´s supposed to give me a list of what values occur and how often. But trying it out I found that "frequency returns the number of samples per unit time and deltat the time interval between observations (see ts)."
[1] 1
> frequency(phpvqZpLa[,2])
[1] 1
> frequency(phpvqZpLa)
[1] 1
> ?frequency
table
I thought about using table. But that´s not really what I want. I´m looking for something so much simpler :D
I am quite new to R and this is my second question in this forum. Therefore I am very happy about helpful answers for my question but also about comments on how I could/ should improove my question or a link to a very similar question (which I did not find before)
edit
after I tried out the suggestion of Shibaprasadb (which seemed to answer my problem) the questionmarks were not counted correctly:
> colSums(phpvqZpLa[,6]=='?')
weight
0
> phpvqZpLa[1,6]
# A tibble: 1 x 1
weight
<chr>
1 '?'
Always try to provide a dummy data frame. That makes the job quite easier.
You can do this:
#Creating a dummy data frame
a <- c(1, 2, 4,'?', 58, 90, '?')
b <- c('?', 89, 90, 100, '?', 67, 900)
c <- c(57, 71, '?', '?', '?',76, 90)
df <- data.frame(a,b,c)
colSums(df=='?')
Output:
a b c
2 2 3
the tidyverse, specially dplyr, are excellent for these operations.
Using the example data by #danloo . We can first replace all ? values with NA. There is a function designed specifically for that, na_if. After that we use normal dplyr synthax to summarise with a list of functions, which are sum(is.na(.x), which sums the NA elements, and mean(is.na(.x)), wich gets us the rate of NAs, for every (everything()) column.
:
library(dplyr)
data %>% na_if("?") %>%
summarise(across(everything(), list(sum=~sum(is.na(.x)), mean=~mean(is.na(.x)))))
# A tibble: 1 x 4
col_a_sum col_a_mean col_b_sum col_b_mean
<int> <dbl> <int> <dbl>
1 1 0.25 2 0.5
With the data from #Shibaprasadb
a_sum a_mean b_sum b_mean c_sum c_mean
1 2 0.2857143 2 0.2857143 3 0.4285714
library(tidyverse)
data <- tibble(
col_a = c("?", "a", "b", "c"),
col_b = c("f", "?", "?", "g")
)
data
#> # A tibble: 4 x 2
#> col_a col_b
#> <chr> <chr>
#> 1 ? f
#> 2 a ?
#> 3 b ?
#> 4 c g
data %>%
pivot_longer(everything()) %>%
mutate(is_na = value == "?") %>%
count(name, is_na) %>%
filter(is_na) %>%
select(name, n)
#> # A tibble: 2 x 2
#> name n
#> <chr> <int>
#> 1 col_a 1
#> 2 col_b 2
Created on 2021-09-14 by the reprex package (v2.0.1)
I would like to be able create a list of variables, and then perform a count of a target variable for each level of each variable in the list. For clear presentation of the results, I'd like my end result to take the form of four columns: Variable, Level, Result, and Count.
Consider this partially-there example, borrowing heavily from Brad Cannel's answer at dplyr- group by in a for loop r:
df <- tibble(
var1 = c(rep("a", 5), rep("b", 5)),
var2 = c(rep("c", 3), rep("d", 7)),
var3 = rnorm(10),
result=c("good","bad","good","bad","good","bad","good","bad","good","bad")
)
groups <- c(quo(var1), quo(var2)) # Create a quoture
results<-list()
for (i in seq_along(groups)) {
results[[i]]<-df %>%
group_by(!!groups[[i]]) %>% # Unquote with !!
count(result)
}
all_results<-bind_rows(results)
At this point, the column n has the counts that I'd like. Rather than having columns named var1 and var2, I'm hoping to produce a result that looks like:
desired_results<-tibble(
variable=c("var1","var1","var1","var1","var2","var2","var2","var2"),
level=c("a","a","b","b","c","c","d","d"),
result=c("bad","good","bad","good","bad","good","bad","good"),
n=c(2,3,3,2,1,2,4,3)
)
I have tried using mutate in the loop to produce my result, but can't get the formatting correct:
for (i in seq_along(groups)) {
results[[i]]<-df %>%
group_by(!!groups[[i]]) %>% # Unquote with !!
count(result)%>%
mutate(level=!!groups[[i]])%>%
mutate(variable=groups[i])%>%
ungroup()%>%
select(variable,level,result,n)
}
I figured out how to "get there" using pivot_longer, like so (albeit just needing to rename columsn afterwards):
all_results2<-all_results%>%
pivot_longer(cols=c(-result,-n))%>%
filter(!(is.na(value)))
I'd really like to know how I could avoid this and just produce a column that houses the variable name right there in the for loop, and I'm guessing I'm missing some key bit of syntax. Any help in finding and explaining the solution would be greatly appreciated!
This could be done with pivot_longer, without looping, then binding the rows etc
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = var1:var2, names_to = 'variable',
values_to ='level') %>%
count(variable, level, result)
-output
# A tibble: 8 x 4
variable level result n
<chr> <chr> <chr> <int>
1 var1 a bad 2
2 var1 a good 3
3 var1 b bad 3
4 var1 b good 2
5 var2 c bad 1
6 var2 c good 2
7 var2 d bad 4
8 var2 d good 3
Another option in the larger tidyverse, instead of the for loop, would be a call to purrr::map_dfr.
library(tidyverse)
groups <- c("var1", "var2")
map_dfr(groups,
~ tibble(variable = .x,
count(df, level = !!sym(.x), result)))
#> # A tibble: 8 x 4
#> variable level result n
#> <chr> <chr> <chr> <int>
#> 1 var1 a bad 2
#> 2 var1 a good 3
#> 3 var1 b bad 3
#> 4 var1 b good 2
#> 5 var2 c bad 1
#> 6 var2 c good 2
#> 7 var2 d bad 4
#> 8 var2 d good 3
Created on 2021-07-21 by the reprex package (v0.3.0)
This question already has answers here:
Count common sets of items between different customers
(4 answers)
Intersect all possible combinations of list elements
(3 answers)
Closed 1 year ago.
Suppose you have a dataframe with ids and elements prescripted to each id. For example:
example <- data.frame(id = c(1,1,1,1,1,2,2,2,3,4,4,4,4,4,4,4,5,5,5,5),
vals = c("a","b",'c','d','e','a','b','d','c',
'd','f','g','h','a','k','l','m', 'a',
'b', 'c'))
I want to find all possible pair combinations. The main struggle here is not the functional of R language that I can use, but the logic. How can I iterate through all elements and find patterns? For instance, a was picked with b 3 times in my sample dataframe. But original dataframe is more than 30k rows, so I cannot count these combinations manually. How do I automatize this process of finding the number of picks of each elements?
I was thinking about widening my df with pivot_wider and then using map_lgl to find matches. Then I faced the problem that it will take a lot of time for me to find all possible combinations, applying map_lgl for every pair of elements.
I was asking nearly the same question less than a month ago, fellow users answered it but the result is not anything I really need.
Do you have any ideas how to create a dataframe with all possible combinations of values for all ids?
I understand that this code is slow, but here is another example code to get the expected output based on tidyverse package.
What I do here is first create a nested dataframe by id, then produce all pair combinations for each id, unnest the dataframe, and finally count the pairs.
library(tidyverse)
example <- data.frame(
id = c(1,1,1,1,1,2,2,2,3,4,4,4,4,4,4,4,5,5,5,5),
vals = c("a","b",'c','d','e','a','b','d','c','d','f','g','h','a','k','l','m','a','b', 'c')
)
example %>% nest(dataset=-id) %>% mutate(dataset=map(dataset, function(dataset){
if(nrow(dataset)>1){
dataset %>% .$vals %>% combn(., 2) %>% t() %>% as_tibble(.name_repair=~c("val1", "val2")) %>% return()
}else{
return(NULL)
}
})) %>% unnest(cols=dataset) %>% group_by(val1, val2) %>% summarize(n=n(), .groups="drop") %>% arrange(desc(n), val1, val2)
#> # A tibble: 34 x 3
#> val1 val2 n
#> <chr> <chr> <int>
#> 1 a b 3
#> 2 a c 2
#> 3 a d 2
#> 4 b c 2
#> 5 b d 2
#> 6 a e 1
#> 7 a k 1
#> 8 a l 1
#> 9 b e 1
#> 10 c d 1
#> # … with 24 more rows
Created on 2021-03-04 by the reprex package (v1.0.0)
This won't (can't) be fast for many IDs. If it is too slow, you need to parallelize or implement it in a compiled language (e.g., using Rcpp).
We sort vals. We can then create all combination of two items grouped by ID. We exclude ID's with 1 item. Finally we tabulate the result.
library(data.table)
setDT(example)
setorder(example, id, vals)
example[, if (.N > 1) split(combn(vals, 2), 1:2), by = id][, .N, by = c("1", "2")]
# 1 2 N
# 1: a b 3
# 2: a c 2
# 3: a d 3
# 4: a e 1
# 5: b c 2
# 6: b d 2
# 7: b e 1
#<...>
I would like to identify all rows of a tibble that have been altered after mutate .
My real data has multiple columns and the mutate function changes more than one column at once.
# library
library(tidyverse)
# get df
df <- tibble(name=c("A","B","C","D"),value=c(1,2,3,4))
# mutate df
dfnew <- df %>%
mutate(value=case_when(name=="A" ~ value+1, TRUE ~value)) %>%
mutate(name=case_when(name=="B" ~ "K", TRUE ~name))
Created on 2020-04-26 by the reprex package (v0.3.0)
Now I look for a way how to compare all rows of df with dfnew and identify all rows with at least one change.
The desired output would be:
# desired output:
#
# # A tibble: 4 x 2
# name value
# <chr> <dbl>
# 1 A 2
# 2 K 2
You can do:
anti_join(dfnew, df)
name value
<chr> <dbl>
1 A 2
2 K 2
#tmfmnk's response does the trick, but if you'd like to use a loop (e.g. for some flexibility using different kinds of messages or warnings depending on what you're checking) you could do:
output <- list()
for (i in 1:nrow(dfnew)) {
if (all(df[i, ] == dfnew[i, ])) {
next
}
output[[i]] <- dfnew[i, ]
}
bind_rows(output)
# A tibble: 2 x 2
name value
<chr> <dbl>
1 A 2
2 K 2
We can also use setdiff from dplyr
library(dplyr)
setdiff(dfnew, df)
# A tibble: 2 x 2
# name value
# <chr> <dbl>
#1 A 2
#2 K 2
Or using fsetdiff from data.table
library(data.table)
fsetdiff(setDT(dfnew), setDT(df))
I have some question for programming using dplyr and for loop in order to create multiple data. The code without loop works very well, but the code with for loop doesn't give me the expected result as well as error message.
Error message was like:
"Error in UseMethod ("select_") : no applicable method for 'select_'
applied to an object of class "character"
Please anyone put me on the right way.
The code below worked
B <- data %>% select (column1) %>% group_by (column1) %>% arrange (column1) %>% summarise (n = n ())
The code below did not work
column_list <- c ('column1', 'column2', 'column3')
for (b in column_list) {
a <- data %>% select (b) %>% group_by (b) %>% arrange (b) %>% summarise (n = n () )
assign (paste0(b), a)
}
Don't use assign. Instead use lists.
We can use _at variations in dplyr which works with characters variables.
library(dplyr)
split_fun <- function(df, col) {
df %>% group_by_at(col) %>% summarise(n = n()) %>% arrange_at(col)
}
and then use lapply/map to apply it to different columns
purrr::map(column_list, ~split_fun(data, .))
This will return you a list of dataframes which can be accessed using [[ individually if needed.
Using example with mtcars
df <- mtcars
column_list <- c ('cyl', 'gear', 'carb')
purrr::map(column_list, ~split_fun(df, .))
#[[1]]
# A tibble: 3 x 2
# cyl n
# <dbl> <int>
#1 4 11
#2 6 7
#3 8 14
#[[2]]
# A tibble: 3 x 2
# gear n
# <dbl> <int>
#1 3 15
#2 4 12
#3 5 5
#[[3]]
# A tibble: 6 x 2
# carb n
# <dbl> <int>
#1 1 7
#2 2 10
#3 3 3
#4 4 10
#5 6 1
#6 8 1